Motivation
With recommendation systems being omnipresent, I've always had an interest in learning about the innerworkings of such engines and how they impact decisions.
These systems are designed to deliver personalized and relevant suggestions, catering to individual user preferences and the characteristics of the items themselves.
This article will show you the process of creating a movie recommender engine from scratch, culminated to a Streamlit app. I will go through essential concepts such as word vectorization and text similarity, as well as different types of recommendation systems.
Here is a snapshot of the app!
You can also check out the demo below!
Project Overview
Built two recommendation systems to recommend movies based on movie description and metadata (director, genre, cast, etc.)
Preprocessed Netflix Movies and TV Shows dataset to extract relevant movie description and metadata
Utilized TF-IDF technique to represent text data (description, names of cast, etc.) as vectors and rank similar movies using Cosine Similarity
Designed app layout and UI integration using Streamlit
Resources and Tools
Language: Python
Packages: pandas, numpy, scikit-learn, streamlit, nltk
Table of Contents:
a. Data Exploration
b. Text Representation & Similarity
III. App Building
IV. Deployment
I. Recommendation Systems Overview
There are several types of recommendation systems. Let's take a high level look at some common types of recommendation systems:
1. Content-based filtering: This approach focuses on item attributes like genre or keywords, suggesting items similar to the ones users have shown interest in, based on these shared attributes.
2. Collaborative filtering: This approach focuses on user behaviors, recommending items based on similarities between item ratings or users' preferences
3. Hybrid recommender system: These systems combine multiple techniques to provide more personalized suggestions
In this article, we will focus on building content-based recommendation engines for movies, where recommendations are based on movie description and metadata information.
II. Data Preprocessing
For this project, we will use the Netflix Movies and TV Shows dataset. This dataset contains ~8,800 titles, along with their descriptions and metadata (director, cast, genre).
a. Data Exploration
Through first glance at the dataset, we can see that there are no data about user ratings or user IDs, which are necessary data for collaborative filtering approach. Thus, we confirm our choice of content-based recommendation.
b. Text Representation & Similarity
Concept Overview - Text Representation
To enable ML algorithm to process text data efficiently, we need to transform this kind of data into a mathematical form. This means that text units are encoded as vectors of numbers.
For Natural Language Processing, there are two common representation techniques:
1. Bag of Words
This approach turns text into a bag of words, treating each word equally without considering their order. It assumes that similar text shares similar words. If two bodies of text have similar words, their vector representations will be close to each other in the vector space.
However, this approach results in sparse representation, with the majority of vector entries being 0s.
2. TF-IDF (Term Frequency-Inverse Document Frequency)
This approach measures the importance of words in a document relative to a collection of documents or corpus. Essentially, it helps identify words that are frequent in a document but not common across all documents. Thus, if a word appears often in a specific document but not in many other documents, it gets a high TF-IDF score for that document. This helps highlight words that are distinctive and meaningful for characterizing the content of a particular document.
Concept Overview - Text Similarity
With the text represented as vector arrays, we need to determine a method to quantify the similarity between the vector representations. Here are some of the commonly used methods for this purpose:
For this project, we will use Cosine Similarity to quantify similarity. Cosine Similarity works by measuring the angle between the vector representations. The closer the angle value is to 1, the more similar the vector representations.
Feature Engineering
Before we can transform our text data into vectors and calculate cosine similarity, we need to perform some data cleaning and feature engineering.
As mentioned previously, we will be building two recommendation systems: (1) based on movie description and (2) based on metadata (director, cast, genre, type).
For each model, we will do the following:
1. Description-based Recommender
- Clean and tokenize description information
2. Metadata-based Recommender
- Clean data on director, actors, genres and type
- Combine these data into a single corpus to prepare for vectorization
Detailed feature engineering code can be found in this notebook.
Vectorization using TfIdfVectorizer & Calculate Cosine Similarity
We now can vectorize the data and calculate Cosine Similarity.
Entire data preprocessing code can be found in this notebook.
III. App Building
With the preprocessed data, we can now start building the app.
If you checked out the preprocessing code from above, you must have known that we will define 2 functions to fetch top n results using the cosine similarity matrix created during feature engineering.
The functions are: get_recommendations() and get_keywords_recommendations()
These two functions will be the recommendation engines for our app.
Then, in order to utilize the weights from the similarity matrix and the DataFrame effectively, we will save them as binary files using either joblib or pickle.
This way, we can easily load and use them within our applications.
Design the app
Designing the app is fairly straightforward. The entire code can be found in this script.
Overall, we will take the following steps:
1. Load matrices and DataFrame binary files
2. Redefine recommendation functions - fetching top n results
3. Write app layout code
Here is how the final app will look like!
You can play with it here!
Below is a demo video to show the app works!
III. Deployment
Now that our app is up and running, we can deploy the app to conveniently share with others. There are many options available, but I personally find deploying the app on Streamlit Cloud very simple. With just a few clicks, you can deploy the app and others can access the app using generated link.
The code shown throughout this article is available in this github repo!