The Internet Movie Database (Imdb) is a great source to get information about movies. Keras provides access to some part of the cleaned dataset (e.g. for sentiment classification). While sentiment classification is an interesting topic, I wanted to see if it is possible to identify a movie’s genre from its description.
The image illustrates the task;
To see if that is possible I downloaded the raw data from an FU-Berlin ftp- server. Most movies have multiple genres assigned (e.g. Action and Sci-fi.). I chose to randomly pick one genre in case of multiple assignments.
So the task at hand is to use a lengthy description to interfere a (noisy) label. Hence, the task is similar to the Reuters news categorization task. I used the code as a guideline for the model.
However, looking at the code, it becomes clear that data preprocessing part is skipped. In order to make it easy for a practitioner to create their own applications, I will try to detail the necessary preprocessing.
The texts are represented as a vector of integers (indexes). So basically one builds a dictionary in which each index refers to a particular word.
In order to get a trainable data, we first balance the dataset such that all classes have the same frequency.
Then we preprocess the raw text descriptions in such an index based representation. As always, we split the dataset in test and training data (90%). Finally, we transform the index based representation into a matrix representation and hot-one-encode the classes.
After setting up the data, we can define the model. I tried different combinations (depth, dropouts, regularizers and input units) and the following layout seems to work the best:
Finally, we plot the training progress and conclude that it is possible to train a classifier without too much effort.
I hope the short tutorial illustrated how to preprocess text in order to build a text-based deep-learning learning classifier. I am pretty sure that are better parameters to tune the model.
If you want to implement such a model in production environment, I would recommend playing with the text-preprocessing parameters. The text-tokenizer and the text_to_sequence functions hold a lot of untapped value.