Tuesday, February 28, 2012

Paper #5: Extraction and Integration of MovieLens and IMDb Data (part 2)

Last week I made a summary of this document; my post was mainly focusing on the IMDb data extraction and processing. This time I will talk about the remainder of the document: the MovieLens data.

The paper describes in a detailed way the process to integrate a MovieLens dataset with the IMDb contents. The process had two phases, data extraction and merging. First, the authors of the paper had to make sense out of the numerous text files from both databases and prepare the files in such a way that they could be loaded into Microsoft Access. Then, after further processing and cleaning up, the movie titles from MovieLens ratings were matched to their IMDb counterparts. Finally, the result from this processing on Access was exported to an Oracle database.

The researchers got a hold of two different sets of data from MovieLens. The main differences between them were size, and the fact that the smaller one linked its movies to the corresponding IMDb titles. Another important difference was that the big dataset had the genres for the movies as concatenated words, whereas the smaller dataset had binary numbers in preset fields to determine whether a given genre was "present" on the movie or not. Both datasets where much cleaner than their IMDb counterparts, as they featured only key information. The only part of the information that really stands up, is an occupation field for users; I believe it is most likely related to MovieLens' original research project.

In turn, the other important part of the paper was the data integration process. A simple title matching of movies between databases only yielded 79% matches. The researchers went through 8 steps to bring that number to 100%. The steps were:

Join movie by title
Match using the MovieLens small data set
Match extracting foreign title
Match ignoring running year
Matching of 20 first characters
Matching of 10 first characters
Manual look-up
Web look-up

Finally, the researchers added a link to a website where they posted the starting and resulting datasets from their project. I tried accessing the website, but the data was not available anymore.

http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf

No comments:

Post a Comment