Tuesday, February 21, 2012

Paper #4:
Extraction and Integration of MovieLens and IMDb Data

This was a rather extensive paper. The purpose of the document was to describe the process of extracting data from two different movie databases (IMDb and MovieLens) and integrate to form a single database. Extracting and merging this data introduced challenges such as cleaning up the information, elimination of double entries, matching common data, etc.

In this blog, I will focus on the data extraction from IMDb, since this is the part most related to our project.

The data extraction process began by loading the text data to a relational data base. Then the data was normalized and duplicates were eliminated.

The IMDb data set is spread throughout 49 files (at the time this paper was written) each including different categories. The authors were interested in only 23 lists out of those 49. Unfortunately, the lists may contain different formats. The authors identified 4 main file formats:

a- Fixed-length columned
b- Tab-separated columned
c- Tagged
d- Hierarchical-structured

The documents explains each format and describes the arrangement of the data including examples. Table 5 also lists each file and the format it uses for the data. Figure 20 shows a diagram with all the different lists, their attributes and relations between those attributes. This is what they call the IMDb schema. Table 6 includes more explanatory data regarding those attributes, their types, lengths and names.

Source: http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf

No comments:

Post a Comment