Computer Science and other things: February 2012

Tuesday, February 28, 2012

Paper #5: Extraction and Integration of MovieLens and IMDb Data (part 2)

Last week I made a summary of this document; my post was mainly focusing on the IMDb data extraction and processing. This time I will talk about the remainder of the document: the MovieLens data.

The paper describes in a detailed way the process to integrate a MovieLens dataset with the IMDb contents. The process had two phases, data extraction and merging. First, the authors of the paper had to make sense out of the numerous text files from both databases and prepare the files in such a way that they could be loaded into Microsoft Access. Then, after further processing and cleaning up, the movie titles from MovieLens ratings were matched to their IMDb counterparts. Finally, the result from this processing on Access was exported to an Oracle database.

The researchers got a hold of two different sets of data from MovieLens. The main differences between them were size, and the fact that the smaller one linked its movies to the corresponding IMDb titles. Another important difference was that the big dataset had the genres for the movies as concatenated words, whereas the smaller dataset had binary numbers in preset fields to determine whether a given genre was "present" on the movie or not. Both datasets where much cleaner than their IMDb counterparts, as they featured only key information. The only part of the information that really stands up, is an occupation field for users; I believe it is most likely related to MovieLens' original research project.

In turn, the other important part of the paper was the data integration process. A simple title matching of movies between databases only yielded 79% matches. The researchers went through 8 steps to bring that number to 100%. The steps were:

Join movie by title
Match using the MovieLens small data set
Match extracting foreign title
Match ignoring running year
Matching of 20 first characters
Matching of 10 first characters
Manual look-up
Web look-up

Finally, the researchers added a link to a website where they posted the starting and resulting datasets from their project. I tried accessing the website, but the data was not available anymore.

http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf

Tuesday, February 21, 2012

Paper #4:
Extraction and Integration of MovieLens and IMDb Data

This was a rather extensive paper. The purpose of the document was to describe the process of extracting data from two different movie databases (IMDb and MovieLens) and integrate to form a single database. Extracting and merging this data introduced challenges such as cleaning up the information, elimination of double entries, matching common data, etc.

In this blog, I will focus on the data extraction from IMDb, since this is the part most related to our project.

The data extraction process began by loading the text data to a relational data base. Then the data was normalized and duplicates were eliminated.

The IMDb data set is spread throughout 49 files (at the time this paper was written) each including different categories. The authors were interested in only 23 lists out of those 49. Unfortunately, the lists may contain different formats. The authors identified 4 main file formats:

a- Fixed-length columned
b- Tab-separated columned
c- Tagged
d- Hierarchical-structured

The documents explains each format and describes the arrangement of the data including examples. Table 5 also lists each file and the format it uses for the data. Figure 20 shows a diagram with all the different lists, their attributes and relations between those attributes. This is what they call the IMDb schema. Table 6 includes more explanatory data regarding those attributes, their types, lengths and names.

Source: http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf

Tuesday, February 14, 2012

Paper #3: Low Power Techniques for an Android Based Phone

The document opens up with a definition of Android: a "software stack, which includes Linux kernel as underlying operating system, middleware software such as application frameworks and libraries and the some of the applications specific to mobile platform." The importance of this definition is to highlight the fact that the linux kernel is responsible for OS related management such as power control.

According to the author, there are 3 main power management categories to be considered:
a- Static Power Management: when the user is not interacting with the device
b- Active Power Management: power saving during short idle periods
c- Android Power Management: power management specific for android

The first solution discussed is related to active power management. The researchers looked at reducing the sampling rate of the OS. To do this they created a daemon process that would calculate system workload and perform frequency scaling depending on the requirements.

To work around the static power management the authors of the document decided to include modifications to the debug file system. These modifications targeted suspending and resuming the system, as well as data retention during long idle periods.

For the last part it is important to understand the wake_lock. This mechanism keeps Andriod from going into suspend mode, and is kept running for as long as CPU usage is needed. For an application to run, it needs to acquire the wake_lock. The researchers recommend Android developers to look at the different types of wake_lock and use the lowest one possible for the context of their application. This way the application would avoid waiting for Android automatic timeouts when the application is not being used.

Tuesday, February 7, 2012

Paper #2: Preference-based User Rate Correction Process for Interactive Recommendation Systems

The objective of the paper is to improve the performance of rating systems to account for user error. The importance of this analysis is due to recommeder systems being based on user ratings; by improving the user ratings the system would give more accurate recommendations.

User ratings are subject to two phenomenon: the missing value problem (user doesn't rate an item), and the noisy rating problem (a user makes a mistake giving a specific rating). Rating noise accounts for: user changing their opinion over time and the user can fail to express their personal preference. The paper focuses on the second problem, users making mistakes when rating.

In their approach they "have focused on a set of items on which the user has taken the action (i.e., rating), and designed an attribute selection scheme to represent user preferences." The example given in the document features a user who likes a movie director (attribute), however the user gave a high rating to most of that director's movies but gave a much lower rating to a particular movie made by the same director. The system assumes that the user has a preference over that attribute and compensates for the lower score in the aforementioned movie.

Section 4 features some of the different techniques which can be used together with the User-Item-Attribute-rating model the paper describes on section 3. The first method calculates a dominant attribute and item, and from there find discrepancies in the ratings. The second method determines a set of 'expert' users, to which the average users can look up to in terms of movie preferences. The third method is self based correction, and the last is a hybrid between the previous two.

Thursday, February 2, 2012

Paper #1:
Putting Things in Context: Challenge on Context-Aware Movie Recommendation

The document discusses the context-aware movie recommendation challenge (CAMRa 2010). A total of 40 team submitted papers out of which 10 were chosen to be presented at the event. The rules of the challege provided the following: two data sets (train set and test set) to be evaluated under 3 statistical rules in order to determine how good each recommendation was. There were 3 recommendation tracks to choose from: a- time of the year and special events, b- social relations of users, and c- user's (implicit) mood.

The closing section of the paper briefly describes some of the results obtained by the ten chose papers. Out of those, there are two particular papers my team would be interested on; they are related to the social relation of users track. The social approach on [10] was based on matrix factorization. A feasable approach if we choose to do the recommendation calculations on server side. Finally, [11] presented a kNN approach based on linear combinations of similarity between users.