Paper #8
Key design considerations for recommendation systems
The document aims to provide the reader with an overview of the study in the field of recommendation systems (RS). The order in which the information is presented is mean to help readers make the design choices for their own implementation of a RS.
According to the author, recommendation systems "provide personalized information through pull or push services to support a user in decision-making processes that involve great amounts of information."
The RS taxonomy.
The author made a note that this taxonomy is oriented for systems that aim to predict a unknown rating.
a) RS may offer a horizontal or vertical solution
Horizontal: explicitly designed and optimized for a particular application in a specific domain
Vertical: may offer recommendations in different domains
b) RS may be traditional or non-traditional
Traditional: support for only two dimensions (users and items); based on the availability of a single utiliry that is either explicitly assigned by a user or inferred from a transaction or other sources.
Non-traditional: assumptions made on traditional RS are relaxed and treated in a flexible manner.
c) Data used
Explicit ratings: utility is viewed as a user-given rating.
Transaction data: the revealed utility is obtained from implicit user input.
Both (hybrid): both can be used.
d) Approach used to solve the problem (method used to estimate ratings)
Content-based: uses product descriptions, known ratings of a particular user, recommend items similar in content to those items the user has liked in the past
Collaborative: recommend items based on known ratings of users with similar tastes and preferences.
Both: combine both methods.
e) Undefined trait
Heuristic-based RS: rely on a heuristic, built upon an assumption, without verifying correctness of the assumption or the optimality of the solutions.
Model-based: presume that utility data predictions can be obtained from a certain model
Fundamental design aspects
The authors classify decisions in two stages: building a foundation and sculpting a RS.
Building a foundation
The first task is to define personalization and RS goals (both goals should be consistent with each other). Then, the designers should analyze the costs of of accomplishing the previously defined goals. In addition, the designers need to define the type of personalization the system will have. Once the previous step is done, the developers should decide at what level the personalization will reside (user level, or system level).
Sculpting a RS
Data collection: which are the available data sources? what data can be retrieved? how is data going to be retrieved? what information is actually needed?
Building profiles (consumer profiles, item profiles, etc.)
How is information going to be represented? How is the initial profile going to be built?
Matchmaking
Which recommendation technique is going to be used? which information is available? which information should be used to make a particular kind of recommendation?
Delivery and Presentation
How are recommendations going to be presented to the user?
Measuring personalization impact
Which evaluation metrics are going to be considered when evaluating the RS? how to measure the chosen metrics?
Adjusting personalization strategy
Is the personalization strategy going to be continuously refined according to evaluation results? what are the causes for poor evaluation results? how can the different recommendation components be refined given the found causes for poor results?
Thursday, April 5, 2012
Monday, March 19, 2012
Paper # 6
SPARK: A Keyword Search Engine on Relational Databases
Relational databases are one of the most popular ways to store text data into a database system. Unfortunately, keyword search systems for these databases are not effective since they use query templates to " map keyword search to full-text matching within one or more attributes." IMDb uses this technique.
The main purpose of the document is to demo the system they developed to address this problem; it is named SPARK.
The SPARK Server contains three main components:
-Nonfree tuple set constructor: This component is built with the set of tuples from a Relation that contain at least one match to a query keyword.
-Candidate network generator: Uses relation algebra to process the chosen tuples from the previous step to generate a set of minimal candidate networks whose sizes are within a user-defined threshold.
-Query processor: It uses four algorithms (Sparse and global pipeline, skyline sweeping and block pipeline) to find the top-k results.
The third part of the paper is an explanatory demonstration of SPARK's features. The casual user mode is quite simple and doesn't require much explanation, but the professional search mode allows for much more customization within queries. For example, the user can choose to search an specific keyword only in a specific attribute of a chosen table. The example they give is a search of the name cage, under the table actors and under the attribute actor.name. A user can also include parameters for the three main SPARK components; one such example would be specifying the candidate network threshold.
SPARK: A Keyword Search Engine on Relational Databases
Relational databases are one of the most popular ways to store text data into a database system. Unfortunately, keyword search systems for these databases are not effective since they use query templates to " map keyword search to full-text matching within one or more attributes." IMDb uses this technique.
The main purpose of the document is to demo the system they developed to address this problem; it is named SPARK.
The SPARK Server contains three main components:
-Nonfree tuple set constructor: This component is built with the set of tuples from a Relation that contain at least one match to a query keyword.
-Candidate network generator: Uses relation algebra to process the chosen tuples from the previous step to generate a set of minimal candidate networks whose sizes are within a user-defined threshold.
-Query processor: It uses four algorithms (Sparse and global pipeline, skyline sweeping and block pipeline) to find the top-k results.
The third part of the paper is an explanatory demonstration of SPARK's features. The casual user mode is quite simple and doesn't require much explanation, but the professional search mode allows for much more customization within queries. For example, the user can choose to search an specific keyword only in a specific attribute of a chosen table. The example they give is a search of the name cage, under the table actors and under the attribute actor.name. A user can also include parameters for the three main SPARK components; one such example would be specifying the candidate network threshold.
Tuesday, February 28, 2012
Paper #5: Extraction and Integration of MovieLens and IMDb Data (part 2)
Last week I made a summary of this document; my post was mainly focusing on the IMDb data extraction and processing. This time I will talk about the remainder of the document: the MovieLens data.
The paper describes in a detailed way the process to integrate a MovieLens dataset with the IMDb contents. The process had two phases, data extraction and merging. First, the authors of the paper had to make sense out of the numerous text files from both databases and prepare the files in such a way that they could be loaded into Microsoft Access. Then, after further processing and cleaning up, the movie titles from MovieLens ratings were matched to their IMDb counterparts. Finally, the result from this processing on Access was exported to an Oracle database.
The researchers got a hold of two different sets of data from MovieLens. The main differences between them were size, and the fact that the smaller one linked its movies to the corresponding IMDb titles. Another important difference was that the big dataset had the genres for the movies as concatenated words, whereas the smaller dataset had binary numbers in preset fields to determine whether a given genre was "present" on the movie or not. Both datasets where much cleaner than their IMDb counterparts, as they featured only key information. The only part of the information that really stands up, is an occupation field for users; I believe it is most likely related to MovieLens' original research project.
In turn, the other important part of the paper was the data integration process. A simple title matching of movies between databases only yielded 79% matches. The researchers went through 8 steps to bring that number to 100%. The steps were:
Join movie by title
Match using the MovieLens small data set
Match extracting foreign title
Match ignoring running year
Matching of 20 first characters
Matching of 10 first characters
Manual look-up
Web look-up
Finally, the researchers added a link to a website where they posted the starting and resulting datasets from their project. I tried accessing the website, but the data was not available anymore.
http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf
Last week I made a summary of this document; my post was mainly focusing on the IMDb data extraction and processing. This time I will talk about the remainder of the document: the MovieLens data.
The paper describes in a detailed way the process to integrate a MovieLens dataset with the IMDb contents. The process had two phases, data extraction and merging. First, the authors of the paper had to make sense out of the numerous text files from both databases and prepare the files in such a way that they could be loaded into Microsoft Access. Then, after further processing and cleaning up, the movie titles from MovieLens ratings were matched to their IMDb counterparts. Finally, the result from this processing on Access was exported to an Oracle database.
The researchers got a hold of two different sets of data from MovieLens. The main differences between them were size, and the fact that the smaller one linked its movies to the corresponding IMDb titles. Another important difference was that the big dataset had the genres for the movies as concatenated words, whereas the smaller dataset had binary numbers in preset fields to determine whether a given genre was "present" on the movie or not. Both datasets where much cleaner than their IMDb counterparts, as they featured only key information. The only part of the information that really stands up, is an occupation field for users; I believe it is most likely related to MovieLens' original research project.
In turn, the other important part of the paper was the data integration process. A simple title matching of movies between databases only yielded 79% matches. The researchers went through 8 steps to bring that number to 100%. The steps were:
Join movie by title
Match using the MovieLens small data set
Match extracting foreign title
Match ignoring running year
Matching of 20 first characters
Matching of 10 first characters
Manual look-up
Web look-up
Finally, the researchers added a link to a website where they posted the starting and resulting datasets from their project. I tried accessing the website, but the data was not available anymore.
http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf
Tuesday, February 21, 2012
Paper #4:
Extraction and Integration of MovieLens and IMDb Data
This was a rather extensive paper. The purpose of the document was to describe the process of extracting data from two different movie databases (IMDb and MovieLens) and integrate to form a single database. Extracting and merging this data introduced challenges such as cleaning up the information, elimination of double entries, matching common data, etc.
In this blog, I will focus on the data extraction from IMDb, since this is the part most related to our project.
The data extraction process began by loading the text data to a relational data base. Then the data was normalized and duplicates were eliminated.
The IMDb data set is spread throughout 49 files (at the time this paper was written) each including different categories. The authors were interested in only 23 lists out of those 49. Unfortunately, the lists may contain different formats. The authors identified 4 main file formats:
a- Fixed-length columned
b- Tab-separated columned
c- Tagged
d- Hierarchical-structured
The documents explains each format and describes the arrangement of the data including examples. Table 5 also lists each file and the format it uses for the data. Figure 20 shows a diagram with all the different lists, their attributes and relations between those attributes. This is what they call the IMDb schema. Table 6 includes more explanatory data regarding those attributes, their types, lengths and names.
Source: http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf
Extraction and Integration of MovieLens and IMDb Data
This was a rather extensive paper. The purpose of the document was to describe the process of extracting data from two different movie databases (IMDb and MovieLens) and integrate to form a single database. Extracting and merging this data introduced challenges such as cleaning up the information, elimination of double entries, matching common data, etc.
In this blog, I will focus on the data extraction from IMDb, since this is the part most related to our project.
The data extraction process began by loading the text data to a relational data base. Then the data was normalized and duplicates were eliminated.
The IMDb data set is spread throughout 49 files (at the time this paper was written) each including different categories. The authors were interested in only 23 lists out of those 49. Unfortunately, the lists may contain different formats. The authors identified 4 main file formats:
a- Fixed-length columned
b- Tab-separated columned
c- Tagged
d- Hierarchical-structured
The documents explains each format and describes the arrangement of the data including examples. Table 5 also lists each file and the format it uses for the data. Figure 20 shows a diagram with all the different lists, their attributes and relations between those attributes. This is what they call the IMDb schema. Table 6 includes more explanatory data regarding those attributes, their types, lengths and names.
Source: http://apmd.prism.uvsq.fr/public/Publications/Rapports/Extraction%20and%20Integration%20of%20MovieLens%20and%20IMDb%20Data_Veronika%20Peralta.pdf
Tuesday, February 14, 2012
Paper #3: Low Power Techniques for an Android Based Phone
The document opens up with a definition of Android: a "software stack, which includes Linux kernel as underlying operating system, middleware software such as application frameworks and libraries and the some of the applications specific to mobile platform." The importance of this definition is to highlight the fact that the linux kernel is responsible for OS related management such as power control.
According to the author, there are 3 main power management categories to be considered:
a- Static Power Management: when the user is not interacting with the device
b- Active Power Management: power saving during short idle periods
c- Android Power Management: power management specific for android
The first solution discussed is related to active power management. The researchers looked at reducing the sampling rate of the OS. To do this they created a daemon process that would calculate system workload and perform frequency scaling depending on the requirements.
To work around the static power management the authors of the document decided to include modifications to the debug file system. These modifications targeted suspending and resuming the system, as well as data retention during long idle periods.
For the last part it is important to understand the wake_lock. This mechanism keeps Andriod from going into suspend mode, and is kept running for as long as CPU usage is needed. For an application to run, it needs to acquire the wake_lock. The researchers recommend Android developers to look at the different types of wake_lock and use the lowest one possible for the context of their application. This way the application would avoid waiting for Android automatic timeouts when the application is not being used.
The document opens up with a definition of Android: a "software stack, which includes Linux kernel as underlying operating system, middleware software such as application frameworks and libraries and the some of the applications specific to mobile platform." The importance of this definition is to highlight the fact that the linux kernel is responsible for OS related management such as power control.
According to the author, there are 3 main power management categories to be considered:
a- Static Power Management: when the user is not interacting with the device
b- Active Power Management: power saving during short idle periods
c- Android Power Management: power management specific for android
The first solution discussed is related to active power management. The researchers looked at reducing the sampling rate of the OS. To do this they created a daemon process that would calculate system workload and perform frequency scaling depending on the requirements.
To work around the static power management the authors of the document decided to include modifications to the debug file system. These modifications targeted suspending and resuming the system, as well as data retention during long idle periods.
For the last part it is important to understand the wake_lock. This mechanism keeps Andriod from going into suspend mode, and is kept running for as long as CPU usage is needed. For an application to run, it needs to acquire the wake_lock. The researchers recommend Android developers to look at the different types of wake_lock and use the lowest one possible for the context of their application. This way the application would avoid waiting for Android automatic timeouts when the application is not being used.
Tuesday, February 7, 2012
Paper #2: Preference-based User Rate Correction Process for Interactive Recommendation Systems
The objective of the paper is to improve the performance of rating systems to account for user error. The importance of this analysis is due to recommeder systems being based on user ratings; by improving the user ratings the system would give more accurate recommendations.
User ratings are subject to two phenomenon: the missing value problem (user doesn't rate an item), and the noisy rating problem (a user makes a mistake giving a specific rating). Rating noise accounts for: user changing their opinion over time and the user can fail to express their personal preference. The paper focuses on the second problem, users making mistakes when rating.
In their approach they "have focused on a set of items on which the user has taken the action (i.e., rating), and designed an attribute selection scheme to represent user preferences." The example given in the document features a user who likes a movie director (attribute), however the user gave a high rating to most of that director's movies but gave a much lower rating to a particular movie made by the same director. The system assumes that the user has a preference over that attribute and compensates for the lower score in the aforementioned movie.
Section 4 features some of the different techniques which can be used together with the User-Item-Attribute-rating model the paper describes on section 3. The first method calculates a dominant attribute and item, and from there find discrepancies in the ratings. The second method determines a set of 'expert' users, to which the average users can look up to in terms of movie preferences. The third method is self based correction, and the last is a hybrid between the previous two.
Thursday, February 2, 2012
Paper #1:
Putting Things in Context: Challenge on Context-Aware Movie Recommendation
The document discusses the context-aware movie recommendation challenge (CAMRa 2010). A total of 40 team submitted papers out of which 10 were chosen to be presented at the event. The rules of the challege provided the following: two data sets (train set and test set) to be evaluated under 3 statistical rules in order to determine how good each recommendation was. There were 3 recommendation tracks to choose from: a- time of the year and special events, b- social relations of users, and c- user's (implicit) mood.
The closing section of the paper briefly describes some of the results obtained by the ten chose papers. Out of those, there are two particular papers my team would be interested on; they are related to the social relation of users track. The social approach on [10] was based on matrix factorization. A feasable approach if we choose to do the recommendation calculations on server side. Finally, [11] presented a kNN approach based on linear combinations of similarity between users.
Subscribe to:
Posts (Atom)