8/01/2013

Mahout - Non Hadoop implementation for Collaborative Filtering

It's been a week with some serious learning stuff. I stumbled upon this technique called 'Collaborative Filtering' and found it very interesting. Though the concept is been there for ages, time and again there has been new implementations around it. With little googling I managed to run a stand alone program (Non-hadoop) using Apache Mahout. For those it is new, read on.

Collaborative Filtering, is a technique of predicting preferences given the preferences of others in the group. There are two main approaches - user based and item based. In case of user-based filtering, the objective is to look for users similar to the given user, then use the ratings from these similar users to predict a preference for the given user. In case of item-based recommendation, similarities between pairs of items are computed, then preferences predicted for the given user using a combination of the user's current item preferences and the similarity matrix. Example shown below uses user's based receommendation (think of something Amazon's recommendations)

Inputs...
The input to such a system is either a 3-tuple of (UserID, ItemID, Rating) or a 2-tuple of (UserID, ItemID).  In Mahout, this input is represented by a DataModel class, which can be created from a file of these tuples (either CSV or TSV), one per line. Other ways to populate the DataModel also exist, for example, programatically or from database.

Under the hoods...
A user-based Recommender is built out of a DataModel, a UserNeighborhood and a UserSimilarity. A UserNeighborhood defines the concept of a group of users similar to the current user - the two available implementations are Nearest and Threshold. The nearest neighborhood consists of the nearest N users for the given user, where nearness is defined by the similarity implementation. The threshold neighborhood consists of users who are at least as similar to the given user as defined by the similarity implementation. The UserSimilarity defines the similarity between two users - implementations include EuclideanDistance, Pearson Correlation, Uncentered Cosine, Caching, City Block, Dummy, Generic User, Log Likelihood, Spearman Correlation and Tanimoto Coefficient similarity.

Input File:
There are five users with the matrix (user,item,rating). The sample program shown below will recommend items for other users based on the below list.

1,101,5.0
1,102,3.0
1,103,2.5

2,101,2.0
2,102,2.5
2,103,5.0
2,104,2.0

3,101,2.5
3,104,4.0
3,105,4.5
3,107,5.0

4,101,5.0
4,103,3.0
4,104,4.5
4,106,4.0

5,101,4.0
5,102,3.0
5,103,2.0
5,104,4.0
5,105,3.5
5,106,4.0

Sample Code - Java Implementation:
package com.cf;

import java.io.File;
import java.util.List;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class TestClass {

    /**
     * @param args
     */
    public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stubs

        File file = new File("C:\\Users\\Kaveesh\\workspace\\Collaborative_Filtering\\ds\\test.txt");
        DataModel model = new FileDataModel(file);

        UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
        UserNeighborhood neighborhood = new NearestNUserNeighborhood(2,
                similarity, model);

        Recommender recommender = new GenericUserBasedRecommender(model,
                neighborhood, similarity);

        //Now we can get 2 recommendations for user ID "1"
        List recommendations = recommender.recommend(1, 2);

        for (RecommendedItem recommendation : recommendations) {
            System.out.println(recommendation);
        }

    }

}

Output:
RecommendedItem[item:104, value:4.257081]
RecommendedItem[item:106, value:4.0]


Jars Needed:
 mahout-core-0.8-job.jar
 mahout-core-0.8.jar
 mahout-integration-0.8.jar
 mahout-math-0.8.jar
 slf4j-api-1.7.5.jar
 slf4j-jcl-1.7.5.jar


 Ah..sounds simple? Not really. Now the bigger picture. Think of Amazon\Netflix running their code on Mahout and Hadoop. Personalized recommendations are provided for each online users scanning miliion of users and millions of items. Needless to say, the results of the recommendations are serviced less than a sec.

So my next step is to create a database in MySQL and integrate with Mahout and Hadoop using MySQL-Hadoop-Applier.

1 comment:

Nitesh Kumar said...

I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in Couch DB, kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training on Couch DB. We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Nitesh Kumar
MaxMunus
E-mail: nitesh@maxmunus.com
Skype id: nitesh_maxmunus
Ph:(+91) 8553912023
http://www.maxmunus.com/




Tired of seeing that 500 Bad gateway error while deploying a Springboot application in AWS...?

By default, Spring Boot applications will listen on port 8080. Elastic Beanstalk assumes that the application will listen on port 5000. Th...