Project Description

Recommendation and prediction algorithms are a driving force behind web-commerce. Online stores, advertising, and even search rely very heavily on these algorithms to present relevant products to potential consumers. Movie ratings and recommendations are so important that Netflix™ (an on-line movie rental company) is offering a one million dollar prize to anyone who can improve upon their rating system. We are leveraging the power of new, large scale computing systems and programming technologies to try to win the Netflix Prize. By building a fully distributed data mining application using MapReduce, and running it on Amazon's EC2 through RightScale's server management environment, we are exploring how these emerging technologies and be used in concert to solve challenging new data-intensive problems at scale. And perhaps win a million dollars, divided five ways of course.

After registering in the Netflix Prize, teams are granted access to a dataset containing over 100 million movie ratings for 480,000 Netflix users. From this dataset, Netflix has withheld a subset of ratings which contestants must predict as close to the withheld value as possible in order to win the prize. There are currently over 36,000 teams across the globe working to attain this prize (see Leaderboard). The leading teams have used very complex combinations of statistical algorithms running for weeks on single machines. Unlike those teams, our approach is to use simple statistical analysis over an enhanced dataset in order to make predictions.

In order to increase both the the quantity and quality of our data set we have collected information from Flixster which is a social networking website that focuses on movies. Flixster provides a wealth of high quality data with over 53 million users. Since Flixster users are likely avid movie fans, their ratings are more informed than the users in the provided data set.

Given the large amount of data and the computations require to create predictions, our project is simply too large for one machine. With the aide of RightScale and Google, we are able to fully distribute work between hundreds of machines. RightScale provides an easy interface for managing a cluster of Amazon EC2 machines. In order to utilize such a large number of machines, we adopted the Map Reduce programming paradigm which has been proven to scale extremely efficiently by industry leading companies like Google and Yahoo.

We believe this model will enable better recommendations to be produced not only for movies, but music, shopping, advertising, and many other areas.


The Team

Name Email Major Astrological Sign
Jonathan Kupferman jkupferman@umail Computer Science Cancer
Jeff Silverman jdsilverman@umail Computer Science Virgo
John Morse jjmorse@umail Computer Science Scorpio
Frank Jones fjones@umail Computer Science Scorpio
Jesse Wang white_eat@hotmail Computer Engineering Gemini

Note: @umail = @umail.ucsb.edu


Project Documentation

Vision Statement - 1/18/2008
Software Requirement Specification - 2/6/2008
Architecture Specification - 2/20/2008
Detailed Design Specification - 3/5/2008


We Won!

After months of hard work, many late nights, and quite a few passionate debates our project has come to a conclusion. June 5th was the UCSB CS189 "Capstone" project presentation day where friends, family, and many companies and individuals from industry were invited to watch us present our projects. Overall the presentations from the CS189 groups were nothing short of stunning, the technologies displayed and the quality of all of the presentations were extremely high. With that being said, we were extremely pleased and honored to be awarded the "Best CS189 Project" prize. Its documented here and we even got a nice blurb on the CS department site here where our project is summed up nicely. Afterwards there was a poster session where the audience could talk to us and ask us more indepth questions about our project. We were extremely pleased to get asked quite a few questions from people who were very interested in learning about our project. It was great to be able to more time explaning ideas and just talking with people. The CS department even took some photos of it, see them here. Thank you everyone!

Links

Netflix Prize
MapReduce
Hadoop
HBase
Flixster
Amazons EC2

Project Poster - 5/30/08
Project Presentation Slides - 6/5/08


Thank you to our generous sponsors

RightScale Google
A special thank you to Martin Rhoads from RightScale and Mohamed Hafez from Google.
Thanks for all of your help and guidance