Main Page
From CS 294-1 Spring 2012
CS294-1 Spring 2012, Behavioral Data Mining
Section: 1-2pm Weds, in 310 Soda
Lectures TuTh 9:30-11am in 310 Soda
CCN: 26899
This is a course about large-scale data mining of behavioral data - data generated by people. Examples include shopping (Amazon, Ebay), messaging (Facebook, Twitter, Livejournal), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow) and recommenders (Netflix, Amazon etc). We focus on event data - discrete actions by people, which means the largest datasets are in the terabyte range and they have some common structural features. We will not cover sensor data, medical images, genetic data etc. which can be many orders of magnitude larger. The course will be hands-on and include several assignments on large datasets and a final project. The course covers a number of data mining algorithms (mostly from machine learning) and how to implement them at scale. The tools will include Hadoop, MarkLogic 5 (text and XML processing), and a matrix algebra toolkit (Matlab or Scalala).
- Teaching Assistant: Keng-hao Chang: 360 Hearst Mining Bldg. Office hour: T 2-3
Prerequisites
This course will have several programming assignments, and fluency with a high-level language (ideally Java) is the main requirement. Most of the algorithms rely on linear algebra and will be expressed in a matrix-based language like Matlab or Scalala, and you should have preparation equivalent to Math 54. Undergraduate knowledge of AI (CS188) or statistics will make it easier to make sense of the algorithms, but for now is not a prerequisite. Students will work on assignments in two-person teams with a goal of maximizing performance, and undergraduate knowledge of systems is a plus. Teams with complementary strengths in math/stat and systems are likely to do very well.
Tentative Outline:
01/17/2012: Introduction [Slides ] Introduction, example problems
01/19/2012: Basic stats - naive Bayes classifier [Slides ]
- Assignment due by
Feb 2Feb 7 (deadline extended): Programming Assignment 1 Assignment1 deliverables Note: Setup Scalala and ScalaNLP
- Assignment due by
01/24/2012: Performance measurement [Slides ], precision/recall, ROC plots, significance tests
01/26/2012: Dealing with text [Slides ], n-grams, smoothing, indexing
01/31/2012: XML search and analytics [Slides ], Guest lecture, Ron Avnur, CTO Mark Logic
02/02/2012: Map Reduce [Slides ]
02/07/2012: Regression [Slides ] - linear and logistic
- Assignment (Extension) due by Feb 24: [Assignment 2 ]
02/09/2012: Machine biology [Slides ] - b/w hierarchy, caching, disks
02/14/2012: About people [Slides ]- power laws, personality factors, social network structure, sentiment
02/16/2012: SVM Classifiers [Slides ]
02/21/2012: Spark [Slides ] - Guest lecture, Matei Zaharia, UCB
02/23/2012: Computational advertising and recommendation [Slides ]- Guest lecture, Ye Chen, Microsoft
- Assignment due by March 13: Project Proposal, List of Projects
02/28/2012: Query languages and architectures [Slides ]- Hive, Pig, Sawzall
03/01/2012: Excavating [Slides ] - crawling, web services
03/06/2012: Project proposals (Schedule)
03/08/2012: Project proposals
03/13/2012: Visualization I [Slides ]
03/15/2012: Visualization II and project proposal wrap-up [Slides ]
03/20/2012: Dimension Reduction [Slides ]- SVD, LSI, random projection
03/22/2012: Clustering 1 [Slides ]- k-means, spectral
- Assignment due by April 19: Assignment 3
03/27-29/2012: Spring Break
04/03/2012: Causal Inference I [Slides ]
04/05/2012: Causal Inference II [Slides ]
04/10/2012: Clustering 2 - Naive Bayes, LDA, GaP [Slides ]
04/12/2012: Prediction - kNN, kd-trees, kNC [Slides ]
04/17/2012: Sequence models I - guest lecture by David Hall [Slides ]
04/19/2012: Sequence models II - guest lecture by David Hall
This week: progress reviews for all projects
04/24/2012: Network algorithms - HITS and pagerank [Slides ]
04/26/2012: Diffusion and meme tracking [Slides ]
05/01/2012: 1-3pm, Project Posters, Soda Hall 5th floor Atrium
05/03/2012: 11am-2pm, Project Presentations, 306 Soda Hall
05/07/2012: Final Report Due