Main Page

From CS 294-1 Spring 2012

Jump to: navigation, search

CS294-1 Spring 2012, Behavioral Data Mining

Section: 1-2pm Weds, in 310 Soda

Lectures TuTh 9:30-11am in 310 Soda

CCN: 26899

This is a course about large-scale data mining of behavioral data - data generated by people. Examples include shopping (Amazon, Ebay), messaging (Facebook, Twitter, Livejournal), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow) and recommenders (Netflix, Amazon etc). We focus on event data - discrete actions by people, which means the largest datasets are in the terabyte range and they have some common structural features. We will not cover sensor data, medical images, genetic data etc. which can be many orders of magnitude larger. The course will be hands-on and include several assignments on large datasets and a final project. The course covers a number of data mining algorithms (mostly from machine learning) and how to implement them at scale. The tools will include Hadoop, MarkLogic 5 (text and XML processing), and a matrix algebra toolkit (Matlab or Scalala).

  • Teaching Assistant: Keng-hao Chang: 360 Hearst Mining Bldg. Office hour: T 2-3

Tips and Tricks

Data Dives


This course will have several programming assignments, and fluency with a high-level language (ideally Java) is the main requirement. Most of the algorithms rely on linear algebra and will be expressed in a matrix-based language like Matlab or Scalala, and you should have preparation equivalent to Math 54. Undergraduate knowledge of AI (CS188) or statistics will make it easier to make sense of the algorithms, but for now is not a prerequisite. Students will work on assignments in two-person teams with a goal of maximizing performance, and undergraduate knowledge of systems is a plus. Teams with complementary strengths in math/stat and systems are likely to do very well.

Tentative Outline:

01/17/2012: Introduction [Slides ] Introduction, example problems

01/19/2012: Basic stats - naive Bayes classifier [Slides ]

Assignment due by Feb 2 Feb 7 (deadline extended): Programming Assignment 1 Assignment1 deliverables Note: Setup Scalala and ScalaNLP

01/24/2012: Performance measurement [Slides ], precision/recall, ROC plots, significance tests

01/26/2012: Dealing with text [Slides ], n-grams, smoothing, indexing

01/31/2012: XML search and analytics [Slides ], Guest lecture, Ron Avnur, CTO Mark Logic

02/02/2012: Map Reduce [Slides ]

02/07/2012: Regression [Slides ] - linear and logistic

Assignment (Extension) due by Feb 24: [Assignment 2 ]

02/09/2012: Machine biology [Slides ] - b/w hierarchy, caching, disks

02/14/2012: About people [Slides ]- power laws, personality factors, social network structure, sentiment

02/16/2012: SVM Classifiers [Slides ]

02/21/2012: Spark [Slides ] - Guest lecture, Matei Zaharia, UCB

02/23/2012: Computational advertising and recommendation [Slides ]- Guest lecture, Ye Chen, Microsoft

Assignment due by March 13: Project Proposal, List of Projects

02/28/2012: Query languages and architectures [Slides ]- Hive, Pig, Sawzall

03/01/2012: Excavating [Slides ] - crawling, web services

03/06/2012: Project proposals (Schedule)

03/08/2012: Project proposals

03/13/2012: Visualization I [Slides ]

03/15/2012: Visualization II and project proposal wrap-up [Slides ]

03/20/2012: Dimension Reduction [Slides ]- SVD, LSI, random projection

03/22/2012: Clustering 1 [Slides ]- k-means, spectral

Assignment due by April 19: Assignment 3

03/27-29/2012: Spring Break

04/03/2012: Causal Inference I [Slides ]

04/05/2012: Causal Inference II [Slides ]

04/10/2012: Clustering 2 - Naive Bayes, LDA, GaP [Slides ]

04/12/2012: Prediction - kNN, kd-trees, kNC [Slides ]

04/17/2012: Sequence models I - guest lecture by David Hall [Slides ]

04/19/2012: Sequence models II - guest lecture by David Hall

This week: progress reviews for all projects

04/24/2012: Network algorithms - HITS and pagerank [Slides ]

04/26/2012: Diffusion and meme tracking [Slides ]

05/01/2012: 1-3pm, Project Posters, Soda Hall 5th floor Atrium

05/03/2012: 11am-2pm, Project Presentations, 306 Soda Hall

05/07/2012: Final Report Due

Personal tools