Main Page

From CS 294-1 Spring 2013

Jump to: navigation, search

CS294-1 Spring 2013, Behavioral Data Mining

Sections: W 2-3, and 3-4pm, 310 Soda

Lectures MW 9-10:30am in 306 Soda

CCN: 26922

This is a course about large-scale mining of behavioral data - data generated by people. Examples include the web itself, social media (Facebook, Twitter, Livejournal), digital mega-libraries, shopping (Amazon, Ebay), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow), MOOC data, server logs and recommenders (Netflix, Amazon etc). These datasets have an enormous variety of potential uses in health care, government, education, commerce etc. They provide previously-unknown opportunities to understand human behavior, and to provide better services to people. The course will be hands-on and include several assignments on large datasets and a final project. The course covers data mining algorithms from general machine learning, causal analysis, social networks and natural language processing. There will be a modest coverage of systems issues: big data toolkits and their affordances, some recent innovations from scientific computing, and GPU programming. Students should have some familiarity with Java or Scala, and will be working with Hadoop and/or Spark, Matlab and BIDMat/BIDMach. Projects will have access to a large number of experimental datasets totaling approximately 20 TB.

  • Instructor: John Canny, Office Hours TuTh 4-5pm
  • Teaching Assistant: Huasha Zhao:

Tips and Tricks


4/9/2013: Assignment 3 deadline extended to 4/18/2013 here.


This course will have several programming assignments, and fluency with a high-level language (ideally Java) is essential. Descriptions of algorithms will require good familiarity with linear algebra and at least an upper division statistics course or machine learning course. Students will work on assignments in two-person teams with a goal of maximizing performance, and undergraduate knowledge of systems is a plus. Teams with complementary strengths in math/stat and systems are likely to do very well.


01/23/2013: Introduction [Slides ] Introduction, example problems

01/28/2013: Basic stats [Slides ] Statistical Learning, Regression, bias/variance tradeoff

Assignment [Assignment 1 ]

01/30/2013: Naive Bayes and Generalized Linear Models [Slides ]

02/04/2013: Performance Measurement [Slides ] Significance tests, ROC plots, permutation and bootstrap tests

02/06/2013: About People [Slides ] Power laws, traits, social network structures

02/11/2013: Optimizers [Slides ] SGD, MCMC

02/13/2013: MapReduce [Slides ]

Assignment due March 4: [Assignment 2 ]

02/20/2013: Query Languages and Systems [Slides ] Spark, Hyracks, Pig

02/25/2013: Causal Analysis [Slides ] Matching, propensity scores

02/27/2013: Excavating [Slides ] - crawling, web services, datasets

03/04/2013: Bagging, Boosting, Random Forests [Slides ]

03/06/2013: Machine Biology [Slides ] - b/w hierarchy, caching, disks

Assignment due by March 20: [Project Proposal ]

03/11/2013: GPU programming I [Slides ]

03/13/2013: GPU programming II [Slides ]

03/18/2013: Project proposals (Schedule)

03/20/2013: Project proposals

Assignment due by April 18: Assignment 3

03/25-29/2013: Spring Break

04/01/2013: Factor Models [Slides ]

04/03/2013: Natural Language processing I [Slides ] Part-Of-Speech Tagging, Entity recognition

04/08/2013: Natural Language processing II [Slides ] Parsing

04/10/2013: Clustering [Slides ] k-means, Spectral

04/15/2013: Causal Analysis II [Slides ] Causal Regression

04/17/2013: Causal Graphical Models [Slides ]

04/18/2013 Assignment due: Assignment 3

04/22/2013: Network Algorithms I [Slides ] HITS and Pagerank

04/24/2013: Network Algorithms II [Slides ] Diffusion and meme tracking

This week: progress reviews for all projects

04/29/2013: Visualization I [Slides ]

05/01/2013: Visualization II

05/07/2013: Project Presentations 2-6pm, 306 Soda Hall.

05/08/2013: Project Posters 4-5pm, 5th floor Soda Atrium.

05/15/2013: Final Report Due

Personal tools