Main Page
From CS 294-1 Spring 2013
CS294-1 Spring 2013, Behavioral Data Mining
Sections: W 2-3, and 3-4pm, 310 Soda
Lectures MW 9-10:30am in 306 Soda
CCN: 26922
This is a course about large-scale mining of behavioral data - data generated by people. Examples include the web itself, social media (Facebook, Twitter, Livejournal), digital mega-libraries, shopping (Amazon, Ebay), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow), MOOC data, server logs and recommenders (Netflix, Amazon etc). These datasets have an enormous variety of potential uses in health care, government, education, commerce etc. They provide previously-unknown opportunities to understand human behavior, and to provide better services to people. The course will be hands-on and include several assignments on large datasets and a final project. The course covers data mining algorithms from general machine learning, causal analysis, social networks and natural language processing. There will be a modest coverage of systems issues: big data toolkits and their affordances, some recent innovations from scientific computing, and GPU programming. Students should have some familiarity with Java or Scala, and will be working with Hadoop and/or Spark, Matlab and BIDMat/BIDMach. Projects will have access to a large number of experimental datasets totaling approximately 6 TB compressed.
- Instructor: John Canny, Office Hours TuTh 4-5pm
- Teaching Assistant: Huasha Zhao: Office hours TBD
News
4/9/2013: Assignment 3 deadline extended to 4/18/2013 here.
Prerequisites
This course will have several programming assignments, and fluency with a high-level language (ideally Java) is essential. Descriptions of algorithms will require good familiarity with linear algebra and at least an upper division statistics course or machine learning course. Students will work on assignments in two-person teams with a goal of maximizing performance, and undergraduate knowledge of systems is a plus. Teams with complementary strengths in math/stat and systems are likely to do very well.
Outline:
01/23/2013: Introduction [Slides ] Introduction, example problems
01/28/2013: Basic stats [Slides ] Statistical Learning, Regression, bias/variance tradeoff
- Assignment [Assignment 1 ]
01/30/2013: Naive Bayes and Generalized Linear Models [Slides ]
02/04/2013: Performance Measurement [Slides ] Significance tests, ROC plots, permutation and bootstrap tests
02/06/2013: About People [Slides ] Power laws, traits, social network structures
02/11/2013: Optimizers [Slides ] SGD, MCMC
02/13/2013: MapReduce [Slides ]
- Assignment due March 4: [Assignment 2 ]
02/20/2013: Query Languages and Systems [Slides ] Spark, Hyracks, Pig
02/25/2013: Causal Analysis [Slides ] Matching, propensity scores
02/27/2013: Excavating [Slides ] - crawling, web services, datasets
03/04/2013: Bagging, Boosting, Random Forests [Slides ]
03/06/2013: Machine Biology [Slides ] - b/w hierarchy, caching, disks
- Assignment due by March 20: [Project Proposal ]
03/11/2013: GPU programming I [Slides ]
03/13/2013: GPU programming II [Slides ]
03/18/2013: Project proposals (Schedule)
03/20/2013: Project proposals
- Assignment due by April 18: Assignment 3
03/25-29/2013: Spring Break
04/01/2013: Factor Models [Slides ]
04/03/2013: Natural Language processing I [Slides ] Part-Of-Speech Tagging, Entity recognition
04/08/2013: Natural Language processing II [Slides ] Parsing
04/10/2013: Clustering [Slides ] k-means, Spectral
04/15/2013: Causal Analysis II [Slides ] Causal Regression
04/17/2013: Causal Graphical Models [Slides ]
- 04/18/2013 Assignment due: Assignment 3
04/22/2013: Network Algorithms I [Slides ]
HITS and Pagerank
04/24/2013: Network Algorithms II [Slides ] Diffusion and meme tracking
This week: progress reviews for all projects
04/29/2013: Visualization I [Slides ]
05/01/2013: Visualization II
05/07/2013: Project Presentations 2-6pm, 306 Soda Hall.
05/08/2013: Project Posters 4-5pm, 5th floor Soda Atrium.
05/15/2013: Final Report Due