# Main Page

### From CS 294-1 Spring 2012

## CS294-1 Spring 2012, Behavioral Data Mining

### Section: 1-2pm Weds, in 310 Soda

Lectures TuTh 9:30-11am in **310** Soda

CCN: 26899

This is a course about large-scale data mining of behavioral data - data generated by people. Examples include shopping (Amazon, Ebay), messaging (Facebook, Twitter, Livejournal), tagging (Flickr, Digg) repositories (Wikipedia, Stack Overflow) and recommenders (Netflix, Amazon etc). We focus on event data - discrete actions by people, which means the largest datasets are in the terabyte range and they have some common structural features. We will not cover sensor data, medical images, genetic data etc. which can be many orders of magnitude larger. The course will be hands-on and include several assignments on large datasets and a final project. The course covers a number of data mining algorithms (mostly from machine learning) and how to implement them at scale. The tools will include Hadoop, MarkLogic 5 (text and XML processing), and a matrix algebra toolkit (Matlab or Scalala).

- Teaching Assistant: Keng-hao Chang: 360 Hearst Mining Bldg. Office hour: T 2-3

### Prerequisites

This course will have several programming assignments, and fluency with a high-level language (ideally Java) is the main requirement. Most of the algorithms rely on linear algebra and will be expressed in a matrix-based language like Matlab or Scalala, and you should have preparation equivalent to Math 54. Undergraduate knowledge of AI (CS188) or statistics will make it easier to make sense of the algorithms, but for now is not a prerequisite. Students will work on assignments in two-person teams with a goal of maximizing performance, and undergraduate knowledge of systems is a plus. Teams with complementary strengths in math/stat and systems are likely to do very well.

### Tentative Outline:

**01/17/2012:** Introduction [*Slides* ]
Introduction, example problems

**01/19/2012:** Basic stats - naive Bayes classifier [*Slides* ]

**Assignment**due by: Programming Assignment 1 Assignment1 deliverables Note: Setup Scalala and ScalaNLP~~Feb 2~~Feb 7 (deadline extended)

**01/24/2012:** Performance measurement [*Slides* ], precision/recall, ROC plots, significance tests

**01/26/2012:** Dealing with text [*Slides* ], n-grams, smoothing, indexing

**01/31/2012:** XML search and analytics [*Slides* ], Guest lecture, Ron Avnur, CTO Mark Logic

**02/02/2012:** Map Reduce [*Slides* ]

**02/07/2012:** Regression [*Slides* ] - linear and logistic

**Assignment****(Extension) due by Feb 24**: [Assignment 2 ]

**02/09/2012:** Machine biology [*Slides* ] - b/w hierarchy, caching, disks

**02/14/2012:** About people [*Slides* ]- power laws, personality factors, social network structure, sentiment

**02/16/2012:** SVM Classifiers [*Slides* ]

**02/21/2012:** Spark [*Slides* ] - Guest lecture, Matei Zaharia, UCB

**02/23/2012:** Computational advertising and recommendation [*Slides* ]- Guest lecture, Ye Chen, Microsoft

**Assignment****due by March 13**: Project Proposal, List of Projects

**02/28/2012:** Query languages and architectures [*Slides* ]- Hive, Pig, Sawzall

**03/01/2012:** Excavating [*Slides* ] - crawling, web services

**03/06/2012:** Project proposals (Schedule)

**03/08/2012:** Project proposals

**03/13/2012:** Visualization I [*Slides* ]

**03/15/2012:** Visualization II and project proposal wrap-up [*Slides* ]

**03/20/2012:** Dimension Reduction [*Slides* ]- SVD, LSI, random projection

**03/22/2012:** Clustering 1 [*Slides* ]- k-means, spectral

**Assignment**due by**April 19**: Assignment 3

**03/27-29/2012:** Spring Break

**04/03/2012:** Causal Inference I [*Slides* ]

**04/05/2012:** Causal Inference II [*Slides* ]

**04/10/2012:** Clustering 2 - Naive Bayes, LDA, GaP [*Slides* ]

**04/12/2012:** Prediction - kNN, kd-trees, kNC [*Slides* ]

**04/17/2012:** Sequence models I - guest lecture by David Hall [*Slides* ]

**04/19/2012:** Sequence models II - guest lecture by David Hall

**This week:** progress reviews for all projects

**04/24/2012:** Network algorithms - HITS and pagerank [*Slides* ]

**04/26/2012:** Diffusion and meme tracking [*Slides* ]

**05/01/2012:** 1-3pm, Project Posters, Soda Hall 5th floor Atrium

**05/03/2012:** 11am-2pm, Project Presentations, 306 Soda Hall

**05/07/2012:** Final Report Due