TODO
Put readings on todo list can be found on class webpage Put homeworks on todo list Midterm 1 Monday march 16 in class Discussion sections Monday-Wednesday
Core material
- Find patterns in data and use them to make predictions
- Models + stats help us understand patterns
- Use optimization algorithms to learn the patterns
Classification
- Simplest Case: you have two choices, given data, make a prediction
- Knn best when few outliers
Classifying Numbers
- Turn points into grid of 0s and 1s based on the color of the grid
- Turn grid into a vector by flattening the vector
- Create hyperplane in the n-dimensional space to group things
Testing and Validation
- Train a classifier - it learns to distinguish 7 from not 7
- Test the classifier on NEW images
-
There are two types of error
- Training set error: Fraction of training images not classified correctly
- Test set error: Fraction of misclassified NEW images, not seen during training
- Outliers: Points whose labels are atypical (e.g solvent borrowers who defaulted anyway)
- Overfitting: When the test error deteriorates because the classifier becomes too sensitive to outliers
- Hyperparameters: Most ML algorithms have a few hyperparameters that control over/underfitting. eg k in k-nearest neighbors
Select classifiers by validation
- Validation Set: Hold back a subset of the labeled data
- Train the classifier multiple times with different hyperparameter settings
- Choose setting(hyperparameter + learning algorithm) that works best on validation set
Now, we have 3 sets:
- Training set: Used to learn model weights
- Validation set: Used to tune hyperparameters, choose among different models
- Test set: Used as FINAL evaluation. Test set kept in vault, ran once, at the very end
Kaggle.com
- Runs ML competitions, including our HWs
-
We use 2 data sets:
- public set labels available during the competition
- private set labels known only to Kaggles
Techniques of Machine Learning Taught in Class
Supervised learning
- Classification: Is this email spam?
- Regression: How likely does this patient have cancer?
Unsupervised learning
- Clustering: which DNA sequences are similar to each other?
- Dimensionality Reduction: What are common features of faces?