All Articles

Lecture 1

TODO

Put readings on todo list can be found on class webpage Put homeworks on todo list Midterm 1 Monday march 16 in class Discussion sections Monday-Wednesday

Core material

  • Find patterns in data and use them to make predictions
  • Models + stats help us understand patterns
  • Use optimization algorithms to learn the patterns

Classification

  • Simplest Case: you have two choices, given data, make a prediction
  • Knn best when few outliers

Classifying Numbers

  1. Turn points into grid of 0s and 1s based on the color of the grid
  2. Turn grid into a vector by flattening the vector
  3. Create hyperplane in the n-dimensional space to group things

Testing and Validation

  • Train a classifier - it learns to distinguish 7 from not 7
  • Test the classifier on NEW images
  • There are two types of error

    • Training set error: Fraction of training images not classified correctly
    • Test set error: Fraction of misclassified NEW images, not seen during training
  • Outliers: Points whose labels are atypical (e.g solvent borrowers who defaulted anyway)
  • Overfitting: When the test error deteriorates because the classifier becomes too sensitive to outliers
  • Hyperparameters: Most ML algorithms have a few hyperparameters that control over/underfitting. eg k in k-nearest neighbors

Select classifiers by validation

  • Validation Set: Hold back a subset of the labeled data
  • Train the classifier multiple times with different hyperparameter settings
  • Choose setting(hyperparameter + learning algorithm) that works best on validation set

Now, we have 3 sets:

  • Training set: Used to learn model weights
  • Validation set: Used to tune hyperparameters, choose among different models
  • Test set: Used as FINAL evaluation. Test set kept in vault, ran once, at the very end

Kaggle.com

  • Runs ML competitions, including our HWs
  • We use 2 data sets:

    • public set labels available during the competition
    • private set labels known only to Kaggles

Techniques of Machine Learning Taught in Class

Supervised learning

  • Classification: Is this email spam?
  • Regression: How likely does this patient have cancer?

Unsupervised learning

  • Clustering: which DNA sequences are similar to each other?
  • Dimensionality Reduction: What are common features of faces?

Questions