Cross Validation and Regularization

The train-test Split

Training Data: Used to fit model
Test Data: Used to check generalization error
How to split? Randomly, Temporally, Geographically (Usually Random)
What size? Larger training set = more complex models, Larger test set = better estimate of generalization error (Usually 90%-10% or 75%-25%)

Recipe for Successful Generalization

Split data into training(90%) and testing set(10%)
Use only training data when training, use cross validation to test generalization(do not look at testing data!)
Commit to the model and train once more using only training data
Test the model using testing data. If accuracy is not acceptable, return to step 2
Train on all available data and ship it

Regularization

Parametrically controlling the model complexity Naive Solution: Find the best value of theta which uses less than B features. Unfortunately, this is an NP hard combinatorial search problem.

L0 Norm Ball Ideal for feature selection but combinatorically difficult to optimize
L1 Norm Ball Encourages sparse solutions and is convex!
L2 Norm Ball Spreads weight over features(robust) and does not encourage sparsity
L1 + L2 Norm(Elastic Net) Compromise, need to tune two regularization parameters

Standardization: Ensure each dimension has the same scale and is centered around 0 Intercept Terms: Typically doesn’t regularize the intercept term