Inference
- Moving from premises to logical consequences
- Induction is inference from a particular premise to a universal conclusion
- Statistical inference: Using data analysis to deduce properties of underlying distribution
Prediction vs Inference
- Prediction: Using our model to make predictions for unseen data. (Given attributes of house, how much is it worth?) We don’t care about how
- Inference: Using our model to draw conclusions about the underlying true relationships between our features and response. (How much extra will a house be worth if it has a view of the river?) We care about model parameters that are interpretable and meaningful
Statistical Inference
- Draw conclusions about a population parameters given only a random sample
- Parameter: some function of population, ie population mean
-
Estimator: some function of a sample, whose goal is to estimate a population parameter, ie sample mean. Estimators are random variables
- Bias of an estimator: difference between estimator’s expected value and the true value of the parameter being estimated.
- Variance of an estimator: Expected squared deviation of an estimator from its mean
Bootstrapping
- Idea: treat our random sample as a population and resample from it
-
Psuedocode
collect random sample of size n(aka bootstrap population) initialize list of estimates repeat 10,000 times: resample with replacement from bootstrap population apply estimator f to resample store in list list of estimates is the bootstrapped sampling distribution of f - The median cannot be accurately drawn from bootstrapping
- If the sample is too small, bootstrapping won’t work
Confidence Interval
-
What does a confidence interval p% mean?
- If we take a sample from the population and compute P% confidence interval for the true population parameter, and repeat this many times, our population parameter will be in our interval P% of the time.
- to compute confidence interval(s,f,P), approximate sampling distribution of f using sample s. Choose middle P% of samples from this approximate distribution
- A 95% confidence interval does not mean that there is a 95% chance that the population parameter is in the interval, either the population parameter is in or isn’t
Bootstrapping Model Parameters
- Our estimate for theta depends on what our training data was.
- We want to think about all of the different ways that our training data and our parameter estimate could have come out
-
We want to test whether a feature has any effect on the outcome. This works for linear and logistic regression models with any number of features.
Estimate theta1 each time Make confidence interval for theta 1 and see if 0 is in the interval If yes, theta 1 is not significantly different than 0 If no, theta 1 is significantly different than 0
Multicollinearity
- If features are related to one another, it might not be possible to have a change in one while holding the others constant
-
Multicollinearity: Where a feature can be predicted fairly accurately by a lienar combination of other features.
- Doesn’t impact model predictability, only interpretability
- Perfect Multicollinearity: one feature can be written exactly as linear combination of other features
Summary
- Estimators are functions that provide estimates of true population parameters
- We can bootstrap to estimate the sampling distribution of an estimator
-
Using the bootstrapped sampling distribution, we can compute a confidence interval for our estimator
- This gives a rough idea of how uncertain we are about the true population parameter
- Only valid if the original random sample is representative
-
The assumption when performing linear regression is that there is some true parameter theta that defines the linear relationship between features X and response Y
- We can use bootstrap to determine whether or not an individual feature is significant
- Multicollinearity arises when features are correlated with one another
- Supervised Learning = We have an X and Y
-
Unsupervised Learning = We only have x, want to learn about X
Questions
What is confidence interval