Inference

Moving from premises to logical consequences
Induction is inference from a particular premise to a universal conclusion
Statistical inference: Using data analysis to deduce properties of underlying distribution

Prediction vs Inference

Prediction: Using our model to make predictions for unseen data. (Given attributes of house, how much is it worth?) We don’t care about how
Inference: Using our model to draw conclusions about the underlying true relationships between our features and response. (How much extra will a house be worth if it has a view of the river?) We care about model parameters that are interpretable and meaningful

Statistical Inference

Draw conclusions about a population parameters given only a random sample
Parameter: some function of population, ie population mean
Estimator: some function of a sample, whose goal is to estimate a population parameter, ie sample mean. Estimators are random variables
- Bias of an estimator: difference between estimator’s expected value and the true value of the parameter being estimated.
- Variance of an estimator: Expected squared deviation of an estimator from its mean

Bootstrapping

Idea: treat our random sample as a population and resample from it

Psuedocode

collect random sample of size n(aka bootstrap population)
initialize list of estimates
repeat 10,000 times:
    resample with replacement from bootstrap population
    apply estimator f to resample
    store in list
list of estimates is the bootstrapped sampling distribution of f

The median cannot be accurately drawn from bootstrapping
If the sample is too small, bootstrapping won’t work

Confidence Interval

What does a confidence interval p% mean?
- If we take a sample from the population and compute P% confidence interval for the true population parameter, and repeat this many times, our population parameter will be in our interval P% of the time.
to compute confidence interval(s,f,P), approximate sampling distribution of f using sample s. Choose middle P% of samples from this approximate distribution
A 95% confidence interval does not mean that there is a 95% chance that the population parameter is in the interval, either the population parameter is in or isn’t

Bootstrapping Model Parameters

Our estimate for theta depends on what our training data was.
We want to think about all of the different ways that our training data and our parameter estimate could have come out

We want to test whether a feature has any effect on the outcome. This works for linear and logistic regression models with any number of features.

Estimate theta1 each time
Make confidence interval for theta 1 and see if 0 is in the interval
If yes, theta 1 is not significantly different than 0
If no, theta 1 is significantly different than 0

Multicollinearity

If features are related to one another, it might not be possible to have a change in one while holding the others constant
Multicollinearity: Where a feature can be predicted fairly accurately by a lienar combination of other features.
- Doesn’t impact model predictability, only interpretability
Perfect Multicollinearity: one feature can be written exactly as linear combination of other features

Summary

Estimators are functions that provide estimates of true population parameters
We can bootstrap to estimate the sampling distribution of an estimator
Using the bootstrapped sampling distribution, we can compute a confidence interval for our estimator
- This gives a rough idea of how uncertain we are about the true population parameter
- Only valid if the original random sample is representative
The assumption when performing linear regression is that there is some true parameter theta that defines the linear relationship between features X and response Y
- We can use bootstrap to determine whether or not an individual feature is significant
Multicollinearity arises when features are correlated with one another
Supervised Learning = We have an X and Y
Unsupervised Learning = We only have x, want to learn about X

Questions

What is confidence interval

Published Oct 21, 2021

Just a kid looking to make itAlbert Su on Twitter

Inference

Inference

Prediction vs Inference

Statistical Inference

Bootstrapping

Confidence Interval

Bootstrapping Model Parameters

Multicollinearity

Summary

Questions