Data 100 Modeling

Motivation

Predict unknown values based on known values

Modeling Process

Choose a Model:** Constant model Choose a Loss Function:** Squared Loss, Absolute Loss Minimize average loss across entire dataset to determine optimal parameters

Correlation

Correlation Coefficient (r): Measures the strength of the LINEAR association between two variables

r is unitless and ranges between -1 and 1
- if r = 1, x = y and all points fall exactly on the line
- if r = -1, -x = y and all points fall exactly on the line
- if r = 0, there is no linear association between x and y
r says nothing about causation or non-linear association. Remember correlation does not imply causation!

r = average of the product of x and y, both measured in standard units xi in standard units = (xi - x) / o_x

Covariance: r * stdx * stdy

Simple Linear Regression

Motivation: Want to predict value of y for and given x. A naive attempt for getting children heights given parent heights is to compute the average value of y for each x value, then use those as predicts.

Simple linear regression: y = a + bx To determine optimal a and b, choose a loss function. If the loss function is squared loss, the objective function is mean squared error(MSE).

To solve for the optimal parameters, we use the objective function and minimize the mean squared error by hand using calculus.

b = r * (oy/ox) a = ybar - b * xbar This gives us parameter estimates for x and y.

Loss Surfaces

Usually 3d, with axes being a, b and loss(y = a + bx)

Model Interpretation

Slope - measured in units of y per unit of x New data needs to be similar to original data - you cannot predict the weight of a chihuahua’s weight given a model for golden retrievers Visualize, then quantify - watch out for anscombe’s quartet

Terminology

Names for the x variable:

Feature
Covariate
Independent variable
Explanatory variable
Predictor
Input
Regressor

Names for the y variable:

Output
Outcome
Response
Dependent variable

Adding independent variables

Use a weighted sum of coefficients and input variables.

Evaluating Models

Look at Mean Squared Error(MSE) or Root Mean Squared Error(RMSE)
Look at the correlations
Look at a residual plot

Root Mean Squared Error: Square root of the mean squared error. RMSE is in the same units as y. A lower RMSE indicates more accurate predictions. It is impossible to lower the RMSE just by adding features using the same data

R squared: Used to measure the strength of the linear association between our actual y and predicted y. aka coefficient of determination.

R^2 = variance of fitted values / variance of y

Ordinary Least Squares

Multiple regression using matrix multiplication

Multiple regression is of the form y = theta_0 + theta_1 * x_1 + theta_2 * x_2 + ... + theta_p * x_p We can restate this as a dot product y = x^T * theta

Design Matrix

Motivation: the mean squared error involves all observations at once, it would be nice to express our model in terms of all observations, not just one. We can put them into a design matrix.

Rows: Correspond to observations. e.g. all features for data point 3 Columns: Correspond to features. e.g. feature 1, for all data points

Residuals

Residuals are the difference between an actual and predicted value, in the regression context. We use the letter e to denote residuals, e_i = y_i - yhat_i

The mean squared error is equal to the mean of the squares of its residuals. We can stack all n residuals into a vector, called the residual vector. residual vector = true y values - predicted y values

Residuals are orthogonal to the span of X. If our model has an intercept term(when our design matrix has a column of all 1s)

The sum and mean of the residuals is 0
The average true y value is equal to the average predicted y value

Residual Plots:

With simple linear regression with only 1 independent variables, we plot residuals vs x
In the general case, use residuals on y axis vs fitted values on x
A good residual plot has no pattern, if there is a curve, this is a sign that transformations or additional variables can help
A residual plot should have a similar vertical spread throughout the entire plot. If it doesn’t there are probably issues with the accuracy of the predictions

Unique Solutions

There is always at least one model parameter that minimizes average loss.
Constant models with a squared loss: a unique solution always exists
Simple linear model with a squared loss: Any non constant value has unique mean, SD, correlation coefficient
Constant model with absolute loss: Unique when there is an odd number y values, if there is an even number of y values, there are infinitely many solution.

Invertability of X transpose * X

Invertible iff it is full rank
X transpose * X and X have the same rank
Thus, X^T * X is invertible iff X has rank p + 1 (full column rank)

Real World Example - Fairness in Housing Appraisal

Situation

The cook county assessor’s office is in charge of assessing property values in order to determine property taxes.

Problem

The biased property value assessment resulted in a regressive tax, where rich people paid less and poor people paid more. In addition, rich people appealed more often than poor people, resulting in an even greater reduction of property tax.

Solution

Ask a Question: What do we want to know? How to fairly value things for tax purposes. What are our metrics for success? Have both fairness and transparency in projections.
Data Acquisition and Cleaning: What data do we have and what do we need? Housing Sales data between 2013-2019, Property Characteristics-ie age, bedrooms, baths, etc.How will we sample more data? Is our sample representative?
Exploratory Data Analysis and Visualization: What attributes are most predictive of sales price? Which are potentially problematic? Is the data predictive of sales price?

Takeaways

Accuracy is a necessary, but not sufficient condition of a fair system.
Fairness and transparency are context-dependent
Learn to work with contexts and consider how your data analysis will reshape them
Keep in mind the power and limits of data analysis?

Probability and Generalization

Random Variables

Random Variable: Represents a numerical value determined by a probabilistic event.

Probability Mass Function

The distribution of a random variable X provides the probability that X takes on each of its possible values(discrete)
The probabilities for all possible values of random variable X in a Probability Mass Function must sum to 1
Each individual probability for a given value X must be between 0 and 1. Joint Distributions: Probability of two or more random variables taking on a specific set of values. Ie P(X=0, Y= 10) = (0.5) ** 10 for coin flips where X is heads and Y is tails Marginal Distribution: A way to go from the joint distribution to the distribution for a single variable. Ie consider all possible values of Y that can simultaneously happen with X and sum over all of the joint probabilities.
∑y∈Y P(X=x,Y=y) = P(X=x)

Independent Random Variables: Any two random variables are independent if and only if knowing the outcome of one variable does not alter the probability of observing any outcomes of the other variables.

Expectation and Variance

Expectation

The long run average of a random variable, also known as the expected value or expectation of a random variable.
E[X] = ∑ x∈X x ⋅ P(X=x)

Linearity of Expectation:

Use when working with linear combinations of random variables. This holds true even when the random variables are dependent on each other.
E[X+Y] = E[X] + E[Y]
E[cX] = c * E[X]
E[X−Y] = E[X] − E[Y]

However, E[XY]=E[X]E[Y] is only true when X and Y are independent random variables.

Variance

The variance of a random variable is a description of the variable’s spread, or how far values are apart from each other.
Var(X) = E[(X − E[X])**2]
Var(X) = E[X**2] − (E[X])**2
Var(aX + b) = a**2 * Var(X) Is true if X is a random variable
Var(X + Y) = Var(X) + Var(Y) Holds true if X and Y are independent

Covariance

If the covariance is positive, the random variables are positively correlated(ie move in the same direction for stocks). If the covariance is negative, the random variables are negatively correlated. A covariance of 0 indicates the variables are independent.

Cov(X,Y) = E[(X − E[X]) * (Y − E[Y])]
Cov(X,Y)= E[XY] − E[X]*E[Y]

Risk

Risk: Statistical risk is known as the expected loss, or the expected value of the model’s loss on randomly chosen points from the population.

R(θ) = E[(X − θ)**2]
To minimize risk, use R(θ) = E[(X − E[X]) ** 2] + (E[X] − θ) ** 2
R(θ) = Bias + Variance = (E[X] − θ) ** 2 + E[(X − E[X]) ** 2]
A low variance means the random variable will likely take a value close to θ, while a high variance means the random variable will take a value far from θ

Empirical Risk Minimization

Since calculating the expected value of X requires complete knowledge of the population, since expected value is defined as probability X takes a specific value * that specific value.
We can use a large random sample instead of the population when calculating the expected value of X
Thus we can approximate E[X] ~ mean(x)
Therefore, the empirical risk is the risk from using the large random sample instead of the population

Multiple Linear Regression

Questions to Ask

Published Oct 10, 2021

Just a kid looking to make itAlbert Su on Twitter