{"componentChunkName":"component---src-templates-post-template-js","path":"/posts/datascience/data-modeling","result":{"data":{"markdownRemark":{"id":"c6188013-2433-5871-8d25-8d885d6d1463","html":"<h2 id=\"motivation\" style=\"position:relative;\"><a href=\"#motivation\" aria-label=\"motivation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Motivation</h2>\n<p>Predict unknown values based on known values</p>\n<h2 id=\"modeling-process\" style=\"position:relative;\"><a href=\"#modeling-process\" aria-label=\"modeling process permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Modeling Process</h2>\n<p>Choose a Model:** Constant model\nChoose a Loss Function:** Squared Loss, Absolute Loss\nMinimize average loss across entire dataset to determine optimal parameters</p>\n<h2 id=\"correlation\" style=\"position:relative;\"><a href=\"#correlation\" aria-label=\"correlation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Correlation</h2>\n<p><strong>Correlation Coefficient (r):</strong> Measures the strength of the LINEAR association between two variables</p>\n<ul>\n<li>\n<p>r is unitless and ranges between -1 and 1</p>\n<ul>\n<li>if r = 1, x = y and all points fall exactly on the line</li>\n<li>if r = -1, -x = y and all points fall exactly on the line</li>\n<li>if r = 0, there is no linear association between x and y</li>\n</ul>\n</li>\n<li>r says nothing about causation or non-linear association. Remember correlation does not imply causation!</li>\n</ul>\n<p>r = average of the product of x and y, both measured in standard units\nx<em>i in standard units = (x</em>i - x) / o_x</p>\n<p><strong>Covariance:</strong> r * std<em>x * std</em>y</p>\n<h2 id=\"simple-linear-regression\" style=\"position:relative;\"><a href=\"#simple-linear-regression\" aria-label=\"simple linear regression permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Simple Linear Regression</h2>\n<p>Motivation: Want to predict value of y for and given x. A naive attempt for getting children heights given parent heights is to compute the average value of y for each x value, then use those as predicts.</p>\n<p>Simple linear regression: y = a + bx\nTo determine optimal a and b, choose a loss function. If the loss function is squared loss, the objective function is mean squared error(MSE).</p>\n<p>To solve for the optimal parameters, we use the objective function and minimize the mean squared error by hand using calculus.</p>\n<p>b = r * (o<em>y/o</em>x)\na = y<em>bar - b * x</em>bar\nThis gives us parameter estimates for x and y.</p>\n<h2 id=\"loss-surfaces\" style=\"position:relative;\"><a href=\"#loss-surfaces\" aria-label=\"loss surfaces permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Loss Surfaces</h2>\n<p>Usually 3d, with axes being a, b and loss(y = a + bx)</p>\n<h2 id=\"model-interpretation\" style=\"position:relative;\"><a href=\"#model-interpretation\" aria-label=\"model interpretation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Model Interpretation</h2>\n<p>Slope - measured in units of y per unit of x\nNew data needs to be similar to original data - you cannot predict the weight of a chihuahua’s weight given a model for golden retrievers\nVisualize, then quantify - watch out for anscombe’s quartet\n<span\n      class=\"gatsby-resp-image-wrapper\"\n      style=\"position: relative; display: block; margin-left: auto; margin-right: auto; max-width: 449px; \"\n    >\n      <a\n    class=\"gatsby-resp-image-link\"\n    href=\"/static/6ac2ca895466097117e4582ba0a2174c/93dc1/Anscombes%20quartet.jpg\"\n    style=\"display: block\"\n    target=\"_blank\"\n    rel=\"noopener\"\n  >\n    <span\n    class=\"gatsby-resp-image-background-image\"\n    style=\"padding-bottom: 92.91666666666667%; position: relative; bottom: 0; left: 0; background-image: url('data:image/jpeg;base64,/9j/2wBDABALDA4MChAODQ4SERATGCgaGBYWGDEjJR0oOjM9PDkzODdASFxOQERXRTc4UG1RV19iZ2hnPk1xeXBkeFxlZ2P/2wBDARESEhgVGC8aGi9jQjhCY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2NjY2P/wgARCAATABQDASIAAhEBAxEB/8QAFwABAQEBAAAAAAAAAAAAAAAAAAIBBf/EABQBAQAAAAAAAAAAAAAAAAAAAAD/2gAMAwEAAhADEAAAAe8A0JCgf//EABYQAAMAAAAAAAAAAAAAAAAAAAEQIP/aAAgBAQABBQKwv//EABQRAQAAAAAAAAAAAAAAAAAAACD/2gAIAQMBAT8BH//EABQRAQAAAAAAAAAAAAAAAAAAACD/2gAIAQIBAT8BH//EABQQAQAAAAAAAAAAAAAAAAAAADD/2gAIAQEABj8CH//EABgQAAIDAAAAAAAAAAAAAAAAAAEQIDFx/9oACAEBAAE/IYY7Ff/aAAwDAQACAAMAAAAQow88/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAwEBPxAf/8QAFBEBAAAAAAAAAAAAAAAAAAAAIP/aAAgBAgEBPxAf/8QAGxAAAwADAQEAAAAAAAAAAAAAAAERITFhEIH/2gAIAQEAAT8QduR34JDTT4WvInIS7EWhY1zz/9k='); background-size: cover; display: block;\"\n  ></span>\n  <picture>\n          <source\n              srcset=\"/static/6ac2ca895466097117e4582ba0a2174c/8ac56/Anscombes%20quartet.webp 240w,\n/static/6ac2ca895466097117e4582ba0a2174c/57bab/Anscombes%20quartet.webp 449w\"\n              sizes=\"(max-width: 449px) 100vw, 449px\"\n              type=\"image/webp\"\n            />\n          <source\n            srcset=\"/static/6ac2ca895466097117e4582ba0a2174c/09b79/Anscombes%20quartet.jpg 240w,\n/static/6ac2ca895466097117e4582ba0a2174c/93dc1/Anscombes%20quartet.jpg 449w\"\n            sizes=\"(max-width: 449px) 100vw, 449px\"\n            type=\"image/jpeg\"\n          />\n          <img\n            class=\"gatsby-resp-image-image\"\n            src=\"/static/6ac2ca895466097117e4582ba0a2174c/93dc1/Anscombes%20quartet.jpg\"\n            alt=\"AnscombesQuartetGraphs\"\n            title=\"AnscombesQuartetGraphs\"\n            loading=\"lazy\"\n            style=\"width:100%;height:100%;margin:0;vertical-align:middle;position:absolute;top:0;left:0;\"\n          />\n        </picture>\n  </a>\n    </span></p>\n<h2 id=\"terminology\" style=\"position:relative;\"><a href=\"#terminology\" aria-label=\"terminology permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Terminology</h2>\n<p><strong>Names for the x variable:</strong></p>\n<ul>\n<li>Feature</li>\n<li>Covariate</li>\n<li>Independent variable</li>\n<li>Explanatory variable</li>\n<li>Predictor</li>\n<li>Input</li>\n<li>Regressor</li>\n</ul>\n<p><strong>Names for the y variable:</strong></p>\n<ul>\n<li>Output</li>\n<li>Outcome</li>\n<li>Response</li>\n<li>Dependent variable</li>\n</ul>\n<h2 id=\"adding-independent-variables\" style=\"position:relative;\"><a href=\"#adding-independent-variables\" aria-label=\"adding independent variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Adding independent variables</h2>\n<p>Use a weighted sum of coefficients and input variables.</p>\n<h2 id=\"evaluating-models\" style=\"position:relative;\"><a href=\"#evaluating-models\" aria-label=\"evaluating models permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Evaluating Models</h2>\n<ul>\n<li>Look at Mean Squared Error(MSE) or Root Mean Squared Error(RMSE)</li>\n<li>Look at the correlations</li>\n<li>Look at a residual plot</li>\n</ul>\n<p>Root Mean Squared Error: Square root of the mean squared error. RMSE is in the same units as y. A lower RMSE indicates more accurate predictions. It is impossible to lower the RMSE just by adding features using the same data</p>\n<p><strong>R squared</strong>: Used to measure the strength of the linear association between our actual y and predicted y. aka coefficient of determination.</p>\n<p>R^2 = variance of fitted values / variance of y</p>\n<h2 id=\"ordinary-least-squares\" style=\"position:relative;\"><a href=\"#ordinary-least-squares\" aria-label=\"ordinary least squares permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Ordinary Least Squares</h2>\n<h3 id=\"multiple-regression-using-matrix-multiplication\" style=\"position:relative;\"><a href=\"#multiple-regression-using-matrix-multiplication\" aria-label=\"multiple regression using matrix multiplication permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Multiple regression using matrix multiplication</h3>\n<p>Multiple regression is of the form\n<code class=\"language-text\">y = theta_0 + theta_1 * x_1 + theta_2 * x_2 + ... + theta_p * x_p</code>\nWe can restate this as a dot product\n<code class=\"language-text\">y = x^T * theta</code></p>\n<h3 id=\"design-matrix\" style=\"position:relative;\"><a href=\"#design-matrix\" aria-label=\"design matrix permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Design Matrix</h3>\n<p>Motivation: the mean squared error involves all observations at once, it would be nice to express our model in terms of all observations, not just one. We can put them into a design matrix.</p>\n<p><strong>Rows:</strong> Correspond to observations. e.g. all features for data point 3\n<strong>Columns:</strong> Correspond to features. e.g. feature 1, for all data points</p>\n<h2 id=\"residuals\" style=\"position:relative;\"><a href=\"#residuals\" aria-label=\"residuals permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Residuals</h2>\n<p>Residuals are the difference between an actual and predicted value, in the regression context. We use the letter <code class=\"language-text\">e</code> to denote residuals, <code class=\"language-text\">e_i = y_i - yhat_i</code></p>\n<p>The mean squared error is equal to the mean of the squares of its residuals.\nWe can stack all n residuals into a vector, called the residual vector.\n<code class=\"language-text\">residual vector = true y values - predicted y values</code></p>\n<p>Residuals are orthogonal to the span of X.\nIf our model has an intercept term(when our design matrix has a column of all 1s)</p>\n<ul>\n<li>The sum and mean of the residuals is 0</li>\n<li>The average true y value is equal to the average predicted y value</li>\n</ul>\n<p><strong>Residual Plots:</strong></p>\n<ul>\n<li>With simple linear regression with only 1 independent variables, we plot residuals vs x</li>\n<li>In the general case, use residuals on y axis vs fitted values on x</li>\n<li>A good residual plot has no pattern, if there is a curve, this is a sign that transformations or additional variables can help</li>\n<li>A residual plot should have a similar vertical spread throughout the entire plot. If it doesn’t there are probably issues with the accuracy of the predictions</li>\n</ul>\n<h2 id=\"unique-solutions\" style=\"position:relative;\"><a href=\"#unique-solutions\" aria-label=\"unique solutions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Unique Solutions</h2>\n<ul>\n<li>There is always at least one model parameter that minimizes average loss.</li>\n<li>Constant models with a squared loss: a unique solution always exists</li>\n<li>Simple linear model with a squared loss: Any non constant value has unique mean, SD, correlation coefficient</li>\n<li>Constant model with absolute loss: Unique when there is an odd number y values, if there is an even number of y values, there are infinitely many solution.</li>\n</ul>\n<h2 id=\"invertability-of-x-transpose--x\" style=\"position:relative;\"><a href=\"#invertability-of-x-transpose--x\" aria-label=\"invertability of x transpose  x permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Invertability of X transpose * X</h2>\n<ul>\n<li>Invertible iff it is full rank</li>\n<li>X transpose * X and X have the same rank</li>\n<li>Thus, X^T * X is invertible iff X has rank p + 1 (full column rank)</li>\n</ul>\n<h1 id=\"real-world-example---fairness-in-housing-appraisal\" style=\"position:relative;\"><a href=\"#real-world-example---fairness-in-housing-appraisal\" aria-label=\"real world example   fairness in housing appraisal permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Real World Example - Fairness in Housing Appraisal</h1>\n<h2 id=\"situation\" style=\"position:relative;\"><a href=\"#situation\" aria-label=\"situation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Situation</h2>\n<p>The cook county assessor’s office is in charge of assessing property values in order to determine property taxes. </p>\n<h2 id=\"problem\" style=\"position:relative;\"><a href=\"#problem\" aria-label=\"problem permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Problem</h2>\n<p>The biased property value assessment resulted in a regressive tax, where rich people paid less and poor people paid more. In addition, rich people appealed more often than poor people, resulting in an even greater reduction of property tax. </p>\n<h2 id=\"solution\" style=\"position:relative;\"><a href=\"#solution\" aria-label=\"solution permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Solution</h2>\n<ol>\n<li><strong>Ask a Question:</strong> What do we want to know? How to fairly value things for tax purposes. What are our metrics for success? Have both fairness and transparency in projections.</li>\n<li><strong>Data Acquisition and Cleaning:</strong> What data do we have and what do we need? Housing Sales data between 2013-2019, Property Characteristics-ie age, bedrooms, baths, etc.How will we sample more data? Is our sample representative?</li>\n<li><strong>Exploratory Data Analysis and Visualization:</strong> What attributes are most predictive of sales price? Which are potentially problematic? Is the data predictive of sales price? </li>\n</ol>\n<h2 id=\"takeaways\" style=\"position:relative;\"><a href=\"#takeaways\" aria-label=\"takeaways permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Takeaways</h2>\n<ol>\n<li>Accuracy is a necessary, but not sufficient condition of a fair system.</li>\n<li>Fairness and transparency are context-dependent</li>\n<li>Learn to work with contexts and consider how your data analysis will reshape them</li>\n<li>Keep in mind the power and limits of data analysis?</li>\n</ol>\n<h1 id=\"probability-and-generalization\" style=\"position:relative;\"><a href=\"#probability-and-generalization\" aria-label=\"probability and generalization permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Probability and Generalization</h1>\n<h2 id=\"random-variables\" style=\"position:relative;\"><a href=\"#random-variables\" aria-label=\"random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Random Variables</h2>\n<p><strong>Random Variable:</strong> Represents a numerical value determined by a probabilistic event. </p>\n<h3 id=\"probability-mass-function\" style=\"position:relative;\"><a href=\"#probability-mass-function\" aria-label=\"probability mass function permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Probability Mass Function</h3>\n<ul>\n<li>The distribution of a random variable X provides the probability that X takes on each of its possible values(discrete)</li>\n<li>The probabilities for all possible values of random variable X in a Probability Mass Function must sum to 1</li>\n<li>Each individual probability for a given value X must be between 0 and 1.\n<strong>Joint Distributions:</strong> Probability of two or more random variables taking on a specific set of values. Ie P(X=0, Y= 10) = (0.5) ** 10 for coin flips where X is heads and Y is tails\n<strong>Marginal Distribution:</strong> A way to go from the joint distribution to the distribution for a single variable. Ie consider all possible values of Y that can simultaneously happen with X and sum over all of the joint probabilities.<br>\n∑y∈Y P(X=x,Y=y) = P(X=x)</li>\n</ul>\n<p><strong>Independent Random Variables:</strong> Any two random variables are independent if and only if knowing the outcome of one variable does not alter the probability of observing any outcomes of the other variables.</p>\n<h2 id=\"expectation-and-variance\" style=\"position:relative;\"><a href=\"#expectation-and-variance\" aria-label=\"expectation and variance permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Expectation and Variance</h2>\n<h3 id=\"expectation\" style=\"position:relative;\"><a href=\"#expectation\" aria-label=\"expectation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Expectation</h3>\n<ul>\n<li>The long run average of a random variable, also known as the expected value or expectation of a random variable.  </li>\n<li>E[X] = ∑ x∈X x ⋅ P(X=x)</li>\n</ul>\n<p><strong>Linearity of Expectation:</strong> </p>\n<ul>\n<li>Use when working with linear combinations of random variables. This holds true even when the random variables are dependent on each other.</li>\n<li>E[X+Y] = E[X] + E[Y]  </li>\n<li>E[cX] = c * E[X]  </li>\n<li>E[X−Y] = E[X] − E[Y]  </li>\n</ul>\n<p>However, E[XY]=E[X]E[Y] is only true when X and Y are independent random variables.</p>\n<h3 id=\"variance\" style=\"position:relative;\"><a href=\"#variance\" aria-label=\"variance permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Variance</h3>\n<ul>\n<li>The variance of a random variable is a description of the variable’s spread, or how far values are apart from each other.  </li>\n<li>Var(X) = E[(X − E[X])**2]</li>\n<li>Var(X) = E[X**2] − (E[X])**2  </li>\n<li>Var(aX + b) = a**2 * Var(X) Is true if X is a random variable</li>\n<li>Var(X + Y) = Var(X) + Var(Y) Holds true if X and Y are independent  </li>\n</ul>\n<h3 id=\"covariance\" style=\"position:relative;\"><a href=\"#covariance\" aria-label=\"covariance permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Covariance</h3>\n<p>If the covariance is positive, the random variables are positively correlated(ie move in the same direction for stocks). If the covariance is negative, the random variables are negatively correlated. A covariance of 0 indicates the variables are independent.</p>\n<ul>\n<li>Cov(X,Y) = E[(X − E[X]) * (Y − E[Y])]</li>\n<li>Cov(X,Y)= E[XY] − E[X]*E[Y]</li>\n</ul>\n<h2 id=\"risk\" style=\"position:relative;\"><a href=\"#risk\" aria-label=\"risk permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Risk</h2>\n<p><strong>Risk:</strong> Statistical risk is known as the expected loss, or the expected value of the model’s loss on randomly chosen points from the population.</p>\n<ul>\n<li>R(θ) = E[(X − θ)**2]</li>\n<li>To minimize risk, use R(θ) = E[(X − E[X]) ** 2] + (E[X] − θ) ** 2</li>\n<li>R(θ) = Bias + Variance = (E[X] − θ) ** 2 + E[(X − E[X]) ** 2]</li>\n<li>A low variance means the random variable will likely take a value close to θ, while a high variance means the random variable will take a value far from θ</li>\n</ul>\n<h3 id=\"empirical-risk-minimization\" style=\"position:relative;\"><a href=\"#empirical-risk-minimization\" aria-label=\"empirical risk minimization permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Empirical Risk Minimization</h3>\n<ul>\n<li>Since calculating the expected value of X requires complete knowledge of the population, since expected value is defined as probability X takes a specific value * that specific value. </li>\n<li>We can use a large random sample instead of the population when calculating the expected value of X</li>\n<li>Thus we can approximate E[X] ~ mean(x)</li>\n<li>Therefore, the empirical risk is the risk from using the large random sample instead of the population</li>\n</ul>\n<h1 id=\"multiple-linear-regression\" style=\"position:relative;\"><a href=\"#multiple-linear-regression\" aria-label=\"multiple linear regression permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Multiple Linear Regression</h1>\n<h2 id=\"questions-to-ask\" style=\"position:relative;\"><a href=\"#questions-to-ask\" aria-label=\"questions to ask permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Questions to Ask</h2>","fields":{"slug":"/posts/datascience/data-modeling","tagSlugs":["/tag/notes/","/tag/data-science/"]},"frontmatter":{"date":"2021-10-10T23:46:37.121Z","description":"Notes on the different models","tags":["Notes","Data Science"],"title":"Data 100 Modeling"}}},"pageContext":{"slug":"/posts/datascience/data-modeling"}},"staticQueryHashes":["251939775","401334301","825871152"]}