{"componentChunkName":"component---src-templates-post-template-js","path":"/posts/academics/probability","result":{"data":{"markdownRemark":{"id":"791c44b2-5399-5847-83b5-366f1e238e11","html":"<p>Introduction to Probability, 2nd Edition\nFollows the course, can borrow from engineering library</p>\n<p>TODO:\nSet calendar for homework self grades and resubmission time\nSet midterm time on calendar\nChoose discussion time\nCountable vs uncountable</p>\n<h2 id=\"probability-math--a-way-of-thinking-about-uncertainty\" style=\"position:relative;\"><a href=\"#probability-math--a-way-of-thinking-about-uncertainty\" aria-label=\"probability math  a way of thinking about uncertainty permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Probability: Math + a way of thinking about uncertainty</h2>\n<p>Humans are naturally good at thinking about probability. For example, you can easily walk down a crowded street and estimate what path allows you to avoid bumping into people. However, humans are pretty awful at turning those innate estimates into quantifiable numbers. The goal is to take the natural probability estimation into something that a robot can understand.</p>\n<p><strong>Quantify + Model</strong></p>\n<p>Inference. Humans are also really good at taking in noisy observations and infer what is really happening. </p>\n<p>How do I generate some image from AI? Take a natural image and repeatedly add noise to it. You then end up with an image of pure noise. The idea behind stable diffusion and dalle-2 is to reverse this process. We generate a bunch of noise, then reduce noise until we end up with a image.</p>\n<p>Modern theory of probability has an axiomatic starting point. Before the 1930s, probability was a heuristic field.\nAll of probability are logical deductions from the starting point.</p>\n<p><em>Probability space:</em> A probability space (Ω, F, P) is a triple.\nΩ is a set, the sample space.\nF is a family of subsets of Ω, called events\nP is a probability measure</p>\n<p>Rules: </p>\n<ul>\n<li>\n<p>F is a σ-algebra, it contains Ω itself. F is closed under complements and countable unions. </p>\n<ul>\n<li>If A ∈ F, then any of A’s complements belong to F </li>\n<li>countable unions: If A1, A2, … ∈ F, then the union of Ai ∈ F</li>\n<li>By DeMorgan’s Laws, this implies that F is also closed under countable intersections</li>\n</ul>\n</li>\n<li>\n<p>P is a function that maps events to a number between 0 and 1. <code class=\"language-text\">P: F -> [0,1]</code></p>\n<ul>\n<li>\n<p>Probability measures must obey Komogorov Axioms</p>\n<ol>\n<li><code class=\"language-text\">P(A) >= 0</code> Probability of any event must be non-negative</li>\n<li><code class=\"language-text\">P(Ω) = 1</code> The probability of any outcome occuring must be equal to 1</li>\n<li>σ-additivity aka countable additivity. If A1, A2, … is a countable sequence of events and disjoint, then <code class=\"language-text\">P(U) = ΣP(Ai)</code></li>\n<li>counterexample: <code class=\"language-text\">P = Unif(0,1)</code>, then P({x}) = 0 for x ∈ [0,1]. If <code class=\"language-text\">1 = P([0,1])</code>, then <code class=\"language-text\">Σ P({x}) = 0</code> &#x3C;- Dive into this later</li>\n</ol>\n</li>\n</ul>\n</li>\n</ul>\n<p>Examples of the Above Rules</p>\n<ol>\n<li>\n<p>Flip a coin with bias P(Heads with probability p, Tails with probability 1-p). </p>\n<ul>\n<li>Ω = {H, T} The sample space is just heads and tables</li>\n<li>F = 2 ^ σ = {H, T, Ø, {H, T}} These are all subsets of σ</li>\n<li>P(H) = p, P(T) = 1-p, P(Ø) = 0, P({H, T}) = 1 Why are Ø and {H, T} here?</li>\n<li>We then want to create an experiment of n “independent” flips. Is our vocabulary rich enough for this? no</li>\n<li>Ω = {H, T}^n, F = 2^Ω, P(f1, f2, … , fn) = p^{number of flips fi = H} * (1-p)^{number of flips fi = T}</li>\n<li>There can be multiple answers for a probability space. It just needs to be rich enough to capture all the probabilities</li>\n</ul>\n</li>\n<li>\n<p>Ω now equals all possible configurations of atoms in the universe. How do we represent this?</p>\n<ul>\n<li>Let A = {set of configurations that lead to flip heads}</li>\n<li>Let B = {set of configurations that lead to flip tails}</li>\n<li>F = {Ø, A, B, Ω = {A U B}}</li>\n<li>P(A) = p, P(B) = 1-p, P(Ø) = 0, P(Ω) = 1</li>\n</ul>\n</li>\n</ol>\n<p>Note: The probability space is usually implicitly described, except for HW 1.\nLol</p>\n<h2 id=\"all-of-the-rules-of-probability-you-are-likely-familiar-with-are-consequences-of-the-three-axioms\" style=\"position:relative;\"><a href=\"#all-of-the-rules-of-probability-you-are-likely-familiar-with-are-consequences-of-the-three-axioms\" aria-label=\"all of the rules of probability you are likely familiar with are consequences of the three axioms permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>All of the rules of probability you are likely familiar with are consequences of the three axioms.</h2>\n<p>Ex: If <code class=\"language-text\">A ∈ F</code>, <code class=\"language-text\">P(A^C) = 1 - P(A)</code>. The probability of A’s complement is 1 - probability of A.</p>\n<ul>\n<li>\n<p>Since F is σ algebra, then A^C ∈ F. </p>\n<ul>\n<li>A U A^C. A and A complement is a disjoint union equal to Ω</li>\n<li>1 = P(Ω) Using axiom 2</li>\n<li>P(A U A^C) = P(A) + P(A^C) Using Axiom 3</li>\n</ul>\n</li>\n</ul>\n<p>Ex: If A and B are events and A is a subset of B, then <code class=\"language-text\">P(A) &lt;= P(B)</code>\nB = A U (B \\ A). B is equal to the disjoint union of A and B minus the elements of A\nP(B) = P(A U (B \\ A)) => P(A) + P(B\\A)  Using axiom 3\nP(A) + P(B\\A) >= P(A) Using axiom 1</p>\n<p>Ex: If A and B are events, then P(A U B) = P(A) + P(B) - P(A ∩ B). Aka inclusion exclusion principle\nProof: <code class=\"language-text\">AUB, A∩B ∈ F</code>\nA U B = B U (A \\ B)\nP(A U B) = P(B) + P(A\\(A∩B)) Using axiom 3\nP(B) + P(A\\(A∩B)) = P(B) + P(A) - P(A ∩B) Using Axiom 3</p>\n<p>Ex: Ω = countable set\nF = 2 ^ Ω, each individual is an event {w} ∈ F, for all w ∈ Ω\nP(A) = Σ for all A ⊆ Ω\nThis is a discrete sample space.</p>\n<p>Ex: Law of total probability. If A1, A2, …, partition Ω, then (Ω = U Ai). Mutually exclusive, collectively exhaustive.\nThen, for any B ∈ F, then <code class=\"language-text\">P(B) = Σ P(B ∩ Ai)</code>\nB = U(Ai ∩ B) then apply axiom 3</p>\n<p>The goal of probability is to take complicated things and use that to approximate them into multiple simpler events.</p>\n<p>Mathematicians: Given a probability space(model), what can I say about outcomes of experiments that derive from this model?\nStatistician: Given outcomes, how do I choose a good model(probability space)?\nEngineers: Given a real world problem, how do I choose a model that captures the essence of the model and then use the model to draw insight.</p>\n<h1 id=\"lecture-2-conditional-probability-independence-random-variables\" style=\"position:relative;\"><a href=\"#lecture-2-conditional-probability-independence-random-variables\" aria-label=\"lecture 2 conditional probability independence random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Lecture 2: Conditional Probability, Independence, Random Variables</h1>\n<p><strong>Conditional Probability:</strong> If <code class=\"language-text\">B ∈ F</code> and <code class=\"language-text\">P(B) > 0</code>, then conditional probability P(A|B) = P(A ∩ B) / P(B)</p>\n<p><em>Intuition</em>: Probability A occurs given we know that B occurred</p>\n<p>Formal Definition: P(.|B) gives a restriction of our model (Ω, F, P) to those samples in B\nIf <code class=\"language-text\">(B, F|B, P(.|B))</code> is a probability space itself, where <code class=\"language-text\">F|B = \\{A ∩ B: A ∈ F\\}</code></p>\n<p><em>Ex.</em> If A1, A2, … ∈ F, P(Ai) > 0, A partion Ω\n<code class=\"language-text\">P(B) = Σ P(B ∩ Ai) = Σ P(B|Ai) * P(Ai)</code></p>\n<h2 id=\"bayes-rule\" style=\"position:relative;\"><a href=\"#bayes-rule\" aria-label=\"bayes rule permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Bayes Rule</h2>\n<p>Sometimes, it’s easy to express P(B|A), but we are really interested in P(A|B)\nLet A be the state of the experiment, B be the observation data</p>\n<p>P(A|B) - given the observation, we want the state. This is the task of inference.</p>\n<p>Suppose we forgot Bayes Rule, let’s try to recreate it from definition of conditional probability\n<code class=\"language-text\">P(A|B) = P(B ∩ A) / P(B) = P(A) * P(B|A) / P(B)</code></p>\n<p><em>Ex.</em> Suppose we have the following model: 85% of students got a pass, 15% of students got a no pass. 60% students that got passes went to lecture, 40% of students did not go to lecture. 10% of students that got no passes went to lecture, 90% didn’t go to lecture.</p>\n<p>We want P(NP | N) / P(NP | Y). How much more likely are you to NP when you don’t attend lecture vs when you do attend lecture?\nP(NP ∩ N) / P(NP ∩ Y) * P(Y) / P(N)</p>\n<p><em>Ex.</em> Suppose we roll two dice and the sum is 10. What is the probability roll 1 was = 4?</p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre class=\"language-text\"><code class=\"language-text\">B = \\{Sum of rolls = 10\\}\nA = \\{first roll is 4\\}\n\nP(A|B) = P(A ∩ B) / P(B)\n       = P({First roll is 4, sum of rolls is 10}) / P({sum of rolls is 10})\n       = P({first: 4, second: 6}) / P({(4,6), (5,5), (6,4)})\n       = (1/36) / (3/36)\n       = 1/3</code></pre></div>\n<h3 id=\"conditioning-also-allows-us-to-usefully-decompose-intersections-of-events\" style=\"position:relative;\"><a href=\"#conditioning-also-allows-us-to-usefully-decompose-intersections-of-events\" aria-label=\"conditioning also allows us to usefully decompose intersections of events permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Conditioning also allows us to usefully decompose intersections of events.</h3>\n<p><em>Ex.</em> Consider event A1 … An\nP(∩ A) = P(A<sub>1</sub> | ∩<sup>n</sup><sub>i = 2</sub> A<sub>i</sub>) P (∩ A<sub>i</sub>)</p>\n<p><em>Ex.</em> Given n people in a room, what is the probability that more than 2 people share a birthday?</p>\n<p>A<sub>i</sub> = {person i does not share a birthday with any of the people j = 1, …, i-1}</p>\n<p>P(A<sub>i</sub> | ∩ <sup>i-1</sup><sub>j=1</sub> A<sub>j</sub>) = (365 - (i-1)) / 365</p>\n<p>This is because the person needs to land in one of the days not in the 365.\nP(no shared birthdays) = P(∩ A<sub>i</sub>)</p>\n<p>Use 1-x &#x3C;= e^-x.\nApproximate using taylor series</p>\n<p>Let n = 23 - there are 23 kids in a class.\nP(shared birthday) >= 1 - e^((i-1)/365))</p>\n<h2 id=\"independence\" style=\"position:relative;\"><a href=\"#independence\" aria-label=\"independence permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Independence</h2>\n<p>Events A, B are independent if P(A ∩ B) = P(A) * P(B)\nDisjoint events are not independent\n<em>Note:\\</em> in a special case where P(A) > 0: A,B: dependent &#x3C;=> P(B|A) = P(B)</p>\n<p>In general, collection A<sub>1</sub>, A<sub>2</sub>, … ∈ F are independent events if P(∩<sub>i ∈ S</sub> A<sub>i</sub>) = ∏ <sub>i ∈ S</sub> P(A<sub>i</sub>) for all finite sets of indices S</p>\n<p>If A<sub>1</sub>, A<sub>2</sub>, … ∈ F are independent, then B<sub>1</sub>, B<sub>2</sub>, … are independent, where each A<sub>i</sub> = B<sub>i</sub> or B<sub>i</sub><sup>C</sup>\nIntuitively, we assume that knowing A means we know A’s complement</p>\n<p>∩<sub>i=2</sub><sup>n</sup> A<sub>i</sub>\n∏P(Ai) = P(∩ Ai) = P(∩) + P(A1)\n(1- P(Ai)) ∏ P(Ai) = P(A1^C ∩ ∏<sub>i=2</sub>Ai)\nThis shows A<sub>1</sub><sup>C</sup>, A<sub>2</sub>, … , A<sub>n</sub> are independent\nTherefore, B<sub>1</sub>, B<sub>2</sub>, … , B<sub>n</sub> are independent</p>\n<h3 id=\"conditional-independence\" style=\"position:relative;\"><a href=\"#conditional-independence\" aria-label=\"conditional independence permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Conditional Independence</h3>\n<p>Often times we have two events that we think of as independent but aren’t. There might be a confounding variable C\nIf A,B,C are such that P(C) > 0 and P(A∩B|C) = P(A|C) P(B|C), then A,B are said to be conditionally independent given C</p>\n<p>Consider two coins with bias p!=q\nPick a coin at random, and flip twice. H(i) = Event that flip i is heads\nAre the two coinflips independent? No. If p is 1 and q is 0, then we know what the next coin flip is given our first coin flip.</p>\n<p>P(H<sub>i</sub>) = p + q / 2\nP(H<sub>1</sub> ∩ H<sub>2</sub>) = p<sup>2</sup> + q<sup>2</sup> / 2 != P(H<sub>1</sub>) P(H<sub>2</sub>) = (p+q)<sup>2</sup> / 2\nC = {pick coin p} H<sub>1</sub>, H<sub>2</sub> conditionally independent given C</p>\n<p>TODO:\nSet calendar for homework self grades and resubmission time\nSet midterm time on calendar\nCountable vs uncountable</p>\n<h1 id=\"random-variables\" style=\"position:relative;\"><a href=\"#random-variables\" aria-label=\"random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Random Variables</h1>\n<p>Password: Teapot</p>\n<p><em>Def:</em> A random variable X is a function X: Ω -> R. It implies it is valid to write things like P(X &#x3C;= 3). P(X &#x3C;=3) is shorthand for P({ω: X(ω) &#x3C;= 3})\nRandom conditions have a measurability condition</p>\n<p>Often times we want to compute more complex probabilities, such as P(α &#x3C; X &#x3C; β).\n{ω: X(ω) &#x3C; β} = U<sub>n >=1</sub>{ω: X(ω) &#x3C; β -1/n} ∈ F\n{ω: X(ω) > α} = {ω: X(ω) &#x3C;= α}<sup>C</sup> ∈ F\nP(X ∈ B) for pretty much any B(subset of R) that you want\nHere, the technical name for B is <em>Borel sets</em></p>\n<p>Another consequence of the definition of random variables:\nIf X,Y are random variables on (Ω, F, P):</p>\n<ul>\n<li>X+Y is a random variable</li>\n<li>X*Y is a random variable</li>\n<li>|X|<sup>p</sup> is a random variable</li>\n</ul>\n<h2 id=\"probability-distributions\" style=\"position:relative;\"><a href=\"#probability-distributions\" aria-label=\"probability distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Probability Distributions</h2>\n<p>For any random variable X on probability space (Ω, F, P), we can define its distribution(aka its “law”) L<sub>x</sub> via L<sub>x</sub>(B) := P(X ∈ B), B ∈ R\nFor example, L<sub>x</sub>({x}) = P(X=x)</p>\n<p>The histogram of values that a random variable would take.</p>\n<p>Similar to a probability measure.</p>\n<p>In practice, we often describe our model for experimental outcomes in terms of distributions. Given this, I can always construct a probability space and random variable X that has this distribution.</p>\n<p>Ex: Given distribution L, we can consider probability space (R, B, L) and random variable X(ω)=x, ω ∈ Ω = R\nDistribution of X</p>\n<p>Important Class of Random Variables;\n<em>Discrete Random Variables</em>: A random variable that takes countably many values\nEx: X = flip of a p-biased coin. This is a bernoulli distribution with probability p</p>\n<p>X = roll a dice. This is a uniform distribution over {1,2,3,4,5,6}\nX = number of coin flips until I flip heads. Geometric Distribution\nX = number of heads in n flips. Binomial Distribution (n, p)</p>\n<p>In case of a discrete random variable X, the distribution of X can be summarized by its probability mass function.\nP<sub>x</sub>(x) := P({X=x}) = P{ω: X(ω) = x}</p>\n<p>By axioms, p<sub>x</sub>(x) >= 0 and Σp<sub>x</sub>(x) = 1</p>\n<h3 id=\"joint-distributions\" style=\"position:relative;\"><a href=\"#joint-distributions\" aria-label=\"joint distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Joint Distributions</h3>\n<p>For a pair of discrete random variables X,Y defined on the common probability space (Ω, F, P), their joint distribution is:\nsummarized by the join probability mass function defined via\nP<sub>xy</sub> = P(X=x, Y=y) = P({ω: X(ω) = x &#x26;&#x26; Y(ω) = y}) = P({ω: X(ω)} ∩ {ω: Y(ω) = y})\nWe can obtain the “marginal” distributions by P<sub>x</sub>(x) = Σ P<sub>xy</sub>(x,y) using the law of total probability Σ P(x) = 1</p>\n<h2 id=\"independent-random-variables\" style=\"position:relative;\"><a href=\"#independent-random-variables\" aria-label=\"independent random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Independent Random Variables</h2>\n<p>Def: Discrete Random Variables X,Y are independent if their probability mass function factors into their marginals\nP<sub>xy</sub>(x,y) = P<sub>x</sub>(x)P<sub>y</sub>(y)</p>\n<p>This is equivalent to saying {ω: X(ω) = x}) and {ω: Y(ω) = y}) are independent events for all x,y</p>\n<p>Examples of Joint Distributions:\nX = {0 if patient tests negative with probability 0.9, 1 if patient tests positive with probability 0.1}\nY = {0 if patient is negative, 1 if patient is positive}</p>\n<p>We know the false positive rate of the test is 5%, the false negative rate of test is 1%</p>\n<p>Question: What is P<sub>XY</sub>?\nI test my patient. The test is either negative(.9) or positive(.1).</p>\n<h2 id=\"expectation---used-for-discrete-random-variables\" style=\"position:relative;\"><a href=\"#expectation---used-for-discrete-random-variables\" aria-label=\"expectation   used for discrete random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Expectation - Used for discrete random variables</h2>\n<p>Takes in a random variable, spits out a distribution\nFor a discrete random variable X on (Ω, F, P), we define its expectation E(X) = ΣxP{X=x} = Σxp<sub>x</sub>(x)</p>\n<p>Ex: If X Ω = {0,1}<sup>n</sup>, P({ω}) = p<sup>{number of heads}</sup> (1-p)<sup>{number of tails}</sup>\nie a sequence of independent p biased coin flips</p>\n<p>X(ω) = {ω = 9}. &#x3C;= Number of heads\nE[X] = Σ{i : ω<sub>i</sub>= 1}, p<sup>{i : ω<sub>i</sub>= 1}</sup> (1-p)<sup>{i : ω<sub>i</sub>= 1}</sup></p>\n<p>= Σ<sub>k=0</sub><sup>n</sup> (n k) p<sup>k</sup> (1-p)<sup>n-k</sup> = np</p>\n<p>The most important property of expectation is that it is linear.\nExpectation is an operator, it takes a function and spits out a number, just like an integral.\nJust like how integrals are linear, expectation is also linear.</p>\n<p>The most important and only property of expectation <code class=\"language-text\">E[X+Y] = E[X] + E[Y]</code></p>\n<h3 id=\"linearity-of-expectation\" style=\"position:relative;\"><a href=\"#linearity-of-expectation\" aria-label=\"linearity of expectation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Linearity of Expectation</h3>\n<p>If X,Y random variables defined on a common probability space. <code class=\"language-text\">E[aX + bY] = aE[X] + bE[Y]</code> &#x3C;= There is no need for expectation to be independent or discrete.</p>\n<p>LemmaL E[g(z)] = Σ g(z) p<sub>z</sub>(z)</p>\n<p>By the law of the unconscious statistician\nE[aX + bY] = Σ(ax + by)P<sub>xy</sub>(xy) = aΣxP<sub>xy</sub>(x,y) + bΣxP<sub>xy</sub>(x,y)</p>\n<h1 id=\"sum-of-independent-binomials-variance-covariance-correlation-coefficient-entropy\" style=\"position:relative;\"><a href=\"#sum-of-independent-binomials-variance-covariance-correlation-coefficient-entropy\" aria-label=\"sum of independent binomials variance covariance correlation coefficient entropy permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Sum of Independent Binomials, Variance, Covariance, Correlation Coefficient, Entropy</h1>\n<h2 id=\"general-strategy-for-expectation\" style=\"position:relative;\"><a href=\"#general-strategy-for-expectation\" aria-label=\"general strategy for expectation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>General Strategy for Expectation</h2>\n<ol>\n<li>We introduce indicator random variables and express random variable of interest in terms of the indicators, then use linearity of expectation</li>\n<li>In general, the indicator for event A is denoted by 1<sub>A</sub>(w) = {0 if ω is not A, 1 if ω is A}\na. 1<sub>A</sub> = Bernuolli(P(A)), E[1<sub>a</sub>] = P(A)</li>\n</ol>\n<p>Ex: n people put their hats in a bucket, and each draw 1 hat. Let X = number of people who get their own hat back.\nX = Σ<sub>i=1</sub><sup>n</sup>1<sub>A</sub>\nA<sub>i</sub> = {gets own hat back}\nE[x] = Σ<sub>i=1</sub> E[1<sub>M/sub>]= n * 1/n = 1</p>\n<p>Ex: Coupon collector problem\nHow many on boxes on average do I need to buy before I collect all N coupons\nX<sub>i</sub> = number of boxes to buy to get ith coupon, after getting (i-1)<sup>th\nR[].</p>\n<h2 id=\"tail-sum-formula-another-way-to-compute-expectation-without-using-linearity-of-expectation\" style=\"position:relative;\"><a href=\"#tail-sum-formula-another-way-to-compute-expectation-without-using-linearity-of-expectation\" aria-label=\"tail sum formula another way to compute expectation without using linearity of expectation permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Tail sum formula: Another way to compute expectation without using linearity of expectation.</h2>\n<p>For non-negative integer values random value X<sub>i</sub>\nE[X] = Σ<sub>k=1</sub><sup>inf</sup>P{x=k}</p>\n<p>Proof: write P({x>= k}) = Σ<sub>x>=k</sub> P{x >= k} = Σ<sub>x>=k</sub> P(X=x)</p>\n<p>X<sub>1</sub> = Geom(p), X<sub>2</sub> = Geom(p)\nM = min{X<sub>1</sub>, X<sub>2</sub>}\nP{M>= k} = P{x<sub>1</sub>}, P{x<sub>2</sub>}</p>\n<p>IE[M] = Σ P{x<sub>1</sub> >= k} P{x<sub>2</sub>>=k}\n= 1 / (1- (1-p)*(1-q)))</p>\n<h2 id=\"variance\" style=\"position:relative;\"><a href=\"#variance\" aria-label=\"variance permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Variance</h2>\n<p>Quantitative Notion of variability of X around E[X]\nVar(X) = E[(X - E[X])<sup>2</sup>] = E[X<sup>2</sup> - 2XE[X] + (E[X])<sup>2</sup>]= E[X<sup>2</sup>] - (E[X])<sup>2</sup></p>\n<p>If expectation is the price of a stock, variance is the volatility.\nVariance is not the only way to measure the variability of a random variable. Another way to measure the variance of a random variable is entropy.\nEntropy = Σ P<sub>x</sub>(X) log(1/P<sub>x</sub>(X)) = E[log(1/P<sub>x</sub>(X))]</p>\n<p>Unlike expectation, variance is not linear with respect to its arguments. Var(X+X) = ?\nVar(X+Y) != Var(X) + Var(Y)\nVar(X+Y) </p>\n<p>By definition of variance\n= E[X+Y - E[X] - E[Y]]<sup>2</sup></p>\n<p>Open up the square using foil\n= Var(X) + Var(Y) + 2 E[(X-E[X])(Y-E[Y])]</p>\n<h2 id=\"x---exy-ey--covariance\" style=\"position:relative;\"><a href=\"#x---exy-ey--covariance\" aria-label=\"x   exy ey  covariance permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>[X - E[X])(Y-E[Y]) = Covariance</h2>\n<p>Positive covariance means X and Y move in the same direction\nIf X and Y are independent, covariance is 0</p>\n<p>If X,Y are uncorrelated(covariance of X,Y = 0), then Var(X+Y) = Var(X) + Var(Y)</p>\n<p>Correlation coefficient: Covariance(X, Y) / sqrt(Var(X)*Var(Y)). It is always between -1 and + 1</p>\n<p>Covariance(X,Y) &#x3C;= sqrt(Var(X) * Var(Y))\nUsing cauchy schwartz inequality, x<sup>T</sup>y / (||X|| ||Y||) is between [-1, 1]</p>\n<p>E[XY] &#x3C;= E[X<sup>2</sup>]<sup>1/2</sup> E[Y<sup>2</sup>]<sup>1/2</sup>\nE[XY] = Σ P<sub>xy</sub>(x,y)<sup>1/2</sup> x * P<sub>xy</sub>(x,y)<sup>1/2</sup> y &#x3C;= (Σ<sub>xy</sub>P<sub>xy</sub>(x,y)x<sup>2</sup>)<sup>1/2</sup> (Σ<sub>xy</sub>P<sub>xy</sub>(x,y)y<sup>2</sup>)<sup>1/2</sup></p>\n<p>Let X ~ Binomial(n, p)\nX = Σ X<sub>i</sub> X<sub>i</sub>Bernuolli(p)\nVar(X) = Σ Var(X<sub>i</sub>) = n Var(X<sub>1</sub>)</p>\n<h2 id=\"poisson-random-variables\" style=\"position:relative;\"><a href=\"#poisson-random-variables\" aria-label=\"poisson random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Poisson random variables</h2>\n<p>Poisson means fish in french. Think number of arrivals when you see poisson\nIf I dip a net into the water and pull it out, a poisson distribution is a good way to checking how much fish I pull out.</p>\n<p>X ~ Poisson(λ) λ = rate\nP<sub>λ</sub>)(k) = λ<sup>k</sup>/k! e<sup>-λ</sup>, k = 0,1,2,…</p>\n<p>Imagine trying to take a bus in berkeley. They are supposed to be at each station every 30 minutes. However, there are times when no buses come for 45 minutes then two buses come in the 46th and 47th minute. You can model the arrival of the buses using a Poisson distribution</p>\n<p>Where does Poisson come from?\nEach interval something will arrive with probability p<sub>n</sub> = λ/n. The probability that something arrives in an interval is X<sub>n</sub> ~ Binomial(n, p<sub>n</sub>)</p>\n<p>P(X<sub>n</sub>=k) = Binomial PMF = (n k) * (λ/n) * (1-λ/n)<sup>n-k</sup>\n= n(n-1)…(n-k+1) <em>(1/n-λ)<sup>k</sup></em> λ<sup>k</sup>/k! (1-λ/n)<sup>k</sup>\nas n goes to infinity, this distribution converges to a poisson distribution\nλ<sup>k</sup>/k! * e<sup>-k</sup></p>\n<h1 id=\"continuous-distributions-continuous-random-variables\" style=\"position:relative;\"><a href=\"#continuous-distributions-continuous-random-variables\" aria-label=\"continuous distributions continuous random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Continuous Distributions, Continuous Random Variables</h1>\n<p>Password: Mirror</p>\n<h2 id=\"conditional-distributions\" style=\"position:relative;\"><a href=\"#conditional-distributions\" aria-label=\"conditional distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Conditional Distributions</h2>\n<p><code class=\"language-text\">P(A|C) := P(A ∩ C) / P(C)</code></p>\n<p>When X,Y are discrete: their joint pmf is P<sub>xy</sub></p>\n<p>P<sub>x|y</sub>(x|y) := P<sub>x|y</sub>(x,y) / P<sub>y</sub>(y) = P({X=x} ∩ {Y=y}) / P({Y=y})</p>\n<p>Since P<sub>x|y</sub> is a pmf for each y with <code class=\"language-text\">P&lt;sub>y&lt;/sub>(y) > 0</code>, we can take expectation of X with respect to it.</p>\n<p>This is defined as conditional expectation\nE[X|Y=y] := Σ<sub>x⊆X</sub> xP<sub>x|y</sub>\nOnce we have the value of Y, what is the value of x?\nWe usually just write E[X|Y] to denote Σ<sub>x⊆X</sub> xP<sub>x|y</sub> evaluated at random variable Y. Thus, the conditional expectation is itself a random variable since it is a function of the random variable Y.</p>\n<h2 id=\"tower-property\" style=\"position:relative;\"><a href=\"#tower-property\" aria-label=\"tower property permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Tower Property</h2>\n<p>Most important property of Conditional expectation: </p>\n<ul>\n<li>\n<p>For all functions f, we can say <code class=\"language-text\">E[f(Y)X] = E[f(Y)*E[X|Y]]</code></p>\n<ul>\n<li>E[f(Y)X] = Σ<sub>y</sub> p<sub>y</sub>(y)f(y) * Σ<sub>x</sub>P<sub>x|y</sub>(x|y)</li>\n<li>E[f(Y)X] = Σ<sub>x,y</sub>P<sub>xy</sub>(x,y)*f(y)x</li>\n</ul>\n</li>\n</ul>\n<p>Example: Iterated Expectation\nE[X] = E[E[X|Y]]\nWe get this using the tower property by setting f(Y) = 1</p>\n<p>Example: Let N>= 0 be an integer valued random variable. Let’s flip a fair coin N times, and let X = the number of heads.\nThere are two sources of randomness, the number of flips and whether we get heads or tails.\nE[X] = E[E[X|N]]\nWe can fix the number of coin tosses, and since half of coin tosses are heads\nE[X|N] = N / 2\nSubstitute\nE[N/2]\nUse linearity of expectation\n1/2 E[N]</p>\n<p>Var(X) = E[Var(X|N)] + Var(E[X|N])\nUsing the same substitution as above of E[X|N] = 1/2\nUse binomial theorem for E[Var(X|N)] = E[N/4]. Variance of a binomial = ?\nVar(X) = E[N/4] + Var(N/2)\nVar(X) = 1/4 * E[N} + 1/2 * Var(N)</p>\n<p>Recall Var(X) = E[X-E[X]<sup>2</sup>]\nDefinition: Conditional Variance of X given {Y=y}\nVar(X|Y=y) = Σp<sub>x|y</sub>(x|y)(X-E[X|Y=y])<sup>2</sup></p>\n<p>Just like conditional expectation, we write Var(X|Y) to denote the random variable evaluated at Y.</p>\n<p>Theorem: Law of total variance</p>\n<p>Var(X) = E[Var(X|Y)] + Var(E[X|Y])</p>\n<p>Proof:</p>\n<p>Var(X) = E[X<sup>2</sup>] - (E[X])<sup>2</sup></p>\n<p>= E[E[X<sup>2</sup>|Y]] - (E[E[X|Y]])<sup>2</sup></p>\n<p>= E[Var(X|Y)] + E[E[X|Y]<sup>2</sup>] - (E[E[X|Y]])<sup>2</sup></p>\n<p>= E[Var(X|Y)] + Var(E[X|Y])</p>\n<p>Now let’s start with Var(X|Y=y)</p>\n<p>= E[X<sup>2</sup>|Y=y] - (E[X|Y=y])<sup>2</sup></p>\n<p>= E[Var(X|Y) + E(X|Y)<sup>2</sup>] - (E[E[X|Y]])<sup>2</sup></p>\n<h2 id=\"continuous-random-variables-and-continuous-distributions\" style=\"position:relative;\"><a href=\"#continuous-random-variables-and-continuous-distributions\" aria-label=\"continuous random variables and continuous distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Continuous Random Variables and Continuous Distributions</h2>\n<p>Note: Random Variables need not be discrete or continuous or combinations thereof\nFor a random variable X, we can always define its cumulative distribution function abbreviated CDF via </p>\n<p>F<sub>x</sub>(X) := P{X&#x3C;=x}</p>\n<ol>\n<li>F<sub>x</sub> is nondecreasing </li>\n<li>F<sub>x</sub>(x) -> {0 if x -> -inf, 1 if x -> inf}</li>\n<li>F<sub>x</sub> is continuous from the right</li>\n</ol>\n<p>Definition: A random variable X has continuous distribution if there exists a function f<sub>x</sub> such that </p>\n<p>F<sub>x</sub>(x) = ∫<sub>-inf</sub><sup>x</sub>f<sub>x</sub>(u)du</p>\n<p>f<sub>x</sub> is called the density of X aka a pdf(probability density function)</p>\n<p>For f<sub>x</sub> to be a density, it just needs to be non-negative and the integral of the density to be equal to 1, since total probability = 1\nf<sub>x</sub> >= 0\n∫f<sub>x</sub>dx = 1</p>\n<h3 id=\"continous-random-variables\" style=\"position:relative;\"><a href=\"#continous-random-variables\" aria-label=\"continous random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Continous Random Variables</h3>\n<p>Good for modeling analog signals from the real world.</p>\n<ul>\n<li>time we wait until a bus arrives(Exponential Distribution)</li>\n<li>Voltage across resistor(Gaussian Distribution)</li>\n<li>Phase of a received wireless signal(Uniform Distribution)</li>\n</ul>\n<p>Continuous distributions usually described by their density.\nX ~ Uniform(a,b) = {1/(b-a) a &#x3C;= x &#x3C;=b, 0 otherwise}\nX ~ Exp(λ) = {λe<sup>-λx</sup> if x>=0, 0 otherwise}\nX ~ N(µ,σ<sup>2</sup>) = 1/sqrt(2pi*σ)exp(-(x-µ)<sup>2</sup> / 2σ<sup>2</sup>)</p>\n<h3 id=\"jointly-continuous-random-variables\" style=\"position:relative;\"><a href=\"#jointly-continuous-random-variables\" aria-label=\"jointly continuous random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Jointly Continuous Random Variables</h3>\n<p>We say X<sub>1</sub>, X<sub>2</sub>, …, X<sub>n</sub> are jointly continuous if\nF<sub>X<sub>1</sub>X<sub>2</sub>…X<sub>n</sub></sub></sub>(x<sub>1</sub>x<sub>2</sub>…x<sub>n</sub>) = P{X<sub>1</sub>&#x3C;=x<sub>1</sub>, X<sub>2</sub>&#x3C;=x<sub>2</sub>, … X<sub>n</sub>&#x3C;=x<sub>n</sub>}</p>\n<p>= ∫∫ F<sub>X<sub>1</sub>X<sub>2</sub>…X<sub>n</sub></sub></sub>(x<sub>1</sub>x<sub>2</sub>…x<sub>n</sub>)du<sub>1</sub>du<sub>2</sub>…du<sub>n</sub></p>\n<p>Example:\nLet a dart land uniformly at random on a 2d dartboard of radius r.\nThe joint density will be flat over the dartboard</p>\n<p>{1/pi*r<sup>2</sup>, x<sup>2</sup> + y<sup>2</sup>}\nIndependence: Random Variables X,Y are independent if F<sub>xy</sub>(x,y) = F<sub>x</sub>(x) * F<sub>y</sub>(x)\nIf X,Y are continuous and independent, their joint density is the product of the marginals</p>\n<h3 id=\"expectation-of-continuous-random-variables\" style=\"position:relative;\"><a href=\"#expectation-of-continuous-random-variables\" aria-label=\"expectation of continuous random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Expectation of continuous random variables</h3>\n<p>Same as discrete case, replace sums with integrals\nE[X] = ∫x f<sub>x</sub>(x)dx</p>\n<p>More generally:\nE[g(X<sub>1</sub>…X<sub>n</sub>)] = ∫…∫g(x<sub>1</sub>…x<sub>n</sub>) f<sub>x<sub>1</sub> … x<sub>n</sub></sub>(x<sub>1</sub> … x<sub>n</sub>)dx<sub>1</sub> … dx<sub>n</sub></p>\n<p>Example:\nVar(X) = ∫(X-E[X])<sup>2</sup> f<sub>x</sub>(x)dx</p>\n<p>Example: X ~ Unif(a,b)\nE[X] = ∫ uniform density function</p>\n<p>= ∫ x * 1 / (b-a) dx</p>\n<p>= 1/2 * (b<sup>2</sup> - a<sup>2</sup>) / (b-a)</p>\n<p>= 1/2 * (b+a)</p>\n<p>Example: Var(X) = E[X<sup>2</sup>] - (E[X])<sup>2</sup>\n= 1/2 (b-a)<sup>2</sup></p>\n<p>Let r = sqrt(x<sup>2</sup> + y<sup>2</sup>)\nCompute the probability that the dart is in the middle half of the dartboard.</p>\n<p>P{R&#x3C;= r/2} = P{x<sup>2</sup> + y<sup>2</sup> &#x3C;= r<sup>2</sup> / 4}</p>\n<p>= E[{x<sup>2</sup> + y<sup>2</sup> &#x3C;= r<sup>2</sup> / 4}]</p>\n<p>= 1/(pi * r<sup>2</sup>) ∫ {x<sup>2</sup> + y<sup>2</sup> &#x3C;= r<sup>2</sup> / 4} dx dy</p>\n<h1 id=\"gaussian-distribution-derived-distributions-continuous-bayes\" style=\"position:relative;\"><a href=\"#gaussian-distribution-derived-distributions-continuous-bayes\" aria-label=\"gaussian distribution derived distributions continuous bayes permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Gaussian Distribution, Derived Distributions, Continuous Bayes</h1>\n<p>Discrete: Expectations come from PMF and sums\nContinuous: Expectations come from PDF and integrals</p>\n<h3 id=\"examples-of-continuous-distributions\" style=\"position:relative;\"><a href=\"#examples-of-continuous-distributions\" aria-label=\"examples of continuous distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Examples of continuous distributions</h3>\n<p>Uniform(a,b)\nExp(N)\nN(µ, σ<sup>2</sup>)</p>\n<h2 id=\"exponential-distribution\" style=\"position:relative;\"><a href=\"#exponential-distribution\" aria-label=\"exponential distribution permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Exponential Distribution</h2>\n<p>Suppose we want to model a memoryless process. For example, as a switch sending packets, it doesn’t matter how long you wait for a packet to show up, the expected wait time for the next packet doesn’t change.</p>\n<p><strong>Mathematically</strong>: Let X be a non-negative random variable with the memoryless property.\nMemoryless = P{X>t+s|X>s} = P{X>t} for all t,s > 0\nIf I’ve waited for s seconds, the probability P{X>t} is still unchanged</p>\n<p>P({X>t+s} ∩ {x > s}) = P({x > t}) P({x > s})\nGet rid of {x > s} since if x > t + s, x must be greater than s</p>\n<p>F(t + s) = F(t) F(s)\nThe only unique solution to the functional equality that is a CDF(must be non-negative, monotone increasing) of the form F<sub>x</sub> = e<sup>-λx</sup></p>\n<p>If x has memoryless property, F(X) = 1 - e<sup>-λt</sup> for some λ > 0</p>\n<h2 id=\"gaussian-random-variables\" style=\"position:relative;\"><a href=\"#gaussian-random-variables\" aria-label=\"gaussian random variables permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Gaussian Random Variables</h2>\n<p>The sum of independent effects. Height of NBA players, satisfaction of jobs, sum of voltage across resistor.</p>\n<p>We call N(0,1), the standard normal distribution. </p>\n<p>CDF:  1/2pi ∫exp(-µ/2)<sup>2</sup>du</p>\n<p>P(X&#x3C;= x) = P((x-µ)/σ &#x3C;= (x-µ)/σ) = Φ((x-µ)/σ)\nGaussians have many nice properties</p>\n<p>If X is gaussian, then so is aX + b.\nIf X,Y are independent gaussians, then X+Y is gaussian</p>\n<p>Example: Let V ~ N(1, 5<sup>2</sup>) be input voltage to chip, averaged over 1 second.\nOur chip fails if voltage dips below 0.5 volts or exceeds 2.5 volts for any one second period</p>\n<p>Probabilty(chip fails in 60 second duration) &#x3C;= 60 * Probability(chip fails in 1 second)\n60 * Probability(chip fails in 1 second) =\n= 60 * Probability(V &#x3C; 0.5) + Probability(V > 2.5)\n= Φ((0.5 - 1)/σ)  + (1 - Φ((2.5-1)/σ))</p>\n<p>Example: cellphone sends a bit B ∈ {-1, 1} to tower. Tower receives Y = B + N, N ~ N(0,1). Tower makes decisions B(Y) = sign(Y)\nP(err | B = 1) </p>\n<p>Given that I receive {Y=y} what is the probability that B = 1\nP({B = 1}) = P(Y ∈ [y, y + 𝛿] | B = 1) P(1)</p>\n<p>= P<sub>B|Y</sub>(1 | y) = P<sub>B</sub>(1) * f<sub>Y|B</sub>(y|1) / f<sub>y</sub>(y)\nThis is bayes rule for one </p>\n<p>= (1/sqrt(2pi) * exp(-(y-1)<sup>2</sup>/2)) / (1/2 * 1/sqrt(2pi) * exp(-(y+1)<sup>2</sup>)/2) + …\n= 1 / (1 + e<sup>-2y</sup>)</p>\n<p>This is b\nExample:\nLast example motivates the definition of conditional density\nIf X,Y are jointly continuous, we define conditional density of X given Y:</p>\n<p>f<sub>x|y</sub>(x|y) = f<sub>xy</sub>(x,y) / f<sub>y</sub>(y)\ninterpretation: is the density of X given I know {Y=y}</p>\n<p>Bayes rule\nf<sub>x|y</sub>(x|y) = f<sub>x</sub>(x) / f<sub>y</sub>(y) * f<sub>y|x</sub>(y|x)</p>\n<p>Example:\nFor X,Y that are jointly continuous, we have\nE[X|Y=y] = ∫ x f<sub>x|y</sub>(x|y)dx\nCE still satisfies tower property: E[g(y|x)] = E[g(y)E[X|Y]]</p>\n<h2 id=\"derived-distributions\" style=\"position:relative;\"><a href=\"#derived-distributions\" aria-label=\"derived distributions permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Derived Distributions</h2>\n<p>Distribution of X give by CDF F<sub>x</sub>. Suppose we define <code class=\"language-text\">Y = g(x)</code> for some function g</p>\n<p>What is the distribution of Y?</p>\n<ol>\n<li>Do you really need this distribution?</li>\n<li>If you really need the distribution, it is best to work with CDFs</li>\n</ol>\n<p>Example:\nX continuous random variable, Y = aX + b\nF<sub>y</sub>(y) = P{aX &#x3C;= y-b}\n= P{x &#x3C;= (y-b) / a if a > 0, x >= (y-b)/a if a &#x3C; 0}\n= {F<sub>x</sub>((y-b)/a) if a > 0, 1 - F<sub>x</sub>((y-b)/a) if a &#x3C; 0}</p>\n<p>f<sub>y</sub>(y) = {1/a * F<sub>x</sub>((y-b)/a) if a > 0, -1/a * F<sub>x</sub>((y-b)/a) if a > 0}\n= 1/|a| * f<sub>x</sub>((y-b)/a)</p>\n<p>Y = AX + b\nf<sub>y</sub>(y) = 1/|det(A)|f<sub>x</sub>(A<sup>-1</sup>(y-b))</p>\n<h1 id=\"information-theory-and-digital-communication-capacity-of-the-binary-erasure-channel-bec\" style=\"position:relative;\"><a href=\"#information-theory-and-digital-communication-capacity-of-the-binary-erasure-channel-bec\" aria-label=\"information theory and digital communication capacity of the binary erasure channel bec permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Information Theory and Digital Communication, Capacity of the Binary Erasure Channel (BEC)</h1>\n<h2 id=\"modes-of-convergence\" style=\"position:relative;\"><a href=\"#modes-of-convergence\" aria-label=\"modes of convergence permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Modes of Convergence</h2>\n<p>Weak law of large numbers: Everyone in the room takes a coin and flips it 500 times. Then we can saw the empirical frequency is between .49 and .51. Then everyone flips it 500 more times. Then the empirical frequency is between .495 and .505. However, someone might have been between .49 and .51 in the first 500 but not .495 ad .505 in the first 1000\nStrong law of large numbers: The empirical frequency will go to its true probability with probability 1.</p>\n<p>CLT is a statement about convergence in a distribution</p>\n<h1 id=\"information-theory\" style=\"position:relative;\"><a href=\"#information-theory\" aria-label=\"information theory permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Information Theory</h1>\n<h2 id=\"source-coding-theorem\" style=\"position:relative;\"><a href=\"#source-coding-theorem\" aria-label=\"source coding theorem permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Source Coding Theorem</h2>\n<h3 id=\"protocol\" style=\"position:relative;\"><a href=\"#protocol\" aria-label=\"protocol permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Protocol</h3>\n<p>If I observe (x<sub>1</sub>, …, x<sub>n</sub>), then I describe it via bitstring (1, xxx…x) -> log(A<sub>t/2</sub>)\nIf I observe a sequence not in the typical set, I describe it brute force via bitstring</p>\n<p>What is the performance(average number of bits needed per symbol observed) for this scheme?\n1/n * E[number of bits in representation] ≤ 1/n (2 + n(H(x)+E))P(x<sub>1</sub>…x<sub>n</sub>) + 1/n (2 + nlog(x)) * P(x<sub>1</sub>…x<sub>n</sub>) ≤ H(x) + E/2 + 4/n + log|x| * probability of sequence not being in typical set</p>\n<h3 id=\"results\" style=\"position:relative;\"><a href=\"#results\" aria-label=\"results permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Results</h3>\n<ul>\n<li>Descriptions using ≤ H(x) + E bits per symbol on average exist</li>\n<li>Lossless descriptions using ≤ H(x) don’t exist</li>\n<li>Huffman encoding: If I know p<sub>x</sub>, I can design a Huffman code requiring ≤ H(x) + 1/n bits on average to compress sequences of length n</li>\n</ul>\n<p>Question: How are we going to show we can compress down to the entropy?\nAnswer: Use concentration. For a sequence(X<sub>1</sub>, x<sub>2</sub>, …, x<sub>n</sub>), let the probability of observing it be p(x<sub>1</sub>…x<sub>n</sub>) = Π<sub>i=1</sub><sup>n</sup> P<sub>x</sub>(x<sub>i</sub>)</p>\n<p>Theorem: Asymptotic Equipartition Property - If Xs are iid with respect to P<sub>x</sub>, then -1/n log p(x<sub>1</sub>…x<sub>n</sub>) converges to H(x) in probability</p>\n<p>Typical Set: For E > 0, for each n ≤ 1 define typical set\nA := { (x<sub>1</sub>…x<sub>n</sub>) : p(x<sub>1</sub>…x<sub>n</sub>) >= 2<sup>-n(H(x) + E)}</p>\n<p>Why is this called a typical set? It is a set of typical sequences.</p>\n<p>Question: Suppose I have N objects. What is the max number of bits I need to represent each object?\nAnswer: max of log(N) bits</p>\n<h1 id=\"poisson-processes\" style=\"position:relative;\"><a href=\"#poisson-processes\" aria-label=\"poisson processes permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Poisson Processes</h1>\n<p>Markov models: Map events that have memory\nPoisson Processes: Random arrivals: customers coming into a store, beta particles detected by geiger counter, vehicles passing a tollbooth\nCounting Processs: Starts at 0, right continuous, integer valued\nPoison Process: a counting process with Exp(λ) interarrival times</p>\n<p>Thm: If N<sub>t</sub> is a Poisson process with rate λ, then N<sub>t</sub> ~ Poisson(λt)\nP{N<sub>t</sub> = n} = e<sup>-λt</sup> * (λt)^n / n!</p>\n<p>Proof:\nP{N<sub>t</sub> = n} = P{T<sub>n</sub> &#x3C;= t &#x3C; T<sub>n+1</sub>}\n= E[1<sub>{T<sub>n</sub>} &#x3C;= t</sub> 1<sub>t &#x3C; T<sub>n+1</sub></sub>]</p>\n<p>Increments of Poisson process are stationary: N<sub>t+s</sub> - N<sub>s</sub> = N<sub>t + tau</sub> - N<sub>tau</sub></p>\n<p>Poisson processes have independent increments:\nif t<sub>0<sub> &#x3C; t<sub>1<sub> &#x3C; … &#x3C; t<sub>k<sub><br>\n=> increments N<sub>t1</sub> - N<sub>t0</sub>, N<sub>t2</sub> - N<sub>t1</sub>, … , N<sub>t1</sub> - N<sub>t0</sub> are independent</p>\n<p>Thm: If N<sub>t</sub> is a counting process with independent stationary increments and N<sub>t</sub> ~ Poisson(λt), then N<sub>t</sub> ~ Poisson Process(λ)</p>\n<p>Conditional distribution of Arrivals\nThm: Conditioned on {N<sub>t</sub> = n}, (T<sub>1</sub>, T<sub>2</sub>, …, T<sub>n</sub>) = (Unif(1), … Unif(n))</p>\n<p>Example: Let cars pass through a tollbooth according to Poisson Process(λ).\nQuestion: What is the probability that no cars pass in 2 minutes\nP{N<sub>2</sub> = 0} = (λ2)^0 * e^(-2λ) / 0! = e^(-2λ)</p>\n<p>Question: If 10 vehicles pass in 2 minutes, what is the distribution of vehicles that passed in the first 30 seconds?\nBinomial(10, 1/4)</p>\n<h2 id=\"mergingsuperposition-and-splittingthinning\" style=\"position:relative;\"><a href=\"#mergingsuperposition-and-splittingthinning\" aria-label=\"mergingsuperposition and splittingthinning permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Merging(superposition) and Splitting(thinning)</h2>\n<p>Merging: If N<sub>1,t</sub> is poisson process with rate λ and N<sub>2,t</sub> is an independent poisson process with rate u, then N<sub>1,t</sub> + N<sub>2,t</sub> ~ PP(λ+u)</p>\n<p>Splitting: Independently mask arrival with 1.\nExample: Let vehicles pass through a tollbooth with PP(λ). Let 1/3 of the vehicles be cards and 2/3s be trucks</p>\n<p>If 10 vehicles pass in the first 2 minutes, what is the distribution of cars that pass in the first 30 seconds?</p>\n<h2 id=\"random-graphs-and-inference\" style=\"position:relative;\"><a href=\"#random-graphs-and-inference\" aria-label=\"random graphs and inference permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Random Graphs and Inference</h2>\n<p>G ~ G(n,p): G is a graph with n verticies where each edge appears indepedently with probability p.</p>\n<p>Monotone Graph Properties have sharp thresholds: P {G ~ G(n,p) has property P} = {0 if p &#x3C; threshold n , 1 if p > threshold n}\nThere is some phase transition phenomena.</p>\n<p>Ex: p = {graph G has >= 1 edge} Monotone property.\nClaim: t(n) = 1/n^2\nProof: Let x = # of edges in G, p = c / n^2\nWhat is the probability G has zero edges?\nP{G has zero edges} = P{X=0} = (1-p)^(n C 2) = e^(-c/2)\nThis evaluates to basically 1 if c &#x3C;&#x3C; 1 and 0 if c >> 1</p>\n<p>Goal: For property p={graph G is connected}, show that t(n) = (log(n)/n)\nIf lambda > 1, then P{G~G(n,p) is connected} -> 1 as n->infinity\nIf lambda &#x3C; 1, then P{G~G(n,p) is connected} -> 0 as n->infinity</p>\n<p>Lemma: If X is a random variable, then P(X=0) &#x3C;= Var(x)/(E[X])^2\nProof:\nVar(x) = E[X-E[X]]^2\n= P(X=0) * E[X-(E[X])^2 | X=0] + P(X!=0) * E[(X-E[X])^2 | X=0]\n= P(X=0) * E[X]^2 + P(X!=0) * E[(X-E[X])^2 | X=0]</p>\n<p>Case lambda &#x3C; 1:\nWill show: P{G has isolated vertex} -> 1\n{G has isolated vertex} C {G is disconnected}\nX = Σ I<em>i = # of isolated verticies, where I</em>i = {0 if i not isolated, 1 if i isolated}\nI<em>i ~ Bern(q) where q</em>i := (1-p)^(n-1)\nVar(X) = Σ Var(I<em>i) + Σ Covar(I</em>i, I<em>j)\n= nq(1-q) + n(n-1)Cov(I</em>1, I_2)</p>\n<p>Where Cov(I<em>1, I</em>2) = E[I<em>1I</em>2] - (E[I])^2\nP{G has no isolated vertices} = P{X=0} &#x3C;= (nq(1-q) + n(n-1) * pnq^2/(1-p))/(n^2q^2)\n&#x3C;= (1-q)/nq + p/(1-p)\n(1-q)/nq goes to zero since nq -> infinity\np/(1-p) goes to zero as p -> 0</p>\n<p>P{G disconnected} = P{U_k=1^{n/2} {There exists a set of k vertices separated from the rest of G}}\nApply union bound again\nP{(n C k) P {vertices 1, …, }}</p>\n<h3 id=\"inference\" style=\"position:relative;\"><a href=\"#inference\" aria-label=\"inference permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Inference</h3>\n<p>Given data, how do I choose the model that generated the data\nX is the state of nature, often parameter or hypothesis -> P_{Y|X} -> Y produces some observation(data)\nX may or may not be the random variable\nIf it is, then the distribution of X is called a prior(Bayesian Inference)</p>\n<h2 id=\"inference-1\" style=\"position:relative;\"><a href=\"#inference-1\" aria-label=\"inference 1 permalink\" class=\"anchor before\"><svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" version=\"1.1\" viewBox=\"0 0 16 16\" width=\"16\"><path fill-rule=\"evenodd\" d=\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\"></path></svg></a>Inference</h2>\n<p>Goal: Infer X from Y\nX -> Model -> Y</p>\n<p>Reasonable Approaches:</p>\n<ul>\n<li>MLE(X|Y) = argmax P(Y|X)</li>\n<li>Binary Hypothesis testing. X(Y) = 1 if L(Y) > lambda, 0 if L(Y) &#x3C; lambda</li>\n</ul>\n<p>For any test X:Y, there are two fundamental error rates, false positives and false negatives</p>\n<p>False Positive: P{X(Y) = 1 | X = 0}</p>\n<p>False Negative: P{X(Y) = 0 | X = 1}</p>\n<p>Question: Given a constraint on type 1 error rate, how do we find the test that minimizes type II error rate?\nSolution:\nSolve X = argmin P{X(Y) = 0| X=1}\nAnswer: Randomized threshold tests are optimal: Neyman-Pearson Theorem\nGiven Beta, the optimal decision rule is X(Y) = 1 if L(Y) > lambda, 0 if L(Y) &#x3C; lambda, Bernoulli(gamma) if L(Y) = lambda\nWhere lambda and gamma are chosen such that P{X(Y) = 1 | X=0} = beta</p>\n<p>B = P(L(Y) >= lambda | X=0)\n= P(Y/sigma >= 1/(2<em>sigma) + phi</em>log(lambda|X=0))\n= 1 - phi(1/(2<em>sigma) + sigma</em>log(lambda))</p>\n<p>Question: Where does randomization enter the picture\nAnswer: Deterministic threshold tests define and lie on error curve, but don’t necessarily sweep the whole thing\nEx: If Y is discrete, say binary-valued, then L(Y) takes at most 2 values\nTakes two tests and randomizes between them</p>\n<p>Threshold tests are optimal</p>\n<p>Estimation under mean squared error\nGoal: Estimate X based on Y under some loss function\nX -> Model -> Y</p>","fields":{"slug":"/posts/academics/probability","tagSlugs":["/tag/list/","/tag/uc-berkeley/"]},"frontmatter":{"date":"2023-01-10T23:46:37.121Z","description":"Real applications of probability","tags":["List","UC Berkeley"],"title":"Probability in Electrical Engineering and Computer Science"}}},"pageContext":{"slug":"/posts/academics/probability"}},"staticQueryHashes":["251939775","401334301","825871152"]}