Probability in Electrical Engineering and Computer Science

Introduction to Probability, 2nd Edition Follows the course, can borrow from engineering library

TODO: Set calendar for homework self grades and resubmission time Set midterm time on calendar Choose discussion time Countable vs uncountable

Probability: Math + a way of thinking about uncertainty

Humans are naturally good at thinking about probability. For example, you can easily walk down a crowded street and estimate what path allows you to avoid bumping into people. However, humans are pretty awful at turning those innate estimates into quantifiable numbers. The goal is to take the natural probability estimation into something that a robot can understand.

Quantify + Model

Inference. Humans are also really good at taking in noisy observations and infer what is really happening.

How do I generate some image from AI? Take a natural image and repeatedly add noise to it. You then end up with an image of pure noise. The idea behind stable diffusion and dalle-2 is to reverse this process. We generate a bunch of noise, then reduce noise until we end up with a image.

Modern theory of probability has an axiomatic starting point. Before the 1930s, probability was a heuristic field. All of probability are logical deductions from the starting point.

Probability space: A probability space (Ω, F, P) is a triple. Ω is a set, the sample space. F is a family of subsets of Ω, called events P is a probability measure

Rules:

F is a σ-algebra, it contains Ω itself. F is closed under complements and countable unions.
- If A ∈ F, then any of A’s complements belong to F
- countable unions: If A1, A2, … ∈ F, then the union of Ai ∈ F
- By DeMorgan’s Laws, this implies that F is also closed under countable intersections
P is a function that maps events to a number between 0 and 1. P: F -> [0,1]
- Probability measures must obey Komogorov Axioms
  1. P(A) >= 0 Probability of any event must be non-negative
  2. P(Ω) = 1 The probability of any outcome occuring must be equal to 1
  3. σ-additivity aka countable additivity. If A1, A2, … is a countable sequence of events and disjoint, then P(U) = ΣP(Ai)
  4. counterexample: P = Unif(0,1), then P({x}) = 0 for x ∈ [0,1]. If 1 = P([0,1]), then Σ P({x}) = 0 <- Dive into this later

Examples of the Above Rules

Flip a coin with bias P(Heads with probability p, Tails with probability 1-p).
- Ω = {H, T} The sample space is just heads and tables
- F = 2 ^ σ = {H, T, Ø, {H, T}} These are all subsets of σ
- P(H) = p, P(T) = 1-p, P(Ø) = 0, P({H, T}) = 1 Why are Ø and {H, T} here?
- We then want to create an experiment of n “independent” flips. Is our vocabulary rich enough for this? no
- Ω = {H, T}^n, F = 2^Ω, P(f1, f2, … , fn) = p^{number of flips fi = H} * (1-p)^{number of flips fi = T}
- There can be multiple answers for a probability space. It just needs to be rich enough to capture all the probabilities
Ω now equals all possible configurations of atoms in the universe. How do we represent this?
- Let A = {set of configurations that lead to flip heads}
- Let B = {set of configurations that lead to flip tails}
- F = {Ø, A, B, Ω = {A U B}}
- P(A) = p, P(B) = 1-p, P(Ø) = 0, P(Ω) = 1

Note: The probability space is usually implicitly described, except for HW 1. Lol

All of the rules of probability you are likely familiar with are consequences of the three axioms.

Ex: If A ∈ F, P(A^C) = 1 - P(A). The probability of A’s complement is 1 - probability of A.

Since F is σ algebra, then A^C ∈ F.
- A U A^C. A and A complement is a disjoint union equal to Ω
- 1 = P(Ω) Using axiom 2
- P(A U A^C) = P(A) + P(A^C) Using Axiom 3

Ex: If A and B are events and A is a subset of B, then P(A) <= P(B) B = A U (B \ A). B is equal to the disjoint union of A and B minus the elements of A P(B) = P(A U (B \ A)) => P(A) + P(B\A) Using axiom 3 P(A) + P(B\A) >= P(A) Using axiom 1

Ex: If A and B are events, then P(A U B) = P(A) + P(B) - P(A ∩ B). Aka inclusion exclusion principle Proof: AUB, A∩B ∈ F A U B = B U (A \ B) P(A U B) = P(B) + P(A\(A∩B)) Using axiom 3 P(B) + P(A\(A∩B)) = P(B) + P(A) - P(A ∩B) Using Axiom 3

Ex: Ω = countable set F = 2 ^ Ω, each individual is an event {w} ∈ F, for all w ∈ Ω P(A) = Σ for all A ⊆ Ω This is a discrete sample space.

Ex: Law of total probability. If A1, A2, …, partition Ω, then (Ω = U Ai). Mutually exclusive, collectively exhaustive. Then, for any B ∈ F, then P(B) = Σ P(B ∩ Ai) B = U(Ai ∩ B) then apply axiom 3

The goal of probability is to take complicated things and use that to approximate them into multiple simpler events.

Mathematicians: Given a probability space(model), what can I say about outcomes of experiments that derive from this model? Statistician: Given outcomes, how do I choose a good model(probability space)? Engineers: Given a real world problem, how do I choose a model that captures the essence of the model and then use the model to draw insight.

Lecture 2: Conditional Probability, Independence, Random Variables

Conditional Probability: If B ∈ F and P(B) > 0, then conditional probability P(A|B) = P(A ∩ B) / P(B)

Intuition: Probability A occurs given we know that B occurred

Formal Definition: P(.|B) gives a restriction of our model (Ω, F, P) to those samples in B If (B, F|B, P(.|B)) is a probability space itself, where F|B = \{A ∩ B: A ∈ F\}

Ex. If A1, A2, … ∈ F, P(Ai) > 0, A partion Ω P(B) = Σ P(B ∩ Ai) = Σ P(B|Ai) * P(Ai)

Bayes Rule

Sometimes, it’s easy to express P(B|A), but we are really interested in P(A|B) Let A be the state of the experiment, B be the observation data

P(A|B) - given the observation, we want the state. This is the task of inference.

Suppose we forgot Bayes Rule, let’s try to recreate it from definition of conditional probability P(A|B) = P(B ∩ A) / P(B) = P(A) * P(B|A) / P(B)

Ex. Suppose we have the following model: 85% of students got a pass, 15% of students got a no pass. 60% students that got passes went to lecture, 40% of students did not go to lecture. 10% of students that got no passes went to lecture, 90% didn’t go to lecture.

We want P(NP | N) / P(NP | Y). How much more likely are you to NP when you don’t attend lecture vs when you do attend lecture? P(NP ∩ N) / P(NP ∩ Y) * P(Y) / P(N)

Ex. Suppose we roll two dice and the sum is 10. What is the probability roll 1 was = 4?

B = \{Sum of rolls = 10\}
A = \{first roll is 4\}

P(A|B) = P(A ∩ B) / P(B)
       = P({First roll is 4, sum of rolls is 10}) / P({sum of rolls is 10})
       = P({first: 4, second: 6}) / P({(4,6), (5,5), (6,4)})
       = (1/36) / (3/36)
       = 1/3

Conditioning also allows us to usefully decompose intersections of events.

Ex. Consider event A1 … An P(∩ A) = P(A₁ | ∩ⁿ_{i = 2} A_i) P (∩ A_i)

Ex. Given n people in a room, what is the probability that more than 2 people share a birthday?

A_i = {person i does not share a birthday with any of the people j = 1, …, i-1}

P(A_i | ∩ ^i-1_j=1 A_j) = (365 - (i-1)) / 365

This is because the person needs to land in one of the days not in the 365. P(no shared birthdays) = P(∩ A_i)

Use 1-x <= e^-x. Approximate using taylor series

Let n = 23 - there are 23 kids in a class. P(shared birthday) >= 1 - e^((i-1)/365))

Independence

Events A, B are independent if P(A ∩ B) = P(A) * P(B) Disjoint events are not independent Note:\ in a special case where P(A) > 0: A,B: dependent <=> P(B|A) = P(B)

In general, collection A₁, A₂, … ∈ F are independent events if P(∩_{i ∈ S} A_i) = ∏ _{i ∈ S} P(A_i) for all finite sets of indices S

If A₁, A₂, … ∈ F are independent, then B₁, B₂, … are independent, where each A_i = B_i or B_i^C Intuitively, we assume that knowing A means we know A’s complement

∩_i=2ⁿ A_i ∏P(Ai) = P(∩ Ai) = P(∩) + P(A1) (1- P(Ai)) ∏ P(Ai) = P(A1^C ∩ ∏_i=2Ai) This shows A₁^C, A₂, … , A_n are independent Therefore, B₁, B₂, … , B_n are independent

Conditional Independence

Often times we have two events that we think of as independent but aren’t. There might be a confounding variable C If A,B,C are such that P(C) > 0 and P(A∩B|C) = P(A|C) P(B|C), then A,B are said to be conditionally independent given C

Consider two coins with bias p!=q Pick a coin at random, and flip twice. H(i) = Event that flip i is heads Are the two coinflips independent? No. If p is 1 and q is 0, then we know what the next coin flip is given our first coin flip.

P(H_i) = p + q / 2 P(H₁ ∩ H₂) = p² + q² / 2 != P(H₁) P(H₂) = (p+q)² / 2 C = {pick coin p} H₁, H₂ conditionally independent given C

TODO: Set calendar for homework self grades and resubmission time Set midterm time on calendar Countable vs uncountable

Random Variables

Password: Teapot

Def: A random variable X is a function X: Ω -> R. It implies it is valid to write things like P(X <= 3). P(X <=3) is shorthand for P({ω: X(ω) <= 3}) Random conditions have a measurability condition

Often times we want to compute more complex probabilities, such as P(α < X < β). {ω: X(ω) < β} = U_{n >=1}{ω: X(ω) < β -1/n} ∈ F {ω: X(ω) > α} = {ω: X(ω) <= α}^C ∈ F P(X ∈ B) for pretty much any B(subset of R) that you want Here, the technical name for B is Borel sets

Another consequence of the definition of random variables: If X,Y are random variables on (Ω, F, P):

X+Y is a random variable
X*Y is a random variable
|X|^p is a random variable

Probability Distributions

For any random variable X on probability space (Ω, F, P), we can define its distribution(aka its “law”) L_x via L_x(B) := P(X ∈ B), B ∈ R For example, L_x({x}) = P(X=x)

The histogram of values that a random variable would take.

Similar to a probability measure.

In practice, we often describe our model for experimental outcomes in terms of distributions. Given this, I can always construct a probability space and random variable X that has this distribution.

Ex: Given distribution L, we can consider probability space (R, B, L) and random variable X(ω)=x, ω ∈ Ω = R Distribution of X

Important Class of Random Variables; Discrete Random Variables: A random variable that takes countably many values Ex: X = flip of a p-biased coin. This is a bernoulli distribution with probability p

X = roll a dice. This is a uniform distribution over {1,2,3,4,5,6} X = number of coin flips until I flip heads. Geometric Distribution X = number of heads in n flips. Binomial Distribution (n, p)

In case of a discrete random variable X, the distribution of X can be summarized by its probability mass function. P_x(x) := P({X=x}) = P{ω: X(ω) = x}

By axioms, p_x(x) >= 0 and Σp_x(x) = 1

Joint Distributions

For a pair of discrete random variables X,Y defined on the common probability space (Ω, F, P), their joint distribution is: summarized by the join probability mass function defined via P_xy = P(X=x, Y=y) = P({ω: X(ω) = x && Y(ω) = y}) = P({ω: X(ω)} ∩ {ω: Y(ω) = y}) We can obtain the “marginal” distributions by P_x(x) = Σ P_xy(x,y) using the law of total probability Σ P(x) = 1

Independent Random Variables

Def: Discrete Random Variables X,Y are independent if their probability mass function factors into their marginals P_xy(x,y) = P_x(x)P_y(y)

This is equivalent to saying {ω: X(ω) = x}) and {ω: Y(ω) = y}) are independent events for all x,y

Examples of Joint Distributions: X = {0 if patient tests negative with probability 0.9, 1 if patient tests positive with probability 0.1} Y = {0 if patient is negative, 1 if patient is positive}

We know the false positive rate of the test is 5%, the false negative rate of test is 1%

Question: What is P_XY? I test my patient. The test is either negative(.9) or positive(.1).

Expectation - Used for discrete random variables

Takes in a random variable, spits out a distribution For a discrete random variable X on (Ω, F, P), we define its expectation E(X) = ΣxP{X=x} = Σxp_x(x)

Ex: If X Ω = {0,1}ⁿ, P({ω}) = p^{{number of heads}} (1-p)^{{number of tails}} ie a sequence of independent p biased coin flips

X(ω) = {ω = 9}. <= Number of heads E[X] = Σ{i : ω_i= 1}, p^{{i : ω_i= 1}} (1-p)^{{i : ω_i= 1}}

= Σ_k=0ⁿ (n k) p^k (1-p)^n-k = np

The most important property of expectation is that it is linear. Expectation is an operator, it takes a function and spits out a number, just like an integral. Just like how integrals are linear, expectation is also linear.

The most important and only property of expectation E[X+Y] = E[X] + E[Y]

Linearity of Expectation

If X,Y random variables defined on a common probability space. E[aX + bY] = aE[X] + bE[Y] <= There is no need for expectation to be independent or discrete.

LemmaL E[g(z)] = Σ g(z) p_z(z)

By the law of the unconscious statistician E[aX + bY] = Σ(ax + by)P_xy(xy) = aΣxP_xy(x,y) + bΣxP_xy(x,y)

Sum of Independent Binomials, Variance, Covariance, Correlation Coefficient, Entropy

General Strategy for Expectation

We introduce indicator random variables and express random variable of interest in terms of the indicators, then use linearity of expectation
In general, the indicator for event A is denoted by 1_A(w) = {0 if ω is not A, 1 if ω is A} a. 1_A = Bernuolli(P(A)), E[1_a] = P(A)

Ex: n people put their hats in a bucket, and each draw 1 hat. Let X = number of people who get their own hat back. X = Σ_i=1ⁿ1_A A_i = {gets own hat back} E[x] = Σ_i=1 E[1_{M/sub>]= n * 1/n = 1}

Ex: Coupon collector problem How many on boxes on average do I need to buy before I collect all N coupons X_i = number of boxes to buy to get ith coupon, after getting (i-1)^th
R[].

Tail sum formula: Another way to compute expectation without using linearity of expectation.

For non-negative integer values random value X_i E[X] = Σ_k=1^infP{x=k}

Proof: write P({x>= k}) = Σ_x>=k P{x >= k} = Σ_x>=k P(X=x)

X₁ = Geom(p), X₂ = Geom(p) M = min{X₁, X₂} P{M>= k} = P{x₁}, P{x₂}

IE[M] = Σ P{x₁ >= k} P{x₂>=k} = 1 / (1- (1-p)*(1-q)))

Variance

Quantitative Notion of variability of X around E[X] Var(X) = E[(X - E[X])²] = E[X² - 2XE[X] + (E[X])²]= E[X²] - (E[X])²

If expectation is the price of a stock, variance is the volatility. Variance is not the only way to measure the variability of a random variable. Another way to measure the variance of a random variable is entropy. Entropy = Σ P_x(X) log(1/P_x(X)) = E[log(1/P_x(X))]

Unlike expectation, variance is not linear with respect to its arguments. Var(X+X) = ? Var(X+Y) != Var(X) + Var(Y) Var(X+Y)

By definition of variance = E[X+Y - E[X] - E[Y]]²

Open up the square using foil = Var(X) + Var(Y) + 2 E[(X-E[X])(Y-E[Y])]

[X - E[X])(Y-E[Y]) = Covariance

Positive covariance means X and Y move in the same direction If X and Y are independent, covariance is 0

If X,Y are uncorrelated(covariance of X,Y = 0), then Var(X+Y) = Var(X) + Var(Y)

Correlation coefficient: Covariance(X, Y) / sqrt(Var(X)*Var(Y)). It is always between -1 and + 1

Covariance(X,Y) <= sqrt(Var(X) * Var(Y)) Using cauchy schwartz inequality, x^Ty / (||X|| ||Y||) is between [-1, 1]

E[XY] <= E[X²]^1/2 E[Y²]^1/2 E[XY] = Σ P_xy(x,y)^1/2 x * P_xy(x,y)^1/2 y <= (Σ_xyP_xy(x,y)x²)^1/2 (Σ_xyP_xy(x,y)y²)^1/2

Let X ~ Binomial(n, p) X = Σ X_i X_iBernuolli(p) Var(X) = Σ Var(X_i) = n Var(X₁)

Poisson random variables

Poisson means fish in french. Think number of arrivals when you see poisson If I dip a net into the water and pull it out, a poisson distribution is a good way to checking how much fish I pull out.

X ~ Poisson(λ) λ = rate P_λ)(k) = λ^k/k! e^-λ, k = 0,1,2,…

Imagine trying to take a bus in berkeley. They are supposed to be at each station every 30 minutes. However, there are times when no buses come for 45 minutes then two buses come in the 46th and 47th minute. You can model the arrival of the buses using a Poisson distribution

Where does Poisson come from? Each interval something will arrive with probability p_n = λ/n. The probability that something arrives in an interval is X_n ~ Binomial(n, p_n)

P(X_n=k) = Binomial PMF = (n k) * (λ/n) * (1-λ/n)^n-k = n(n-1)…(n-k+1) (1/n-λ)^k λ^k/k! (1-λ/n)^k as n goes to infinity, this distribution converges to a poisson distribution λ^k/k! * e^-k

Continuous Distributions, Continuous Random Variables

Password: Mirror

Conditional Distributions

P(A|C) := P(A ∩ C) / P(C)

When X,Y are discrete: their joint pmf is P_xy

P_x|y(x|y) := P_x|y(x,y) / P_y(y) = P({X=x} ∩ {Y=y}) / P({Y=y})

Since P_x|y is a pmf for each y with P<sub>y</sub>(y) > 0, we can take expectation of X with respect to it.

This is defined as conditional expectation E[X|Y=y] := Σ_x⊆X xP_x|y Once we have the value of Y, what is the value of x? We usually just write E[X|Y] to denote Σ_x⊆X xP_x|y evaluated at random variable Y. Thus, the conditional expectation is itself a random variable since it is a function of the random variable Y.

Tower Property

Most important property of Conditional expectation:

For all functions f, we can say E[f(Y)X] = E[f(Y)*E[X|Y]]
- E[f(Y)X] = Σ_y p_y(y)f(y) * Σ_xP_x|y(x|y)
- E[f(Y)X] = Σ_x,yP_xy(x,y)*f(y)x

Example: Iterated Expectation E[X] = E[E[X|Y]] We get this using the tower property by setting f(Y) = 1

Example: Let N>= 0 be an integer valued random variable. Let’s flip a fair coin N times, and let X = the number of heads. There are two sources of randomness, the number of flips and whether we get heads or tails. E[X] = E[E[X|N]] We can fix the number of coin tosses, and since half of coin tosses are heads E[X|N] = N / 2 Substitute E[N/2] Use linearity of expectation 1/2 E[N]

Var(X) = E[Var(X|N)] + Var(E[X|N]) Using the same substitution as above of E[X|N] = 1/2 Use binomial theorem for E[Var(X|N)] = E[N/4]. Variance of a binomial = ? Var(X) = E[N/4] + Var(N/2) Var(X) = 1/4 * E[N} + 1/2 * Var(N)

Recall Var(X) = E[X-E[X]²] Definition: Conditional Variance of X given {Y=y} Var(X|Y=y) = Σp_x|y(x|y)(X-E[X|Y=y])²

Just like conditional expectation, we write Var(X|Y) to denote the random variable evaluated at Y.

Theorem: Law of total variance

Var(X) = E[Var(X|Y)] + Var(E[X|Y])

Proof:

Var(X) = E[X²] - (E[X])²

= E[E[X²|Y]] - (E[E[X|Y]])²

= E[Var(X|Y)] + E[E[X|Y]²] - (E[E[X|Y]])²

= E[Var(X|Y)] + Var(E[X|Y])

Now let’s start with Var(X|Y=y)

= E[X²|Y=y] - (E[X|Y=y])²

= E[Var(X|Y) + E(X|Y)²] - (E[E[X|Y]])²

Continuous Random Variables and Continuous Distributions

Note: Random Variables need not be discrete or continuous or combinations thereof For a random variable X, we can always define its cumulative distribution function abbreviated CDF via

F_x(X) := P{X<=x}

F_x is nondecreasing
F_x(x) -> {0 if x -> -inf, 1 if x -> inf}
F_x is continuous from the right

Definition: A random variable X has continuous distribution if there exists a function f_x such that

F_x(x) = ∫_-inf^xf_x(u)du

f_x is called the density of X aka a pdf(probability density function)

For f_x to be a density, it just needs to be non-negative and the integral of the density to be equal to 1, since total probability = 1 f_x >= 0 ∫f_xdx = 1

Continous Random Variables

Good for modeling analog signals from the real world.

time we wait until a bus arrives(Exponential Distribution)
Voltage across resistor(Gaussian Distribution)
Phase of a received wireless signal(Uniform Distribution)

Continuous distributions usually described by their density. X ~ Uniform(a,b) = {1/(b-a) a <= x <=b, 0 otherwise} X ~ Exp(λ) = {λe^-λx if x>=0, 0 otherwise} X ~ N(µ,σ²) = 1/sqrt(2pi*σ)exp(-(x-µ)² / 2σ²)

Jointly Continuous Random Variables

We say X₁, X₂, …, X_n are jointly continuous if F_{X₁X₂…X_n}(x₁x₂…x_n) = P{X₁<=x₁, X₂<=x₂, … X_n<=x_n}

= ∫∫ F_{X₁X₂…X_n}(x₁x₂…x_n)du₁du₂…du_n

Example: Let a dart land uniformly at random on a 2d dartboard of radius r. The joint density will be flat over the dartboard

{1/pi*r², x² + y²} Independence: Random Variables X,Y are independent if F_xy(x,y) = F_x(x) * F_y(x) If X,Y are continuous and independent, their joint density is the product of the marginals

Expectation of continuous random variables

Same as discrete case, replace sums with integrals E[X] = ∫x f_x(x)dx

More generally: E[g(X₁…X_n)] = ∫…∫g(x₁…x_n) f_{x₁ … x_n}(x₁ … x_n)dx₁ … dx_n

Example: Var(X) = ∫(X-E[X])² f_x(x)dx

Example: X ~ Unif(a,b) E[X] = ∫ uniform density function

= ∫ x * 1 / (b-a) dx

= 1/2 * (b² - a²) / (b-a)

= 1/2 * (b+a)

Example: Var(X) = E[X²] - (E[X])² = 1/2 (b-a)²

Let r = sqrt(x² + y²) Compute the probability that the dart is in the middle half of the dartboard.

P{R<= r/2} = P{x² + y² <= r² / 4}

= E[{x² + y² <= r² / 4}]

= 1/(pi * r²) ∫ {x² + y² <= r² / 4} dx dy

Gaussian Distribution, Derived Distributions, Continuous Bayes

Discrete: Expectations come from PMF and sums Continuous: Expectations come from PDF and integrals

Examples of continuous distributions

Uniform(a,b) Exp(N) N(µ, σ²)

Exponential Distribution

Suppose we want to model a memoryless process. For example, as a switch sending packets, it doesn’t matter how long you wait for a packet to show up, the expected wait time for the next packet doesn’t change.

Mathematically: Let X be a non-negative random variable with the memoryless property. Memoryless = P{X>t+s|X>s} = P{X>t} for all t,s > 0 If I’ve waited for s seconds, the probability P{X>t} is still unchanged

P({X>t+s} ∩ {x > s}) = P({x > t}) P({x > s}) Get rid of {x > s} since if x > t + s, x must be greater than s

F(t + s) = F(t) F(s) The only unique solution to the functional equality that is a CDF(must be non-negative, monotone increasing) of the form F_x = e^-λx

If x has memoryless property, F(X) = 1 - e^-λt for some λ > 0

Gaussian Random Variables

The sum of independent effects. Height of NBA players, satisfaction of jobs, sum of voltage across resistor.

We call N(0,1), the standard normal distribution.

CDF: 1/2pi ∫exp(-µ/2)²du

P(X<= x) = P((x-µ)/σ <= (x-µ)/σ) = Φ((x-µ)/σ) Gaussians have many nice properties

If X is gaussian, then so is aX + b. If X,Y are independent gaussians, then X+Y is gaussian

Example: Let V ~ N(1, 5²) be input voltage to chip, averaged over 1 second. Our chip fails if voltage dips below 0.5 volts or exceeds 2.5 volts for any one second period

Probabilty(chip fails in 60 second duration) <= 60 * Probability(chip fails in 1 second) 60 * Probability(chip fails in 1 second) = = 60 * Probability(V < 0.5) + Probability(V > 2.5) = Φ((0.5 - 1)/σ) + (1 - Φ((2.5-1)/σ))

Example: cellphone sends a bit B ∈ {-1, 1} to tower. Tower receives Y = B + N, N ~ N(0,1). Tower makes decisions B(Y) = sign(Y) P(err | B = 1)

Given that I receive {Y=y} what is the probability that B = 1 P({B = 1}) = P(Y ∈ [y, y + 𝛿] | B = 1) P(1)

= P_B|Y(1 | y) = P_B(1) * f_Y|B(y|1) / f_y(y) This is bayes rule for one

= (1/sqrt(2pi) * exp(-(y-1)²/2)) / (1/2 * 1/sqrt(2pi) * exp(-(y+1)²)/2) + … = 1 / (1 + e^-2y)

This is b Example: Last example motivates the definition of conditional density If X,Y are jointly continuous, we define conditional density of X given Y:

f_x|y(x|y) = f_xy(x,y) / f_y(y) interpretation: is the density of X given I know {Y=y}

Bayes rule f_x|y(x|y) = f_x(x) / f_y(y) * f_y|x(y|x)

Derived Distributions

Distribution of X give by CDF F_x. Suppose we define Y = g(x) for some function g

What is the distribution of Y?

Do you really need this distribution?
If you really need the distribution, it is best to work with CDFs

Example: X continuous random variable, Y = aX + b F_y(y) = P{aX <= y-b} = P{x <= (y-b) / a if a > 0, x >= (y-b)/a if a < 0} = {F_x((y-b)/a) if a > 0, 1 - F_x((y-b)/a) if a < 0}

f_y(y) = {1/a * F_x((y-b)/a) if a > 0, -1/a * F_x((y-b)/a) if a > 0} = 1/|a| * f_x((y-b)/a)

Y = AX + b f_y(y) = 1/|det(A)|f_x(A^-1(y-b))

Information Theory and Digital Communication, Capacity of the Binary Erasure Channel (BEC)

Modes of Convergence

Weak law of large numbers: Everyone in the room takes a coin and flips it 500 times. Then we can saw the empirical frequency is between .49 and .51. Then everyone flips it 500 more times. Then the empirical frequency is between .495 and .505. However, someone might have been between .49 and .51 in the first 500 but not .495 ad .505 in the first 1000 Strong law of large numbers: The empirical frequency will go to its true probability with probability 1.

CLT is a statement about convergence in a distribution

Information Theory

Source Coding Theorem

Protocol

If I observe (x₁, …, x_n), then I describe it via bitstring (1, xxx…x) -> log(A_t/2) If I observe a sequence not in the typical set, I describe it brute force via bitstring

What is the performance(average number of bits needed per symbol observed) for this scheme? 1/n * E[number of bits in representation] ≤ 1/n (2 + n(H(x)+E))P(x₁…x_n) + 1/n (2 + nlog(x)) * P(x₁…x_n) ≤ H(x) + E/2 + 4/n + log|x| * probability of sequence not being in typical set

Results

Descriptions using ≤ H(x) + E bits per symbol on average exist
Lossless descriptions using ≤ H(x) don’t exist
Huffman encoding: If I know p_x, I can design a Huffman code requiring ≤ H(x) + 1/n bits on average to compress sequences of length n

Question: How are we going to show we can compress down to the entropy? Answer: Use concentration. For a sequence(X₁, x₂, …, x_n), let the probability of observing it be p(x₁…x_n) = Π_i=1ⁿ P_x(x_i)

Theorem: Asymptotic Equipartition Property - If Xs are iid with respect to P_x, then -1/n log p(x₁…x_n) converges to H(x) in probability

Typical Set: For E > 0, for each n ≤ 1 define typical set A := { (x₁…x_n) : p(x₁…x_n) >= 2^{-n(H(x) + E)}}

Why is this called a typical set? It is a set of typical sequences.

Question: Suppose I have N objects. What is the max number of bits I need to represent each object? Answer: max of log(N) bits

Poisson Processes

Markov models: Map events that have memory Poisson Processes: Random arrivals: customers coming into a store, beta particles detected by geiger counter, vehicles passing a tollbooth Counting Processs: Starts at 0, right continuous, integer valued Poison Process: a counting process with Exp(λ) interarrival times

Thm: If N_t is a Poisson process with rate λ, then N_t ~ Poisson(λt) P{N_t = n} = e^-λt * (λt)^n / n!

Proof: P{N_t = n} = P{T_n <= t < T_n+1} = E[1_{{T_n} <= t} 1_{t < T_n+1}]

Increments of Poisson process are stationary: N_t+s - N_s = N_{t + tau} - N_tau

Poisson processes have independent increments: if t_{0_{< t_{1_{< … < t_{k_{=> increments N_t1 - N_t0, N_t2 - N_t1, … , N_t1 - N_t0 are independent}}}}}}

Thm: If N_t is a counting process with independent stationary increments and N_t ~ Poisson(λt), then N_t ~ Poisson Process(λ)

Conditional distribution of Arrivals Thm: Conditioned on {N_t = n}, (T₁, T₂, …, T_n) = (Unif(1), … Unif(n))

Example: Let cars pass through a tollbooth according to Poisson Process(λ). Question: What is the probability that no cars pass in 2 minutes P{N₂ = 0} = (λ2)^0 * e^(-2λ) / 0! = e^(-2λ)

Question: If 10 vehicles pass in 2 minutes, what is the distribution of vehicles that passed in the first 30 seconds? Binomial(10, 1/4)

Merging(superposition) and Splitting(thinning)

Merging: If N_1,t is poisson process with rate λ and N_2,t is an independent poisson process with rate u, then N_1,t + N_2,t ~ PP(λ+u)

Splitting: Independently mask arrival with 1. Example: Let vehicles pass through a tollbooth with PP(λ). Let 1/3 of the vehicles be cards and 2/3s be trucks

If 10 vehicles pass in the first 2 minutes, what is the distribution of cars that pass in the first 30 seconds?

Random Graphs and Inference

G ~ G(n,p): G is a graph with n verticies where each edge appears indepedently with probability p.

Monotone Graph Properties have sharp thresholds: P {G ~ G(n,p) has property P} = {0 if p < threshold n , 1 if p > threshold n} There is some phase transition phenomena.

Ex: p = {graph G has >= 1 edge} Monotone property. Claim: t(n) = 1/n^2 Proof: Let x = # of edges in G, p = c / n^2 What is the probability G has zero edges? P{G has zero edges} = P{X=0} = (1-p)^(n C 2) = e^(-c/2) This evaluates to basically 1 if c << 1 and 0 if c >> 1

Goal: For property p={graph G is connected}, show that t(n) = (log(n)/n) If lambda > 1, then P{G~G(n,p) is connected} -> 1 as n->infinity If lambda < 1, then P{G~G(n,p) is connected} -> 0 as n->infinity

Lemma: If X is a random variable, then P(X=0) <= Var(x)/(E[X])^2 Proof: Var(x) = E[X-E[X]]^2 = P(X=0) * E[X-(E[X])^2 | X=0] + P(X!=0) * E[(X-E[X])^2 | X=0] = P(X=0) * E[X]^2 + P(X!=0) * E[(X-E[X])^2 | X=0]

Case lambda < 1: Will show: P{G has isolated vertex} -> 1 {G has isolated vertex} C {G is disconnected} X = Σ Ii = # of isolated verticies, where Ii = {0 if i not isolated, 1 if i isolated} Ii ~ Bern(q) where qi := (1-p)^(n-1) Var(X) = Σ Var(Ii) + Σ Covar(Ii, Ij) = nq(1-q) + n(n-1)Cov(I1, I_2)

Where Cov(I1, I2) = E[I1I2] - (E[I])^2 P{G has no isolated vertices} = P{X=0} <= (nq(1-q) + n(n-1) * pnq^2/(1-p))/(n^2q^2) <= (1-q)/nq + p/(1-p) (1-q)/nq goes to zero since nq -> infinity p/(1-p) goes to zero as p -> 0

P{G disconnected} = P{U_k=1^{n/2} {There exists a set of k vertices separated from the rest of G}} Apply union bound again P{(n C k) P {vertices 1, …, }}

Inference

Given data, how do I choose the model that generated the data X is the state of nature, often parameter or hypothesis -> P_{Y|X} -> Y produces some observation(data) X may or may not be the random variable If it is, then the distribution of X is called a prior(Bayesian Inference)

Inference

Goal: Infer X from Y X -> Model -> Y

Reasonable Approaches:

MLE(X|Y) = argmax P(Y|X)
Binary Hypothesis testing. X(Y) = 1 if L(Y) > lambda, 0 if L(Y) < lambda

For any test X:Y, there are two fundamental error rates, false positives and false negatives

False Positive: P{X(Y) = 1 | X = 0}

False Negative: P{X(Y) = 0 | X = 1}

Question: Given a constraint on type 1 error rate, how do we find the test that minimizes type II error rate? Solution: Solve X = argmin P{X(Y) = 0| X=1} Answer: Randomized threshold tests are optimal: Neyman-Pearson Theorem Given Beta, the optimal decision rule is X(Y) = 1 if L(Y) > lambda, 0 if L(Y) < lambda, Bernoulli(gamma) if L(Y) = lambda Where lambda and gamma are chosen such that P{X(Y) = 1 | X=0} = beta

B = P(L(Y) >= lambda | X=0) = P(Y/sigma >= 1/(2sigma) + philog(lambda|X=0)) = 1 - phi(1/(2sigma) + sigmalog(lambda))

Question: Where does randomization enter the picture Answer: Deterministic threshold tests define and lie on error curve, but don’t necessarily sweep the whole thing Ex: If Y is discrete, say binary-valued, then L(Y) takes at most 2 values Takes two tests and randomizes between them

Threshold tests are optimal

Estimation under mean squared error Goal: Estimate X based on Y under some loss function X -> Model -> Y

Published Jan 10, 2023

Just a kid looking to make itAlbert Su on Twitter