[Probability] Bayesian & Likelihood

2023. 12. 11. 18:29Mathematics/Probability

1. Bayesian Statistical Inference

1.1. Terminology of Bayesian Inference 

$x = (x_1, \cdots, x_n)$: observation vector of $X$

$p_\Theta$ or $f_\Theta$: prior distribution

unknown parameter distribution $\Theta$ that was assumed before observing $x$

$p_{\Theta|X}$ or $f_{\Theta|X}$: posterior distribution

unknown parameter distribution $\Theta$ that was assumed after observing $x$

1.2. Summary of Bayesian Inference 

1. we start with a prior distribution $p_\Theta$ or $f_\Theta$ for the unknown random variable $\Theta$.

2. we have a model $p_{X|\Theta}$ or $f_{X|\Theta}$ of the observation vector $x$.

3. After observing vector $x$, we form the posterior distribution of $\Theta$, using the appropriate version of Bayes' rule.

1.3. The four versions of Bayes' rule

1. if $\Theta$ is discrete, $X$ is discrete, then

$$p_{\Theta|X}(\theta|x) = \frac{p_\Theta(\theta) p_{X|\Theta}(x|\theta)}{\sum_{\theta^\prime} p_\Theta(\theta^\prime) p_{X|\Theta}(x|\theta^\prime)}$$

2. if $\Theta$ is discrete, $X$ is continuous, then

$$p_{\Theta|X}(\theta|x) = \frac{p_\Theta(\theta) f_{X|\Theta}(x|\theta)}{\sum_{\theta^\prime} p_\Theta(\theta^\prime) f_{X|\Theta}(x|\theta^\prime)}$$

3. if $\Theta$ is discrete, $X$ is continuous, then

$$p_{\Theta|X}(\theta|x) = \frac{f_\Theta(\theta) p_{X|\Theta}(x|\theta)}{\int f_\Theta(\theta^\prime) p_{X|\Theta}(x|\theta^\prime)}$$

4. if $\Theta$ is continuous, $X$ is continuous, then

$$f_{\Theta|X}(\theta|x) = \frac{f_\Theta(\theta) f_{X|\Theta}(x|\theta)}{\int f_\Theta(\theta^\prime) f_{X|\Theta}(x|\theta^\prime)}$$

1.4. Maximum a Posteriori Probability (MAP) rule

$$\hat{\theta_n} = \arg\underset{\theta}\max p_{\Theta|X}(\theta|x) \ (\Theta \text{ is discrete}), \quad \hat{\theta_n} = \arg\underset{\theta}\max f_{\Theta|X}(\theta|x) \ (\Theta \text{ is continuous})$$

1.5. Example of Bayesian Inference

줄리엣은 로미오와의 데이트에서 항상 $X - \text{Uniform} [0, \theta]$만큼 지각을 한다고 가정하자.

첫 데이트 때, 줄리엣이 $x_1$시간 만큼 지각했다고 했을 때, $f_\Theta$ 업데이트 즉,  $f_{\Theta|X}$와 $\hat \theta$가 구해라.

 

$$f_\Theta(\theta) = \begin{cases} 1, & \text{if } 0 \le \theta \le 1 \\ 0, & \text{otherwise} \end{cases} \quad f_{X|\Theta}(x|\theta) = \begin{cases} 1/\theta, & \text{if } 0 \le x \le \theta \\ 0, & \text{otherwise} \end{cases}$$ $$f_{\Theta|X}(\theta|x_1) =  \frac{f_\Theta(\theta) f_{X|\Theta}(x_1|\theta)}{\int_0^1 f_\Theta(\theta^\prime) f_{X|\Theta}(x_1|\theta^\prime)} = \frac{1/\theta}{\int_{x_1}^1 \frac{1}{\theta^\prime}d\theta^\prime} = \frac{1}{\theta \cdot |\log x_1|}, \quad \text{if } x_1 \le \theta \le 1$$

 

$\theta = x_1$일 때 $f_{\Theta|X}$가 가장 크기 때문에, $\hat \theta = x_1$이다.


2. Classical Statistical Inference

2.1. Estimation of the Mean and Variance of a Random variable

Let $X_1, X_2, \cdots$ be i.i.d random variables with mean $\mu$ and variance $\sigma^2$ that are unknown.

mean estimator is Sample Mean

$$M_n=\frac{X_1+X_2+\ldots+X_n}{n},\ \ E\left[M_n\right]=\mu,\ \ \text{var}\left(M_n\right)=\frac{\sigma^2}{n}$$

variance estimator is Sample Variance

$${{\bar{S}}_n}^2=\frac{1}{n}\sum_{1}^{n}{{(X}_i-M_n)}^2,\ \ \ \ E\left[{\bar{S}}_n\right]=\frac{n-1}{n}\sigma^2,\ \ \widehat{S_n}=\frac{1}{n-1}\sum_{1}^{n}{{(X}_i-M_n)}^2,\ \ \ \ E\left[\widehat{S_n}\right]=\sigma^2$$

proof

$$\begin{matrix}E\left[{{\bar{S}}_n}^2\right]&=&\ \frac{1}{n}E\left[\sum_{1}^{n}\left({X_i}^2-2X_iM_n+{M_n}^2\right)\right] \\ &=&\frac{1}{n}E\left[\sum_{1}^{n}{X_i}^2-2M_n\sum_{1}^{n}{X_i}^2+n{M_n}^2\right] \\ &=&E\left[\frac{1}{n}\sum_{1}^{n}{X_i}^2-2{M_n}^2+{M_n}^2\right]\\&=&E\left[\frac{1}{n}\sum_{1}^{n}{X_i}^2-{M_n}^2\right]\\&=&\mu^2+\sigma^2-\left(\mu^2+\frac{\sigma^2}{n}\right)\\&=&\frac{n-1}{n}\sigma^2 \end{matrix}$$


2.2. Maximum Likelihood Estimation (ML estimation)

Let the vector of observations $X = (X_1, \cdots, X_n)$ is described by $p_X(x; \theta)$ whose form depends on $\theta$.

Suppose that we observe a particular value $x = (x_1, \cdots, x_n)$ of $X$.

Maximum likelihood estimate is a value of the parameter $\hat \theta$ that maximizes $p_X(x_1, \cdots, x_n; \theta)$ over all $\theta$:

$$\hat{\theta_n} = \arg\underset{\theta}\max p_X(x_1, \cdots, x_n; \theta) \ (X \text{ is discrete}) \\ \hat{\theta_n} = \arg\underset{\theta}\max f_X(x_1, \cdots, x_n; \theta) \ (X \text{ is continuous})$$

It is natural to maximize the probability of occurrence of observed results,

so estimate parameter $\hat \theta$ by maximizing $p_X(x; \theta)$ referred to as likelihood function.

 

In many experiments, the observations $X_i$ are assumed to be independent, so the likelihood function is form

$$p_X(x_1, \cdots, x_n; \theta) = \prod p_{X_i}(x_i;\theta) \ (X \text{ is discrete}), \quad f_X(x_1, \cdots, x_n; \theta) = \prod f_{X_i}(x_i;\theta) \ (X \text{ is continuous})$$

 

- 중요 -

$p_X(x_i; \theta)$가 의미하는 것은 "$\Theta$가 $\theta$일 확률"이 아닌,

"$\Theta$를 $\theta$로 두었을 때, $x_i$가 발생할 가능성(=확률)"을 의미한다.

결론, MLE는 $x$의 발생 가능성을 가장 높이는 $\hat \theta$를 찾는 과정이다.

2.2.1. Example of ML estimation

Likelihood function: $X - \text{Bernoulli}(p), \quad p_X(x; \theta) = \begin{cases} \theta, & \text{if } x = 1 \\ 1-\theta, & \text{if } x = 0 \end{cases}$, ($\theta$ = 정완이가 게임에서 이길 확률)

정완이가 첫 배치고사 10판 중 6판은 이기고 4판은 졌을 때($x = (1, 0, 0, 0, 1, 1, 1, 0, 1, 0)$), $\hat \theta$를 구하시오.

 

$$p_X(x;\theta) = \theta^6(1-\theta)^4 \\ \frac{dp_x}{d\theta}(x; \theta) = 6\theta^5(1-\theta)^4 - 4\theta^6(1-\theta)^3 = (6 - 10\theta)(\theta^5(1-\theta)^3)) \\ 6 - 10\theta = 0 \rightarrow \theta = \frac{6}{10}$$

 

$\theta = \frac{6}{10}$일 때 $p_X(x; \theta)$가 가장 크기 때문에, $\hat \theta = \frac{6}{10}$이다.


- Bayesian statistic VS Classical statistic -

 

1. $\Theta$ 추정 방법: bayesian statistic은 베이즈 정리를 통해 $\Theta$의 확률분포를 갱신하는 방식으로 모수를 추정하고,

classical statistic은 사례 $x$의 발생 가능성을 나타내는 함수인 likelihood function을 사용해 모수를 추정한다.

 

2. $\Theta$에 대한 관점: classical statistic은 unknown parameter $\Theta$를 상수(=constant)로 취급하는 반면,

bayesian statistic은 unknown parameter $\Theta$를 확률변수(=random variable)로 취급한다.

 

3. 장단점: bayesian statistic은 주관에 의해 첫 $p_\Theta$가 정해진다.

하지만, 사례 $x$가 많아질수록 점점 객관적으로 변한다.

classical statistic은 주관 없이 오직 사례 $x$만으로 확률분포를 추정한다.

하지만, 사례 $x$가 적으면 신뢰도가 매우 낮아져 활용하기 어렵다.

 

 

'Mathematics > Probability' 카테고리의 다른 글

[Probability] Markov Chains  (0) 2024.01.19
[Probability] Law of Large Number & Central Limit Theorem  (0) 2023.12.11