Central Limit Theorem Explained

Central Limit Theorem (CLT) is a statistical theory stating that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.

Working with Real time data

To explain the concepts of Central limit Theorem i am taking a dataset from here

In this dataset we have 537577 data points, and below is the distribution plot of the prices of the iterms.

Getting samples

To demonistrate the cetral limit Theorem, lets do a simple experiment by collecting few samples from the data and for each sample calculate its mean.

the bellow image shows the distribution of the those means

if you have observed the above image, i have plotted 6 distributions, the left top is generated with 100 samples and each sample of size of 50, for each sample i have calculated mean and plotted a kde plot, the blue line shows the mean of sample means i.e mean(100 sample means) and the red line is drawn at the population mean. the remaining other plots were also generated in the similar fashion.

if we can observe the thrid row distribution plots, we can say that the larger the sample size, the more it looks like Gaussian

I have just tabulated all the values so that we can have a better look at the numbers and relate to our experiment. In the above table P_mean is Population mean, P_std is Population standard deviation, and $n$ is the number of elements in each sample.

Observations: If you check the above stats, we can observe the distribution of sample means, is having mean $\mu_{x} \approx \mu$ and $\sigma_{x}\approx \frac{\sigma}{\sqrt{n}}$

Central Limit Theorem says:

The sampling distribution of the sample mean $\overline{X}$ is approximately normally distributed with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$. if the original distributions are non-normal.

The larger the sample size $n$ is, the more normally distributed the sampling distribution will be and the more tightly it will converge about the true population mean $\mu$.

The sampling distribution of the sample mean $\overline{X}$ is exactly normally distributed with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ if the original distributions are normal.

Note: here $\overline{x}$ is the mean of single sample, $\mu_{x}$ is the mean of the sampling distribution of the sample mean and $\mu$ is the population mean

Population Mean Confidence Intervals for Larger Samples

Consider the poplulation mean is $\mu$, and suppose you have taken a sample and calculated its mean as $\overline{x}$
Often for a given population we don't know what is the values of $\mu$
Here we try to esitmate the unknown $\mu$ with the help of known value $\overline{x}$

Q: Here “Why do we need a sample mean to calculate the population mean? why can’t we directly calcualte the mean of the population?”

Ans: Suppose your population of interest is in Delhi, and you want to know the mean age of the population.

Due to lack of time, energy, and money, you cannot obtain the age of every person in Delhi.
You can select a sample (e.g. a simple random sample) and calculate the mean of that sample, $\overline{x}$

Q: Then “Why don’t we just use the sample mean $\overline{x}$ to estimate the population mean $\mu$?”

We can – but the sample mean $\overline{x}$ may be quite different from the population mean $\mu$, even if we obtained the sample correctly.
In addition, a single number estimate by itself, such as $\overline{x}$, provides no information about the precision and reliability of the estimate with respect to the larger population.

Statisticians use the sample statistic $\overline{x}$ and the population($\sigma$) or sample standard deviation to provide an interval of plausible estimates for the population parameter $\mu$. This interval is called a confidence interval.

Definition: A confidence interval is an entire interval of plausible values for a population parameter, such as $\mu$, based on observations obtained from a random sample of size $n$.

Let us answer a question

What is the avarage money spent by Male population on black friday ?

Before we know how to estimate that lets have a look at couple of concepts

Standard error

The standard error (SE) of a statistic (usually an estimate of a parameter) is the standard deviation of its sampling distribution or an estimate of that standard deviation.

If the parameter or the statistic is the mean, it is called the standard error of the mean (SEM).

The standard error of the mean (SEM) can be expressed as:

$\sigma_\overline{x}=\frac{\sigma}{\sqrt{n}}$

Since the population standard deviation is seldom known, the standard error of the mean(SEM) is usually estimated as the sample standard deviation divided by the square root of the sample size (assuming statistical independence of the values in the sample).

${\sigma }_{\bar {x}}\ \approx {\frac {s}{\sqrt {n}}}$

(zScore) and Confidence Levels:

Let $\alpha$ be a number between 0 and 1, and let 100 * (1 – $\alpha$)% denote the confidence level.
For example,
- if $\alpha$ = 0.05, then the corresponding confidence level is 95%.
- If $\alpha$= 0.01, then the confidence level is 99%.
Suppose we have a standard normal distribution $Z$.
Let $z_\frac{\sigma}{2}$ denote a $zScore$ with α/2 probability to its right.
Similarly let -$z_\frac{\sigma}{2}$ denote a $zScore$ with α/2 probability to its left.
Example: The value $z_{0.10}$ is the positive z-score that has α/2 = 0.1 probability to its right. The desired $zScore$ is 1.282.
The value $-z_{0.25}$ is the negative z-score that has α/2 = 0.25 probability to its left. The desired $zScore$ is -0.6745.

# https://stackoverflow.com/a/20864883/4084039
import scipy.stats as st
print("zScore for 0.1 probability to right is",st.norm.ppf(1-0.10))
print("zScore for 0.25 probability to left is",st.norm.ppf(0.25))

zScore for 0.1 probability to right is 1.2815515655446004
zScore for 0.25 probability to left is -0.6744897501960817

Note: the data we have in hand might not included all the purchases that are made, and assume we have given the whole population standard deviation as 5051

# we are taking a sample of male persons and calculating their mean
data_male = np.array(df[df['Gender']=='M']['Purchase'].values)
samples = random.sample(range(0, data_male.shape[0]), 100)
print("the mean of money spent by sample set of 100 persons :",data_male[samples].mean())
print("Given that the we have population standard deviation : 5051")
print("From central limit theorem we can say that, the std of sampling distribution of the sample mean is \u03C3/\u221An :", 5051/10)

the mean of money spent by sample set of 100 persons : 10474.31
Given that the we have population standard deviation : 5051
From central limit theorem we can say that, the std of sampling distribution of the sample mean is σ/√n : 505.1

We know that in normal distribution, given a data point there is (95) probability that it will be within the range [ $\mu-2\sigma$, $\mu+2\sigma$]

The sampling distribution of the sample means is a normal distribution
Any sample mean we take $\overline{x}$ it is 95% probability that it will be within the range [ $\mu_{x}-2\sigma_{x}$, $\mu_{x}+2\sigma_{x}$] $i.e.$ for every 100 sample means typically 95 of them are in this range [ $\mu_{x}-2\sigma_{x}$, $\mu_{x}+2\sigma_{x}$]
It is similar to that for any sample mean we take $\overline{x}$ it is 95% probability that the range [ $\overline{x}-2\sigma_{x}$, $\overline{x}+2\sigma_{x}$] will contain distribution mean $\mu_{x} [\approx \mu]$.

Choose the best interpretation of a 95% confidence interval for the population mean μ?

Option 1:

If repeated random samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean.

Option 2:

The probability that the population mean μ is in the confidence interval is 0.95

Option 3:

95% of the population distribution is contained in the confidence interval.

Answer:

The correct answer is Option 1 please check the above 3 points, Option 2 is incorrect because it places the probability on μ, instead of on the confidence interval. Option 3 is incorrect since the confidence interval for the population mean is built using sample means and not values from the population distribution. Using population distribution values would give us a confidence interval that is wider than the one for the population mean.

From the above equations Let us construct an intravel [$\overline{x}$- 2*505.1, $\overline{x}$+2*505.1] = [ $\overline{x}$- 2*$\frac{\sigma}{\sqrt{n}}$, $\overline{x}$+2*$\frac{\sigma}{\sqrt{n}}$]

In the above figure, the red line show the sample mean $\overline{x}$ and the two green lines shows [ $\overline{x}$- 2*$\frac{\sigma}{\sqrt{n}}$, $\overline{x}$+2*$\frac{\sigma}{\sqrt{n}}$]

Note: We have a big Assumption that, we know the population standard deviation as 5051.

Confidence interval when don’t have knowldge about population standard deviation

we know the the cofidenance interval [ $\overline{x}$- 2*$\frac{\sigma}{\sqrt{n}}$, $\overline{x}$+2*$\frac{\sigma}{\sqrt{n}}$] when we know the popuplation standard deviation. If you observe here we estimating population mean with sample mean (from above pdf plots, the sample mean is almost close to population mean)

Can we do the similar estimation of population stadard deviation using sample stadard deviation? Answer: Yes, We can estimate it

SE is used is to make confidence intervals of the unknown population mean. If the sampling distribution is normally distributed, the sample mean, the standard error, and the quantiles of the normal distribution can be used to calculate confidence intervals for the true population mean.

The following expressions can be used to calculate the upper and lower (95%) confidence limits

${\text{upper 95% limit}{\displaystyle ={\bar {x}}+{\text{SE}}\times 1.96}}$
${\text{lower 95% limit}{\displaystyle ={\bar {x}}-{\text{SE}}\times 1.96}}$
${\displaystyle {\bar {x}}}$ is equal to the sample mean, an estimation to population mean SE is equal to the standard error for the sample mean,
1.96 is the 0.975 quantile of the normal distribution

But why we have taken 0.975 quantile?

Answer: as we need the confidence level of 95%, the ${\alpha}$ value will be 0.05, so $\frac{\alpha}{2}=0.025$ As we know

${\text{Upper 95% limit}{\displaystyle ={\bar {x}}+{\text{SE}}\times z_\frac{\alpha}{2}}} = {\bar{x}}+{\text{SE}} \times z_{0.025} = {\bar{x}}+{\text{SE}} \times 1.96 $
${\text{Lower 95% limit}{\displaystyle ={\bar {x}}-{\text{SE}}\times z_\frac{\alpha}{2}}} = {\bar{x}}-{\text{SE}} \times z_{0.025}= {\bar{x}}-{\text{SE}} \times 1.96 $ From the above equations we can construct an intravel [ $\overline{x}$- 2*$\frac{s}{\sqrt{n}}$, $\overline{x}$+2*$\frac{s}{\sqrt{n}}$]

Conclusion: Finding Confidenace interval of population mean

Case 1: Knowing Population Standard Deviation ${\sigma}$
1. Get a sample with decent size($n$) from population and caculate its mean $\overline{x}$
2. Report confidence intravel as [ $\overline{x}$- 2*$\frac{\sigma}{\sqrt{n}}$, $\overline{x}$+2*$\frac{\sigma}{\sqrt{n}}$]
Case 2: Without Knowing Population Standard Deviation
1. Get a sample with decent size($n$) from population and caculate its mean $\overline{x}$
2. Calculate the sample std s and find the The standard error of mean or SE mean as $\frac{s}{\sqrt{n}}$.
3. report confidence intravel as[ $\overline{x}$- 2*$\frac{s}{\sqrt{n}}$, $\overline{x}$+2*$\frac{s}{\sqrt{n}}$] or [ $\overline{x}$- 2*$SE mean$, $\overline{x}$+2*$SE mean$]

« Introduction 2 hypothesis Testing »

Overblown Concepts of ML