Central Limit Theorem (CLT) is a statistical theory stating that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
To explain the concepts of Central limit Theorem i am taking a dataset from here
In this dataset we have 537577 data points, and below is the distribution plot of the prices of the iterms.
To demonistrate the cetral limit Theorem, lets do a simple experiment by collecting few samples from the data and for each sample calculate its mean.
the bellow image shows the distribution of the those means
if you have observed the above image, i have plotted 6 distributions, the left top is generated with 100 samples and each sample of size of 50, for each sample i have calculated mean and plotted a kde plot, the blue line shows the mean of sample means i.e mean(100 sample means)
and the red line is drawn at the population mean. the remaining other plots were also generated in the similar fashion.
if we can observe the thrid row distribution plots, we can say that the larger the sample size, the more it looks like Gaussian
I have just tabulated all the values so that we can have a better look at the numbers and relate to our experiment. In the above table P_mean is Population mean, P_std is Population standard deviation, and \(n\) is the number of elements in each sample.
Observations: If you check the above stats, we can observe the distribution of sample means, is having mean \(\mu_{x} \approx \mu\) and \(\sigma_{x}\approx \frac{\sigma}{\sqrt{n}}\)
Central Limit Theorem says:
Note: here \(\overline{x}\) is the mean of single sample, \(\mu_{x}\) is the mean of the sampling distribution of the sample mean and \(\mu\) is the population mean
Q: Here “Why do we need a sample mean to calculate the population mean? why can’t we directly calcualte the mean of the population?”
Ans: Suppose your population of interest is in Delhi, and you want to know the mean age of the population.Q: Then “Why don’t we just use the sample mean \(\overline{x}\) to estimate the population mean \(\mu\)?”
Statisticians use the sample statistic \(\overline{x}\) and the population(\(\sigma\)) or sample standard deviation to provide an interval of plausible estimates for the population parameter \(\mu\). This interval is called a confidence interval.
Definition: A confidence interval is an entire interval of plausible values for a population parameter, such as \(\mu\), based on observations obtained from a random sample of size \(n\).
What is the avarage money spent by Male population on black friday ?
Before we know how to estimate that lets have a look at couple of concepts
Since the population standard deviation is seldom known, the standard error of the mean(SEM) is usually estimated as the sample standard deviation divided by the square root of the sample size (assuming statistical independence of the values in the sample).
# https://stackoverflow.com/a/20864883/4084039
import scipy.stats as st
print("zScore for 0.1 probability to right is",st.norm.ppf(1-0.10))
print("zScore for 0.25 probability to left is",st.norm.ppf(0.25))
zScore for 0.1 probability to right is 1.2815515655446004
zScore for 0.25 probability to left is -0.6744897501960817
Note: the data we have in hand might not included all the purchases that are made, and assume we have given the whole population standard deviation as 5051
# we are taking a sample of male persons and calculating their mean
data_male = np.array(df[df['Gender']=='M']['Purchase'].values)
samples = random.sample(range(0, data_male.shape[0]), 100)
print("the mean of money spent by sample set of 100 persons :",data_male[samples].mean())
print("Given that the we have population standard deviation : 5051")
print("From central limit theorem we can say that, the std of sampling distribution of the sample mean is \u03C3/\u221An :", 5051/10)
the mean of money spent by sample set of 100 persons : 10474.31
Given that the we have population standard deviation : 5051
From central limit theorem we can say that, the std of sampling distribution of the sample mean is σ/√n : 505.1
We know that in normal distribution, given a data point there is (95) probability that it will be within the range [ \(\mu-2\sigma\), \(\mu+2\sigma\)]
Choose the best interpretation of a 95% confidence interval for the population mean μ?
If repeated random samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean.
Option 2:The probability that the population mean μ is in the confidence interval is 0.95
Option 3:95% of the population distribution is contained in the confidence interval.
The correct answer is Option 1 please check the above 3 points, Option 2 is incorrect because it places the probability on μ, instead of on the confidence interval. Option 3 is incorrect since the confidence interval for the population mean is built using sample means and not values from the population distribution. Using population distribution values would give us a confidence interval that is wider than the one for the population mean.
From the above equations Let us construct an intravel [\(\overline{x}\)- 2*505.1, \(\overline{x}\)+2*505.1] = [ \(\overline{x}\)- 2*\(\frac{\sigma}{\sqrt{n}}\), \(\overline{x}\)+2*\(\frac{\sigma}{\sqrt{n}}\)]In the above figure, the red line show the sample mean \(\overline{x}\) and the two green lines shows [ \(\overline{x}\)- 2*\(\frac{\sigma}{\sqrt{n}}\), \(\overline{x}\)+2*\(\frac{\sigma}{\sqrt{n}}\)]
Note: We have a big Assumption that, we know the population standard deviation as 5051.
we know the the cofidenance interval [ \(\overline{x}\)- 2*\(\frac{\sigma}{\sqrt{n}}\), \(\overline{x}\)+2*\(\frac{\sigma}{\sqrt{n}}\)] when we know the popuplation standard deviation. If you observe here we estimating population mean with sample mean (from above pdf plots, the sample mean is almost close to population mean)
Can we do the similar estimation of population stadard deviation using sample stadard deviation? Answer: Yes, We can estimate it
SE is used is to make confidence intervals of the unknown population mean. If the sampling distribution is normally distributed, the sample mean, the standard error, and the quantiles of the normal distribution can be used to calculate confidence intervals for the true population mean.
The following expressions can be used to calculate the upper and lower (95%) confidence limits
\({\text{upper 95% limit}{\displaystyle ={\bar {x}}+{\text{SE}}\times 1.96}}\)
But why we have taken 0.975 quantile?
Conclusion: Finding Confidenace interval of population mean