In real life scenarios we don't know the population parameter and
, in this case we have to reply on the sample statistics to estimate these population parameters. When we take a single sample and caclutate the sample mean
it is called the point estimator of the population mean
. The population parameters is always fixed, however a statistics can assume different values depending on the sample from which it is computed. If we repeat the process of selecting a simple random sample of size n from a population of size N over and over again, each time computing the values of sample mean
we will have a distribution of sample mean
. This distribution of the statistics is called the sampling distribution.
If we consider the process of selecting a simple random sample as an experiement, the sample mean is the numerical description of the outcome of the experiment. Thus, the sample mean
is a random variable and just like any other random variable it has a mean or expected value
, a standard deviation
, and a probability distribution of all possible values of the sample mean
.
A simple random sample of size n from a population of size N is a sample selected such that each member in the sample has the same probability 1/N of being selected in the sample. There are two ways of selecting a unit from a sample - with replacement and without replacement.
- Population has a normal distribution: Here, the sampling distribution of
is also normal for any sample size.
- Population does not have a normal distribution: Here, the sampling distribution of
is non-normal for small samples, but normal for large samples (
).
In the example below we will use R-programming to illustrate the effect of sampling distribution of
.
-
Example: Suppose we have been asked to develop a profile of the company's all data science managers. The characteristics to be identified include the mean annual salary for the managers. Now suppose using the details of all the managers in the personnel database we could compute the population mean annual salary
= $51,800, and population standard deviation
= $4000. We will use these population parameters as a reference to check how close our sample mean
comes close to the population mean
. Now, suppose that we don't have the necessary information on all the managers, and the mean annual salary details is not readily available to us. The question now is how we can obtain estimates of the population parameters by using a sample of managers rather than all the managers in the entire population. Suppose that a sample without replacement of 30 managers will be used, and we will repeat this experiment for 500 times. Clearly, the time and the cost of developing a profile would be substantially less for 30 managers than for the entire population. We will expect that the largest concentration of the
values and the mean of the 500
values is near to the population mean
= $51,800. As per the Central Limit Theorem the mean of all possible
values, which is referred to as the expected value of
is
=
, the expected value or mean of the sampling distribution of
is equal to the mean of the population. Hence the mean of all possible sample mean for our example is also $51800. Now, if the population has a normal distribution, the sampling distribution of
is normally distributed. If the population does not have a normal distribution, the simple random sample of 30 managers and the Central Limit Theorem enable us to conclude that the sampling distribution of
can be approximated by a normal distribution.
-
Below is the implementation of this example in R-programming:
mean_x = 0
for (i in 1:500){
# Random Sample with mu=51800 sd=4000
x <- rnorm(30, mean = 51800, sd = 4000)
mean_x[i] = mean(x)
}
length(mean_x)
summary(mean_x)
mean(mean_x)
sd(mean_x)
hist(mean_x, main="Sampling Distribution of sample mean",
xlab="Sample Means")
Let's create bins to see the distribution of all the values of the sample means.
bin_names <- c("[49,500.00-49,999.99)","[50,000.00-50,499.99)", "[50,500.00-50,999.99)",
"[51,000.00-51,499.99)", "[51,500.00-51,999.99)", "[52,000.00-52,499.99)",
"[52,500.00-52,999.99)","[53,000.00-53,499.99)", "[53,500.00-53,999.99)")
bin_distribution <- cut(mean_x, breaks=9, include.lowest=TRUE, right=FALSE,
labels=bin_names)
summary(bin_distribution)
# Plotting the Bin Distribution
install.packages("tidyverse")
library(tidyverse)
ggplot(data = as_tibble(bin_distribution), mapping = aes(x=value)) +
geom_bar(fill="bisque",color="white",alpha=0.7) +
stat_count(geom="text", aes(label=sprintf("%.4f",..count../length(bin_distribution))), vjust=-0.5) +
labs(x='Sampling Distribution of mean') +
theme_minimal()
- Because we now have identified the properties of the sampling distribution of sample mean
, we can use this distribution to answer the question like: What is the probability that the sample mean computed using a simple random sample of 30 managers will be within $500 of the population mean? i.e.
, where
=
= $51,800 and standard deviation of sample mean
is
= 730.30.
- In R-Programming we can use the below code to calculate the required probabilities under the curve between these two values:
pnorm(52300, mean=51800, sd=730.30, lower.tail = TRUE) - pnorm(51300, mean=51800, sd=730.30, lower.tail = TRUE)