목차
1. 라이브러리
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats
2. sample()
샘플을 추출한다.
n은 샘플사이즈
replace 추출한 후 다시 넣을 것인가(with replacement), 넣지 않을 것인가(without replacement)
random_state는 seed이다.
sampled_data = education_districtwise.sample(n=50, replace=True, random_state=31208)
추출한 샘플의 값을 평균을 낸다.
education_districtwise['OVERALL_LI'].sample(n=50, replace=True, random_state=56810).mean()
10000개의 샘플을 추출하여, 각각의 샘플의 평균값을 estimate_list에 넣는다.
estimate_list = []
for i in range(10000):
estimate_list.append(education_districtwise['OVERALL_LI'].sample(n=50, replace=True).mean())
데이터 프레임을 만들어서, 평균을 낸다.
estimate_df = pd.DataFrame(data={'estimate': estimate_list})
mean_sample_means = estimate_df['estimate'].mean()
3. std()
The standard error of a statistic is the standard deviation of the sampling distribution associated with the statistic. It provides a numerical measure of sampling variability and answer the question. "How far is a statistic based on one particular sample from the typical value of the statistic?"
estimate_df['estimate'].std()
4. 시각화
1) hist()
plt.hist(estimate_df['estimate'], bins=25, density=True, alpha=0.4, label = "histogram of sample means of 10000 random samples")
2) plot()
standard_error=estimate_df['estimate'].std()
population_mean=epa_data['aqi'].mean()
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100) # generate a grid of 100 values from xmin to xmax.
p = stats.norm.pdf(x, population_mean, standard_error)
plt.plot(x, p, 'k', linewidth=2, label = 'normal curve from central limit theorem')
3) axvline()
plt.axvline(x=sampled_mean, color='r', linestyle='--', label='sample mean of the first random sample')
plt.axvline(x=population_mean, color='y', linestyle='solid', label='population mean')
plt.axvline(x=samples_mean, color='g', linestyle=':', label='mean of sample means of 10000 random samples')
4) legend()
plt.legend(bbox_to_anchor=(1, 1))