[Sampling] Python, 통계 라이브러리, sample(), std(), hist(), axvline(), plot(), legend()

Certificate/data analytics-Google

[Sampling] Python, 통계 라이브러리, sample(), std(), hist(), axvline(), plot(), legend()

Olivia-BlackCherry 2023. 7. 18. 09:30

1. 라이브러리

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

2. sample()

샘플을 추출한다.

n은 샘플사이즈

replace 추출한 후 다시 넣을 것인가(with replacement), 넣지 않을 것인가(without replacement)

random_state는 seed이다.

sampled_data = education_districtwise.sample(n=50, replace=True, random_state=31208)

추출한 샘플의 값을 평균을 낸다.

education_districtwise['OVERALL_LI'].sample(n=50, replace=True, random_state=56810).mean()

10000개의 샘플을 추출하여, 각각의 샘플의 평균값을 estimate_list에 넣는다.

estimate_list = []
for i in range(10000):
    estimate_list.append(education_districtwise['OVERALL_LI'].sample(n=50, replace=True).mean())

데이터 프레임을 만들어서, 평균을 낸다.

estimate_df = pd.DataFrame(data={'estimate': estimate_list})
mean_sample_means = estimate_df['estimate'].mean()

3. std()

The standard error of a statistic is the standard deviation of the sampling distribution associated with the statistic. It provides a numerical measure of sampling variability and answer the question. "How far is a statistic based on one particular sample from the typical value of the statistic?"

estimate_df['estimate'].std()

4. 시각화

1) hist()

plt.hist(estimate_df['estimate'], bins=25, density=True, alpha=0.4, label = "histogram of sample means of 10000 random samples")

2) plot()

standard_error=estimate_df['estimate'].std()
population_mean=epa_data['aqi'].mean()

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100) # generate a grid of 100 values from xmin to xmax.
p = stats.norm.pdf(x, population_mean, standard_error)
plt.plot(x, p, 'k', linewidth=2, label = 'normal curve from central limit theorem')

3) axvline()

plt.axvline(x=sampled_mean, color='r', linestyle='--', label='sample mean of the first random sample')
plt.axvline(x=population_mean, color='y', linestyle='solid', label='population mean')
plt.axvline(x=samples_mean, color='g', linestyle=':', label='mean of sample means of 10000 random samples')

4) legend()

plt.legend(bbox_to_anchor=(1, 1))

data analysis, probability

저작자표시 비영리 변경금지 (새창열림)

'Certificate > data analytics-Google' 카테고리의 다른 글

[Confidence interval] Python, 신뢰구간 만드는 방법, stats, scipy.stats.norm.interval(alpha, loc, scale) (0)	2023.07.19
[confidence interval] 점추정, 구간추정, 신뢰구간, sample statistic, Margin of error, Confidence level, t분포 (0)	2023.07.19
[Sampling] population, representative sampling, process, simple, stratified, cluster, systematic, convenience, voluntary response, snowball, purposive, sampling distribution, bias, central limit theorem, standard error, proportion (0)	2023.07.18
[Probability] python, scipy, statsmodels, hist, empirical rule, z-score, statz.zscore(), outlier (0)	2023.07.17
[Probability] objective, classical, empirical, subjective, mutual exclusive, independent event, complement, additional, multiplication, conditional probability, bayes, random variable, discrete, continuous, binomial, poisson, nomal distribution, standar.. (1)	2023.07.17

현재글[Sampling] Python, 통계 라이브러리, sample(), std(), hist(), axvline(), plot(), legend()

올리비아 코딩스쿨