Certificate/data analytics-Google

[Sampling] Python, 통계 라이브러리, sample(), std(), hist(), axvline(), plot(), legend()

Olivia-BlackCherry 2023. 7. 18. 09:30

목차

    1. 라이브러리

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import statsmodels.api as sm
    from scipy import stats

     

     

    2. sample()

    샘플을 추출한다. 

    n은 샘플사이즈

    replace 추출한 후 다시 넣을 것인가(with replacement), 넣지 않을 것인가(without replacement)

    random_state는 seed이다.

    sampled_data = education_districtwise.sample(n=50, replace=True, random_state=31208)

     

    추출한 샘플의 값을 평균을 낸다. 

    education_districtwise['OVERALL_LI'].sample(n=50, replace=True, random_state=56810).mean()

     

    10000개의 샘플을 추출하여, 각각의 샘플의 평균값을 estimate_list에 넣는다. 

    estimate_list = []
    for i in range(10000):
        estimate_list.append(education_districtwise['OVERALL_LI'].sample(n=50, replace=True).mean())

     

    데이터 프레임을 만들어서, 평균을 낸다.

    estimate_df = pd.DataFrame(data={'estimate': estimate_list})
    mean_sample_means = estimate_df['estimate'].mean()

     

     

     

    3. std()

    The standard error of a statistic is the standard deviation of the sampling distribution associated with the statistic. It provides a numerical measure of sampling variability and answer the question. "How far is a statistic based on one particular sample from the typical value of the statistic?"

    estimate_df['estimate'].std()

     

     

    4. 시각화

    출처: 구글

    1) hist()

    plt.hist(estimate_df['estimate'], bins=25, density=True, alpha=0.4, label = "histogram of sample means of 10000 random samples")

     

    2) plot()

    standard_error=estimate_df['estimate'].std()
    population_mean=epa_data['aqi'].mean()
    
    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100) # generate a grid of 100 values from xmin to xmax.
    p = stats.norm.pdf(x, population_mean, standard_error)
    plt.plot(x, p, 'k', linewidth=2, label = 'normal curve from central limit theorem')

    3) axvline()

    plt.axvline(x=sampled_mean, color='r', linestyle='--', label='sample mean of the first random sample')
    plt.axvline(x=population_mean, color='y', linestyle='solid', label='population mean')
    plt.axvline(x=samples_mean, color='g', linestyle=':', label='mean of sample means of 10000 random samples')

     

     

    4) legend()

    plt.legend(bbox_to_anchor=(1, 1))

     

     

    data analysis, probability