exploratory data analysis(EDA), grouping data, correlation, p-value, correlation sufficient, association between two categorical variable, chi-square

Certificate/data science-IBM

exploratory data analysis(EDA), grouping data, correlation, p-value, correlation sufficient, association between two categorical variable, chi-square

Olivia-BlackCherry 2023. 5. 4. 10:57

exploratory data analysis is preliminary step in data analysis to

-summarize main characteristics, gain better understanding of the data set, uncover relationships between variables, extract important variables

1. descriptive statistics

describe basic features of data.

-describe()

-df.describe(include=['object'])

-value_counts()

df['drive-wheels'].value_counts()

df['drive-wheels'].value_counts().to_frame()

drive_wheels_counts.index.name = 'drive-wheels'

descriptive statistics

-box plots

sns.boxplot(x="drive-wheels", y="price", data=df)

-scatter plot

sns.regplot(x="engine-size", y="price", data=df)
plt.ylim(0,)

2. grouping data

-groupby()

df['drive-wheels'].unique()

df_group_one = df[['drive-wheels','body-style','price']]

# grouping results
df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()
df_group_one

# grouping results
df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

-pivot()

grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0
grouped_pivot

>>heatmap 으로 표현

import matplotlib.pyplot as plt
%matplotlib inline

#use the grouped results
plt.pcolor(grouped_pivot, cmap='RdBu')
plt.colorbar()
plt.show()

- 여러 정보를 담도록

fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')

#label names
row_labels = grouped_pivot.columns.levels[1]
col_labels = grouped_pivot.index

#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)

#insert labels
ax.set_xticklabels(row_labels, minor=False)
ax.set_yticklabels(col_labels, minor=False)

#rotate label if too long
plt.xticks(rotation=90)

fig.colorbar(im)
plt.show()

3. correlation

-corr() 상관관계 분석

corr()은 Pandas 라이브러리에서 제공되는 DataFrame과 Series 객체에서 상관 관계(correlation)를 계산하는 함수입니다. 상관 관계는 두 변수 간의 관계를 나타내는 지표로, 변수 간에 어떤 선형적인 관계가 있는지를 측정합니다.

corr() 함수는 기본적으로 피어슨(Pearson) 상관 계수를 계산합니다. 이외에도 스피어만(Spearman) 상관 계수, 켄달(Kendall) 상관 계수 등을 계산할 수 있습니다.

corr() 함수를 사용하기 위해서는 먼저 데이터가 저장된 DataFrame이나 Series 객체를 만들어야 합니다. 이후 corr() 함수를 호출하여 상관 관계를 계산하면 됩니다.

-measures to what extent different variables are interdependent.

ex) cancer-> smoking, rain->umbrella

- positive/negative linear relationship or weak correlation

-regplot()

4. correlation -statistics

1) pearson correlation

from scipy import stats

pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P =", p_value)

Pearson correlation은 두 변수 간의 선형 상관 관계를 측정하는 방법 중 하나입니다. Pearson correlation은 값의 범위가 -1에서 1사이인 상관 계수(correlation coefficient)를 반환합니다.

두 변수 x와 y의 Pearson correlation은 다음과 같이 정의됩니다.
r = (Σ(xi - x_mean) * (yi - y_mean)) / (sqrt(Σ(xi - x_mean)^2) * sqrt(Σ(yi - y_mean)^2))

여기서, x_mean과 y_mean은 x와 y의 평균이고, xi와 yi는 각각 x와 y의 관측값입니다. r은 x와 y의 상관 관계를 나타내며, r 값이 양수이면 양의 상관 관계가 있고, r 값이 음수이면 음의 상관 관계가 있습니다. r 값이 0이면 두 변수 간에는 선형적인 상관 관계가 없다는 것을 의미합니다.
close to 1: large positive relationship

close to -1: large negative relationship

close to 0 : no relationship

Pearson correlation은 선형적인 상관 관계만을 측정하기 때문에, 비선형적인 상관 관계가 있는 경우에는 다른 상관 계수를 사용해야 합니다. 또한, 상관 관계는 인과 관계를 나타내지 않기 때문에, 두 변수 간의 인과 관계를 파악하기 위해서는 추가적인 분석이 필요합니다.

2) P-value

P-value는 가설 검정(hypothesis testing)에서 사용되는 통계적인 지표 중 하나입니다. 가설 검정은 어떤 주장(가설)이 맞는지 여부를 통계적으로 검증하는 방법입니다.

P-value는 "관찰된 데이터가 귀무가설(null hypothesis) 하에서 예상되는 값보다 얼마나 더 극단적인(extreme) 값인가?"를 나타내는 확률입니다. 귀무가설이란, 일반적으로 알려진 또는 예상되는 것으로 가정하고, 검증하고자 하는 가설(대립가설, alternative hypothesis)과 반대되는 가설을 의미합니다.

가설 검정에서는 보통 귀무가설이 참일 때 나올 확률을 계산합니다. 그리고 이 확률이 특정 임계값(보통 0.05)보다 작으면, 귀무가설을 기각하고 대립가설을 채택합니다. 이때 작은 p-value는 귀무가설이 맞지 않는다는 강력한 증거가 됩니다.

예를 들어, 두 그룹 간의 차이가 유의미한지 검증하려면, 두 그룹의 평균값이 같다는 귀무가설과, 두 그룹의 평균값이 다르다는 대립가설을 설정하고, 이에 대한 p-value를 계산합니다. p-value가 임계값보다 작으면 귀무가설을 기각하고, 두 그룹 간의 차이가 유의미하다는 결론을 내릴 수 있습니다.

p-value는 통계학에서 매우 중요한 개념 중 하나이며, 가설 검정 이외에도 회귀 분석 등 다양한 분야에서 사용됩니다.

가설 검정에서 귀무가설이란, 일반적으로 알려진 또는 예상되는 상황을 가정하고, 이 가정을 검증하고자 하는 가설(대립가설)과 반대되는 가설을 말합니다. 즉, 귀무가설은 "아무 일도 일어나지 않았다"는 상황을 가정하고, 그 가정이 틀린지 검증하는 것입니다.

P-value는 귀무가설이 맞을 때, 관찰된 데이터가 귀무가설이 기대하는 값보다 더 극단적인 값일 확률을 나타냅니다. 이때, 작은 p-value는 귀무가설이 맞지 않을 가능성이 높다는 것을 의미합니다. 따라서, p-value가 일정 수준보다 작으면, 귀무가설을 기각하고 대립가설을 채택하게 됩니다.

간단히 말해서, 가설 검정에서는 "아무 일도 일어나지 않았다"는 상황을 가정하고, 이 가정이 틀린지 검증하는 것이며, 이때 p-value는 귀무가설이 맞을 확률을 나타내는 지표 중 하나입니다. 작은 p-value는 귀무가설이 맞지 않을 가능성이 높다는 것을 의미합니다.

P-value <0.0001 strong certainty in the result

<0.05 moderate certainty in the result

<0.1 weak certainty in the result

<0.1 no certainty in the result

3) heatmap

5.Chi-square(카이 스퀘어)

association ebtween two categorical variables

두 개 이상의 범주형 변수 간의 관련성을 검정하는 데 사용되는 통계 분석 방법입니다

저작자표시 비영리 변경금지

'Certificate > data science-IBM' 카테고리의 다른 글

area plot, histogram, bar chart, annotate (0)	2023.05.09
data visualization with python, matplotlib architecture, %matplotlib inline (0)	2023.05.05
Model Evaluation, refinement, overfitting, underfiiting, grid search, hyperparameters, ridge regression, polynomial transform (0)	2023.05.05
model development, linear regression, plots, pipeline (0)	2023.05.04
data cleansing (0)	2023.05.04

현재글exploratory data analysis(EDA), grouping data, correlation, p-value, correlation sufficient, association between two categorical variable, chi-square

올리비아 코딩스쿨