pandas 배우기 4편 데이터 전처리 : upsampling(업샘플링), outlier(이상치) , 상관관계, 차원변환

파이썬/판다스

pandas 배우기 4편 데이터 전처리 : upsampling(업샘플링), outlier(이상치) , 상관관계, 차원변환

Olivia-BlackCherry 2024. 11. 8. 13:39

데이터 전처리 중 업샘플링, 이상치 탐지, 상관관계 분석, 머신러닝 모델에 입출력을 위한 차원변환에 대해 알아본다.

1. 이상치

1) 박스플롯

2) quantile
25%위치 : quantile(0.25)
75%위치 : quantile(0.75)

# iqr = 75% - 25%
percentile25 = data.video_like_count.quantile(0.25)
percentile75 = data.video_like_count.quantile(0.75)
iqr = percentile75 - percentile25

# max = 75% +1.5*iqr
up_limit  = percentile75 +1.5*iqr

# 이상치처리
data.loc[data.video_comment_count>up_limit,'video_comment_count'] = up_limit

2. upsampling

클래스 간 불균형을 해결하기 위해 소스 클래스 데이터를 다수 클래스 데이터와 같은 크기로 업샘플링함.
1) 타겟 분포 확인

value_counts(normalize=True)

data.verified_status.value_counts(normalize=True)

verified_status
not verified    0.93712
verified        0.06288

2) 타겟 구분

majority = data[data['verified_status']=='verified']
minority = data[data['verified_status'] == 'not verified']

3) resample

from sklearn.utils import resample
data_minority_upsampled = resample(minority, #대상
                                   replace=True, #복원추출
                                   n_samples=len(majority), #개수
                                   random_state=0)

4) concat

data_upsampled = pd.concat([majority, data_minority_upsampled], axis=0).reset_index(drop=True)

3. 상관관계

corr(numeric_only=True)

★ numeric_only가 있으면 숫자형만 사용 가능.

4. 원핫인코딩

1) get_dummies()
★ drop_first=True
각 행에서 첫 번째 범주를 삭제
만약 a, b, c라는 세가지 범주를 더미 변수로 만들면 각각의 범주에 대해 변수를 하나씩 추가하게 되는데,
세 가지 값 중, 두 가지 값만 알면 나머지 하나를 자동으로 알 수 있게 되므로 상관관계가 100%가 되는 문제가 발생함
drop='first'를 하는 경우, 위와 같은 dummy variable trap의 문제를 피하게 되어, 다중공선성의 문제를 줄일 수 있다.
또한 전체 변수가 줄어들므로 모델의 복잡도가 감소하고, 분석 결과를 해석하기가 쉬워짐

oh_x = pd.get_dummies(x, columns = ['claim_status', 'author_ban_status'], drop_first=True)
oh_x.head()

2) OnehotEncoder()
★ drop='first'
★ sparse_output = False
sparse_output : 희소행렬(sparse matrix)형태를 밀집행렬(dense matrix)로 변환함
희소행렬의 경우 메모리에는 효율적이나 데이터 조작시에는 어려움
밀집행렬은 0을 포함한 모든 데이터를 numpy로 저장하므로 데이터프레임에 변환 및 조작하기에 더 쉬움

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(drop='first', sparse_output=False)
X_encoder = onehot.fit_transform(x[['claim_status', 'author_ban_status']])
X_encoder

★ get_feature_names_out()
변환될 열 이름 자동으로 생성

X_encoder_df = pd.DataFrame(X_encoder, columns=onehot.get_feature_names_out())
X_encoder_df.head()

5. 차원변화 (1차원 -> 2차원)

OnehotEncoder에 학습시키려면 입력차원이 2차원이어야 한다.
현재 차원이 1차원인 상태

1) to_frame()

reshape_y = y.to_frame()
reshape_y

2) reshape(-1,1)
데이터의 행 수는 자동으로 맞추고, 열은 1개로 설정하는 것---> 최종결과: (n_samples, 1) 형태

reshape_y = y.values.reshape(-1,1)
reshape_y

array([['not verified'],
       ['not verified'],
       ['not verified'],
       ...,
       ['verified'],
       ['verified'],
       ['verified']], dtype=object)

6. 차원변화(2차원 -> 1차원)

입력값을 2차원 형태로 받음
출력값도 2차원

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first', sparse_output= False)
y_encoder = ohe.fit_transform(reshape_y)
y_encoder

array([[0.],
       [0.],
       [0.],
       ...,
       [1.],
       [1.],
       [1.]])

★ ravel()
2차원을 1차원 형태로 평탄화 작업(to flatten the array) --> 나중에 다른 model 학습시에 데이터를 편하게 쓰도록 변환.
일반적으로 목표 변수 y값은 머신러닝 모델에서 입력 시 1차원 배열인 경우가 많음

목표 변수 y값 자체의 의미 자체가 각 샘플에 대해 하나의 값이기 때문
만약 2차원이라면 각 샘플에 대해 여러 값을 가지는 것으로 해석될 수 있기에 모델이 혼란스러움.
ex) logisticRegression, RandomForestClassifier 같은 모델들은 예측할 때 y값을 각 샘플에 대해 하나의 값으로 처리함

y_encoder=y_encoder.ravel()
y_encoder

array([0., 0., 0., ..., 1., 1., 1.])

7. train_test_split

1) train, test
x 먼저, y 나중에
xtrain, xtest, ytrain, ytest

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((11450, 10), (7634, 10), (11450,), (7634,))

2) train, validation
x_tr, x_val, y_tr, y_val

x_tr, x_val, y_tr, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state=1)
x_tr.shape, x_val.shape, y_tr.shape, y_val.shape

((9160, 10), (2290, 10), (9160,), (2290,))

데이터분석 시험 ADP 종합정리

'파이썬 > 판다스' 카테고리의 다른 글

pandas 배우기 6편 enocoding 한글 인코딩 utf-8 euc-kr (0)	2024.11.15
pandas 배우기 5편 모델 : 훈련, GridSearchCV, 하이퍼파라미터, 평가 : logisticRegression, RandomForest, XGBClassifier (2)	2024.11.09
pandas 배우기 3편 데이터시각화: 빅분기 ADP 데이터분석 시험, 파이차트, 히스토그램, 박스플랏, 스케터플랏,히트맵 (1)	2024.11.07
pandas 배우기 2편 데이터전처리 :빅분기 ADP 데이터분석 요약 (0)	2024.10.11
pandas 배우기 1편 EDA : 빅분기 ADP 데이터분석 시험 요약 (0)	2024.10.11

현재글pandas 배우기 4편 데이터 전처리 : upsampling(업샘플링), outlier(이상치) , 상관관계, 차원변환

올리비아 코딩스쿨