[Machine learning - plan, analze] Python, class imbalance, upsampling, downsampling

Certificate/data analytics-Google

[Machine learning - plan, analze] Python, class imbalance, upsampling, downsampling

Olivia-BlackCherry 2023. 8. 1. 09:21

<PACE: Plan, Analze step>

Class imbalance

When a dataset has a predictor variable that contains more instances of one outcome than another.

majority class(많은 것) vs minority class(작은 것)

class 안에서 majority와 minority의 balance가 맞지 않아도 된다. 문제가 생기는 경우는 majority class가 90% 이상을 차지 할 때 이다. 이 문제를 해결하기 위한 방법은 두 가지이다.

1) upsampling

- dataset이 작을 때 유용하다.

2) downsampling

- dataset이 매우 클 때 유리하다.

뽑는 방법은 랜덤 또는 수학 formula를 쓴다.

Python1

- customer churn

고객이 은행의 서비스를 그만두는 것

1. 라이브러리

import numpy as np
import pandas as pd

2. 데이터

3. feature selection

인덱스나, 개인정보와 관련된 feature는 지운다.

churn_df = df_original.drop(['RowNumber', 'CustomerId', 'Surname', 'Gender'], axis=1)

참고로 내가 만든 모델로 인해 도출한 결과의 ethical problem을 고려할 필요가 있다. 예를 들어 gender는 성차별적인 예측이 나올 수도 있고, 사회적으로 민감한 이슈를 불러올 수 있다.

4. feature extraction

churn_df['Loyalty'] = churn_df['Tenure'] / churn_df['Age']

loyalty라는 컬럼을 새로 만든다. 이 값은 Tenure(은행 이용 년수) / Age(나이) 값으로 일생 동안 이 은행을 이용한 비율을 나타낸다.

5. feature transformation

geography는 값이 france, spain, germany 세 개의 카테고리컬 벨류이다. 이를 불리언 컬럼으로 인코딩하자.

get_dummies() 함수 이용한다.

drop_first=True를 사용하는 까닭은 새로운 컬럼을 3개가 아니라 2개만 사용하자는 뜻이다. 이렇게하면 dataset이 더 짧고 간결해진다. 만약 geography_Germany, geography_spain 모두 0이라면 이 값은 Germany france라는 뜻이다.

# Dummy encode categorical variables
churn_df = pd.get_dummies(churn_df, drop_first=True)

Python2

data

1. columns

data.columns

Index(['name', 'gp', 'min', 'pts', 'fgm', 'fga', 'fg', '3p_made', '3pa', '3p',
       'ftm', 'fta', 'ft', 'oreb', 'dreb', 'reb', 'ast', 'stl', 'blk', 'tov',
       'target_5yrs'],
      dtype='object')

2. analze: info, isna, shape, value_counts()...

data['target_5yrs'].value_counts(normalize=True)

3. selection

여러 근거에 따라 feature을 selection했다. target value인 target_5yrs는 반드시 포함한다.

selected_data= data[['gp','min','pts','fg','3p','ft','ast','stl','blk','tov','target_5yrs']]

4. extraction

extracted_data=selected_data.copy()
extracted_data['total_points']= extracted_data['gp']*extracted_data['pts']
extracted_data['efficiency']= extracted_data['total_points']/ (extracted_data['min']*extracted_data['gp'])

extracted_data=extracted_data.drop(columns=['gp', 'pts', 'min'])

데이터교육

저작자표시 비영리 변경금지 (새창열림)

'Certificate > data analytics-Google' 카테고리의 다른 글

[Unsupervised learning] K-means, centroid, Python (0)	2023.08.03
[Machine learning - Construct] Naive Bayes, 나이브베이즈, python, stratify (0)	2023.08.01
[Machine Learning- PACE] Feature engineering, feature selection, transformation, extraction, log normalization, scaling, encoding, normalization, standardization, ordinal encoding, variable encoding (0)	2023.07.31
[Machine learning] supervised, unsupervised, reinforcement, deep learning, recommendation system, content-based, collaborative, variable types, python (0)	2023.07.31
[logistic regression] Python, binomial logistic regression, assumptions, odds, likelyhood, logit, confusion matrix, ROC curv, AUC (0)	2023.07.28

현재글[Machine learning - plan, analze] Python, class imbalance, upsampling, downsampling

올리비아 코딩스쿨