목차
1. model.score(x_test, y_test)
모델의 성능을 평가하는 데 사용.
x_test, y_test 테스트 데이터를 가지고 모델의 기본 성능 지표를 반환하는데, 성격에 따라 다른 평가 지표를 활용함
- 분류모델 : accuracy
- 회귀모델 : R2
2. accuracy_score(y_test, y_pred)
정확도만 반환함
대부분 분류모델에서는 model.score()와 accuracy_score()은 유사한 값을 반환함
<python code>
예시 데이터
데이터 스케일링
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
data_scaled = scaler.fit_transform(x)
data_scaled
array([[ 1.09706398, -2.07333501, 1.26993369, ..., 2.29607613,
2.75062224, 1.93701461],
[ 1.82982061, -0.35363241, 1.68595471, ..., 1.0870843 ,
-0.24388967, 0.28118999],
[ 1.57988811, 0.45618695, 1.56650313, ..., 1.95500035,
1.152255 , 0.20139121],
...,
[ 0.70228425, 2.0455738 , 0.67267578, ..., 0.41406869,
-1.10454895, -0.31840916],
[ 1.83834103, 2.33645719, 1.98252415, ..., 2.28998549,
1.91908301, 2.21963528],
[-1.80840125, 1.22179204, -1.81438851, ..., -1.74506282,
-0.04813821, -0.75120669]])
data split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data_scaled, target, test_size= 0.3, random_state=0)
logisitc 모델 만들기
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
model= lr_clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
평가
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
model= lr_clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
from sklearn.metrics import accuracy_score, roc_auc_score
print(model.score(x_test, y_test), "모델 score")
print(accuracy_score(y_test, y_pred), "정확도")
print(roc_auc_score(y_test, y_pred), "ROC_AUC")
0.9766081871345029 모델 score
0.9766081871345029 정확도
0.9715608465608465 ROC_AUC
3. confusion_matrix(y_test, y_pred)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[ 60, 3],
[ 1, 107]])
TN FP
FN TP
4. recall, precision
from sklearn.metrics import precision_score, recall_score
precision= precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("precision", precision)
print("recall", recall)
precision 0.9727272727272728
recall 0.9907407407407407
5. Binarizer
1) y_pred vs y_proba
y_pred
from sklearn.linear_model import LogisticRegression
lr_clf = LogisticRegression()
model= lr_clf.fit(x_train, y_train)
y_pred = model.predict(x_test)
y_pred[:3]
array([0, 1, 1])
y_proba
pred_proba = model.predict_proba(x_test)
pred_proba[:3]
rray([[0.99864569, 0.00135431],
[0.03842822, 0.96157178],
[0.00130563, 0.99869437]])
두 클래스 중에 더 큰 확률을 선택함.
2) 확률임계값 바꾸기 Binarizer
positive 확률만 뽑기
# positive 확률만 뽑기
pred_proba_1 = pred_proba[:,1]
pred_proba_1[:3]
array([0.00135431, 0.96157178, 0.99869437])
shape 바꾸기
pred_proba_1 = pred_proba_1.reshape(-1,1)
pred_proba_1[:3]
array([[0.00135431],
[0.96157178],
[0.99869437]])
Binarizer로 preprocessing해서 임계값 바꾸기
from sklearn.preprocessing import Binarizer
custom_thredshold = 0.5
binarizer= Binarizer(threshold=custom_thredshold)
custom_pred = binarizer.fit_transform(pred_proba_1)
custom_pred[:3]
>> custom_pred 생성함
array([[0.],
[1.],
[1.]])
3) confusion matrix 확인하기
confusion_matrix(y_test, custom_pred)
array([[ 60, 3],
[ 1, 107]])
4) thredshold 바꾸기
from sklearn.preprocessing import Binarizer
custom_thredshold = 0.8
binarizer= Binarizer(threshold=custom_thredshold)
custom_pred = binarizer.fit_transform(pred_proba_1)
confusion_matrix(y_test, custom_pred)
array([[62, 1],
[10, 98]])
5) 평가
<임계값 : 0.7 일 때>
from sklearn.metrics import accuracy_score, roc_auc_score
print(model.score(x_test, y_test), "모델 score")
print(accuracy_score(y_test, custom_pred), "정확도")
print(roc_auc_score(y_test, custom_pred), "ROC_AUC")
precision= precision_score(y_test, custom_pred)
recall = recall_score(y_test, custom_pred)
print("precision", precision)
print("recall", recall)
0.9766081871345029 모델 score
0.935672514619883 정확도
0.9457671957671958 ROC_AUC
precision 0.98989898989899
recall 0.9074074074074074
< 임계값 : 0.5 일 때 >
from sklearn.metrics import accuracy_score, roc_auc_score
print(model.score(x_test, y_test), "모델 score")
print(accuracy_score(y_test, y_pred), "정확도")
print(roc_auc_score(y_test, y_pred), "ROC_AUC")
precision= precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print("precision", precision)
print("recall", recall)
0.9766081871345029 모델 score
0.9766081871345029 정확도
0.9715608465608465 ROC_AUC
precision 0.9727272727272728
recall 0.9907407407407407
---> precision과 recall은 tradeoff 관계
6. F1 스코어
from sklearn.metrics import f1_score
f1= f1_score(y_test, y_pred)
f1
0.981651376146789
7. ROC
# 확률
pred_proba_class1 = model.predict_proba(x_test)[:,1]
pred_proba_class1[:3]
# fprs, tprs-->ROC 그리기
fprs, tprs, thresholds = roc_curve(y_test, pred_proba_class1)
plt.plot(fprs, tprs, label='ROC')
plt.plot([0,1], [0,1], 'k--', label='Random')
plt.legend()
plt.show()
8. AUC
ROC 자체는 tpr과 fpr의 변화값을 보는데 이용하며 분류의 성능 지표는 ROC 곡선 면적에 기반한 AUC 값이다.
1에 가까울수록 좋은 수치.
FPR이 작았을 때, TPR이 클 수록 좋음!
from sklearn.metrics import roc_curve
pred_proba_class1 = model.predict_proba(x_test)[:,1]
roc_score= roc_auc_score(y_test, pred_proba_class1)
print("ROC_AUC: ", roc_score)
ROC_AUC: 0.9947089947089947