Kaggle

14차시 분류 평가지표, score(), accuracy, precision, recall, confusion_matrix, binarizar, predict_proba, f1 score, roc, auc, roc_auc

Olivia-BlackCherry 2024. 4. 27. 07:05

목차

     

    1. model.score(x_test, y_test)

    모델의 성능을 평가하는 데 사용.

    x_test, y_test 테스트 데이터를 가지고 모델의 기본 성능 지표를 반환하는데, 성격에 따라 다른 평가 지표를 활용함

    - 분류모델 : accuracy

    - 회귀모델 : R2

     

     

    2. accuracy_score(y_test, y_pred)

    정확도만 반환함

    대부분 분류모델에서는 model.score()와 accuracy_score()은 유사한 값을 반환함

     

     

    <python code>

    예시 데이터

     

     

    데이터 스케일링

    from sklearn.preprocessing import StandardScaler
    scaler= StandardScaler()
    data_scaled = scaler.fit_transform(x)
    data_scaled

     

    array([[ 1.09706398, -2.07333501,  1.26993369, ...,  2.29607613,
             2.75062224,  1.93701461],
           [ 1.82982061, -0.35363241,  1.68595471, ...,  1.0870843 ,
            -0.24388967,  0.28118999],
           [ 1.57988811,  0.45618695,  1.56650313, ...,  1.95500035,
             1.152255  ,  0.20139121],
           ...,
           [ 0.70228425,  2.0455738 ,  0.67267578, ...,  0.41406869,
            -1.10454895, -0.31840916],
           [ 1.83834103,  2.33645719,  1.98252415, ...,  2.28998549,
             1.91908301,  2.21963528],
           [-1.80840125,  1.22179204, -1.81438851, ..., -1.74506282,
            -0.04813821, -0.75120669]])

     

     

    data split

    from sklearn.model_selection import train_test_split
    x_train, x_test, y_train, y_test = train_test_split(data_scaled, target, test_size= 0.3, random_state=0)

     

     

    logisitc 모델 만들기

    from sklearn.linear_model import LogisticRegression
    lr_clf = LogisticRegression()
    model= lr_clf.fit(x_train, y_train)
    y_pred = model.predict(x_test)

     

     

    평가

    from sklearn.linear_model import LogisticRegression
    lr_clf = LogisticRegression()
    model= lr_clf.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    
    from sklearn.metrics import accuracy_score, roc_auc_score
    print(model.score(x_test, y_test), "모델 score")
    print(accuracy_score(y_test, y_pred), "정확도")
    print(roc_auc_score(y_test, y_pred), "ROC_AUC")
    0.9766081871345029 모델 score
    0.9766081871345029 정확도
    0.9715608465608465 ROC_AUC

     

     

     

    3. confusion_matrix(y_test, y_pred)

    from sklearn.metrics import confusion_matrix
    confusion_matrix(y_test, y_pred)
    array([[ 60,   3],
           [  1, 107]])

    TN FP

    FN TP

     

     

     

    4. recall, precision

    from sklearn.metrics import precision_score, recall_score
    precision= precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print("precision", precision)
    print("recall", recall)
    precision 0.9727272727272728
    recall 0.9907407407407407

     

     

     

    5. Binarizer

    1) y_pred vs y_proba

    y_pred

    from sklearn.linear_model import LogisticRegression
    lr_clf = LogisticRegression()
    model= lr_clf.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    y_pred[:3]
    array([0, 1, 1])

     

     

    y_proba

    pred_proba = model.predict_proba(x_test)
    pred_proba[:3]
    rray([[0.99864569, 0.00135431],
           [0.03842822, 0.96157178],
           [0.00130563, 0.99869437]])

    두 클래스 중에 더 큰 확률을 선택함. 

     

     

    2) 확률임계값 바꾸기 Binarizer

    positive 확률만 뽑기

    # positive 확률만 뽑기
    pred_proba_1 = pred_proba[:,1]
    pred_proba_1[:3]
    array([0.00135431, 0.96157178, 0.99869437])

     

     

    shape 바꾸기 

    pred_proba_1 = pred_proba_1.reshape(-1,1)
    pred_proba_1[:3]
    array([[0.00135431],
           [0.96157178],
           [0.99869437]])

     

     

    Binarizer로 preprocessing해서 임계값 바꾸기

    from sklearn.preprocessing import Binarizer
    custom_thredshold = 0.5
    binarizer= Binarizer(threshold=custom_thredshold)
    custom_pred = binarizer.fit_transform(pred_proba_1)
    custom_pred[:3]

    >> custom_pred 생성함

    array([[0.],
           [1.],
           [1.]])

     

     

    3) confusion matrix 확인하기

    confusion_matrix(y_test, custom_pred)
    array([[ 60,   3],
           [  1, 107]])

     

     

    4) thredshold 바꾸기

    from sklearn.preprocessing import Binarizer
    custom_thredshold = 0.8
    binarizer= Binarizer(threshold=custom_thredshold)
    custom_pred = binarizer.fit_transform(pred_proba_1)
    confusion_matrix(y_test, custom_pred)
    array([[62,  1],
           [10, 98]])

     

     

    5) 평가

    <임계값 : 0.7 일 때>

    from sklearn.metrics import accuracy_score, roc_auc_score
    print(model.score(x_test, y_test), "모델 score")
    print(accuracy_score(y_test, custom_pred), "정확도")
    print(roc_auc_score(y_test, custom_pred), "ROC_AUC")
    precision= precision_score(y_test, custom_pred)
    recall = recall_score(y_test, custom_pred)
    print("precision", precision)
    print("recall", recall)
    0.9766081871345029 모델 score
    0.935672514619883 정확도
    0.9457671957671958 ROC_AUC
    precision 0.98989898989899
    recall 0.9074074074074074

     

     

    < 임계값 : 0.5 일 때 >

    from sklearn.metrics import accuracy_score, roc_auc_score
    print(model.score(x_test, y_test), "모델 score")
    print(accuracy_score(y_test, y_pred), "정확도")
    print(roc_auc_score(y_test, y_pred), "ROC_AUC")
    precision= precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    print("precision", precision)
    print("recall", recall)
    0.9766081871345029 모델 score
    0.9766081871345029 정확도
    0.9715608465608465 ROC_AUC
    precision 0.9727272727272728
    recall 0.9907407407407407
    
    ---> precision과 recall은 tradeoff 관계
     
     

    6. F1 스코어

    from sklearn.metrics import f1_score
    f1= f1_score(y_test, y_pred)
    f1
    0.981651376146789

     

     

    7. ROC

    # 확률
    pred_proba_class1 = model.predict_proba(x_test)[:,1]
    pred_proba_class1[:3]
    
    # fprs, tprs-->ROC 그리기
    fprs, tprs, thresholds = roc_curve(y_test, pred_proba_class1)
    plt.plot(fprs, tprs, label='ROC')
    plt.plot([0,1], [0,1], 'k--', label='Random')
    plt.legend()
    plt.show()

     

     

     

    8. AUC

    ROC 자체는 tpr과 fpr의 변화값을 보는데 이용하며 분류의 성능 지표는 ROC 곡선 면적에 기반한 AUC 값이다. 

    1에 가까울수록 좋은 수치.

    FPR이 작았을 때, TPR이 클 수록 좋음!

    from sklearn.metrics import roc_curve
    pred_proba_class1 = model.predict_proba(x_test)[:,1]
    roc_score= roc_auc_score(y_test, pred_proba_class1)
    print("ROC_AUC: ", roc_score)
    ROC_AUC:  0.9947089947089947

     

     

    분류 평가지표, score(), accuracy, precision, recall, confusion_matrix, binarizar, predict_proba, f1 score, roc, auc, roc_auc