clustering, k-means

머신러닝/machine learning

clustering, k-means

Olivia-BlackCherry 2023. 5. 18. 22:37

clustering 군집화

군집화는 비지도 unsupervised 알고리즘이다. cluster이란 비슷한 데이터들끼리 묶여있는 집단을 말한다.

어떤 데이터셋은 label이 정해져있지 않다. 이런 경우에는 cluster을 여러 개 만들어서 비슷한 성질을 가진 데이터들을 묶고, 그 데이터 묶음에 cluster을 부여한다.

clustering applications

- 고객 구매 패턴

- 새로운 고객에게 신간 책과 영화 추천

- 신용카드 부정 사용 파악

- 고객 분류

- 고객 신용 위험 진단

- 고객 추천 기사

- 환자 행동 분석

clustering 쓰는 이유

데이터 탐색에 좋다.

대략적으로 데이터를 일반화하거나, 사이즈를 줄일 수 있다.

이상치를 발견한다.

중복을 찾는다.

다양한 clustering algorithms

1) partioned-based

반원형태의 집단으로 묶인다.

중간, 대용량 데이터셋이 쓰며, 상대적으로 효율적이다

K-means

2)Hierarchical Clustering

트리형식으로 상하 관계가 있다. 직관적으며 작은 데이터셋에 쓰인다.

Agglomerative, Divisive

3)Density-based Clustering

다양한 모양으로 군집되어지며, 노이즈가 있는 데이터셋에 쓰인다.

DBSCAN

K-means clustering

-partitioning clustering

-K-means divides the data into non-overlapping subsets without any cluster-internal structure

-examples within a cluster are very similar

-examples across different clusters are very idfferent

K-means 안의 sample들은 여러 개의 집단에 속해있다.

집단 내는 동질, 집단 간은 이질적인 특징이 있다.

집단 내의 sample들의 거리는 상호 가까울 수록, 집단 간의 거리는 멀 수록 군집이 잘 되었다고 본다.

어떻게 거리를 측정할까?

-유클리드 거리

- cosine 유사도

-평균거리

등등이 있다. 데이터셋에 대한 이해와 피처의 데이터타입 등에 따라 적절한 측정을 한다.

어떻게 작동할까?

1) initialize k = 3( k는 cluster의 수이다)

centroids : randomly set

2)distance calculation

각 데이터에서 각 centroid까지 거리를 구한다.

3)assign each point to the closest centroid

각 데이터는 가장 가까운 거리에 위치한 centroid에 할당한다.

여기서 SSE를 측정해보면, 오류가 매우 높게 나온다. 왜냐면 centroid를 임의로 정했기 때문에 당연하다.

이 오류를 줄이기 위해서 다음 스텝을 진행한다.

4)compute the new centroids for each cluster

각 centroid에 속한 데이터들의 평균으로 각 centroid는 이동한다.

5) repeat until there are no changes

centroid가 더이상 움직이지 않을 때까지 진행한다.

평가?

만들어진 K-means algorithm이 좋은지 확인하는 척도로 accuracy를 쓴다.

비지도 학습이기 때문에 이전의 알고리즘처럼 단순 계산을 할 수는 없다.

대신에 거리를 가지고 정확도를 측정한다.

아래의 그래프를 보면 k개수가 늘어남에 따라 중심점과 데이터 간의 거리가 좁아지는 것을 볼 수 있다.

그런데 k개수에 따라 데이터 간 거리는 늘 좁아지므로, 급격한 엘보 포인트를 찾으면 되고, 그 수가 얻고자 하는 k값이라고 본다.

임의 데이터 생성하여 k-means 연습하기

import random 
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans 
from sklearn.datasets import make_blobs 
%matplotlib inline

#1. 데이터 준비하기
#need to set a random seed.
np.random.seed(0)

#make_blobs
X, y = make_blobs(n_samples=5000, centers=[[4,4], [-2, -1], [2, -3], [1, 1]], cluster_std=0.9)

#시각화
plt.scatter(X[:, 0], X[:, 1], marker='.')

#2. K-means 모델링하기
k_means = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
k_means.fit(X)
k_means_labels = k_means.labels_
k_means_labels
k_means_cluster_centers = k_means.cluster_centers_
k_means_cluster_centers

#visualization
fig = plt.figure(figsize=(6, 4))
colors = plt.cm.Spectral(np.linspace(0, 1, len(set(k_means_labels))))
ax = fig.add_subplot(1, 1, 1)
for k, col in zip(range(len([[4,4], [-2, -1], [2, -3], [1, 1]])), colors):

    # Create a list of all data points, where the data points that are 
    # in the cluster (ex. cluster 0) are labeled as true, else they are
    # labeled as false.
    my_members = (k_means_labels == k)
    
    # Define the centroid, or cluster center.
    cluster_center = k_means_cluster_centers[k]
    
    # Plots the datapoints with color col.
    ax.plot(X[my_members, 0], X[my_members, 1], 'w', markerfacecolor=col, marker='.')
    
    # Plots the centroids with specified color, but with a darker outline
    ax.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,  markeredgecolor='k', markersize=6)
ax.set_title('KMeans')
ax.set_xticks(())
ax.set_yticks(())
plt.show()

실제 데이터로 연습하기

import pandas as pd
cust_df = pd.read_csv("Cust_Segmentation.csv")
cust_df.head()

#categorical한 데이터는 삭제하기
df = cust_df.drop('Address', axis=1)
df.head()

#데이터 정규화, 표준화하기
from sklearn.preprocessing import StandardScaler
X = df.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet

#모델링
clusterNum = 3
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)

#insight 얻기
df["Clus_km"] = labels
df.head(5)
df.groupby('Clus_km').mean()

#2차원 시각화
area = np.pi * ( X[:, 1])**2  
plt.scatter(X[:, 0], X[:, 3], s=area, c=labels.astype(np.float), alpha=0.5)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)

plt.show()

#3차원 시각화
from mpl_toolkits.mplot3d import Axes3D 
fig = plt.figure(1, figsize=(8, 6))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()
# plt.ylabel('Age', fontsize=18)
# plt.xlabel('Income', fontsize=16)
# plt.zlabel('Education', fontsize=16)
ax.set_xlabel('Education')
ax.set_ylabel('Age')
ax.set_zlabel('Income')

ax.scatter(X[:, 1], X[:, 0], X[:, 3], c= labels.astype(np.float))

저작자표시 비영리 변경금지 (새창열림)

'머신러닝 > machine learning' 카테고리의 다른 글

머신러닝 과정 전체, preprocessing.StandardScaler(), fit(), transform(), fit_transform(), gridsearchCV, scores (0)	2023.06.03
classification, SVM, support vector machine, kerneling (0)	2023.05.18
logistic regression, sigmoid, logistic regression vs linear regression, C, optimizer, softmax (0)	2023.05.18
classification, regression tree (0)	2023.05.18
classification, decision tree, entropy, 지니계수, information gain (0)	2023.05.17

현재글clustering, k-means

올리비아 코딩스쿨