Comparing clustering methods manually can be tricky and cumbersome. We propose an automated method that would output the number of correctly clustered entries.
Cluster analysis is an unsupervised machine learning task, which involves automatically discovering groupings of data. Placing any sort of entry in the correct cluster can be done using different methods and functions. For example, the AgglomerativeClustering function from the sklearn
library in Python provides 4 built-in methods:
ward
minimizes the variance of the clusters being merged.average
uses the average of the distances of each observation of the two sets.complete
or maximum
linkage uses the maximum distances between all observations of the two sets.single
uses the minimum of the distances between all observations of the two sets.We are considering single, average, and ward. These methods may have varying outputs, which as of right now can only be spotted by performing a manual comparison by the user.
We created an automated function that compares the three methods against each other and outputs how many entries are clustered the same. The inputs required by the algorithm are the clustering labels for each method and the respective entry names. The combination of the two is the ground truth, i.e. the reference clustering against which the candidate clustering set is compared. The algorithm is automated for any csv file and proceeds as following for each pair of methods:
We used the insurance.csv file from Kaggle as a sample data frame. The parameters that we used for the clustering are bmi and charges. Here are the first few cells of the file:
age | sex | bmi | children | smoker | region | charges |
---|---|---|---|---|---|---|
19 | female | 27.9 | 0 | yes | southwest | 16884.924 |
18 | male | 33.77 | 1 | no | southeast | 1725.5523 |
28 | male | 33 | 3 | no | southeast | 4449.462 |
33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
32 | male | 28.88 | 0 | no | northwest | 3866.8552 |
And here is the code:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def max_len(lst):
maxList = max(lst, key=len)
return maxList
def match_clustering_groups(ground_truth, files_names, clustering_labels, n_clusters):
match = 0
ls_1 = files_names
ls_2 = clustering_labels.tolist()
ls_3 = ground_truth.tolist()
d_candidate = {'names': ls_1, 'groups': ls_2}
d_truth = {'names': ls_1, 'groups': ls_3}
df_candidate = pd.DataFrame(d_candidate)
df_truth = pd.DataFrame(d_truth)
df_c_sorted = df_candidate.sort_values(by=['groups'])
df_t_sorted = df_truth.sort_values(by=['groups'])
count = 0
clustering_groups_sublists_c = [[] for i in range(0, n_clusters)]
for i in range(len(df_c_sorted)):
if i == 0:
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[0])
elif df_c_sorted.groups.iloc[i] == df_c_sorted.groups.iloc[i - 1]:
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
else:
count = count + 1
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
count = 0
clustering_groups_sublists_t = [[] for i in range(0, n_clusters)]
for i in range(len(df_t_sorted)):
if i == 0:
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[0])
elif df_t_sorted.groups.iloc[i] == df_t_sorted.groups.iloc[i - 1]:
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
else:
count = count + 1
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
while len(clustering_groups_sublists_t) > 0:
maxList = max_len(clustering_groups_sublists_c)
# find best match between candidate and ground truth
n_match = 0
best_entry = 0
for j in range(len(clustering_groups_sublists_t)):
best_match_i = sum(el in clustering_groups_sublists_t[j] for el in maxList)
if best_match_i > n_match:
n_match = best_match_i
best_entry = j
match += n_match
clustering_groups_sublists_t.pop(best_entry)
ind = clustering_groups_sublists_c.index(maxList)
clustering_groups_sublists_c.pop(ind)
return match
path_to_csv = './agglomerative_clustering_automation/insurance.csv'
df_insurance = pd.read_csv(path_to_csv)
methods = ['single', 'ward', 'average']
clusterings_list = []
entries_len = range(len(df_insurance.charges))
frames = {'bmi': df_insurance.bmi, 'charges': df_insurance.charges}
df_final = pd.DataFrame(data=frames)
for i in range(len(methods)):
clustering = AgglomerativeClustering(n_clusters=3, linkage=methods[i]).fit(df_final)
clusterings_list.append(clustering.labels_)
for i in range(len(methods)):
for j in range(i + 1):
if i != j:
n_matches = match_clustering_groups(clusterings_list[i],
entries_len,
clusterings_list[j],
3)
print(str(methods[i]) + " and " +
str(methods[j]) + " have " +
str(n_matches) + " clustering matches")
By running the algorithm, we observed the following differences:
ward and single have 988 clustering matches
average and single have 1110 clustering matches
average and ward have 1064 clustering matches
If two methods had the same clustering groups, the total matches would amount to 1338 (i.e. the total number of entries).
This algorithm can also be used to check any other method, so feel free to try out our algorithm and let us know what you think!