Comparing clustering methods manually can be tricky and cumbersome. We propose an automated method that would output the number of correctly clustered entries.
Cluster analysis is an unsupervised machine learning task, which involves automatically discovering groupings of data. Placing any sort of entry in the correct cluster can be done using different methods and functions. For example, the AgglomerativeClustering function from the sklearn
library in Python provides 4 builtin methods:
ward
minimizes the variance of the clusters being merged.average
uses the average of the distances of each observation of the two sets.complete
ormaximum
linkage uses the maximum distances between all observations of the two sets.single
uses the minimum of the distances between all observations of the two sets.
We are considering single, average, and ward. These methods may have varying outputs, which as of right now can only be spotted by performing a manual comparison by the user.
We created an automated function that compares the three methods against each other and outputs how many entries are clustered the same. The inputs required by the algorithm are the clustering labels for each method and the respective entry names. The combination of the two is the ground truth, i.e. the reference clustering against which the candidate clustering set is compared. The algorithm is automated for any csv file and proceeds as following for each pair of methods:
 merge clustering groups labels with list of respective entry labels in a data frame
 sort data frame with respect to clustering labels
 create a list with sublists, each sublist is a different clustering group (candidate list)
 do the same for the ground truth (reference method of current pass)
 compare ground truth with candidate list
 extract list (cluster) with highest number of elements from ground truth (reference list)
 check what candidate list the reference list matches the most, i.e. the lists that have the highest match in number of elements
 remove both sublists from lists
 iterate over all of the sublists and count the number of matches
Example
We used the insurance.csv file from Kaggle as a sample data frame. The parameters that we used for the clustering are bmi and charges. Here are the first few cells of the file:
age  sex  bmi  children  smoker  region  charges 

19  female  27.9  0  yes  southwest  16884.924 
18  male  33.77  1  no  southeast  1725.5523 
28  male  33  3  no  southeast  4449.462 
33  male  22.705  0  no  northwest  21984.47061 
32  male  28.88  0  no  northwest  3866.8552 
And here is the code:
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def max_len(lst):
maxList = max(lst, key=len)
return maxList
def match_clustering_groups(ground_truth, files_names, clustering_labels, n_clusters):
match = 0
ls_1 = files_names
ls_2 = clustering_labels.tolist()
ls_3 = ground_truth.tolist()
d_candidate = {'names': ls_1, 'groups': ls_2}
d_truth = {'names': ls_1, 'groups': ls_3}
df_candidate = pd.DataFrame(d_candidate)
df_truth = pd.DataFrame(d_truth)
df_c_sorted = df_candidate.sort_values(by=['groups'])
df_t_sorted = df_truth.sort_values(by=['groups'])
count = 0
clustering_groups_sublists_c = [[] for i in range(0, n_clusters)]
for i in range(len(df_c_sorted)):
if i == 0:
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[0])
elif df_c_sorted.groups.iloc[i] == df_c_sorted.groups.iloc[i  1]:
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
else:
count = count + 1
clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
count = 0
clustering_groups_sublists_t = [[] for i in range(0, n_clusters)]
for i in range(len(df_t_sorted)):
if i == 0:
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[0])
elif df_t_sorted.groups.iloc[i] == df_t_sorted.groups.iloc[i  1]:
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
else:
count = count + 1
clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
while len(clustering_groups_sublists_t) > 0:
maxList = max_len(clustering_groups_sublists_c)
# find best match between candidate and ground truth
n_match = 0
best_entry = 0
for j in range(len(clustering_groups_sublists_t)):
best_match_i = sum(el in clustering_groups_sublists_t[j] for el in maxList)
if best_match_i > n_match:
n_match = best_match_i
best_entry = j
match += n_match
clustering_groups_sublists_t.pop(best_entry)
ind = clustering_groups_sublists_c.index(maxList)
clustering_groups_sublists_c.pop(ind)
return match
path_to_csv = './agglomerative_clustering_automation/insurance.csv'
df_insurance = pd.read_csv(path_to_csv)
methods = ['single', 'ward', 'average']
clusterings_list = []
entries_len = range(len(df_insurance.charges))
frames = {'bmi': df_insurance.bmi, 'charges': df_insurance.charges}
df_final = pd.DataFrame(data=frames)
for i in range(len(methods)):
clustering = AgglomerativeClustering(n_clusters=3, linkage=methods[i]).fit(df_final)
clusterings_list.append(clustering.labels_)
for i in range(len(methods)):
for j in range(i + 1):
if i != j:
n_matches = match_clustering_groups(clusterings_list[i],
entries_len,
clusterings_list[j],
3)
print(str(methods[i]) + " and " +
str(methods[j]) + " have " +
str(n_matches) + " clustering matches")
By running the algorithm, we observed the following differences:

ward and single have 988 clustering matches

average and single have 1110 clustering matches

average and ward have 1064 clustering matches
If two methods had the same clustering groups, the total matches would amount to 1338 (i.e. the total number of entries).
This algorithm can also be used to check any other method, so feel free to try out our algorithm and let us know what you think!