Mirai Solutions :: Agglomerative clustering methods comparison

Back to overview

Comparing clustering methods manually can be tricky and cumbersome. We propose an automated method that would output the number of correctly clustered entries.

Cluster analysis is an unsupervised machine learning task, which involves automatically discovering groupings of data. Placing any sort of entry in the correct cluster can be done using different methods and functions. For example, the AgglomerativeClustering function from the sklearn library in Python provides 4 built-in methods:

ward minimizes the variance of the clusters being merged.
average uses the average of the distances of each observation of the two sets.
complete or maximum linkage uses the maximum distances between all observations of the two sets.
single uses the minimum of the distances between all observations of the two sets.

We are considering single, average, and ward. These methods may have varying outputs, which as of right now can only be spotted by performing a manual comparison by the user.

We created an automated function that compares the three methods against each other and outputs how many entries are clustered the same. The inputs required by the algorithm are the clustering labels for each method and the respective entry names. The combination of the two is the ground truth, i.e. the reference clustering against which the candidate clustering set is compared. The algorithm is automated for any csv file and proceeds as following for each pair of methods:

merge clustering groups labels with list of respective entry labels in a data frame
sort data frame with respect to clustering labels
create a list with sublists, each sublist is a different clustering group (candidate list)
do the same for the ground truth (reference method of current pass)
compare ground truth with candidate list
extract list (cluster) with highest number of elements from ground truth (reference list)
check what candidate list the reference list matches the most, i.e. the lists that have the highest match in number of elements
remove both sublists from lists
iterate over all of the sublists and count the number of matches

Example

We used the insurance.csv file from Kaggle as a sample data frame. The parameters that we used for the clustering are bmi and charges. Here are the first few cells of the file:

age	sex	bmi	children	smoker	region	charges
19	female	27.9	0	yes	southwest	16884.924
18	male	33.77	1	no	southeast	1725.5523
28	male	33	3	no	southeast	4449.462
33	male	22.705	0	no	northwest	21984.47061
32	male	28.88	0	no	northwest	3866.8552

And here is the code:

import pandas as pd
from sklearn.cluster import AgglomerativeClustering

def max_len(lst):
    maxList = max(lst, key=len)
    return maxList

def match_clustering_groups(ground_truth, files_names, clustering_labels, n_clusters):
    match = 0
    ls_1 = files_names
    ls_2 = clustering_labels.tolist()
    ls_3 = ground_truth.tolist()
    d_candidate = {'names': ls_1, 'groups': ls_2}
    d_truth = {'names': ls_1, 'groups': ls_3}
    df_candidate = pd.DataFrame(d_candidate)
    df_truth = pd.DataFrame(d_truth)
    df_c_sorted = df_candidate.sort_values(by=['groups'])
    df_t_sorted = df_truth.sort_values(by=['groups'])

    count = 0
    clustering_groups_sublists_c = [[] for i in range(0, n_clusters)]
    for i in range(len(df_c_sorted)):
        if i == 0:
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[0])
        elif df_c_sorted.groups.iloc[i] == df_c_sorted.groups.iloc[i - 1]:
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
        else:
            count = count + 1
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])

    count = 0
    clustering_groups_sublists_t = [[] for i in range(0, n_clusters)]
    for i in range(len(df_t_sorted)):
        if i == 0:
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[0])
        elif df_t_sorted.groups.iloc[i] == df_t_sorted.groups.iloc[i - 1]:
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
        else:
            count = count + 1
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])

    while len(clustering_groups_sublists_t) > 0:
        maxList = max_len(clustering_groups_sublists_c)
        # find best match between candidate and ground truth
        n_match = 0
        best_entry = 0
        for j in range(len(clustering_groups_sublists_t)):
            best_match_i = sum(el in clustering_groups_sublists_t[j] for el in maxList)
            if best_match_i > n_match:
                n_match = best_match_i
                best_entry = j
        match += n_match
        clustering_groups_sublists_t.pop(best_entry)
        ind = clustering_groups_sublists_c.index(maxList)
        clustering_groups_sublists_c.pop(ind)

    return match

path_to_csv = './agglomerative_clustering_automation/insurance.csv'
df_insurance = pd.read_csv(path_to_csv)
methods = ['single', 'ward', 'average']
clusterings_list = []
entries_len = range(len(df_insurance.charges))
frames = {'bmi': df_insurance.bmi, 'charges': df_insurance.charges}
df_final = pd.DataFrame(data=frames)

for i in range(len(methods)):
    clustering = AgglomerativeClustering(n_clusters=3, linkage=methods[i]).fit(df_final)
    clusterings_list.append(clustering.labels_)

for i in range(len(methods)):
    for j in range(i + 1):
        if i != j:
            n_matches = match_clustering_groups(clusterings_list[i],
                                                entries_len,
                                                clusterings_list[j],
                                                3)
            print(str(methods[i]) + " and " +
                  str(methods[j]) + " have " +
                  str(n_matches) + " clustering matches")

By running the algorithm, we observed the following differences:

ward and single have 988 clustering matches
average and single have 1110 clustering matches
average and ward have 1064 clustering matches

If two methods had the same clustering groups, the total matches would amount to 1338 (i.e. the total number of entries).

This algorithm can also be used to check any other method, so feel free to try out our algorithm and let us know what you think!