LinkedIn link GitHub link

Agglomerative clustering methods comparison

Comparing clustering methods manually can be tricky and cumbersome. We propose an automated method that would output the number of correctly clustered entries.

Cluster analysis is an unsupervised machine learning task, which involves automatically discovering groupings of data. Placing any sort of entry in the correct cluster can be done using different methods and functions. For example, the AgglomerativeClustering function from the sklearn library in Python provides 4 built-in methods:

  • ward minimizes the variance of the clusters being merged.
  • average uses the average of the distances of each observation of the two sets.
  • complete or maximum linkage uses the maximum distances between all observations of the two sets.
  • single uses the minimum of the distances between all observations of the two sets.

We are considering single, average, and ward. These methods may have varying outputs, which as of right now can only be spotted by performing a manual comparison by the user.

We created an automated function that compares the three methods against each other and outputs how many entries are clustered the same. The inputs required by the algorithm are the clustering labels for each method and the respective entry names. The combination of the two is the ground truth, i.e. the reference clustering against which the candidate clustering set is compared. The algorithm is automated for any csv file and proceeds as following for each pair of methods:

  • merge clustering groups labels with list of respective entry labels in a data frame
  • sort data frame with respect to clustering labels
  • create a list with sublists, each sublist is a different clustering group (candidate list)
  • do the same for the ground truth (reference method of current pass)
  • compare ground truth with candidate list
  • extract list (cluster) with highest number of elements from ground truth (reference list)
  • check what candidate list the reference list matches the most, i.e. the lists that have the highest match in number of elements
  • remove both sublists from lists
  • iterate over all of the sublists and count the number of matches

Example

We used the insurance.csv file from Kaggle as a sample data frame. The parameters that we used for the clustering are bmi and charges. Here are the first few cells of the file:

age sex bmi children smoker region charges
19 female 27.9 0 yes southwest 16884.924
18 male 33.77 1 no southeast 1725.5523
28 male 33 3 no southeast 4449.462
33 male 22.705 0 no northwest 21984.47061
32 male 28.88 0 no northwest 3866.8552

And here is the code:

import pandas as pd
from sklearn.cluster import AgglomerativeClustering
def max_len(lst):
    maxList = max(lst, key=len)
    return maxList
def match_clustering_groups(ground_truth, files_names, clustering_labels, n_clusters):
    match = 0
    ls_1 = files_names
    ls_2 = clustering_labels.tolist()
    ls_3 = ground_truth.tolist()
    d_candidate = {'names': ls_1, 'groups': ls_2}
    d_truth = {'names': ls_1, 'groups': ls_3}
    df_candidate = pd.DataFrame(d_candidate)
    df_truth = pd.DataFrame(d_truth)
    df_c_sorted = df_candidate.sort_values(by=['groups'])
    df_t_sorted = df_truth.sort_values(by=['groups'])

    count = 0
    clustering_groups_sublists_c = [[] for i in range(0, n_clusters)]
    for i in range(len(df_c_sorted)):
        if i == 0:
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[0])
        elif df_c_sorted.groups.iloc[i] == df_c_sorted.groups.iloc[i - 1]:
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])
        else:
            count = count + 1
            clustering_groups_sublists_c[count].append(df_c_sorted.names.iloc[i])

    count = 0
    clustering_groups_sublists_t = [[] for i in range(0, n_clusters)]
    for i in range(len(df_t_sorted)):
        if i == 0:
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[0])
        elif df_t_sorted.groups.iloc[i] == df_t_sorted.groups.iloc[i - 1]:
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])
        else:
            count = count + 1
            clustering_groups_sublists_t[count].append(df_t_sorted.names.iloc[i])

    while len(clustering_groups_sublists_t) > 0:
        maxList = max_len(clustering_groups_sublists_c)
        # find best match between candidate and ground truth
        n_match = 0
        best_entry = 0
        for j in range(len(clustering_groups_sublists_t)):
            best_match_i = sum(el in clustering_groups_sublists_t[j] for el in maxList)
            if best_match_i > n_match:
                n_match = best_match_i
                best_entry = j
        match += n_match
        clustering_groups_sublists_t.pop(best_entry)
        ind = clustering_groups_sublists_c.index(maxList)
        clustering_groups_sublists_c.pop(ind)

    return match
path_to_csv = './agglomerative_clustering_automation/insurance.csv'
df_insurance = pd.read_csv(path_to_csv)
methods = ['single', 'ward', 'average']
clusterings_list = []
entries_len = range(len(df_insurance.charges))
frames = {'bmi': df_insurance.bmi, 'charges': df_insurance.charges}
df_final = pd.DataFrame(data=frames)

for i in range(len(methods)):
    clustering = AgglomerativeClustering(n_clusters=3, linkage=methods[i]).fit(df_final)
    clusterings_list.append(clustering.labels_)

for i in range(len(methods)):
    for j in range(i + 1):
        if i != j:
            n_matches = match_clustering_groups(clusterings_list[i],
                                                entries_len,
                                                clusterings_list[j],
                                                3)
            print(str(methods[i]) + " and " +
                  str(methods[j]) + " have " +
                  str(n_matches) + " clustering matches")

By running the algorithm, we observed the following differences:

  • ward and single have 988 clustering matches

  • average and single have 1110 clustering matches

  • average and ward have 1064 clustering matches

If two methods had the same clustering groups, the total matches would amount to 1338 (i.e. the total number of entries).

This algorithm can also be used to check any other method, so feel free to try out our algorithm and let us know what you think!