Clustering is the general study of grouping similar objects together. This definition is purposely very abstract, because what the objects are and how you define the "distance" between two objects both make a huge difference in the output, but the general algorithms remain similar in all cases. There are numerous clustering algorithms with different properties and different performance characteristics, but in this article I'll talk about single link clustering, or SLINK for short. Single link is one of the simplest clustering algorithms: it starts out with N clusters consisting of one object each, and at each step merges the two "closest" clusters. Single link clustering is easy to implement and I often use it to gain insight into hard-to-grasp data sets.

Clustering Geospatial Coordinates

An easy way to visualize how clustering works is by clustering points on a map. The distance metric is pretty obvious and the correct answer is easy to see at a glance. For example, below is a map of the approximate coordinates of 8 MongoDB offices: 4 in North America, 2 in Europe, 1 in the Middle East, and 1 in Australia. Intuitively, if you were to break these locations into 4 clusters based on distance, you'd cluster them into Australia, Europe, Middle East, and Australia. Click the below map for an interactive view.

First, let's set up this data in Node.js using some Turf.js helpers. First, install Turf. Over the last year Turf has been going overboard on micro-modules, npm scoped packages, and ES6 imports, even dropping support for require(), but the new cumbersome Turf workflow is not quite bad enough to consider forking, so please bear with me.

npm install @turf/helpers @turf/distance @turf/centroid
const { point } = require('@turf/helpers');
const centroid = require('@turf/centroid')['default'];
const distance = require('@turf/distance')['default'];

const cities = [
  'New York',
  'Palo Alto',
  'Austin',
  'Los Angeles',
  'Dublin',
  'London',
  'Tel Aviv',
  'Sydney'
];
const points = [
  [-73.9895767, 40.7572784], // New York
  [-122.1633558, 37.4419716], // Palo Alto
  [-97.8063463, 30.2453212], // Austin
  [-118.4674619, 34.1539733], // Los Angeles
  [-6.2325547, 53.332175], // Dublin
  [-0.1090697, 51.5065462], // London
  [34.7915409, 32.1039053], // Tel Aviv
  [151.2022304, -33.878484] // Sydney
].map((p, index) => {
  const pt = point(p);
  pt.properties.name = cities[index];
  return pt;
});

Now, the goal is to cluster these points using single link clustering. The general idea of the algorithm is this:

  • Start out with 8 clusters, each with one location
  • Pick the two closest clusters. In this case, "closest" is defined by minimum distance between the center of all points in cluster 1 and the center of all points in cluster 2.
  • Repeat until there are 4 clusters left.

Below is the actual implementation of the clustering algorithm:

// Create 8 clusters, each with one location. Each cluster has a
// center so we can compute the "distance" between two clusters
// by seeing how far apart their centers are.
const clusters = points.map(p => ({ locs: [p], center: p }));

while (clusters.length > 4) {
  // Find two closest clusters
  let minClusterDistance = Number.POSITIVE_INFINITY;
  let minClusterI = -1;
  let minClusterJ = -1;
  for (let i = 0; i < clusters.length; ++i) {
    for (let j = i + 1; j < clusters.length; ++j) {
      const dist = distance(clusters[i].center, clusters[j].center);
      if (dist < minClusterDistance) {
        minClusterDistance = dist;
        minClusterI = i;
        minClusterJ = j;
      }
    }
  }

  // Combine the two closest clusters by adding minClusterJ into minClusterI...
  clusters[minClusterI].locs = clusters[minClusterI].locs.concat(clusters[minClusterJ].locs);
  // `centroid()` takes a GeoJSON FeatureCollection and spits out a point,
  // so need to convert our list of Features into a FeatureCollection
  clusters[minClusterI].center = centroid({
    type: 'FeatureCollection',
    features: clusters[minClusterI].locs
  });
  // and removing minClusterJ
  clusters.splice(minClusterJ, 1);
}

// Print out the result
for (const cluster of clusters) {
  for (const loc of cluster.locs) {
    console.log(loc.properties.name);
  }
  console.log('-----');
}

As expected, the result shows the algorithm grouped all the North America, Europe, Middle East, and Australia locations together.

$ node cluster-geo.js
New York
Palo Alto
Los Angeles
Austin
-----
Dublin
London
-----
Tel Aviv
-----
Sydney
-----
$

Single link clustering has the nice property that you specify ahead of time the number of clusters you want. Depending on your use case, this can be good or bad, because it is prone to creating clusters with objects that are too dissimilar if the desired number of clusters is small. But, if you know the number of clusters you want and don't really care about outliers, single link is excellent.

Clustering User Behavior

The above example isn't that insightful because you can cluster more easily at a glance than by writing out an algorithm. However, clustering's usefulness isn't limited to simple two dimensional cases that you can easily visualize. Recently, I found myself putting together a single link algorithm to cluster users based on the site functionality they use. For example, below is a CSV file access.csv that contains data on users and what endpoints they accessed.

User 1,GET /v2/Driver,2
User 1,GET /v2/Assignment,1
User 1,PUT /v2/Assignment,5
User 1,GET /v2/Region,1
User 1,GET /v2/Tanker,2
User 2,GET /v2/Promotion,7
User 2,GET /v2/Customer,7
User 2,PUT /v2/Customer,1
User 2,GET /v2/Region,5
User 3,GET /v2/Driver,9
User 3,GET /v2/Assignment,11
User 3,PUT /v2/Assignment,120
User 3,GET /v2/Region,11
User 3,GET /v2/Tanker,10
User 4,GET /v2/Promotion, 34
User 4,GET /v2/Customer,22
User 4,GET /v2/Region,8

This data set is small and relatively easy to wrap your head around, but if you had thousands of rows like this, you would have a hard time grouping users manually. For the purposes of this data, the distance metric for clustering will be the minimum between what endpoints are in cluster i but not cluster j, and what endpoints are in cluster j but not cluster i, weighted by the number of times the endpoint was hit. Intuitively, if one user used a subset of the endpoints another user used, they should be considered as part of the same group. This wouldn't work well if one user was a "super user" that accessed all the endpoints, but in this simple example this distance metric is sufficient to get a sensible result. Below is the implementation.

const userDataCSV = fs.readFileSync('./access.csv', 'utf8').trim();

const data = userDataCSV.split('\n').map(line => line.split(',')).map(cols => ({
  name: cols[0],
  url: cols[1],
  number: parseInt(cols[2], 10)
}));

const users = data.reduce((cur, obj) => {
  cur[obj.name] = cur[obj.name] || {};
  cur[obj.name][obj.url] = obj.number;
  return cur;
}, {});

const clusters = Object.keys(users).map(name => ({
  users: [name],
  totals: Object.assign({}, users[name])
}));

while (clusters.length > 2) {
  // Find two closest clusters
  let minClusterDistance = Number.POSITIVE_INFINITY;
  let minClusterI = -1;
  let minClusterJ = -1;
  for (let i = 0; i < clusters.length; ++i) {
    for (let j = i + 1; j < clusters.length; ++j) {
      const distanceIToJ = Object.keys(clusters[i].totals).reduce((cur, url) => {
        if (!clusters[j].totals[url]) {
          return cur + clusters[i].totals[url]
        }
        return cur
      }, 0)
      const distanceJToI = Object.keys(clusters[j].totals).reduce((cur, url) => {
        if (!clusters[i].totals[url]) {
          return cur + clusters[j].totals[url]
        }
        return cur
      }, 0)
      const dist = Math.min(distanceIToJ, distanceJToI);
      if (dist < minClusterDistance) {
        minClusterDistance = dist;
        minClusterI = i;
        minClusterJ = j;
      }
    }
  }

  clusters[minClusterI].users = clusters[minClusterI].users.concat(clusters[minClusterJ].users);
  Object.keys(clusters[minClusterJ].totals).forEach(key => {
    clusters[minClusterI].totals[key] = clusters[minClusterI].totals[key] || 0;
    clusters[minClusterI].totals[key] += clusters[minClusterJ].totals[key];
  });
  // and removing minClusterJ
  clusters.splice(minClusterJ, 1);
}

// Print out the result
for (const cluster of clusters) {
  for (const user of cluster.users) {
    console.log(user);
  }
  console.log('-----');
}

The output clusters users 1 and 3 together, and users 2 and 4 together, as expected.

$ node cluster-users.js
User 1
User 3
-----
User 2
User 4
-----
$

Moving On

Clustering is a very useful tool for analyzing data that you don't understand. Finding groups can help you categorize the data, and categorized data is easier to wrap your head around. The hard part is choosing a distance metric that makes sense: the above distance metric for user data isn't very robust and is prone to being rendered useless by one outlier. However, even if a distance metric provides you with a bad categorization, it can still provide you valuable insight.

Found a typo or error? Open up a pull request! This post is available as markdown on Github
comments powered by Disqus