A Tale of Three Algorithms — Different Views of the Same World

Torsten Volk
4 min readJan 27, 2022

--

Using three different algorithms to cluster the same web search results shows how machine learning can impact our perception of the real world. For this little experiment, I used the excellent Carrot Search 2 platform and entered “Infrastructure as Code” into the search box. Carrot Search 2 retrieves about 100 search results and treats each result as an individual input document for the three clustering algorithms. You can then switch between each algorithm to receive the same the cluster charts you find in this article. In a nutshell, all three algorithms have access to exactly the same content for clustering. Now let’s look at the results.

K-Means — Simple, But Useful, Word Correlations

K-Means clusters together terms with the highest correlations based on a simple iterative spatial analysis. It then simply uses a list of these terms as the label for a group of underlying search results. “Common, Microsoft, Automatically” in the middle of this chart simply are the three most frequent keywords the 7 results have in common. Unfortunately, creating cluster labels as a collection of the most frequent joined keywords from our search results is often not sufficiently descriptive for us to understand the content of this cluster. However, the chart allows you to see the titles of each clustered result in the background, which is a great feature of the Carrot Search platform.

Algorithm used: K-Means (Carrot Search 2)

Upside: Simple and transparent Approach

Downside: Cluster descriptions are non-descriptive

STC — Quick Clustering Sometimes Lacks Balance

Suffix Tree Clustering (STC) decomposes a text into a tree hierarchy of strings and multiple levels of substrings. STC compares each tree branch between documents, in our case Google search results, and clusters together the results with the highest number of matches. Note that this process happens without regard for the actual words contained in each text and therefore it is difficult for humans to intuitively understand the clustering results. However, comparing these tree branches (patterns) between texts brings the much needed semantic context that lacks in the K-Means algorithm and it allows to label each cluster based on the highest common hierarchy level of the document trees.

Algorithm used: STC (Carrot Search 2)

On the downside, STC often creates one large “catch” all cluster where it puts all the documents that satisfy its pattern matching requirements. This shows that an approach of simply correlating substrings without consideration for actual words and their relationships is often unable to detect relevant similarities in a large part of the sample (our search results).

source: Ilic, Spalewic, Veinovic, 2004: https://erk.fe.uni-lj.si/2014/ilic(suffix_tree)p.pdf

Upside: Adds context to clustering; Shorter and often more meaningful labels than K-Means

Downside: Often large clusters

Lingo — At Times, It Feels Like “True” AI

Term Document Matrix (source: pending)

The Lingo algorithm addresses the downsides of both, STC and K-Means and is able to create a high diversity of clusters with often sufficiently descriptive labels. Lingo is based on a term-document matrix allowing the algorithm to identify patterns that stretch across an entire document. Clusters are then created based on how many of these patterns match between each document and how complex these matching patterns are.

Algorithm: Lingo (Carrot Search 2)

Upside: Often descriptive labels and granular clusters

Downside: Only recommended for small text collections, such as Google searches.

Conclusion: 3 Different Views on the Same Topic

Applying these three algorithms to cluster the same 100 web search results demonstrates how algorithms make us see the world through their own lens. K-Means is the most transparent algorithm of the three, but delivered poor labelling and a number of irrelevant clusters. STC delivered nicely descriptive labels for each cluster, but clumped together about 20% of results in one large and meaningless cluster that was exactly named after our search terms (“Infrastructure as Code”). Then came Lingo. Lingo was able to find even smaller clusters that grouped together results in a manner that is typically intuitively understandable by humans. Lingo’s labelling is definitely better for our use case compared to STC, but still far from perfect.

In a nutshell, all three algorithms have their use cases, but it is up to the user to investigate and understand how the results were created.

--

--

Torsten Volk
Torsten Volk

Written by Torsten Volk

Industry analyst for application development and modernization at the Enterprise Strategy Group (by InformaTechTarget).

No responses yet