Graph Algorithms - Strongly Connected Components in Spark 2

August 12, 2017

Ever since I generated doc2vec (Word Embeddings) for our documents, we found interesting things by doing computations and comparisons to these vectors. For example, we try to find similar documents using cosine similarity and other similarity measures. These representations of the document give us the flexibility of doing a lot of stuff. We tried using the vectors in an ANNOY index to find near neighbors for a document quickly.

Now I am exploring these same vectors in finding documents that are repeatedly written and discusses the same topic. If I want to find these documents I figured that these documents would be closely similar. Documents that address the same topic will probably have a set of standard vocabulary. What if I want to find the most influential document among these related documents? To do that we need to define the connections between these documents. When we say "connections," I can't help but think of a Graph (or Network). Another approach is to use a clustering algorithm to group together related documents base on the doc2vec vectors and other features. Some of the unsupervised machine learning algorithms for clustering will need an arbitrary number of cluster (a good guess of how many clusters you might want to find) as a parameter. The unsupervised machine learning algorithm feels somewhat a trial and error until the somebody figures out the right number of clusters that groups together the documents as expected.

GraphFrames

So enter GraphFrames (GraphFrames - Databricks), a spark library that supports general graph processing. This library was introduced March 2016 - more info here (GraphFrames). The GraphFrames library promotes several graph algorithms. There are several algorithms that I wanted to try against the document vectors that we have. One of the algorithms in the library is PageRank (PageRank). You mean I can use Google's search algorithm just like that. Well, the algorithms in the library are only tools, the real clincher is how would we define the graph that will help us search for what we are trying to find. In the context of what I was doing looking for documents that talk about the same topic, PageRank would just turn up a lot of similar documents on the top but will not define a grouping of documents that discuss the same topic. So I tried Strongly Connected Components (Strongly Connected Components), this algorithm will find the subgraphs of vertices that have a connection going both directions. Here is a quick discussion on Wikipedia. SCC is what exactly I need. So we just need to define the vertices and the edges of the graph we are going to use.

Since we already have the vectors for each document, I ran a spark job that will calculate all the cosine similarity between two documents as long as they are part of search result coming from Elasticsearch if I do a More Like This query. This approach saved me from doing a cartesian join of about 30 million documents and eliminated the need for calculating similarity for un-related documents. So with the vertices being each document and the edges are cosine similarity I ran the Strongly Connected Components algorithm. It was quick. I thought it would take a day for 2 million vertices and 140 million edges but to my surprise, it only ran in less than 20 minutes. I initially thought that I might have missed a step or did something wrong. After verifying, I indeed have found groups of documents that are strongly connected. Not only that I discovered that the library provided the number of degrees for each vertex, so in a group, I can find the document that has the most number of incoming edges.

Conclusion

If you find that unsupervised clustering algorithms might be too loose of an approach to find clusters, I recommend using graphs as long as you define the proper graph about your problem context. Strongly Connected Components algorithms will find the connections easily.

Search This Blog

Petabyte