Posts

Showing posts with the label Spark

Graph Algorithms - Strongly Connected Components in Spark 2

Ever since I generated doc2vec (Word Embeddings) for our documents, we found interesting things by doing computations and comparisons to these vectors. For example, we try to find similar documents using cosine similarity and other similarity measures. These representations of the document give us the flexibility of doing a lot of stuff. We tried using the vectors in an ANNOY index to find near neighbors for a document quickly. Now I am exploring these same vectors in finding documents that are repeatedly written and discusses the same topic. If I want to find these documents I figured that these documents would be closely similar. Documents that address the same topic will probably have a set of standard vocabulary. What if I want to find the most influential document among these related documents? To do that we need to define the connections between these documents. When we say "connections,"  I can't help but think of a Graph (or Network).  Another approach is to us

Gensim Doc2Vec on Spark - a quest to get the right Vector

Ever since I joined the R & D group we have been doing a lot of cool things, like trying IBM Watson (see previous blog entry). Now we are doing a lot of Natural language processing.  We wanted to compare the similarity of two documents. There is this excellent project Gensim ( doc2vec ) that easily allow to you to translate large blocks of text to a fixed length feature vector to make comparisons. Here is the link to the  original paper  from some people from Google that explains the approach. In essence, they wanted to find a representation that will overcome the weaknesses of the bag of words model. The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem is that there

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

IllegalArgumentException - ByteBuffer - Spark DataFrame I was processing a several million documents (~ 20 million) in which we need to extract the NLP features using NLP4J, OpenNLP, and WordNet. The combination of the three NL features blows up each record to 11 times its original size. We are using all three because we do not know yet what feature sets will be helpful to us. The original dataset is in parquet files in HDFS (16 partitions). I thought that was convenient just use withColumn and pass a UDF (User Defined Function) on the column where it needs those features. withColumn adds the calculated column back to the DataFrame. So I created the spark job (I am on Spark 1.5.2-cdh5.5.2)for the above, and things started to get nasty. I am blowing up the ByteBuffer array on the in-memory columnar storage. This is the exception that I am getting. There seems to be no reference in my code in this stack trace. java.lang.IllegalArgumentException at java.nio.ByteB