Gensim Doc2Vec on Spark - a quest to get the right Vector
Ever since I joined the R & D group we have been doing a lot of cool things, like trying IBM Watson (see previous blog entry). Now we are doing a lot of Natural language processing. We wanted to compare the similarity of two documents. There is this excellent project Gensim (doc2vec) that easily allow to you to translate large blocks of text to a fixed length feature vector to make comparisons. Here is the link to the original paper from some people from Google that explains the approach. In essence, they wanted to find a representation that will overcome the weaknesses of the bag of words model.
The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem is that there is no implementation of Gensim for Spark. So I tried the canonical version of gensim in pyspark.
So using a pre-trained model we needed to infer the vector of all the documents in our corpus. Using pyspark, I passed in the models using the --files option then inferred the vector in the mapping function. After all of that I got some vectors, BUT when I tested the vectors, they are useless because comparing similar documents the vectors are orthogonal. (Not related similar at all using cosine similarity) Tried a few things but nothing is changing those vectors to closer for related documents. I have been attempting reading thru the gensim code on how the vectors are generated to make sense of what is happening. I also have looked at several projects that used gensim in spark https://github.com/yiransheng/gensim-doc2vec-spark and http://deepdist.com/. Nothing worked, as a workaround getting inspiration on how deepdist uses flask to distribute the training. I created a standalone Flask service that calculates the doc2vec and calls it from spark. Calling as service from Spark works BUT at the rate of how the doc2vec is being calculated my spark job is set to finish all 26 million in 2.7 days - which is a better choice that generating all the doc2vec for all 26 million locally using several threads which might take up to 30 days. Still stuck on this problem - I might debug a spark process to see what is happening to the weights in gensim.
The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem is that there is no implementation of Gensim for Spark. So I tried the canonical version of gensim in pyspark.
So using a pre-trained model we needed to infer the vector of all the documents in our corpus. Using pyspark, I passed in the models using the --files option then inferred the vector in the mapping function. After all of that I got some vectors, BUT when I tested the vectors, they are useless because comparing similar documents the vectors are orthogonal. (Not related similar at all using cosine similarity) Tried a few things but nothing is changing those vectors to closer for related documents. I have been attempting reading thru the gensim code on how the vectors are generated to make sense of what is happening. I also have looked at several projects that used gensim in spark https://github.com/yiransheng/gensim-doc2vec-spark and http://deepdist.com/. Nothing worked, as a workaround getting inspiration on how deepdist uses flask to distribute the training. I created a standalone Flask service that calculates the doc2vec and calls it from spark. Calling as service from Spark works BUT at the rate of how the doc2vec is being calculated my spark job is set to finish all 26 million in 2.7 days - which is a better choice that generating all the doc2vec for all 26 million locally using several threads which might take up to 30 days. Still stuck on this problem - I might debug a spark process to see what is happening to the weights in gensim.
Comments