Posts

Showing posts with the label word2vec

Word Embeddings - Vectors that represents words

Ever since I started being a part of the R & D in our company. I have been dealing with understanding neural networks that generate representations of words and documents. Representing words and document as vectors allow us to carry out Natural Language Processing Tasks mathematically.An example, we can see how similar two documents are (using cosine similarity), a quick analogy (vector operations), and ranking documents. But how do you produce the vectors that will represent the words? Well, there are many ways. There are traditional NLP approaches that can still work very well like Matrix Factorization (LDA, GloVe) and newer methods many of which uses neural networks (Word2Vec). I have been producing document vectors using the gensim doc2vec (which uses Word2Vec). I have been using the hierarchical softmax skip-gram model ( Word2Vec.py Neural network ). When I was reading this part of the code, I thought that if this is a neural network (shallow - not a deep learning model) w

Gensim Doc2Vec on Spark - a quest to get the right Vector

Ever since I joined the R & D group we have been doing a lot of cool things, like trying IBM Watson (see previous blog entry). Now we are doing a lot of Natural language processing.  We wanted to compare the similarity of two documents. There is this excellent project Gensim ( doc2vec ) that easily allow to you to translate large blocks of text to a fixed length feature vector to make comparisons. Here is the link to the  original paper  from some people from Google that explains the approach. In essence, they wanted to find a representation that will overcome the weaknesses of the bag of words model. The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem is that there