Word Embeddings - Vectors that represents words

Ever since I started being a part of the R & D in our company. I have been dealing with understanding neural networks that generate representations of words and documents. Representing words and document as vectors allow us to carry out Natural Language Processing Tasks mathematically.An example, we can see how similar two documents are (using cosine similarity), a quick analogy (vector operations), and ranking documents.

But how do you produce the vectors that will represent the words?
Well, there are many ways. There are traditional NLP approaches that can still work very well like Matrix Factorization (LDA, GloVe) and newer methods many of which uses neural networks (Word2Vec). I have been producing document vectors using the gensim doc2vec (which uses Word2Vec). I have been using the hierarchical softmax skip-gram model (Word2Vec.py Neural network). When I was reading this part of the code, I thought that if this is a neural network (shallow - not a deep learning model) won't I be able to use Tensorflow to generate my embeddings. The creators of Tensorflow are way ahead of me on that.

Background on Word2Vec (before we try doing embeddings in Tensorflow)
Although Word2Vec is using a neural network, we don't use the neural network for the trained task, but the goal is to learn the weights of the hidden layer of the neural network. We are going to save these weights, and this is the "vectors" that we are trying to learn.

There are several ways to train the neural network and several ways the input layer is represented. For this blog, let us stick with the skip-gram model because I am familiar with it.

But there is an issue with learning the "word vectors" just using a standard neural network. The neural network being used is predicting the next word given a window of words. Predicting the next word is just like predicting a class, but in this context, each class is a word. Doing a multi-class classifier using a standard neural network would require having the output layer have the same number as the classes you are predicting. Training with a large body of text with a lot of different unique words can get out of hand quickly using a standard neural network to generate the word vectors..

Softmax may not entirely cut it for a large vocabulary size
 Here is a quick jsfiddle why a softmax will be bad for multiclass  classification - (look at the "for loop" - it is a summation of all the elements of the vector - Softmax in javascript)

To deal with this issue, the neural network used to generate word vectors uses a technique called noise contrastive estimation. The idea behind NCE (noise contrastive estimation) is to convert the multi-class classification problem into binary classification.

Here is a sample of how embeddings can be generated in Tensorflow that I wrote following the sample in Learning Tensorflow. The code in here uses a skip-gram model - Tensorflow embeddings

The code sample is base on Vector representations of words
and Learning Tensforflow (Early Release)








References:
NCE Loss
Skip-gram Word2Vec Tutorial
Learning Word Embeddings efficiently using NCE

Comments

Popular posts from this blog

OAuth 1.0a Request Signing and Verification - HMAC-SHA1 - HMAC-SHA256

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

Gensim Doc2Vec on Spark - a quest to get the right Vector