Petabyte

Posts

Predicting Helpful Posts

July 15, 2019

Original paper: https://www.aclweb.org/anthology/N19-1318 Here is a quick summary: The research purpose is to identify helpful posts from discussion threads in forums, especially long-running discussions. The approach is to model the relevance of each post concerning the original post and the novelty (not presented in the earlier posts of the discussion thread)of a post based on a windowed context. To model, the 'relevance' the original post and the target post are encoded using an RNN (GRU). The encoded sequences are then element-wise multiplied. As for modeling of the 'novelty,' the target post and the past K posts (where K is the number of past posts taken into context. A 'K' between 11 to 7 worked best for the Reddit dataset used in the experiment - performance stops improving after a certain number of posts taken into context) are also encoded using the same RNN text encoder. Once the 'K' posts are text encoded it is then fed thru another R...

Abusive Language Detection

July 15, 2019

Original paper - https://www.aclweb.org/anthology/N19-1221 Here is a quick summary: In the paper submitted by Facebook AI, London to the recent NAACL (North American Chapter of the Association for Computational Linguistics) conference held in Minneapolis, they presented a novel approach using Graph Convolutional Networks to outperform some the best ways to detect abusive language on the internet. The approach made use of a heterogeneous graph that contains an authors community network and tweets. The graph is then used to predict the class and generate an embedding. In the paper's experiments, the researchers used embeddings from node2vec (sample implementation here https://snap.stanford.edu/node2vec/) and a 2-layer Graph Convolutional Network. The Graph Convolutional Network that represents the author's profiles and tweets were used to predict the author's tweet into three classes using a softmax layer as the output layer of the network. To extract the embedding f...

Fakes News and A New Life Philosophy

December 19, 2018

When I worked on Reuters Tracer for a little bit, one of the hot item topics during that time is detecting Fake News. Fake News is a real issue for people wanting reliable information and data. People who want reliable information and data are often decision makers, who need to take action. I tried to look into what is "Fake News" which lead me to a new life philosophy. This new philosophy - is a moral imperative to check every information we consume before we believe because beliefs are the basis of our action. In this day and age that everyone is streamed with information all day thru social media, adoption of this philosophy becomes a necessity. Why does anyone want to create "Fake News"? The short answer to this is to win people over. "Fake News" mostly appeal to the target audience emotions. You can't win people over with logic, facts, and data. Have you ever been riled up to work harder when presented with charts and graphs of our e...

Graph Algorithms - Strongly Connected Components in Spark 2

August 12, 2017

Ever since I generated doc2vec (Word Embeddings) for our documents, we found interesting things by doing computations and comparisons to these vectors. For example, we try to find similar documents using cosine similarity and other similarity measures. These representations of the document give us the flexibility of doing a lot of stuff. We tried using the vectors in an ANNOY index to find near neighbors for a document quickly. Now I am exploring these same vectors in finding documents that are repeatedly written and discusses the same topic. If I want to find these documents I figured that these documents would be closely similar. Documents that address the same topic will probably have a set of standard vocabulary. What if I want to find the most influential document among these related documents? To do that we need to define the connections between these documents. When we say "connections," I can't help but think of a Graph (or Network). Another approach is to us...

Word Embeddings - Vectors that represents words

June 06, 2017

Ever since I started being a part of the R & D in our company. I have been dealing with understanding neural networks that generate representations of words and documents. Representing words and document as vectors allow us to carry out Natural Language Processing Tasks mathematically.An example, we can see how similar two documents are (using cosine similarity), a quick analogy (vector operations), and ranking documents. But how do you produce the vectors that will represent the words? Well, there are many ways. There are traditional NLP approaches that can still work very well like Matrix Factorization (LDA, GloVe) and newer methods many of which uses neural networks (Word2Vec). I have been producing document vectors using the gensim doc2vec (which uses Word2Vec). I have been using the hierarchical softmax skip-gram model ( Word2Vec.py Neural network ). When I was reading this part of the code, I thought that if this is a neural network (shallow - not a deep learning model) w...

RE WORK - Boston Deep Learning Summit

June 05, 2017

I have recently attended the Deep Learning Summit in Boston. The event was organized by RE WORK. RE WORK was founded in London. The team is all women. The mission of the RE WORK team is to encourage conversations around entrepreneurship, technology, and science to shape the future. This is a quick recount of the event from my perspective. First of all, I have never been to Boston. The public transportation that I took from the airport to the place of the conference was really easy to navigate (In short, I did not get lost). This is probably a result of the effort put in by the local government to make Boston a premier conference venue. Traffic congestion is another story. Schedule of Talks The conference schedule is packed. The speakers are researchers from some of the top tech companies. Facebook, Google, Amazon, Ebay, and Spotify are all represented. I was excited about two topics in the schedule. Here are some of the papers presented. The papers I chose...

Gensim Doc2Vec on Spark - a quest to get the right Vector

March 31, 2017

Ever since I joined the R & D group we have been doing a lot of cool things, like trying IBM Watson (see previous blog entry). Now we are doing a lot of Natural language processing. We wanted to compare the similarity of two documents. There is this excellent project Gensim ( doc2vec ) that easily allow to you to translate large blocks of text to a fixed length feature vector to make comparisons. Here is the link to the original paper from some people from Google that explains the approach. In essence, they wanted to find a representation that will overcome the weaknesses of the bag of words model. The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem ...