Petabyte

Posts

Gensim Doc2Vec on Spark - a quest to get the right Vector

March 31, 2017

Ever since I joined the R & D group we have been doing a lot of cool things, like trying IBM Watson (see previous blog entry). Now we are doing a lot of Natural language processing. We wanted to compare the similarity of two documents. There is this excellent project Gensim ( doc2vec ) that easily allow to you to translate large blocks of text to a fixed length feature vector to make comparisons. Here is the link to the original paper from some people from Google that explains the approach. In essence, they wanted to find a representation that will overcome the weaknesses of the bag of words model. The doc2vec approach proves to be a reliable approach to comparing similarities of documents because it takes into consideration the semantics and the order of the word in context. So with that, we wanted to use it for a corpus of 26 million documents. Calculating the doc2vec for 26 million documents is not a small task, so we need to process it in Spark. The problem ...

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

December 22, 2016

IllegalArgumentException - ByteBuffer - Spark DataFrame I was processing a several million documents (~ 20 million) in which we need to extract the NLP features using NLP4J, OpenNLP, and WordNet. The combination of the three NL features blows up each record to 11 times its original size. We are using all three because we do not know yet what feature sets will be helpful to us. The original dataset is in parquet files in HDFS (16 partitions). I thought that was convenient just use withColumn and pass a UDF (User Defined Function) on the column where it needs those features. withColumn adds the calculated column back to the DataFrame. So I created the spark job (I am on Spark 1.5.2-cdh5.5.2)for the above, and things started to get nasty. I am blowing up the ByteBuffer array on the in-memory columnar storage. This is the exception that I am getting. There seems to be no reference in my code in this stack trace. java.lang.IllegalArgumentException at java.nio.B...

Watson - The mystery after jeopardy

November 21, 2016

We have been deep diving in cognitive computing. One of the best platforms that a business can leverage to hit the ground on cognitive computing is IBM Watson. Watson has a lot of capabilities especially with the acquisition of Alchemy's API as well. ( Alchemy Acquisition - IBM ). You get a language translator, language classifier, retrieve and rank, text to speech, tone analyzer, and a lot more. It is just a matter of how these capabilities can be integrated to your business use cases. As part of "the answer" company we have a tremendous and diverse use case for searching - and giving you answers in a way that makes sense, relevant and make a user decide better is at the heart of what makes us "the answer" company. I was a part of the team given the freedom to explore IBM Watson (no matter what the cost). So we have tried the different APIs in a span of a few weeks. Of course, we have to take a look at the Watson's retrieve and rank ( IBM Watson;s Retrieve...

2016 - Movies Data Analysis - Linear Regression Modelling

October 14, 2016

Java 1.8 Migration - Performance and Garbage Collection

June 04, 2016

Java 7 to Java 8 - that is easy! I have been working on migrating our web application from Java 1.7 to Java 1.8. Migrating our web app is a lot of challenge. What makes it more challenging is that our web application has a really unique process footprint (well that can be said for all web application). You have to know your application like the back of your hand especially if you want to tune garbage collection for it. When I accepted the challenge of changing our web application from Java 1.7 to Java 1.8. I thought that it was going to just a breeze considering that from 1.7 to 1.8 was not that far of a version. It turned out that I was totally wrong. Here are some of the major challenges that I encountered: 1. Permanent Generation turned into MetaSpace Before Java, 1.8 class metadata is located in the permanent generation of the java heap. This can be set using the -XX:PermSize option. This was removed in Java 1.8 ( Remove Permanent Generation ). The reason why it...

Agile is not a process!

May 30, 2016

Agile - What is it? I have recently reinforced my understanding of what Agile is in relation to software development. One of the things that I realized is that the Agile manifesto does not dictate a process you have to follow but is more like a culture of what you need to value. One of the most popular agile methodologies is Scrum. Scrum Guide (The last time I read the guide was in 2009 - they have released a new guide in 2013. They will probably release a new one soon)- If you follow Scrum, you need to follow everything otherwise you are not doing Scrum. You can do stand ups but if that is only you are doing then it is not Scrum. Scrum will ensure that an Agile culture is developed with the team - and the person to see that through is the Scrum Master. Since Agile is in its teen years - we probably need to reconnect with it and revisit where it is being taken. Agile in its teen year!

Speech to Text - HTML 5

May 24, 2016

Technology Can Help Recently there was a news in Good Morning America were a deaf cashier is taking orders and customers patiently writing out there orders. Here is a link Deaf Cashier Well with a computer in every pocket there is technology to help. Here is a quick demo that i have created to use HTML 5 technology to create a speech to text web page. Here is a link to the working page http://petabyte.github.io/textToSpeech.html https://www.youtube.com/watch?v=27SZISZAPEA