Posts

Showing posts with the label parquet

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

IllegalArgumentException - ByteBuffer - Spark DataFrame I was processing a several million documents (~ 20 million) in which we need to extract the NLP features using NLP4J, OpenNLP, and WordNet. The combination of the three NL features blows up each record to 11 times its original size. We are using all three because we do not know yet what feature sets will be helpful to us. The original dataset is in parquet files in HDFS (16 partitions). I thought that was convenient just use withColumn and pass a UDF (User Defined Function) on the column where it needs those features. withColumn adds the calculated column back to the DataFrame. So I created the spark job (I am on Spark 1.5.2-cdh5.5.2)for the above, and things started to get nasty. I am blowing up the ByteBuffer array on the in-memory columnar storage. This is the exception that I am getting. There seems to be no reference in my code in this stack trace. java.lang.IllegalArgumentException at java.nio.ByteB