Hadoop - a Big Data Sneak Peek

What is Hadoop ?
    According to the Apache website Hadoop. It is a software library framework that allows distributed processing of large datasets across clusters of computers using simple programming models.

So How's Hadoop different than regular DBMS ?
Current DBMS can store large datasets in clustered settings but what sets Hadoop apart than regular DBMS systems ?(e.g. Oracle).  Aren't we storing large datasets in the database system and can access them using simple programming models ? The answer to that is BIGDATA. Right now as software systems mature and more complex ones are created organizations are having a hard time storing, managing and analyze the data that they have. So what differentiates BIGDATA from just data.
 The 3 V's 
 Velocity
       - Velocity is the rate at which how an organizations data grows. Data grows really fast. For example Google processes about 24 petabytes per day (ACM White Paper - MapReduce)
 Variety
        - Data comes from all shape and sizes. Most organizations have data scattered all over the place and you can not force all this data into a single schema. This is specially true when organizations merge (believe me 30 year old systems are still out there holding important data in a format that most college graduate would not recognize).
  and specially Volume
        - Don't think 30,000 employee records, think humongous like in the range of Data's memory capacity (Data from Star Trek not so far fetched). Petabytes of data !

How do you process such kind of Data ?
   That is the problem space that Hadoop is trying to solve. Hadoop again is a reliable, fault tolerant, high performance distributed parallel programming framework for large scale data written in Java.

Two Parts of Hadoop
    1. HDFS (Hadoop File System)
        This is a distributed file system that is fault tolerant (data is replicated) and runs on inexpensive commodity hardware (hardware you can buy from Best Buy)
    2. MapReduce
          This is the programming model that you can use to process and analyze data in Hadoop. This is the same programming model that Google uses.

Now that you have an idea what Hadoop is then you should have an idea of what it is not for.  Here are some suggestions:
Don't store small files - hadoop is not efficient storing small files. Think big!
Don't use it for online applications - HDFS is not highly available for online applications

In my next post, I will try and discuss the parts of HDFS and touch a little bit on a MapReduce sample.


Comments

Popular posts from this blog

OAuth 1.0a Request Signing and Verification - HMAC-SHA1 - HMAC-SHA256

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

Gensim Doc2Vec on Spark - a quest to get the right Vector