Posts

Showing posts with the label MapReduce

Hadoop - a Big Data Sneak Peek (part 2)

Image
Here is the next installment for Hadoop. I hope this gives you an idea of what HDFS and MapReduce in Hadoop is. Please leave a comment if you find this useful. HDFS Architecture The diagram below shows the main components for the Hadoop File System (HDFS). Name Node - This is the central piece of HDFS. The Name Node tracks all the  file system's metadata (e.g. directory tree of all files, where the file is kept in the cluster). It does not store the data itself. The Name Node is the SPOF (Single Point of Failure in HDFS). Here is a link to a more detailed definition -> NameNode Secondary Name Node - optional component of the HDFS. Creates checkpoints of the namespaces Data Node - Stores data in large blocks in the filesystem. Reports the blocks of data it holds to the Name Node. MapReduce MapReduce is a programming model on a distributed processing platform that is scalable. It is actually as three step process. 1. Map - this steps creates a map (key-val...

Hadoop - a Big Data Sneak Peek

What is Hadoop ?     According to the Apache website  Hadoop . It is a software library framework that allows distributed processing of large datasets across clusters of computers using simple programming models. So How's Hadoop different than regular DBMS ? Current DBMS can store large datasets in clustered settings but what sets Hadoop apart than regular DBMS systems ?(e.g. Oracle).  Aren't we storing large datasets in the database system and can access them using simple programming models ? The answer to that is BIGDATA. Right now as software systems mature and more complex ones are created organizations are having a hard time storing, managing and analyze the data that they have. So what differentiates BIGDATA from just data.  The 3 V's    Velocity        - Velocity is the rate at which how an organizations data grows. Data grows really fast. For example Google processes about 24 petabytes per day ( ACM White Paper ...