Introduction To Big Data
A very common misconception is that big data is some technology or tool. Big data, in reality, is a very large, heterogeneous set of data. This data comes more in an unstructured or semi-structured form, so extracting useful information is very difficult. With the growth of cloud technologies, the generation rate of data has increased tremendously.
Therefore, we need a solution that allows us to process such “Big Data” at optimal speed and to do so without compromising data security. There are a cluster of technologies that deal with this and one of the best is Hadoop.
“How does Hadoop provide the solution to big data problems?” This is a common question that arises. The answer to this is:
- Hadoop uses data to store data in blocks on multiple system nodes rather than on a single machine. This allows for a separation of concerns, fault tolerance, and increased data security.
- There is no need for defined schema before the data being stored in it. One of the major drawbacks of the RDBMS system is that it works on predefined schema structures which take the flexibility away from the user to store different types of data.
- Another feature of Hadoop is that it brings processing power to the user. In Hadoop, a processor is being taken to data rather than data being carried from one system to another. As there is a distributed architecture, there is a flexibility for the end user to increase any number of nodes.
This all contributes to Hadoop being a reliable, economical (RAID being costlier than local nodes), scalable, and flexible system.
Hadoop is comprised of two main components, namely nodes and resource managers.
- Nodes (Name node and Data nodes): The name node acts as the master and contains all the metadata that is getting processed on the data nodes. Typically, there is only one name node in a system but its number can be increased as per your requirements. Data nodes are the real site workers where the actual processing takes place. Here, data resides and get stored after processing. Name nodes only contain the map of the data node and a chunk of data.
- Resource Managers (MapReduce and YARN): Resource managers contain the algorithm(s) required to process the data. This is the heart of Hadoop where the business logic for processing is written.
MapReduce comprises of two jobs, namely map and reduce. “‘Map’ refers to taking a set of data and converting it into another set of data, where individual elements are broken down into key/value pairs. ‘Reduce’ refers to getting the output from a map as input and combines those data tuples into a smaller set of tuples.”(Source: IBM’s page on MapReduce) The important thing to note here is the reduce job is always performed after the map job. Another resource manager that can be used along with MapReduce or as a standalone resource is YARN. YARN stands for Yet Another Resource Negotiator and is a resource management and job scheduling technology. IBM mentioned in its article that, “according to Yahoo!, the practical limits of such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently.” Apart from this limitation, the utilization of computational resources is inefficient in MRV1. Also, the Hadoop framework became limited only to MapReduce processing paradigm. According to Hortonworks, “YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost-effective, linear-scale storage and processing.” It provides ISVs and developers a consistent framework for writing data access applications that run in Hadoop. YARN relieves MapReduce for Resource Management and Job Scheduling. YARN started to give Hadoop the ability to run non-MapReduce jobs within the Hadoop framework.