MODULE 1 : BIG DATA — THE BIG PICTURE | Hadoop: Evolution, Overview, and Core Components
Evolution of Hadoop
Background:
- Hadoop originated from the need to handle massive amounts of data generated by web search engines.
- In the early 2000s, Doug Cutting and Mike Cafarella developed the Apache Nutch project, which was inspired by Google’s MapReduce and Google File System (GFS) papers.
- In 2006, Hadoop became a subproject of the Apache Lucene project.
Key Milestones:
- 2005: Development of Hadoop started as part of the Apache Nutch project.
- 2006: Hadoop became a separate project under the Apache Software Foundation.
- 2008: Yahoo! announced that its search engine was being run on Hadoop, proving its scalability and reliability.
- 2011: Hadoop 1.0 was released, marking the stability of the platform.
- 2013: Hadoop 2.0 introduced YARN (Yet Another Resource Negotiator), which improved resource management and job scheduling.
Overview of Hadoop
What is Hadoop? Hadoop is an open-source framework designed for distributed storage and processing of large datasets using clusters of commodity hardware. It enables applications to work with thousands of nodes and petabytes of data.
Key Features:
- Scalability: Can scale out to accommodate more data by adding more nodes to the cluster.
- Fault Tolerance: Automatically handles hardware failures by replicating data across multiple nodes.
- Cost-Effective: Uses commodity hardware, reducing costs compared to traditional high-end servers.
- Flexibility: Can handle structured, semi-structured, and unstructured data.
Core Components of Hadoop
1. Hadoop Distributed File System (HDFS):
Purpose: A distributed file system that stores data across multiple nodes, providing high throughput access to data.
Architecture:
- NameNode: Manages the file system namespace and regulates access to files by clients.
- DataNode: Stores the actual data. Multiple copies of data blocks are stored across different DataNodes for fault tolerance.
Features:
- Replication: Default replication factor is three, ensuring data availability even if nodes fail.
- Large File Support: Optimized for storing large files and streaming access to data.
2. MapReduce:
- Purpose: A programming model for processing large datasets in parallel across a Hadoop cluster.
Components:
- JobTracker: Manages the distribution of MapReduce tasks to specific nodes in the cluster.
- TaskTracker: Executes the tasks as directed by the JobTracker.
Workflow:
- Map Phase: Input data is split into chunks and processed by Map tasks, which generate intermediate key-value pairs.
- Reduce Phase: Intermediate results are merged and processed to produce the final output.
3. Yet Another Resource Negotiator (YARN):
- Purpose: An enhanced resource management layer in Hadoop 2.0 that allows for better resource utilization and scalability.
Components:
- ResourceManager: Manages resources across the cluster.
- NodeManager: Manages resources and monitoring on a single node.
- ApplicationMaster: Manages the lifecycle of applications running on YARN.
Benefits:
- Allows multiple data processing engines (MapReduce, Spark, etc.) to run simultaneously.
- Improves cluster utilization by decoupling resource management from data processing.
4. Hadoop Common:
- Purpose: A set of shared utilities and libraries that support other Hadoop components.
- Components:
- File System Abstractions: Interface for various file systems, including HDFS, S3, etc.
- Serialization Libraries: Mechanisms for reading and writing data.
- Java Libraries: Common utilities and code for Hadoop modules.
Additional Ecosystem Components
- Hive:
- Purpose: A data warehouse infrastructure that provides data summarization, query, and analysis.
- Features: Uses SQL-like language called HiveQL for querying data stored in HDFS.
Pig:
- Purpose: A high-level platform for creating MapReduce programs used with Hadoop.
- Features: Uses a scripting language called Pig Latin for expressing data transformations.
HBase:
- Purpose: A distributed, scalable, big data store modeled after Google’s Bigtable.
- Features: Provides real-time read/write access to large datasets.
Spark:
- Purpose: A fast and general-purpose cluster computing system.
- Features: Provides in-memory data processing capabilities, making it much faster than Hadoop MapReduce for certain applications.
Conclusion :
Hadoop has evolved significantly since its inception, becoming a cornerstone of big data processing. Its core components — HDFS, MapReduce, YARN, and Hadoop Common — provide a robust framework for storing and processing large datasets efficiently. Additionally, its extensive ecosystem of tools and libraries makes it a versatile platform for a wide range of big data applications.