Distributed/Hadoop Storage Layer

Storage layer:

The storage layer is usually loaded with data using a batch process.

The integration component of the ingestion layer invokes various mechanisms—like Sqoop, MapReduce jobs, ETL jobs, and others—to upload data to the distributed Hadoop storage layer (DHSL).

Hadoop:

  • Open source framework
  • Allows us to store huge volumes of data in a distributed fashion.
  • Provides de-coupling between the distributed computing software engineering and the actual application logic that you want to execute.
  • Enables to interact with a logical cluster of processing and storage nodes.

Components:

1. HDFS

  • File system designed to store a very large volume of information (terabytes or petabytes) across a large number of       machines in a cluster.
  • It stores data reliably, runs on commodity hardware
  • Uses blocks to store a file or parts of a file
  • Supports a write-once-read-many model of data access.

2. Map reduce

  • Computes results in batch

 

Storage pattern:

  • Communication from ingestion layer to storage layer
  • Can be implemented based on the performance, scalability, and availability requirements.