Apache Hadoop’s Core: HDFS and MapReduce — Brief Summary

Uğur Sakarya
5 min readApr 10, 2022

I would like to tell you about Hadoop and MapReduce briefly. I will not dive deep about these tools and try to keep it as simple as possible. So;

What is HDFS and How is it works?

Hadoop Ecosystem (Source: geeksforgeeks)

HDFS allows data to be stored across an entire cluster in a distributed manner and allows your applications to analyze that data.

It made for handling large files.(ie. sensors data, log files) It’s handling by breaking up these files into blocks size max 128 mb which is pretty large.

HDFS, allows you to distribute the processing of one of those large files’ blocks as splitting into different computers. So, different computers can access different parts of that data and process it in parallel.

In HDFS, data stored as more than one copy of each data block, so when data block has fail, data can be backup from other blocks.

HDFS allows users to use multiple commodity computers to store their data, rather than one expensive one.

HDFS ARCHITECTURE

HDFS Architecture (SundogEducation)

HDFS has a master/slave architecture:

Name Node: Single node that keeps track where all those data block lives. Maintaining edit logs like what being created, modified or deleted.

Data Node: Store each block of a file. Talking eachother to maintain blocks. If one’s failed, backup copy data from other nodes.

READING A FILE

Process of reading a file from HDFS (SundogEducation)

Reading a file from HDFS is simply;

Client Node(CN) which is a running application, asks to Name Node: Yo Name Node, “Where is this file? I want it!”. Then, Name Node returns:“Hey, your data lives on these blocks of these data nodes.”. After this answer, Client Node will immideately goes that block of data nodes and ask them to retrieve what it wants and appreciates data nodes because they are there and working really fast.

WRITING A FILE

Process of writing a File (SundogEducation)

Writing a file on HDFS is a bit complicated than reading a file. First of all, Client Node goes to Name Node that it has a new entry of a file or data and I want to keep track of its data blocks. Then, Name Node will say: Okay, I’ve created an entry for you. You can go and create it. Later, Client Node talk to first Data Node and say: “Here is my data and file.” The Data Node then, talks other data nodes to replicate data blocks across data nodes. After storing data, all that information comes back to Client Node and then finally Name Node that can record file has been written succesfully.

So, what happens if one and only Name Node fails?

If Name Node fails the files are stored on HDFS will not be accessible. There are some solutions but before that if your client node already reading data from data node before name node failure, the process will be completed succesfully because client node already has the info about where data is stored.

Now, what are precautions should you take about name node failure?

There are several options about this situation;

1- You can backup your metadata to your local file system or NFS.
2- Using secondary NameNode would be the answer for you where it copies your primary namenode metadata at a certain time.
3. HDFS Federation, where each namenode manages specific namespace volume.
4. HDFS High Availability, where there is a hot standby namenode using shared edit log and Zookeeper tracks active namenode. In any failure, activates standby namenode.

MAPREDUCE FUNDAMENTAL CONCEPTS

Mapreduce is one of the built-in core components of Hadoop.
It distributes the process of data on cluster to process parallel across your cluster.

Divide the data into partitions that are MAPPED(transforming data) and REDUCED(aggregating data) by mapper and reducer. This is basically what MapReduce is.

MapReduce is resilient to failure which means an application master monitors your mappers and reducers on each partition and interferes when its needed.

Mapping: Extracting and organizing data.
Reducing: Aggregates and map data as key:value pairs.

MapReduce On A Cluster

Basically, it will carve up your input data into different partitions and then assign each node to a partition to process.

Hadoop keep track of when another computer finishes process or not.

Then before reducing stage, element will be shuffled and sorted.(Merge step)

Later, reducers comes in and they can be run in parallel. So each node might be responsible for reducing a given set of keys.

Graphical Example of MapReduce. (projectpro.io)

HANDLING FAILURE

Application master monitors worker tasks for error and hanging
- Restart as needed.
- Preferably on a different node.

What if the application master goes down?
- YARN can try to restart it.

What if an entire node goes down?
- This could be the application master.
- The resource manager will try to restart it.

What if the resource manager goes down?
- Can set up “High Availability(HA)” using ZooKeeper to have a hot standby.

So, this was all of my notes about Hadoop and MapReduce technologies. Of course, there are tons of things to learn about these topices but I tried my best to learn and teach the very basics.

Thanks for reading.

--

--