What is HDFS(Hadoop Distributed File System)

2017-08-14 hadoop

What is HDFS

HDFS stands for Hadoop Distributed File System, one of the implementations of file systems for Hadoop, and other implementations include local files and S3. It distributes the disk I/O, which is a big problem when the data size is enormous and makes a block size, a unit for reading and writing, big to reduce the seek cost and improve the throughput.

You can see disk I/O can be a bottleneck from the data that while communication within the data center takes about 0.5ms for a round trip, seeking takes 10ms, and reading 1MB takes 20ms.

HDFS is not good at handling many files because 150 bytes of metadata per file is allocated in the memory of NameNode. Besides, files can only be appended data.

NameNode and DataNode

The cluster has DataNodes, which have distributed blocks, and per namespace, a NameNode, which manages which DataNode has blocks of a file. NameNode has fsimage, which is a snapshot of file’s metadata, and edit log, which haven’t yet contained in fsimage. Edit log is periodically merged to fsimage by Secondary NameNode.

If NameNode is down, any data can’t be read and written, so it needs to start a new NameNode. At that time, the state is recovered with applying edit log to fsimage, so these files should be backuped to prevent from losing.

It can take over 30 minutes to start a NameNode if the cluster is big, so Standby NameNode can be placed instead of Secondary NameNode to improve the availability. Standby NameNode share edit log on a shared storage with NameNode, and the latest state is on memory, so it can be switched within a few tens of seconds after it is determined that the NameNode is dead.

Write and read

Write

File system ommunicates with NameNode to check the file exists and have permissions, and if there is no problem, returns FSDataOutputStream to the client. First, written data is put into a queue, and then DataStreamer selects which DataNodes are written with communicating with NameNode. The number of replica settings DataNodes are selected, and a pipeline is created between the DataNodes to write data in order. If succeeded, an ack packet is returned from each DetaNode, so put it into an ack queue and wait for ack packets to be returned from all DataNodes. If some DataNodes fails, erase the written data, remove it from the pipeline, and create a new pipeline.

Read

File system asks NameNode where blocks of the file is. NameNode sorts DataNode having blocks in order of closeness on the network topology and returns the address of neareset one. After that, file system returns FSDataInputStream containing the address to the client, and it reads data.

Start on SingleNode Cluster

$ yum --enablerepo=epel -y install pdsh
$ echo $JAVA_HOME
/usr/lib/jvm/jre
$ wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
$ tar xvzf hadoop-2.7.3.tar.gz 
$ cd hadoop-2.7.3
$ bin/hadoop version
Hadoop 2.7.3

Set the default file system to HDFS and the replica to 1.

$ vi etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
...
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

$ vi etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
...
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Allow ssh to start/stop the Hadoop daemon.

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys
$ ssh localhost

Start.

$ bin/hdfs namenode -format
$ sbin/start-dfs.sh
localhost: starting namenode, ...
localhost: starting datanode, ...
0.0.0.0: starting secondarynamenode, ...

Create a directory and a file and access them.

$ bin/hdfs dfs -mkdir /home
$ bin/hdfs dfs -mkdir /user/ec2-user
$ echo 'aaaaa' > hoge
$ bin/hdfs dfs -put hoge ./
$ bin/hdfs dfs -put hoge ./
put: `hoge': File exists

$ bin/hdfs dfs -appendToFile hoge hoge
$ bin/hdfs dfs -ls ./
Found 1 items
-rw-r--r--   1 ec2-user supergroup         12 2017-08-14 13:44 hoge

$ bin/hdfs dfs -cat hoge
aaaaa
aaaaa

Run filesystem check utility.

$ bin/hdfs fsck ./ -files -blocks
Connecting to namenode via http://localhost:50070/fsck?ugi=ec2-user&files=1&blocks=1&path=%2Fuser%2Fec2-user
FSCK started by ec2-user (auth:SIMPLE) from /127.0.0.1 for path /user/ec2-user at Mon Aug 14 13:44:48 UTC 2017
/user/ec2-user <dir>
/user/ec2-user/hoge 12 bytes, 1 block(s):  OK
0. BP-478671077-172.31.3.159-1502715364675:blk_1073741825_1002 len=12 repl=1

Status: HEALTHY
 Total size:	12 B
 Total dirs:	1
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 12 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	1
 Average block replication:	1.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		1
 Number of racks:		1
FSCK ended at Mon Aug 14 13:44:48 UTC 2017 in 2 milliseconds

References

Hadoop: The Definitive Guide, 4th Edition

A Guide to Checkpointing in Hadoop – Cloudera Engineering Blog