Friday, June 1, 2012

Managing Hadoop Cluster for newbies


When i installed hadoop, i started with all default settings & everything was making use of /tmp directories. The default locations are non-advisable & should be immediately changed for real practical use.

Here is the table which lists the dedault location & suggested location (assuming you have already created hadoop user).

DirectoryDescriptionDefault locationSuggested location
HADOOP_LOG_DIROutput location for log files from daemons${HADOOP_HOME}/logs/var/log/hadoop
hadoop.tmp.dirA base for other temporary directories/tmp/hadoop-${user.name}/tmp/hadoop
dfs.name.dirWhere the NameNode metadata should be stored${hadoop.tmp.dir}/dfs/name/home/hadoop/dfs/name
dfs.data.dirWhere DataNodes store their blocks${hadoop.tmp.dir}/dfs/data/home/hadoop/dfs/data
mapred.system.dirThe in-HDFS path to shared MapReduce system files${hadoop.tmp.dir}/mapred/system/hadoop/mapred/system


Majority of hadoop settings resides in xml configuration files. Prior to hadoop 0.20, everything is part of hadoop-default.xml and hadoop-site.xml. As the name itself conveys, hadoop-default xml contains all default settings & if you want to override anything; then hadoop-site.xml is the file to work on.

If you are like me (running hadoop 1.x) running later versions (any thing > 0.20), this hadoop-site.xml has been sepearted into
- core-site.xml : We specify the hostname and port of the Namenode
- hdfs-site.xml : We specify the hostname and port of the JobTracker
- mapred-site.xml : We specify the replication factor for Hdfs.

So, we can add these namenode & datanode dirctories in hdfs-site.xml


<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
</property>



No comments:

Post a Comment