Sunday, July 15, 2012

Hadoop MapReduce Job for analyzing Tweets - Part I

Hadoop MapReduce Jobs for analyzing Tweets Part I

Recenly uploaded my Twitter MR job for analyzing tweets to Github
https://github.com/satishvarmadandu/MyBigData

Audience:
 If you have already worked on HadoopWordCount example & looking for some real-world application with hadoop, then this blog might be helpful. Are you
- looking to start working on some real-world applications to see the power of hadoop?
- looking for some ready-made open source package on hadoop map reduce so that you can extend your to-do list
- wondering on how to unit-test your MapReduce jobs?

If the answer is Yes for any of the above questions, then this project might be helpful for you.

Why MyBigData:

MyBigData tries to apply the concepts of hadoop on some real-world data (mainly with twitter data set). User's can download the entire project & they can either run it as it is or can extend to incorporate their features (mainly with twitter). I like Twitter because its so open & there is so much wealth of information that we can derive some value out of the tweets.

What is MyBigData:

MyBigData contains MapReduce jobs to perform Tweet analytics. Users can specify some keywords to track in a file. We use Twitter's Streaming API to collect all the tweets matching user's specified keywords. For each tweet, we extract entities like urls, user_mentions, hashtags etc. This project contains MapReduce jobs
  • to find out most popular urls (for every hour).
  • inludes some performance tuning settings to improve MapReduce performace.
  • contains mrunit (map reduce unit testing) & junit test cases for MapReduce jobs to demonstrate unit testing for Hadoop MapReduce jobs.
  • Hadoop deprecated old API from 0.20.*. This projects contains MapReduce jobs using old & new hadoop API to demonstrate the migration.


Twitter provides Streaming API.  Default access provides the following limits
- We can track upto 400 keywords
- 5000 userIds.
- 1% of the total firehose. Twitter reports 250M tweets/day (as of Oct 2011) & growing everyday.



In Part-II, we see how to get & run the MyBigData from Github to analyze the tweets using Hadoop

Monday, June 25, 2012

Hadoop cluster setup : Firewall issues


Hadoop cluster setup : Firewall issues

Expectations: This blog entry is not a step-by-step guide to setup hadoop cluster. There are numerous articles on setting up hadoop cluster. The intent of this blog is to provide a solution for a couple of issues that i have faced while setting-up the cluster (unfortuantely, i couldnt able to find a direct answer for these issues in google, so blogging over here)


Recently, i was tasked to create a new hadoop cluster on our new CentOS machines. The first time when i created cluster, i could able to create them succesfully. But with the new machines, i ran into few problems.

Issue 1 # DataNode cannot connect to NameNode
Call to master/192.168.143.xxx:54310 failed on local exception: java.net.NoRouteToHostException: No route to host

1) Configured everything & when i started the namenodes & datanodes using
# cd $HADOOP_HOME
# ./bin/start-dfs.sh

NameNode logs:

 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: NameNode up at: master/192.168.143.211:54310
2012-06-25 19:27:40,338 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 54310: starting


Namenode has been started succesfully


DataNode logs:
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 

Call to master/192.168.xxx.xxx:54310 failed on local exception: java.net.NoRouteToHostException: No route to host
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:1063)

        at org.apache.hadoop.ipc.Client.call(Client.java:1031)
        at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
        ... 13 more     
Caused by: java.net.NoRouteToHostException: No route to host


This clearly says that datanode machines cannot connect to namenode. So i tried hitting the namenode UI in the browser

http://192.168.143.xxx:50070 (masked IP)

Ans: Failed to connect to UI. Got timedout. Seems something wrong with the namenode.

but when i did a telnet to that (namenode) port
# telnet 192.168. xxx .xxx  50070

Trying 192.168. xxx .xxx...
Connected to 192.168. xxx .xxx.
Escape character is '^]'.

So this tells that namenode is up & running but its not available to outside. so, it seems the problem is with firewall. So tried to disable firewall on my namenode machine.

Login as a root to the namenode machine & execute the following commands.
# service iptables save
# service iptables stop
# chkconfig iptables off

After disabling the firewall, restarted the dfs. Now my datanodes can connect to my namenode & Namenode UI is also working fine.


Issue2: 
Error: java.io.IOException: File /tmp/hadoop-hadoop/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1

The main cause of this problem is config (99%). This is mainly due to your conf/slaves files or your /etc/hosts entries. There are many blogs on addressing this issue. But the remaining 1% of the time, this is due to firewall issues from your datanodes. So run the above the commands to disable the firewall on your datanode machines. 

Restarted my mapred.sh & everything looks good now.

Issue3: If you are seeing the following exception while pushing a file to HDFS, then you need to disable the firewall on the slave machine.

INFO hdfs.DFSClient: Exception in createBlockOutputStream java.net.NoRouteToHostException: No route to host
INFO hdfs.DFSClient: Abandoning block blk_3519823924710640125_1087
INFO hdfs.DFSClient: Excluding datanode 192.168.xxx.xxx:50010


Friday, June 1, 2012

Managing Hadoop Cluster for newbies


When i installed hadoop, i started with all default settings & everything was making use of /tmp directories. The default locations are non-advisable & should be immediately changed for real practical use.

Here is the table which lists the dedault location & suggested location (assuming you have already created hadoop user).

DirectoryDescriptionDefault locationSuggested location
HADOOP_LOG_DIROutput location for log files from daemons${HADOOP_HOME}/logs/var/log/hadoop
hadoop.tmp.dirA base for other temporary directories/tmp/hadoop-${user.name}/tmp/hadoop
dfs.name.dirWhere the NameNode metadata should be stored${hadoop.tmp.dir}/dfs/name/home/hadoop/dfs/name
dfs.data.dirWhere DataNodes store their blocks${hadoop.tmp.dir}/dfs/data/home/hadoop/dfs/data
mapred.system.dirThe in-HDFS path to shared MapReduce system files${hadoop.tmp.dir}/mapred/system/hadoop/mapred/system


Majority of hadoop settings resides in xml configuration files. Prior to hadoop 0.20, everything is part of hadoop-default.xml and hadoop-site.xml. As the name itself conveys, hadoop-default xml contains all default settings & if you want to override anything; then hadoop-site.xml is the file to work on.

If you are like me (running hadoop 1.x) running later versions (any thing > 0.20), this hadoop-site.xml has been sepearted into
- core-site.xml : We specify the hostname and port of the Namenode
- hdfs-site.xml : We specify the hostname and port of the JobTracker
- mapred-site.xml : We specify the replication factor for Hdfs.

So, we can add these namenode & datanode dirctories in hdfs-site.xml


<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/dfs/data</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop</value>
</property>



Thursday, May 31, 2012

Hadoop: Starting in Pseudo-distributed mode


In this blog entry; i am gonna list common mistakes/gotchas while starting Hadoop :

1) If you get the following exception while starting hadoop

localhost: Exception in thread "main" java.lang.IllegalArgumentException: Does not contain a valid host:port authority: file:///


Then, its mostly likely
a) Either the mapred configuration is empty or not valid
# cat conf/mapred-site.xml

<configuration>

</configuration>

b) or you are still pointing your conf to localmode (Refer #2 for switching different modes with a shortlink) & starting the hadoop



2) Switching between modes:
For newbies; To switch between local-mode, pseudo-distributed mode, fully distributed mode; it is advisable to create 3 different directories

conf.standalone, conf.pseudo, conf.full

Then just create a soft link to point to the appropriate mode. For Ex:
# cd $HADOOP_HOME
# ln -s conf.standalone conf

conf -> conf.standalone

Tuesday, March 27, 2012

How to fix : ANT Error - Could not create task or type of type: propertyfile.


If you see this error while running ant build , then follow the listed 3 steps

BUILD FAILED ..../build.xml:234: Could not create task or type of type: propertyfile.
Ant could not find the task or a class this task relies upon.

Solution (1,2 or 3 solves the problem):
1) Check the Ant version
# which ant
# ant -version
Make sure you have atleast version > 1.6.x as its not supported in older ANT versions.

2) Make sure you have this jar "ant-nodeps.jar" in $ANT_HOME/lib. This is the right jar for  PropertyFile.class


3) Make sure that you DON'T have this file /etc/ant.conf. If it exists, then this points to the jpackage version of ant . Just remove this file & ther's the end of the story.   



How to fix: CoreData error "NSInternalInconsistencyException" - iPhone App Development


Migrating this blog post from my old blog. This might help if some one gets stuck with this "NSInternalInconsistencyException".

If you see the following exception while working with core data (iphone sdk):


[Session started at 2010-06-01 22:47:09 -0700.]
2010-06-01 22:47:11.150 MyCoreDataList[88744:20b] *** Terminating app due to uncaught exception 'NSInternalInconsistencyException', reason: '+entityForName: could not locate an NSManagedObjectModel for entity name 'Stock''
2010-06-01 22:47:11.153 MyCoreDataList[88744:20b] Stack: (
    9905243,
    2444394043,
    7893163,
...

This happens because, u r trying to load a DB row from an object context that hasn't been loaded yet. So ur option: u can set-up right ther or just before loading this view...


Solution
Place the following segment in your RootViewController's ViewDidLoad() method


if(managedObjectContext == nil){
managedObjectContext = [(MyCoreDataListAppDelegate *)[[UIApplication sharedApplication] delegate] managedObjectContext];
}

Fix: cvs setup error with libcom_err.so.3


Recently, i tried to setup cvs on of my dev linux boxes. When i ran cvs command, i got
# cvs

Error while loading shared libraries: libcom_err.so.3: cannot open shared Object File


libcom_err.so comes from Kerberos(Krb5).

How to resolve this:
1) First check your Red hat version. Not sure but there is a weird thing with this file naming convention. For old Red hat versions, the file was named as  "libcom_err.so.3" where as the latest red hat versions of it are "libcom_err.so.2" and "/libcom_err.so.2.1". I know; this should be the other way round & not sure about the reasoing behind this

2) Run locate command to see if you have this .so file or not
# locate libcom_err
/lib/libcom_err.so.2
/lib/libcom_err.so.2.1

On my CentOS, i have this file under /lib. It seems that i am having version 2.x on my machine. Since these are latest versions compared to 3. I tried making a symlink from "3" to "2.1"

# ln -s /usr/lib/libcom_err.so.2.1 /usr/lib/libcom_err.so.3 

This solved the cvs error