Sunday, July 15, 2012

Hadoop MapReduce Job for analyzing Tweets - Part I

Hadoop MapReduce Jobs for analyzing Tweets Part I

Recenly uploaded my Twitter MR job for analyzing tweets to Github
https://github.com/satishvarmadandu/MyBigData

Audience:
 If you have already worked on HadoopWordCount example & looking for some real-world application with hadoop, then this blog might be helpful. Are you
- looking to start working on some real-world applications to see the power of hadoop?
- looking for some ready-made open source package on hadoop map reduce so that you can extend your to-do list
- wondering on how to unit-test your MapReduce jobs?

If the answer is Yes for any of the above questions, then this project might be helpful for you.

Why MyBigData:

MyBigData tries to apply the concepts of hadoop on some real-world data (mainly with twitter data set). User's can download the entire project & they can either run it as it is or can extend to incorporate their features (mainly with twitter). I like Twitter because its so open & there is so much wealth of information that we can derive some value out of the tweets.

What is MyBigData:

MyBigData contains MapReduce jobs to perform Tweet analytics. Users can specify some keywords to track in a file. We use Twitter's Streaming API to collect all the tweets matching user's specified keywords. For each tweet, we extract entities like urls, user_mentions, hashtags etc. This project contains MapReduce jobs
  • to find out most popular urls (for every hour).
  • inludes some performance tuning settings to improve MapReduce performace.
  • contains mrunit (map reduce unit testing) & junit test cases for MapReduce jobs to demonstrate unit testing for Hadoop MapReduce jobs.
  • Hadoop deprecated old API from 0.20.*. This projects contains MapReduce jobs using old & new hadoop API to demonstrate the migration.


Twitter provides Streaming API.  Default access provides the following limits
- We can track upto 400 keywords
- 5000 userIds.
- 1% of the total firehose. Twitter reports 250M tweets/day (as of Oct 2011) & growing everyday.



In Part-II, we see how to get & run the MyBigData from Github to analyze the tweets using Hadoop