blog: July 2012

Hadoop MapReduce Jobs for analyzing Tweets Part I

Recenly uploaded my Twitter MR job for analyzing tweets to Github
https://github.com/satishvarmadandu/MyBigData

Audience:
If you have already worked on HadoopWordCount example & looking for some real-world application with hadoop, then this blog might be helpful. Are you
- looking to start working on some real-world applications to see the power of hadoop?
- looking for some ready-made open source package on hadoop map reduce so that you can extend your to-do list
- wondering on how to unit-test your MapReduce jobs?

If the answer is Yes for any of the above questions, then this project might be helpful for you.

Why MyBigData:

MyBigData tries to apply the concepts of hadoop on some real-world data (mainly with twitter data set). User's can download the entire project & they can either run it as it is or can extend to incorporate their features (mainly with twitter). I like Twitter because its so open & there is so much wealth of information that we can derive some value out of the tweets.

What is MyBigData:

MyBigData contains MapReduce jobs to perform Tweet analytics. Users can specify some keywords to track in a file. We use Twitter's Streaming API to collect all the tweets matching user's specified keywords. For each tweet, we extract entities like urls, user_mentions, hashtags etc. This project contains MapReduce jobs

to find out most popular urls (for every hour).
inludes some performance tuning settings to improve MapReduce performace.
contains mrunit (map reduce unit testing) & junit test cases for MapReduce jobs to demonstrate unit testing for Hadoop MapReduce jobs.
Hadoop deprecated old API from 0.20.*. This projects contains MapReduce jobs using old & new hadoop API to demonstrate the migration.

Twitter provides Streaming API. Default access provides the following limits
- We can track upto 400 keywords
- 5000 userIds.
- 1% of the total firehose. Twitter reports 250M tweets/day (as of Oct 2011) & growing everyday.

In Part-II, we see how to get & run the MyBigData from Github to analyze the tweets using Hadoop

blog

Sunday, July 15, 2012

Hadoop MapReduce Job for analyzing Tweets - Part I