Hadoop and Dumbo
Dumbo is a Python framework for writing Map Reduce flows with or without Hadoop. It’s been a pain up until now, trying to get it going as it has relied on a number of patches to Hadoop for different byte streams, type codes etc. to make it work. No longer - as the necessary patches ave now made it into core as of 1.0.2.
On Ubuntu 12.04 all I needed was the debian package from here, (install as per these instructions) and then run sudo easy_install dumbo .
The only catch is that Dumbo does not currently recognise the Debian package layout used by the Hadoop package maintainers, so I found that I had to make a one line patch to compensate for it:
diff --git a/dumbo/util.py b/dumbo/util.py index a57166d..cd35df3 100644 --- a/dumbo/util.py +++ b/dumbo/util.py @@ -267,6 +267,7 @@ def findjar(hadoop, name): hadoop home directory and component base name (e.g 'streaming')""" jardir_candidates = filter(os.path.exists, [ + os.path.join(hadoop, 'share', 'hadoop', 'contrib', name), os.path.join(hadoop, 'mapred', 'build', 'contrib', name), os.path.join(hadoop, 'build', 'contrib', name), os.path.join(hadoop, 'mapred', 'contrib', name, 'lib'),
And then run the quick tutorial example from here like so:
hadoop fs -copyFromLocal /var/log/apache2/access.log /user/hduser/access.log hadoop fs -ls /user/hduser/ dumbo start ipcount.py -hadoop /usr -input /user/hduser/access.log -output ipcounts dumbo cat ipcounts/part* -hadoop /usr | sort -k2,2nr | head -n 5
Posted by PiersHarding at April 13, 2012 5:20 PM