I've been building up my background knowledge on current toolsets used in Data Science, and part of this is R and another is Hadoop.

Hadoop is a big thing, and takes (to my mind) quite a lot of effort to get going, and to understand how you can bend it to your will. Par of this learning process has been about finding a comfortable installation pattern for Linux - in particular Ubuntu, and the best help I've found so far has been from Michael Noll. Things that I had to be careful about were getting ssh working, and name resolution exactly right on all nodes that you put in your cluster, as you distribute things like /etc/hadoop/masters and the *-site.xml config files.

The next stage was to find a development pattern that enabled me to avoid Java. The answer to this for me is Hadoop Streaming. This basically allows you to pipe IO in and out of programs written in your favourite language - and in this case Michael does brilliantly again with Python and MapReduce.

Posted by PiersHarding at March 26, 2012 9:23 AM