MapReduce

Technology

Jan 18

Written By

In the past couple weeks I've been playing with Hadoop -- or more specifically Amazon's AWS implementation called Elastic MapReduce (EMR).Taking a step back, Hadoop is a way of doing cluster computing when you're dealing with big datasets that can easily be broken up to have multiple machines work on them at once.I have to say the learning curve is quite a bit steeper than I would have guessed. It's not that Amazon did anything to make it hard -- on the contrary, Amazon did a lot to make it easy. I, for one, would not want to be responsible for setting up a Hadoop cluster let alone maintain that cluster. With AWS I can say something like "I want a cluster of five big machines to run this job for me until it's complete." Complete can take hours or days. When it's done, I stop paying for it.One thing I would recommend without any hesitation is MRJob. It's a Python package that wraps a lot of the functionality that EMR provides and gives you a nice, clean, repeatable call interface for EMR. I like it! Even with that it took a goodly while to figure out how to get it all working.Beyond me going gaga over EMR and such, I can't get into the details of what I'm working on. Maybe once things are launched I can say something. Maybe not. ;-)

MapReduce

Race

A very good read