Pages

Wednesday, March 4, 2009

Map Reduce in JavaScript pt. 2

About two months ago I discussed how to run map reduce in JavaScript, closing with:

If only there was a vast sea of computers, all running javascript interpreters, all connected to the internet, all capable of downloading and running your m/r jobs. :)


Seems I wasn't the only one thinking along these lines. See igvita.com: Collaborative Map-Reduce in the Browser.

While I still like the idea (it makes use of vast untapped resources), there are some fundamental problems.

Running this kind of job over the open internet instead of a fast local (and secure) network is asking for trouble.

Sabotage: Forget accidental corruption. Workers can intentionally poison your jobs if they have an incentive to. Suppose you want to use this m/r setup to produce a spam classifier. Spammers could set up "workers" that submit bogus results that bias the filter to let their spam in.

How do you know you can trust a worker? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

Economics and Speed: Map Reduce works on large data sets. Datasets that will cost significant amounts of money to move back and forth across the internet. Even if you use Amazon S3, the JavaScript has to post the interim results back to your server (unless you want to post your s3 key inside the worker JS, which I doubt ;) so you're paying for that bandwidth from your hosting provider in addition to whatever S3 charges. Even if the dollar cost is not an issue, you're talking about some really slow total run times since your slowest worker is much slower on the open internet than it is on a set of machines you control.

How do you decide if the cost (in time and dollars) of running the job justifies the value of obtaining its results? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

So until someone does a lot of legwork to sort out the basic m/r infrastructure and then tackles the additional problems introduced by running on an open, slow, expensive network connection, the JavaScript MapReduce over HTTP idea is just a (admittedly fun) toy.

Recommendation: if you need to crunch a data set, use Hadoop. If you want to demonstrate Feats of Technical Strength Regardless of Utility (as I often do), try JSMRHTTP.

JSMRHTTP is not very catchy. There must be some other name for this concept.