Pages

Wednesday, March 4, 2009

Map Reduce in JavaScript pt. 2

About two months ago I discussed how to run map reduce in JavaScript, closing with:

If only there was a vast sea of computers, all running javascript interpreters, all connected to the internet, all capable of downloading and running your m/r jobs. :)


Seems I wasn't the only one thinking along these lines. See igvita.com: Collaborative Map-Reduce in the Browser.

While I still like the idea (it makes use of vast untapped resources), there are some fundamental problems.

Running this kind of job over the open internet instead of a fast local (and secure) network is asking for trouble.

Sabotage: Forget accidental corruption. Workers can intentionally poison your jobs if they have an incentive to. Suppose you want to use this m/r setup to produce a spam classifier. Spammers could set up "workers" that submit bogus results that bias the filter to let their spam in.

How do you know you can trust a worker? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

Economics and Speed: Map Reduce works on large data sets. Datasets that will cost significant amounts of money to move back and forth across the internet. Even if you use Amazon S3, the JavaScript has to post the interim results back to your server (unless you want to post your s3 key inside the worker JS, which I doubt ;) so you're paying for that bandwidth from your hosting provider in addition to whatever S3 charges. Even if the dollar cost is not an issue, you're talking about some really slow total run times since your slowest worker is much slower on the open internet than it is on a set of machines you control.

How do you decide if the cost (in time and dollars) of running the job justifies the value of obtaining its results? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

So until someone does a lot of legwork to sort out the basic m/r infrastructure and then tackles the additional problems introduced by running on an open, slow, expensive network connection, the JavaScript MapReduce over HTTP idea is just a (admittedly fun) toy.

Recommendation: if you need to crunch a data set, use Hadoop. If you want to demonstrate Feats of Technical Strength Regardless of Utility (as I often do), try JSMRHTTP.

JSMRHTTP is not very catchy. There must be some other name for this concept.

6 comments:

  1. AMRAX. The MR is for MapReduce, the other letters can be backronymed later. Or just take the same words as in AJAX, they make as much sense (or not) here.

    ReplyDelete
  2. BTW, this comment widget has issues.

    Selected nothing to see what happens -- and it defaulted to Google.

    Sent me from your unbranded site to a Blogger login page, which trains people to be easy to phish.

    Backed up to see if I had another option, my comment was lost.

    Went forward to see if comment would survive if I relented and logged-in, got a 'duplicate action error'.

    ReplyDelete
  3. Thank you for the nice tips. It was easy to read, but I'd like to add that if your business needs to be updated try software development services.

    ReplyDelete
  4. A lot of thanks for this great review. Casino affiliates always search for online casino affiliate programs to increase their revenue income from best casinos or poker rooms.

    ReplyDelete
  5. Its interesting. Let me tell you something about homeowners insurance comparison to help you to save on home policy.

    ReplyDelete