Pages

Wednesday, November 25, 2009

Joining Promises for Parallel RPCs in Node.JS

I've been playing around a bit more with Node.JS since my last post and I decided to experiment with the asynchronous process.Promise object this time around. In other languages I believe this concept is sometimes referred to as a future.

Diving in, suppose the following:

  • You're writing a blogging engine.
  • Blog Posts are kept in one data store, and Comments are in another.
  • A request for a Post object takes 1 second to return.
  • A request for a list of Comments on a Post takes 2 seconds to return (the comments data store is run by a bunch of slackers who don't care about latency)
  • You want to have /posts/{postId} return an html page that renders both a Post and all all the Comment objects on it.

When you make an RPC (or any I/O call) in Node.JS you should wrap it in a Promise so the process doesn't block on your HTTP request.

So our getPost and getComments RPCs (faked out) look like this:

var getCommentsPromise = function(postId) {
  var promise = new process.Promise();
  var comments = ["Comment 1 on " + postId, "Comment 2 on " + postId];
  setTimeout(function() { promise.emitSuccess(comments); }, 2000);
  return promise;
}

var getPostPromise = function(postId) {
  var promise = new process.Promise();
  setTimeout(function() { promise.emitSuccess({title: "Post Title " + postId, body: "Post Body " + postId}); }, 1000);
  return promise;
}

Now, if all you had to render on /posts/{postId} was the Post object and not the comments, you could just put the rendering code inside the handler for the Post RPC and be done with it, like so (building on the URI template router from my last post):
var handlers = {
  '/posts/{postId}' : {
      GET : function(request, response, args) {
        var postPromise = getPostPromise(postId);

        postPromise.addCallback(function(post) {
          templateVars.post = post;
          var pageHtml = postTemplate(templateVars);
          response.sendBody(pageHtml);
          response.finish();
        });
      }
    }
  }
}


But life is never that simple, and /posts/{postId} has to make two RPCs to get the data required to render a page. This is complicated because you can't render the page until both RPCs are complete.

There are at least two ways to deal with this situation. One sucks and the other doesn't suck as much.

Teh Suck: Serialize the RPCs, then render.

You can serialize the RPCs by nesting the call to the second one inside the handler for the first:

'/slowposts/{postId}' : {
      GET : function(request, response, args) {
        response.sendHeader(200, {"Content-Type": "text/html"});
        var postPromise = getPostPromise(postId);
        postPromise.addCallback(function(post) {
          var commentsPromise = getCommentsPromise(postId);
          commentsPromise.addCallback(function(comments) {
            var postTemplate = tmpl['post-template.html'];            
            var pageHtml = postTemplate({'post': post, 'comments': comments});
            response.sendBody(pageHtml);
            response.finish();
          });
        });
      }
    }
  }

This takes 3 seconds to complete: 2 for fetching comments, then 1 more for fetching the post.

Teh Not So Suck: Parallelize the RPCs, join them and render when the join is complete.

'/fasterposts/{postId}' : {
      GET : function(request, response, args) {
        response.sendHeader(200, {"Content-Type": "text/html"});
        var commentsPromise = getCommentsPromise(args.postId);
        var postPromise = getPostPromise(args.postId);
        var templateVars = {};
        commentsPromise.addCallback(function(comments) {
          templateVars.comments = comments;
        });
        postPromise.addCallback(function(post) {
          templateVars.post = post;
        })

        var joinedPromise = join([commentsPromise, postPromise]);
        
        joinedPromise.addCallback(function() {
          var postTemplate = tmpl['post-template.html'];            
          var pageHtml = postTemplate(templateVars);
          response.sendBody(pageHtml);
          response.finish();
        });
      }
    }
  }

This takes 2 seconds to complete since the RPCs are made in parallel, and the total time is just the slowest RPC (2 for fetching comments).

This method takes a special function, join(), to make it work.   join takes a bunch of promise objects and returns another promise that fires once all the other promises are complete:

function join(promises) {
  var count = promises.length;
  var p = new process.Promise();
  for (var i=0; i<promises.length; i++) {
    promises[i].addCallback(function() {
      if (--count == 0) { p.emitSuccess(); }
    });
  }
  
  return p;
}

Note that this example ignores stuff like errors, which make things even more complicated.  What to do with join when one of the promise objects fires an error instead of success?  Probably a good topic for another post in the future.

Also, I've been using Jed Schmidt's tmpl-node engine to render html in this example. Templating in Node.JS appears to be an active area of debate, but this one works fine for my purposes.

Note that one could also parallelize the rendering of the template as well, so the postPromise handler renders the html for the Post while commentsPromise is fetching/rendering comments. Then the join handler would stitch together the final html.

Sunday, November 22, 2009

Request Routing With URI Templates in Node.JS

I've been playing around with node.js, an asynchronous JavaScript server built on V8.

Node.js itself is pretty bare bones.  It's not a framework like Rails, but rather plain request-response handling.  It's sort of like Python's Twisted framework, from what I gather.

There are more full-featured frameworks for node.js if you look around on github, but I'm bored and feel like committing the sin of writing yet more framework code.

This morning I started with a request router for Node.JS that leverages URI templates*.  You specify the application as a series of request templates, paired with the functions that handle them.

For instance if you take the typical blogging application example, you might have a path like /posts/1234 - and the URI template would look like /posts/{postId}.

The magic is in turning {param}s in the URI template into parameters in the handler call.

Here's an example app that routes blog-like requests:

var handlers = {
  '/posts/{postId}' : {
      GET : function(postId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("GET Post ID: " + postId);
        this.response.finish();
      },
      POST : function(postId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("POST Post ID: " + postId);
        this.response.finish();        
      }
  },
  '/comments/{postId}/{commentId}' : {
      GET : function(postId, commentId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("GET Post ID: " + postId + " Comment ID: " + commentId);    
        this.response.finish();
      }
  }
};

As you can see the individual handler functions are further distinguished by HTTP method.

I'm not sure how to pass POST bodies to the handlers. They could just be attached to the handler's this I suppose.

Here's the full source:

var sys = require("sys"), http = require("http");

var handlers = {
  '/posts/{postId}' : {
      GET : function(postId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("GET Post ID: " + postId);
        this.response.finish();
      },
      POST : function(postId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("POST Post ID: " + postId);
        this.response.finish();        
      }
  },
  '/comments/{postId}/{commentId}' : {
      GET : function(postId, commentId) {
        this.response.sendHeader(200, {"Content-Type": "text/plain"});
        this.response.sendBody("GET Post ID: " + postId + " Comment ID: " + commentId);    
        this.response.finish();
      }
  }
};

var Route = function(uriTemplate) {
 this.uriTemplate = uriTemplate;
 var nameMatcher = new RegExp('{([^}]+)}', 'g');
 
 this.paramNames = this.uriTemplate.match(nameMatcher);
 // the regex keeps the {} on the param names for some reason. TODO: fix this.
 for (var i = 0; i < this.paramNames.length; i++) {
  this.paramNames[i] = this.paramNames[i].replace('{', '').replace('}', '');
 }

 this.matcherRegex = this.uriTemplate.replace('?', "\\?").replace(/{([^}]+)}/g, '([^/?&]+)');
 this.matcher = new RegExp(this.matcherRegex);
};

Route.prototype.parse = function(path) {
 if (this.matcher.test(path)) {
  var result = {};
  var paramValues = this.matcher.exec(path);
  // assert: paramValues.length == paramNames.length
  for (var i = 1; i < paramValues.length; i++) {
      result[this.paramNames[i-1]] = paramValues[i];
    }
  return result;
 }
 return null; //throw exception?
};

http.createServer(function (request, response) {
   var handled = false;

   for (pathTemplate in handlers) {
     var route = new Route(pathTemplate);
     var params = route.parse(request.uri.full);
     if (params) {
       // Convert the results to an array so we can pass them in via apply().
       var values = [];
       for (name in params) {
         values[values.length] = params[name];
       }

       var handler = handlers[pathTemplate][request.method];
       // So you can call this.request and this.response in the handlers.
       handler.apply({'request' : request, 'response' : response}, values);
       handled = true;
     }
   }

   if (!handled) {
     response.sendHeader(404, {"Content-Type": "text/plain"});
     var output = "Couldn't route: " + request.uri.full + "\n";
     for (name in request) {
       output += name + ": " + request[name] + "\n";
     }
     response.sendBody(output);
     response.finish();
   }
}).listen(8000);

sys.puts("Server running at http://127.0.0.1:8000/");
The route lookup in the http.createServer could be a lot more efficient, like memoizing Route objects for instance.

Anyways, NodeJS looks pretty exciting. Combined with CouchDB you could have a full JavaScript application stack: from storage to app server to client.

*Yes, I realize this is not a full implementation of the URI template spec.  It's just a proof of concept.

Saturday, September 12, 2009

Thoughts on Processing for the Web

Mozilla's Processing for the Web aims to leverage processing.org's low learning curve and expressiveness, but abandon the JVM dependency.  This is probably for the best as Processing is a great learning and experimentation environment, but applets are still clunky today well over ten years since they were introduced.

My personal felling is that processing's Java roots hinder it slightly in the learning curve department due to its static typing system.  Collections are particularly messy since processing doesn't do parameterized types, so you end up with lots of explicit casting if you're doing anything interesting at all.  JavaScript is much more lax in this regard and has language-level associative array support.  I know my own sketches would be much easier to work with in JS than in the current Java based Processing language because of this.  I create lots of classes to represent objects in my sketches and it's a pain to stuff them into and cast them back out of java.util.ArrayList and java.util.HashMap.

I met John Resig (author of jQuery and processing.js) when he came to speak at Google about JavaScript performance several months back.  I mentioned my backport of processing.js to processing to him, and he chuckled and said something to the effect of "Yeah, processing wasn't ready for the web."

It really wasn't, but I'm not sure that was a mistake.

You can do some really neat things with processing, specifically hardware controller interfaces, that you can't do in a browser without some serious compromises in the browser security model.  It would be a shame if Processing for the Web drew all the attention and developer resources away from the original Processing, but I doubt that will happen (soon, if ever).  P4Web will probably bring in new developers who wouldn't have touched the original Processing in the first place.

What I would like to see from this development is an evolution of the (JVM based) Processing language itself to be more JavaScripty. Perhaps P4Web will help nudge it in that direction.

Thursday, July 16, 2009

Thoughts on Leaked Twitter Docs

I couldn't stop myself from reading Tech Crunch's leaked Twitter docs. [disclaimer: Arrington is a total douche, but you knew that already.]

The more I read, the more I wanted to know. At the same time, I feel like I really shouldn't be looking at it. A delicious but guilt-inducing infosnack.

The feelings of uncontrollable curiosity reminded me of another incident involving leaked data: the AOL Search Data Scandal of 2006.

Fun Fact: Abdur Chowdhury, the researcher responsible for the AOL data leak back in 2006, is now Chief Scientist at ... Twitter.

Not that that's anything more than a coincidence, of course.

Wednesday, April 22, 2009

How to MapReduce in JavaScript Like a Pr0n Star With CouchDB

<soapbox> While I can appreciate the attempt to add some color to a technical presentation, at the risk of sounding prude there are ways to entertain a professional audience that don't involve pr0n references. </soapbox>

Here's a possibly NSFW presentation on CouchDB (Blogger *really* needs an after-the-jump feature :)



The tidbit that caught my eye though is the slide on how you build Views on CouchDB: CouchDB's Views are specified by JavaScript map/reduce functions, which is kinda cool. Here's a nifty interactive demo seasoned with jQuery.

jQuery meets MapReduce. The JavaScript Singularity approaches.

Sunday, April 19, 2009

Alex Payne Speaks at Stanford

Watch the whole video here

Random notes, taken as I watched and listened to Twitter's API Lead Alex Payne talk about API as UI, and various Twitter details:

The fact that in many schools, a student's first programming experience is with VB6 might explain why so many kids are turned off by by CS. If they were exposed to APIs with more thought put into usability they might be more likely to pursue CS.

API to HCI: millions of computer users are developers, and the way they interact with their computers is via APIs. API == super geeky UI.

100's of Millions of requests per day to the Twitter API.

Oooh Tweetie for OSX. Nice.

Twitter API support collaborates on one twitter account using CoTweet.

StockTwits: many of these users never go to Twitter.com after registering. StockTwits is how they interact with twitter.

Hardware Hack: BakerTweet: a baker in london turns the dial to "cupcakes" and hits a button to tweet that cupcakes are ready.
Hardware Hack: Kill-a-watt

inconsistency upsets developers
- actions that can be interpreted multiple ways
- don't provide UI conventions - devs have to use the twitter website first to understand what to do with the data.

java.util.Calendar/Date is indeed a fucking awful API.

import in Scala lets you do "import java.util.{Calendar, Date}" - Mmmmm love that syntactic sugar.

Yelp has an API? hehe aww how cute.

Yelp Docs don't match the actual API. Examples use single-quoted strings in JSON, which isn't valid JSON.

Yelp API puts response codes into the JSON payload instead of using HTTP headers to indicate OK, ERROR and so on.

! in Scala can mean either the traditional boolean NOT operator, or "send the following message to this actor": actorObj ! "message"
This is somewhat ambiguous. Al3x: "You get used to it." If it's okay for Scala why is it not okay for Win32 APIs? Many developers have just gotten "Used to it" in Win32. This is a nit, and I see his point.

REST vs. SOAP, WS-*: SOAP does a lot for the programmer: machine readable API definition -> auto-code generation.
REST doesn't write any code for you, but is intuitive enough not to require that.

"In my experience programmers don't like having details hidden from them" - but encapsulation and information hiding are fundamental to OO, if not API design.

"Smalltalk never really took off as a programming language" I know some pros who sold a shit ton of Smalltalk to banks in the 1980/90s who would disagree.

High traffic Twitter API users have requested Thrift access instead of REST API access. This makes total sense and I expect to see more of it.

"what reason do you have for not offering your own url shortening service, or not including URLs in the message itself?"
Al3x: "we don't have a good reason, which is why how we deal with urls is going to change" This sounds like twitter is going to offer its own URL shortener. Can't tell from his comments, but they are just dropping a lot of out-click data on the floor currently.

Al3x: The people who build the most interesting stuff bug us the least on mailing lists.

I have to wonder if in the future, as wifi becomes more prevalent, will Twitter : SMS :: Blogger : FTP ?

SignUp API coming soon?

how much did CNN pay for @cnnbrk? (would indicate $ value of followers to CNN)

>90% of their traffic is to the API rather than the web UI.

Wednesday, March 4, 2009

Map Reduce in JavaScript pt. 2

About two months ago I discussed how to run map reduce in JavaScript, closing with:

If only there was a vast sea of computers, all running javascript interpreters, all connected to the internet, all capable of downloading and running your m/r jobs. :)


Seems I wasn't the only one thinking along these lines. See igvita.com: Collaborative Map-Reduce in the Browser.

While I still like the idea (it makes use of vast untapped resources), there are some fundamental problems.

Running this kind of job over the open internet instead of a fast local (and secure) network is asking for trouble.

Sabotage: Forget accidental corruption. Workers can intentionally poison your jobs if they have an incentive to. Suppose you want to use this m/r setup to produce a spam classifier. Spammers could set up "workers" that submit bogus results that bias the filter to let their spam in.

How do you know you can trust a worker? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

Economics and Speed: Map Reduce works on large data sets. Datasets that will cost significant amounts of money to move back and forth across the internet. Even if you use Amazon S3, the JavaScript has to post the interim results back to your server (unless you want to post your s3 key inside the worker JS, which I doubt ;) so you're paying for that bandwidth from your hosting provider in addition to whatever S3 charges. Even if the dollar cost is not an issue, you're talking about some really slow total run times since your slowest worker is much slower on the open internet than it is on a set of machines you control.

How do you decide if the cost (in time and dollars) of running the job justifies the value of obtaining its results? (It's much easier to answer this if you're running your map reduce on a fast, secure local network.)

So until someone does a lot of legwork to sort out the basic m/r infrastructure and then tackles the additional problems introduced by running on an open, slow, expensive network connection, the JavaScript MapReduce over HTTP idea is just a (admittedly fun) toy.

Recommendation: if you need to crunch a data set, use Hadoop. If you want to demonstrate Feats of Technical Strength Regardless of Utility (as I often do), try JSMRHTTP.

JSMRHTTP is not very catchy. There must be some other name for this concept.