James Aylett: Running statsd on Heroku

Published at
Thursday 18th April, 2013

statsd is a “simple daemon for easy stats aggregation”: you send it stats whenever you can (such as when rendering a web page), and it aggregates them internally and passes them upstream to something that can store them and make them available for other clients for analysis, graphing and so on. Upstream stores from statsd might include the Carbon storage engine from Graphite that you can run yourself somewhere, or a hosted service such as Librato. You can combine the two by using Hosted Graphite, which does exactly what it says on the tin.

Heroku is an infrastructure as a service company that provides an abstraction over servers, virtual machines and so forth geared to web deployment, as well as a toolchain for working with that.

It would be nice if we could use them together, and the good news is that we can. I wrote this because I couldn’t find anything online that spells out how to. The code and configuration is available on github.

How we’re going to do this

A simple deployment of statsd is this: put one instance on each physical machine you have, and point them all at a storage system. (You can also chain instances together, and have instances send their data on to multiple receivers. Let’s just ignore all of that, because then you probably don’t want to host on Heroku, and if you do you can certainly figure out how this all applies to your setup.)

On Heroku, we don’t have physical machines; in fact there isn’t the concept of “machine” at all. Instead, Heroku has Dynos, which are described as “lightweight containers” for UNIX processes. From their documentation:

[A Dyno] can run any command available in its default environment combined with your app’s slug

(The slug is basically your codebase plus dependencies.)

When working with physical machines there’s a tendency to put a number of different types of process on each, to avoid having to buy and manage more of them. With virtualisation, and hosting systems such as Amazon EC2, this isn’t so important, and with Heroku their entire architecture is set up almost to mandate that you have different types of Dynos (called process types) for different jobs; almost always a web type that is basically your application server, probably a secondary worker type that handles any long-running operations asynchronously to web requests, and so on.

However this doesn’t mean we can’t run multiple UNIX processes within one Dyno. Providing each process type is still only doing one thing, it still fits the Heroku semantics. This means we can tuck a statsd instance away in each Dyno, so it will aggregate information from the work being done there, with each statsd sending its aggregated data upstream.

(Why not have a process type for statsd and send all data to one or two Dynos before aggregating it upstream? Because statsd works over UDP for various sound reasons, but Heroku doesn’t provide UDP routing for its Dynos. Even if it did, you wouldn’t want to do things that way because UDP between arbitrary Dynos running who knows where within Heroku’s virtualised infrastructure can fall foul of all sorts of intermediate network issues.)

A demonstration app

Process types are configured in your app’s Procfile, so we want a single command that launches both statsd and whatever the main work of this Dyno is going to be. Let’s start by making a simple Flask app and deploying it to Heroku without statsd.

# requirements.txt
Flask==0.9
gunicorn==0.17.2

# web.py
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

if __name__ == "__main__":
    app.run()

And a simple Procfile to launch that:

# Procfile
web: gunicorn -b "0.0.0.0:$PORT" -w 4 web:app

If we turn this into a git repo, create a Heroku app and push everything up, we’ll be able to see our very boring homepage.

$ git init
$ git add requirements.txt Procfile web.py
$ git commit -a -m 'Simple Flask app for Heroku.'
$ heroku apps:create
Creating afternoon-reaches-9313... done, stack is cedar
http://afternoon-reaches-9313.herokuapp.com/ | git@heroku.com:afternoon-reaches-9313.git
Git remote heroku added
$ git push heroku master

(Lots of unimportant output removed; the important bit is the output from heroku apps:create which tells you the URL.)

Okay, all is well there. Let’s get statsd into play.

Doing two things at once in a Dyno

The key here is to put a command in the Procfile which launches both gunicorn and the statsd. A simple choice here is honcho, which is a python version of foreman. (If we were doing this using the Heroku Ruby runtime (say a Rails or Sinatra app) then it would make sense to use foreman instead.)

As we’re working in the python side of things, let’s add a simple statsd counter to our web app at the same time.

# requirements.txt
Flask==0.9
gunicorn==0.17.2
honcho==0.4.0
python-statsd==1.5.8

# web.py
import statsd
from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    counter = statsd.Counter("Homepage hits")
    counter += 1
    return "Hello World!"

if __name__ == "__main__":
    app.run()

Honcho uses a Procfile itself to figure out what to launch, so we need to give it separate configuration from the main Heroku one:

# Procfile.chain
web: gunicorn -b "0.0.0.0:$PORT" -w 4 web:app
statsd: cat /dev/zero

At this point we don’t know how to launch a statsd so we’ll just have it launch a dummy command that will keep running while gunicorn does its work. Then we need the main Heroku Procfile to launch honcho instead of gunicorn directly:

# Procfile
web: USER=nobody PORT=$PORT honcho -f Procfile.chain start

(The USER environment variable is needed because of how honcho defaults some of its options.)

And push it to Heroku:

$ git add requirements.txt Procfile Procfile.chain  web.py 
$ git commit -a -m 'Run gunicorn + dummy process; python will try to push to statsd'
$ git push heroku master

The python that tries to push a counter to statsd will fail silently if there isn’t one running, so all is well and you should still be able to get to your homepage at whichever URL Heroku gave you when you created the app.

Running statsd on Heroku

statsd is a node.js program, so we want the Heroku node.js support in order to run it. Heroku supports different languages using buildpacks – and we’re already using the Python buildpack to run Flask. Fortunately there are community-contributed buildpacks available, one of which suits our needs: heroku-buildpack-multi allows using multiple buildpacks at once. We need to set this as the buildpack for our app:

$ heroku config:add BUILDPACK_URL=https://github.com/ddollar/heroku-buildpack-multi.git

Then we can add a .buildpacks file that lists all the buildpacks we want to use.

http://github.com/heroku/heroku-buildpack-nodejs.git
http://github.com/heroku/heroku-buildpack-python.git

The node.js buildpack uses package.json to declare dependencies:

/* package.json */
{
  "name": "heroku-statsd",
  "version": "0.0.1",
  "dependencies": {
    "statsd": "0.6.0"
  },
  "engines": {
    "node": "0.10.x",
    "npm":  "1.2.x"
  }
}

statsd itself needs a tiny amount of configuration; at this point we’re not going to consider an upstream, so we want it to log every message it gets sent so we can see it in the Heroku logs:

/* statsd-config.js */
{
  dumpMessages: true
}

And finally we want to chain Procfile.chain so honcho knows to launch statsd:

web: gunicorn -b "0.0.0.0:$PORT" -w 4 web:app
statsd: node node_modules/statsd/stats.js statsd-config.js

Push that up to Heroku:

$ git add .buildpacks package.json statsd-config.js Procfile.chain
$ git commit -a -m 'Run statsd alongside gunicorn'
$ git push heroku master

If you hit your Heroku app’s URL you won’t see anything different, but when you check your Heroku logs:

$ heroku logs
2013-04-17T14:06:38.766960+00:00 heroku[router]: at=info method=GET path=/ host=afternoon-reaches-9313.herokuapp.com fwd="149.241.66.93" dyno=web.1 connect=2ms service=5ms status=200 bytes=12
2013-04-17T14:06:38.780056+00:00 app[web.1]: 14:06:38 statsd.1 | 17 Apr 14:06:38 - DEBUG: Homepage hits:1|c

Again I’ve removed a lot of boring output to focus on the two important lines: the first (from the Heroku routing layer; gunicorn itself doesn’t log by default) shows the request being successfully processed, and the second shows statsd getting our counter.

Pushing upstream

Both Librato and Hosted Graphite provide statsd backends so you can aggregate directly to them. For Librato the plugin is statsd-librato-backend, and for Hosted Graphite it’s statsd-hostedgraphite-backend. Other options will either have their own backends, or you can always write your own.

As well any configuration to support your chosen upstream, you probably want to drop the dumpMessages: true line so your Heroku logs are tidier.

Running locally

Everything we’ve done here will work locally as well. Assuming you have node.js (and npm) installed already, and you have virtualenv on your system for managing python virtual environments, just do:

$ virtualenv ENV
$ source ENV/bin/activate
$ ENV/bin/pip install -r requirements.txt
$ npm install
$ honcho -f Procfile.chain start

Caveats

I haven’t used this in production (yet), so beyond the concept being sound I can’t commit to its working without problems. In particular, things to think about include:

Certainly if you put this into production I’d pay attention to Heroku platform errors, do spot checks on data coming out of statsd if you can, and generally be cautious.