83% Bandwidth Reduction via API Response Change

A typical response to a StatHat stat API call looks like this:

HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 02 May 2017 14:53:45 GMT
Content-Length: 25
Connection: keep-alive


It’s pretty small. 152 bytes. But users sent in 150 billion requests last month, and StatHat responded with that 152 bytes to them all, which is 20 terabytes.

After studying RFC 7231, we are going to change the default success response to:

HTTP/1.1 204 No Content

It is 25 bytes (maybe 26 with a blank line after the header), 127 bytes leaner.

The 204 response is an accurate description of a successful API request:

The 204 (No Content) status code indicates that the server has successfully fulfilled the request and that there is no additional content to send in the response payload body.

While the Date header field is encouraged, it is optional:

An origin server MUST NOT send a Date header field if it does not have a clock capable of providing a reasonable approximation of the current instance in Coordinated Universal Time.

So let’s just pretend we don’t have a good clock.

None of the official StatHat libraries look at the body of the response, they just care about the 2xx success status. We have tried this response in production on a subset of the requests and have not received any reports of it being a problem, so we are rolling it out for all requests.

Note that we will still provide details for multiple stats uploaded in a JSON request and any error cases.

If you have code that parses the original body and would like to continue doing so, include a vb=1 request parameter with your POST or GET request and the servers will respond with the original verbose body output.

This change should remove about 17 terabytes of useless data from the internet pipes each month.

Is Pied Piper using StatHat?

No, of course a fictional company on a TV show isn't using a real service like StatHat.1 But it sure looks like they are.

Season 3, Episode 9 Daily Active Users was all about two stats: Installs and Daily Active Users. Tracking these with StatHat would be a piece of cake.

Let's just assume "Installs" means "New User Created". In the show, they compare Installs to Daily Active Users. "Installs" doesn't make much sense for what appears to be primarily a web service.2 So whenever a new user is created, they could add one line of code to track this as a counter stat:

db.Exec("INSERT INTO users (id, email, created_at, active_at) VALUES (?, ?, NOW(), NOW())", id, email)
stathat.PostEZCount("installs", "stats@piedpiper.com", 1)

Whenever a user did something on the Pied Piper platform, they could update the user row in the database:

db.Exec("UPDATE users SET active_at=NOW() WHERE id=?", id)

Then they could have a script that ran via cron every minute to send the number of active users in the past 24 hours to StatHat:

var dau int
db.QueryRow("SELECT COUNT(*) FROM users WHERE active_at > NOW() - INTERVAL 24 HOUR").Scan(&dau)
stathat.PostEZValue("daily active users", "stats@piedpiper.com", dau)

Those two calls to StatHat are all it takes to track these stats. They don't need to be created on the website first as StatHat will create new stats dynamically when it receives a new stat name.

The installs stat is a counter. Every time someone signs up for Pied Piper, it sends a count of 1 to StatHat. StatHat will then sum these up over time.

The daily active users stat is a value. Every minute, the cron script sends the current daily active users value to StatHat. StatHat will then average this value over time.

Now that the data is going to StatHat, there are many options for viewing it. The web interface allows you to inspect the data at any timeframe, compare multiple stats. It also has cards that would be an easy way to create a dashboard of Installs and Daily Active Users. Or Panic's Status Board would be another easy choice.

But there's also an embed API that allows you to embed stat data on any web page, which would be one way to get a similar looking dashboard to the one on the show. The stat integrations page gives you a small block of JavaScript you can paste in any web page. Here's a stat embedded on this blog post, styled to look somewhat similar to the Pied Piper dashboard:


The code to do this looks like:

<script src="//www.stathat.com/js/embed.js"></script>
<script>StatHatEmbed.render({kind: 'text', s1: 'K6xI3hBsBACxjF9nAd7IhOPw8RbKH5XJ'});</script>

That's all that is needed. You should see the number displayed above change every minute. Since this stat is a counter, StatHat is displaying the total count received over the timeframe.

Perfect for milestone parties:


1: Betteridge's law of headlines is true in this case.

2: Yes, the "platform" is available on all devices so could be installable software, but it is also clearly a web service:

chartd.co: responsive, retina-compatible charts with just an img tag

We created a chart service that lives at chartd.co. It allows you to create responsive, retina-compatible charts with just an img tag. Like this:

chartd chart

That whole chart came from this URL:


We built it a long time ago and StatHat has been using it to include charts in alert and report emails. While we told a few other companies about it, we never officially announced it.

All the documentation for it is on the main page at chartd.co.

Please let us know what you think of it!

Fix EC2 Network Issue: skb rides the rocket: 19 slots

Here’s the fix:

sudo ethtool -K eth0 sg off

Back story:

We’ve been noticing sporadic network issues with some of our EC2 instances. The main symptom is that once in a while, an RPC request will hang for around 10 minutes.

We’ve never been able to reproduce it in a development environment, only in production. Some days it happens a lot, some not at all. All of our attempts to debug the issue failed. We have automatic monitoring tools that check for this happening and make a server inactive until it clears up. These handle the situation fine, but it’s still annoying and suboptimal.

Last week, I was investigating an unrelated issue. I was reading a syslog and saw this:

xen_netfront: xennet: skb rides the rocket: 19 slots

If it wasn’t so odd, I would have passed right over it. But I had to google it. Which led me to this blog post and this ubuntu bug report. Apparently, it’s based on getting hit by a rocket launcher in Quake, which takes me back to the daily capture the flag matches we used to have at Click working on Throne of Darkness…

Anyway, I found similar log messages on the servers with the intermittent RPC request issues. I ran the command to turn off scatter-gather on one of the servers:

sudo ethtool -K eth0 sg off

No errors for 6 hours. So I ran it on all of them. And we haven’t seen the RPC error since. Now it’s in rc.local in all of our AMIs. We haven’t noticed any performance problems. And getting rid of an occasional 10 minute RPC request is a definite performance boost.

So check your syslogs. If you see anything like “skb rides the rocket” then try turning off scatter-gather.

Send alerts to PagerDuty, Webhooks, and Slack

We expanded where you can send your StatHat alerts. In addition to email addresses and Campfire chat rooms, you can send your alerts to PagerDuty, Slack channels, and generic webhooks.

We created a new system to set up your alert destinations. You can configure defaults for manual and automatic alerts as well as customized destinations for each manual alert.

Read all about alerts and the new destinations here.

We have a few more integrations in the queue, but let us know if there are other integrations you would love to have.