Fix EC2 Network Issue: skb rides the rocket: 19 slots

22 December 2014

Here’s the fix:

sudo ethtool -K eth0 sg off

Back story:

We’ve been noticing sporadic network issues with some of our EC2 instances. The main symptom is that once in a while, an RPC request will hang for around 10 minutes.

We’ve never been able to reproduce it in a development environment, only in production. Some days it happens a lot, some not at all. All of our attempts to debug the issue failed. We have automatic monitoring tools that check for this happening and make a server inactive until it clears up. These handle the situation fine, but it’s still annoying and suboptimal.

Last week, I was investigating an unrelated issue. I was reading a syslog and saw this:

xen_netfront: xennet: skb rides the rocket: 19 slots

If it wasn’t so odd, I would have passed right over it. But I had to google it. Which led me to this blog post and this ubuntu bug report. Apparently, it’s based on getting hit by a rocket launcher in Quake, which takes me back to the daily capture the flag matches we used to have at Click working on Throne of Darkness…

Anyway, I found similar log messages on the servers with the intermittent RPC request issues. I ran the command to turn off scatter-gather on one of the servers:

sudo ethtool -K eth0 sg off

No errors for 6 hours. So I ran it on all of them. And we haven’t seen the RPC error since. Now it’s in rc.local in all of our AMIs. We haven’t noticed any performance problems. And getting rid of an occasional 10 minute RPC request is a definite performance boost.

So check your syslogs. If you see anything like “skb rides the rocket” then try turning off scatter-gather.

Send alerts to PagerDuty, Webhooks, and Slack

10 September 2014

We expanded where you can send your StatHat alerts. In addition to email addresses and Campfire chat rooms, you can send your alerts to PagerDuty, Slack channels, and generic webhooks.

We created a new system to set up your alert destinations. You can configure defaults for manual and automatic alerts as well as customized destinations for each manual alert.

Read all about alerts and the new destinations here.

We have a few more integrations in the queue, but let us know if there are other integrations you would love to have.

Export Daily Summary Data

18 August 2014

By popular demand, we made it easy to get daily summary data using the export API. This query will return 7 days of summary data:

You can specify any standard timeframe abbreviation, for example 2 months:

It uses the timezone set on the settings page to split the data exactly at midnight.
Any data received since midnight is not included.

What's powering the new web interface?

10 April 2014

We’ve had a lot of questions about what’s powering our new interface, so here’s a brief rundown:


Everything inside the <body> tag is rendered with React components.

We tried a lot of JS frameworks and libraries to help us make the UI more dynamic. The one that fit best was React. It provides a powerful way to make reusable components and the one-way data flow makes a lot of sense. With React, our code is understandable, small, and fast.


React doesn’t have anything for routing. We tried a bunch of these as well and settled on Backbone’s router. It’s simple, it works.


We are using d3 to render our charts. d3 is pretty amazing. We’re using maybe 1% of its API, but were able to replicate our hand-coded server-drawn charts quite easily.


We wrote some async JS code by hand a while back to get a bunch of datasets and split them up into a timeframe for a chart. Then we came back to it and had no idea how it worked. IcedCoffeeScript makes asynchronous JavaScript a breeze. We replaced our code with a few lines of IcedCoffeeScript and never looked back.


We use gulp to stitch all this JavaScript together, lint it, minify it, compress it. Then we have a Go app that uploads the bundle to s3/CloudFront, updates an assets file, and HUPs the web servers so they reload the assets file.

» Here’s the gulp plugin we wrote for IcedCoffeeScript


Nothing has changed on the backend, we still use Go for everything. It does a marvelous job with JSON.

Web App Interface Changes: Stats

09 April 2014

We just deployed some significant changes to StatHat’s web interface. Here are some changes concerning viewing and analyzing stats:

We changed the stat details interface to let you select any timeframe you want. Our standard one hour to one year overview still exists, but you don’t need to switch to the analyze interface to view something like 7 hours at 2 minute intervals.

When you hover over the chart, you will now see the value at each data point.

We are also including more summary data:

  • most recent value
  • standard deviation
  • 95% and 99% confidence intervals

Pressing the Play button makes the chart automatically update every minute.

Like always, the analyze interface allows you to compare up to five stats on the same chart. Comparing stats of vastly different scales is easier now: the left and right axes are independent and can be set to any group of stats. Comparing the number of API calls (which ranges from 6M - 14M) with HTTP request times ranging from 100 - 1800ms is clearer.

The charts are fully responsive, from large screens:

to medium:

to small:

We made changes to make StatHat easier to use when you have lots of stats:

  • stats are automatically collapsed into groups by common prefix
  • intelligent filtering and improved search
  • groups and saved searches visible in the stat list side bar

There are many more minor tweaks and updates, all with the goal of making StatHat a more useful tool to analyze your stats. Please let us know what you think of the changes.

Upcoming posts will highlight changes to alerts and describe the technologies behind these interface changes.