Here’s the fix:
sudo ethtool -K eth0 sg off
We’ve been noticing sporadic network issues with some of our EC2 instances. The main
symptom is that once in a while, an RPC request will hang for around 10 minutes.
We’ve never been able to reproduce it in a development environment, only in production.
Some days it happens a lot, some not at all. All of our attempts to debug the issue
failed. We have automatic monitoring tools that check for this happening and make a
server inactive until it clears up. These handle the situation fine, but it’s still
annoying and suboptimal.
Last week, I was investigating an unrelated issue. I was reading a syslog and saw
xen_netfront: xennet: skb rides the rocket: 19 slots
If it wasn’t so odd, I would have passed right over it. But I had to google it.
Which led me to this blog post and this ubuntu bug report. Apparently, it’s based on getting hit by a rocket launcher in Quake, which
takes me back to the daily capture the flag matches we used to have at Click working
on Throne of Darkness…
Anyway, I found similar log messages on the servers with the intermittent RPC request
issues. I ran the command to turn off scatter-gather on one of the servers:
sudo ethtool -K eth0 sg off
No errors for 6 hours. So I ran it on all of them. And we haven’t seen the RPC
error since. Now it’s in rc.local in all of our AMIs. We haven’t noticed any
performance problems. And getting rid of an occasional 10 minute RPC request is
a definite performance boost.
So check your syslogs. If you see anything like “skb rides the rocket” then
try turning off scatter-gather.
We expanded where you can send your StatHat alerts. In
addition to email addresses and Campfire chat rooms, you
can send your alerts to PagerDuty, Slack channels, and
We created a new system to set up your alert destinations.
You can configure defaults for manual and automatic alerts
as well as customized destinations for each manual alert.
Read all about alerts and the new destinations here.
We have a few more integrations in the queue, but let us
know if there are other integrations you would love to
By popular demand, we made it easy to get daily summary data using the export API.
This query will return 7 days of summary data:
You can specify any standard timeframe abbreviation, for example 2 months:
It uses the timezone set on the settings page to split the data exactly at midnight.
Any data received since midnight is not included.
We’ve had a lot of questions about what’s powering our new interface, so
here’s a brief rundown:
Everything inside the
<body> tag is rendered with React components.
We tried a lot of JS frameworks and libraries to help us make the UI more dynamic. The one that fit best was React.
It provides a powerful way to make reusable components and the one-way data flow makes a lot of sense.
With React, our code is understandable, small, and fast.
React doesn’t have anything for routing. We tried a bunch of these as well and settled on Backbone’s router. It’s simple,
We are using d3 to render our charts. d3 is pretty amazing. We’re using maybe 1% of its API, but were able to replicate our
hand-coded server-drawn charts quite easily.
We wrote some async JS code by hand a while back to get a bunch of datasets and split them up into a timeframe for a chart.
We replaced our code with a few lines of IcedCoffeeScript and never looked back.
Then we have a Go app that uploads the bundle to s3/CloudFront, updates an assets file, and HUPs
the web servers so they reload the assets file.
» Here’s the gulp plugin we wrote for IcedCoffeeScript
Nothing has changed on the backend, we still use Go for everything.
It does a marvelous job with JSON.
We just deployed some significant changes to StatHat’s web interface.
Here are some changes concerning viewing and analyzing stats:
We changed the stat details interface to let you select any timeframe you want.
Our standard one hour to one year overview still exists, but you don’t need to switch
to the analyze interface to view something like 7 hours at 2 minute intervals.
When you hover over the chart, you will now see the value at each data point.
We are also including more summary data:
- most recent value
- standard deviation
- 95% and 99% confidence intervals
Pressing the Play button makes the chart automatically update every minute.
Like always, the analyze interface allows you to compare up to five stats on the same
chart. Comparing stats of vastly different scales is easier now: the left and right
axes are independent and can be set to any group of stats. Comparing the number of
API calls (which ranges from 6M - 14M) with HTTP request times ranging from 100 - 1800ms
The charts are fully responsive, from large screens:
We made changes to make StatHat easier to use when you have lots of stats:
- stats are automatically collapsed into groups by common prefix
- intelligent filtering and improved search
- groups and saved searches visible in the stat list side bar
There are many more minor tweaks and updates, all with the goal of making StatHat
a more useful tool to analyze your stats. Please let us know what you think of the
Upcoming posts will highlight changes to alerts and describe the technologies behind
these interface changes.