Here’s the fix:
sudo ethtool -K eth0 sg off
Back story:
We’ve been noticing sporadic network issues with some of our EC2 instances. The main symptom is that once in a while, an RPC request will hang for around 10 minutes.
We’ve never been able to reproduce it in a development environment, only in production. Some days it happens a lot, some not at all. All of our attempts to debug the issue failed. We have automatic monitoring tools that check for this happening and make a server inactive until it clears up. These handle the situation fine, but it’s still annoying and suboptimal.
Last week, I was investigating an unrelated issue. I was reading a syslog and saw this:
xen_netfront: xennet: skb rides the rocket: 19 slots
If it wasn’t so odd, I would have passed right over it. But I had to google it. Which led me to this blog post and this ubuntu bug report. Apparently, it’s based on getting hit by a rocket launcher in Quake, which takes me back to the daily capture the flag matches we used to have at Click working on Throne of Darkness…
Anyway, I found similar log messages on the servers with the intermittent RPC request issues. I ran the command to turn off scatter-gather on one of the servers:
sudo ethtool -K eth0 sg off
No errors for 6 hours. So I ran it on all of them. And we haven’t seen the RPC error since. Now it’s in rc.local in all of our AMIs. We haven’t noticed any performance problems. And getting rid of an occasional 10 minute RPC request is a definite performance boost.
So check your syslogs. If you see anything like “skb rides the rocket” then try turning off scatter-gather.