This is a heavily truncated version of an internal blog post from August 2017. For more recent updates on Kafka, check out another blog post on compression, where we optimized throughput 4.5x for both disks and network.
Photo by Alex Povolyashko / Unsplash
Upgrading our systems to Debian Stretch
For quite some time we've been rolling out Debian Stretch, to the point where we have reached ~10% adoption in our core datacenters. As part of upgarding the underlying OS, we also evaluate the higher level software stack, e.g. taking a look at our ClickHouse and Kafka clusters.
During our upgrade of Kafka, we sucessfully migrated two smaller clusters, logs
and dns
, but ran into issues when attempting to upgrade one of our larger clusters, http
.
Thankfully, we were able to roll back the http
cluster upgrade relatively easily, due to heavy versioning of both the OS and the higher level software stack. If there's one takeaway from this blog post, it's to take advantage of consistent versioning.
High level differences
We upgraded one Kafka http
node, and it did not go as planned:
Having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared to a node where no upgrade happened, and an intermediary node that received a software stack upgrade, but not an OS upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.
For debugging CPU saturation issues, we call on perf
to fish out details:
The command used was: perf top -F 99
.
RCU stalls
In addition to higher system CPU usage, we found secondary slowdowns, including read-copy update (RCU) stalls:
[ 4909.110009] logfwdr (26887) used greatest stack depth: 11544 bytes left
[ 4909.392659] oom_reaper: reaped process 26861 (logfwdr), now anon-rss:8kB, file-rss:0kB, shmem-rss:0kB
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
We've seen RCU stalls before, and our (suboptimal) solution was to reboot the machine.
However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. During our deep dive, we noticed in dmesg
that we had issues allocating memory, while trying to write errors:
Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188
Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
Aug 15 21:51:35 myhost kernel: [<ffffffff810b1193>] ? call_console_drivers.isra.22.constprop.30+0xf3/0x100
Aug 15 21:51:35 myhost kernel: [<ffffffff810b1f51>] ? console_unlock+0x281/0x550
Aug 15 21:51:35 myhost kernel: [<ffffffff810b2498>] ? vprintk_emit+0x278/0x430
Aug 15 21:51:35 myhost kernel: [<ffffffff810b27ef>] ? vprintk_default+0x1f/0x30
Aug 15 21:51:35 myhost kernel: [<ffffffff811588df>] ? printk+0x48/0x50
Aug 15 21:51:35 myhost kernel: [<ffffffff810b30ee>] ? dump_stack_print_info+0x7e/0xc0
Aug 15 21:51:35 myhost kernel: [<ffffffff8142d41f>] ? dump_stack+0x44/0x65
Aug 15 21:51:35 myhost kernel: [<ffffffff81162e64>] ? warn_alloc+0x124/0x150
Aug 15 21:51:35 myhost kernel: [<ffffffff81163842>] ? __alloc_pages_slowpath+0x932/0xb80
Aug 15 21:51:35 myhost kernel: [<ffffffff81163c92>] ? __alloc_pages_nodemask+0x202/0x250
Aug 15 21:51:35 myhost kernel: [<ffffffff811ae9c2>] ? alloc_pages_current+0x92/0x120
Aug 15 21:51:35 myhost kernel: [<ffffffff81159d2f>] ? __page_cache_alloc+0xbf/0xd0
Aug 15 21:51:35 myhost kernel: [<ffffffff8115cdfa>] ? filemap_fault+0x2ea/0x4d0
Aug 15 21:51:35 myhost kernel: [<ffffffff8136dc95>] ? xfs_filemap_fault+0x45/0xa0
Aug 15 21:51:35 myhost kernel: [<ffffffff8118b3eb>] ? __do_fault+0x6b/0xd0
Aug 15 21:51:35 myhost kernel: [<ffffffff81190028>] ? handle_mm_fault+0xe98/0x12b0
Aug 15 21:51:35 myhost kernel: [<ffffffff8110756b>] ? __seccomp_filter+0x1db/0x290
Aug 15 21:51:35 myhost kernel: [<ffffffff8104fa5c>] ? __do_page_fault+0x22c/0x4c0
Aug 15 21:51:35 myhost kernel: [<ffffffff8104fd10>] ? do_page_fault+0x20/0x70
Aug 15 21:51:35 myhost kernel: [<ffffffff819bea02>] ? page_fault+0x22/0x30
This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this hypothesis, we looked at the very beginning of the error chain:
Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80
As much as shrink_node
may scream "NUMA issues", you're looking primarily at:
Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
In addition, we also found memory allocation issues:
[78972.506644] Mem-Info:
[78972.506653] active_anon:3936889 inactive_anon:371971 isolated_anon:0
[78972.506653] active_file:25778474 inactive_file:1214478 isolated_file:2208
[78972.506653] unevictable:0 dirty:1760643 writeback:0 unstable:0
[78972.506653] slab_reclaimable:1059804 slab_unreclaimable:141694
[78972.506653] mapped:47285 shmem:535917 pagetables:10298 bounce:0
[78972.506653] free:202928 free_pcp:3085 free_cma:0
[78972.506660] Node 0 active_anon:8333016kB inactive_anon:989808kB active_file:50622384kB inactive_file:2401416kB unevictable:0kB isolated(anon):0kB isolated(file):3072kB mapped:96624kB dirty:3422168kB writeback:0kB shmem:1261156kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:15744 all_unreclaimable? no
[78972.506666] Node 1 active_anon:7414540kB inactive_anon:498076kB active_file:52491512kB inactive_file:2456496kB unevictable:0kB isolated(anon):0kB isolated(file):5760kB mapped:92516kB dirty:3620404kB writeback:0kB shmem:882512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:9080974 all_unreclaimable? no
[78972.506671] Node 0 DMA free:15900kB min:100kB low:124kB high:148kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 9 printk messages dropped ** [78972.506716] Node 0 Normal: 15336*4kB (UMEH) 4584*8kB (MEH) 2119*16kB (UME) 775*32kB (MEH) 106*64kB (UM) 81*128kB (MH) 29*256kB (UM) 25*512kB (M) 19*1024kB (M) 7*2048kB (M) 2*4096kB (M) = 236080kB
[78972.506725] Node 1 Normal: 31740*4kB (UMEH) 3879*8kB (UMEH) 873*16kB (UME) 353*32kB (UM) 286*64kB (UMH) 62*128kB (UMH) 28*256kB (MH) 20*512kB (UMH) 15*1024kB (UM) 7*2048kB (UM) 12*4096kB (M) = 305752kB
[78972.506726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506727] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506728] 27531091 total pagecache pages
[78972.506729] 0 pages in swap cache
[78972.506730] Swap cache stats: add 0, delete 0, find 0/0
[78972.506730] Free swap = 0kB
[78972.506731] Total swap = 0kB
[78972.506731] 33524975 pages RAM
[78972.506732] 0 pages HighMem/MovableOnly
[78972.506732] 546255 pages reserved
[78972.620129] ntpd: page allocation stalls for 272380ms, order:0, mode:0x24000c0(GFP_KERNEL)
[78972.620132] CPU: 16 PID: 13099 Comm: ntpd Tainted: G O 4.9.43-cloudflare-2017.8.4 #1
[78972.620133] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A21 10/01/2015
[78972.620136] ffffc90022f9b6f8 ffffffff8142d668 ffffffff81ca31b8 0000000000000001
[78972.620138] ffffc90022f9b778 ffffffff81162f14 024000c022f9b740 ffffffff81ca31b8
[78972.620140] ffffc90022f9b720 0000000000000010 ffffc90022f9b788 ffffc90022f9b738
[78972.620140] Call Trace:
[78972.620148] [<ffffffff8142d668>] dump_stack+0x4d/0x65
[78972.620152] [<ffffffff81162f14>] warn_alloc+0x124/0x150
[78972.620154] [<ffffffff811638f2>] __alloc_pages_slowpath+0x932/0xb80
[78972.620157] [<ffffffff81163d42>] __alloc_pages_nodemask+0x202/0x250
[78972.620160] [<ffffffff811aeae2>] alloc_pages_current+0x92/0x120
[78972.620162] [<ffffffff8115f6ee>] __get_free_pages+0xe/0x40
[78972.620165] [<ffffffff811e747a>] __pollwait+0x9a/0xe0
[78972.620168] [<ffffffff817c9ec9>] datagram_poll+0x29/0x100
[78972.620170] [<ffffffff817b9d48>] sock_poll+0x48/0xa0
[78972.620172] [<ffffffff811e7c35>] do_select+0x335/0x7b0
This specific error message did seem fun:
[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation (smallest allocation of one 4 KiB page).
Comparing to our control nodes, the only two possible explanations were: a kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. However, just to be safe, we rolled both the kernel back to 4.4.55, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.
Digging a bit deeper
Keeping servers running on older kernel and distribution is not a viable long term solution. Through bisection, we found the issue lay in the Jessie to Stretch upgrade, contrary to our initial hypothesis.
Now that we knew what the problem was, we proceeded to investigate why. With the help from existing automation around perf
and Java, we generated the following flamegraphs:
Jessie
Stretch
At first it looked like Jessie was doing writev
instead of sendfile
, but the full flamegraphs revealed that Strech was executing sendfile
a lot slower.
If you highlight sendfile
:
Jessie
Stretch
And zoomed in:
Jessie
Stretch
These two look very different.
Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.
We'll dive into the differences in the next section.
And deeper
To trace latency distributions of sendfile
syscalls between Jessie and Stretch, we used funclatency
from bcc-tools:
Jessie
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
usecs : count distribution
0 -> 1 : 9 | |
2 -> 3 : 47 |**** |
4 -> 7 : 53 |***** |
8 -> 15 : 379 |****************************************|
16 -> 31 : 329 |********************************** |
32 -> 63 : 101 |********** |
64 -> 127 : 23 |** |
128 -> 255 : 50 |***** |
256 -> 511 : 7 | |
Stretch
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:28
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 20 |*** |
4 -> 7 : 46 |******* |
8 -> 15 : 56 |******** |
16 -> 31 : 65 |********** |
32 -> 63 : 75 |*********** |
64 -> 127 : 75 |*********** |
128 -> 255 : 258 |****************************************|
256 -> 511 : 144 |********************** |
512 -> 1023 : 24 |*** |
1024 -> 2047 : 27 |**** |
2048 -> 4095 : 28 |**** |
4096 -> 8191 : 35 |***** |
8192 -> 16383 : 1 | |
In the flamegraphs, you can see timers being set at the tip (mod_timer
function), with these timers taking locks. On Stretch we installed 3x more timers, resulting in 10x the amount of contention:
Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:36
FUNC COUNT
mod_timer 60482
00:33:37
FUNC COUNT
mod_timer 58263
00:33:38
FUNC COUNT
mod_timer 54626
$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:36
FUNC COUNT
lock_timer_base 15962
00:32:37
FUNC COUNT
lock_timer_base 16261
00:32:38
FUNC COUNT
lock_timer_base 15806
Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:28
FUNC COUNT
mod_timer 149068
00:33:29
FUNC COUNT
mod_timer 155994
00:33:30
FUNC COUNT
mod_timer 160688
$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:32
FUNC COUNT
lock_timer_base 119189
00:32:33
FUNC COUNT
lock_timer_base 196895
00:32:34
FUNC COUNT
lock_timer_base 140085
The Linux kernel includes debugging facilities for timers, which call the timer:timer_start
tracepoint on every timer start. This allowed us to pull up timer names:
Jessie
$ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ]
6 blk_rq_timed_out_timer
2 clocksource_watchdog
5 commit_timeout
5 cursor_timer_handler
2 dev_watchdog
10 garp_join_timer
2 ixgbe_service_timer
36 reqsk_timer_handler
4769 tcp_delack_timer
171 tcp_keepalive_timer
168512 tcp_write_timer
Stretch
$ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ]
6 clocksource_watchdog
4 commit_timeout
12 cursor_timer_handler
2 dev_watchdog
18 garp_join_timer
4 ixgbe_service_timer
1 neigh_timer_handler
1 reqsk_timer_handler
4622 tcp_delack_timer
1 tcp_keepalive_timer
1983978 tcp_write_timer
1 writeout_period
So basically we install 12x more tcp_write_timer
timers, resulting in higher kernel CPU usage.
Taking specific flamegraphs of the timers revealed the differences in their operation:
Jessie
Stretch
We then traced the functions that were different:
Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:33
FUNC COUNT
tcp_sendmsg 21166
03:33:34
FUNC COUNT
tcp_sendmsg 21768
03:33:35
FUNC COUNT
tcp_sendmsg 21712
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:14
FUNC COUNT
tcp_push_one 496
03:37:15
FUNC COUNT
tcp_push_one 432
03:37:16
FUNC COUNT
tcp_push_one 495
$ sudo /usr/share/bcc/tools/trace -p 23485 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
1583 4
2043 54
3546 18
4016 59
4423 50
5349 8
6154 40
6620 38
17121 51
39528 44
Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:30
FUNC COUNT
tcp_sendmsg 53834
03:33:31
FUNC COUNT
tcp_sendmsg 49472
03:33:32
FUNC COUNT
tcp_sendmsg 51221
$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:10
FUNC COUNT
tcp_push_one 64483
03:37:11
FUNC COUNT
tcp_push_one 65058
03:37:12
FUNC COUNT
tcp_push_one 72394
$ sudo /usr/share/bcc/tools/trace -p 3416 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
396 46
409 4
1124 50
1305 18
1547 40
1672 59
1729 8
2181 38
19052 44
64504 4096
The traces showed huge variations of tcp_sendmsg
and tcp_push_one
within sendfile
.
To further introspect, we leveraged a kernel feature available since 4.9: the ability to count stacks. This led us to measuring what hits tcp_push_one
:
Jessie
$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
tcp_push_one
inet_sendmsg
sock_sendmsg
sock_write_iter
do_iter_readv_writev
do_readv_writev
vfs_writev
do_writev
SyS_writev
do_syscall_64
return_from_SYSCALL_64
1
tcp_push_one
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
direct_splice_actor
splice_direct_to_actor
do_splice_direct
do_sendfile
sys_sendfile64
do_syscall_64
return_from_SYSCALL_64
4950
Stretch
$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
tcp_push_one
inet_sendmsg
sock_sendmsg
sock_write_iter
do_iter_readv_writev
do_readv_writev
vfs_writev
do_writev
SyS_writev
do_syscall_64
return_from_SYSCALL_64
123
tcp_push_one
inet_sendmsg
sock_sendmsg
sock_write_iter
__vfs_write
vfs_write
SyS_write
do_syscall_64
return_from_SYSCALL_64
172
tcp_push_one
inet_sendmsg
sock_sendmsg
kernel_sendmsg
sock_no_sendpage
tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
direct_splice_actor
splice_direct_to_actor
do_splice_direct
do_sendfile
sys_sendfile64
do_syscall_64
return_from_SYSCALL_64
735110
If you diff the most popular stacks, you'll get:
--- jessie.txt 2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
Let's look closer at tcp_sendpage
:
int tcp_sendpage(struct sock *sk, struct page *page, int offset,
size_t size, int flags)
{
ssize_t res;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!sk_check_csum_caps(sk))
return sock_no_sendpage(sk->sk_socket, page, offset, size,
flags);
lock_sock(sk);
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
res = do_tcp_sendpages(sk, page, offset, size, flags);
release_sock(sk);
return res;
}
It looks like we don't enter the if
body. We looked up what NET_F_SG does: segmentation offload. This difference is peculiar, since both OS'es should have this enabled.
Even deeper, to the crux
It turned out that we had segmentation offload enabled for only a few of our NICs: eth2
, eth3
, and bond0
. Our network setup can be described as follows:
eth2 -->| |--> vlan10
|---> bond0 -->|
eth3 -->| |--> vlan100
The missing piece was that we were missing segmentation offload on VLAN interfaces, where the actual IPs live.
Here's the diff from ethtook -k vlan10
:
$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63 2017-08-16 21:21:12.000000000 -0700
+++ /dev/fd/62 2017-08-16 21:21:12.000000000 -0700
@@ -1,21 +1,21 @@
Features for vlan10:
rx-checksumming: off [fixed]
-tx-checksumming: off
+tx-checksumming: on
tx-checksum-ipv4: off [fixed]
- tx-checksum-ip-generic: off
+ tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
tx-checksum-fcoe-crc: off
tx-checksum-sctp: off
-scatter-gather: off
- tx-scatter-gather: off
+scatter-gather: on
+ tx-scatter-gather: on
tx-scatter-gather-fraglist: off
-tcp-segmentation-offload: off
- tx-tcp-segmentation: off [requested on]
- tx-tcp-ecn-segmentation: off [requested on]
- tx-tcp-mangleid-segmentation: off [requested on]
- tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+ tx-tcp-segmentation: on
+ tx-tcp-ecn-segmentation: on
+ tx-tcp-mangleid-segmentation: on
+ tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: off [fixed]
So we entusiastically enabled segmentation offload:
$ sudo ethtool -K vlan10 sg on
And it didn't help! Will the suffering ever end? Let's also enable TCP transmission checksumming offload:
$ sudo ethtool -K vlan10 tx on
Actual changes:
tx-checksumming: on
tx-checksum-ip-generic: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: on
Nothing. The diff is essentially empty now:
$ diff -rup <(ssh jessie sudo ethtool -k vlan10) <(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63 2017-08-16 21:31:27.000000000 -0700
+++ /dev/fd/62 2017-08-16 21:31:27.000000000 -0700
@@ -4,11 +4,11 @@ tx-checksumming: on
tx-checksum-ipv4: off [fixed]
tx-checksum-ip-generic: on
tx-checksum-ipv6: off [fixed]
- tx-checksum-fcoe-crc: off [requested on]
- tx-checksum-sctp: off [requested on]
+ tx-checksum-fcoe-crc: off
+ tx-checksum-sctp: off
scatter-gather: on
tx-scatter-gather: on
- tx-scatter-gather-fraglist: off [requested on]
+ tx-scatter-gather-fraglist: off
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
The last missing piece we found was that offload changes are applied only during connection initiation, so we restarted Kafka, and we immediately saw a performance improvement (green line):
Not enabling offload features when possible seems like a pretty bad regression, so we filed a ticket for systemd
:
In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.
Having a fix enabled, we rebooted our logs
Kafka cluster to upgrade to the latest kernel, and our 5 day CPU usage history yielded positive results:
The DNS cluster also yielded positive results, with just 2 nodes rebooted (purple line going down):
Conclusion
It was an error on our part to hit performance degradation without a good regression framework in place to catch the issue. Luckily, due to our heavy use of version control, we managed to bisect the issue rather quickly, and had a temp rollback in place while root causing the problem.
In the end, enabling offload also removed RCU stalls. It's not really clear whether it was the cause or just a catalyst, but the end result speaks for itself.
On the bright side, we dug pretty deep into Linux kernel internals, and although there were fleeting moments of giving up, moving to the woods to become a park ranger, we persevered and came out of the forest successful.
If deep diving from high level symptoms to kernel/OS issues makes you excited, drop us a line.