-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Description
When migrating Containerd from v2.0.0.rc-4 to v2.0.0.rc-5 on the nodes of a large Kubernetes cluster, a lot of fatal errors like this one started to appear and make Containerd crash:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
The concurrent map operation that are occurring always come from the containerd/ttrpc dependency, and are triggered by all possible configurations: r/w, w/w and iteration/w.
After bisecting on v2.0.0-rc.4...v2.0.0-rc.5, here is the culprit PR: #10186
Steps to reproduce the issue
I could not find a simple reproducer for this issue inasmuch as it comes from concurrent operations running randomly at the same time. On my side, deploying v2-rc5 on a ~30-node cluster with ~10 pods per node consistently triggers the issue (but again, it might depend on many other factors...): there are at all time around 2 nodes that are failing due to those containerd failures.
Describe the results you received and expected
Here are a few examples:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: fatal error: concurrent map iteration and map write
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 20431 [running]:
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.MD.setRequest(...)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:66
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.(*Client).Call(0xc000619290, {0x62c01be746e8, 0xc0013eee70}, {0x62c01b7364d4, 0x17}, {0x62c01b711e19, 0x4}, {0x62c01bcfa7a0?, 0xc0010c61e0?}, {0x62c01bcfa860, ...})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x1a5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Wait(0xc000122ac0, {0x62c01be746e8, 0xc0013eee70}, 0xc0010c61e0)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:273 +0x92
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Wait(0xc0014c6e88, {0x62c01be746e8, 0xc0013eee70})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:133 +0xbf
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Wait(0x62c01be74720?, {0x62c01be746e8, 0xc0013eee70}, 0xc000fa54a0, {0xc00132dfb0?, 0x2?, 0x62c01cc60250?})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:633 +0xde
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/client.(*process).Wait.func1()
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/client/process.go:175 +0x27d
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: created by github.com/containerd/containerd/v2/client.(*process).Wait in goroutine 20426
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: /go/src/github.com/containerd/containerd/client/process.go:168 +0xa5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 1 [chan receive, 17 minutes]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: fatal error: concurrent map iteration and map write
Dec 10 13:30:57 <HOSTNAME> containerd[837]: goroutine 28835 [running]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.inject({0x55d4b49606e8, 0xc0013bc180}, {0x55d4b4958100, 0xc0001b5ad0}, 0xc000f14100)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:86 +0x2e5
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.UnaryClientInterceptor.func1({0x55d4b49606e8, 0xc000e86300}, 0xc000f14100, 0xc00011c640, 0x55d4b47e39e0?, 0xc00187c020)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/interceptor.go:98 +0x285
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/ttrpc.(*Client).Call(0xc0008f2000, {0x55d4b49606e8, 0xc000e86300}, {0x55d4b42224d4, 0x17}, {0x55d4b41fdedd, 0x4}, {0x55d4b482e5c0?, 0xc00011c5f0?}, {0x55d4b4775020, ...})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:173 +0x323
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Kill(0xc000ea4030, {0x55d4b49606e8, 0xc000e86300}, 0xc00011c5f0)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:233 +0x92
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Kill(0xc001986210, {0x55d4b49606e8, 0xc000e86300}, 0x9, 0xc0?)
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:41 +0xc7
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Kill(0x55d4b4960720?, {0x55d4b49606e8, 0xc000e86300}, 0xc00191a1e0, {0xc000f70828?, 0x3?, 0x500708?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:443 +0xbf
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.(*process).Kill(0xc001838bd0, {0x55d4b4960720, 0xc00191a000}, 0x9, {0xc000ea4028, 0x1, 0x0?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/client/process.go:157 +0x394
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.WithProcessKill({0x55d4b49606e8, 0xc000e86240}, {0x55d4b496b950, 0xc001838bd0})
Dec 10 13:30:57 <HOSTNAME> containerd[837]: /go/src/github.com/containerd/containerd/client/task_opts.go:168 +0x10e
What version of containerd are you using?
containerd v2-rc5 (the issue also occurs on v2.0.0)
Any other relevant information
As far as I understand the issue, it may be a good idea to make the ttrpc.MD object safer by preventing packages that are importing it from accessing its content directly and adding a RW mutex.
Here is a fix that I did on my side that fixes the issue: containerd/ttrpc#176
Feel free to let me know what you think
Show configuration if it is related to CRI plugin.
No response