Skip to content

Fatal errors: TTRPC metadata concurrent map operations #11138

@just1not2

Description

@just1not2

Description

When migrating Containerd from v2.0.0.rc-4 to v2.0.0.rc-5 on the nodes of a large Kubernetes cluster, a lot of fatal errors like this one started to appear and make Containerd crash:

Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})

The concurrent map operation that are occurring always come from the containerd/ttrpc dependency, and are triggered by all possible configurations: r/w, w/w and iteration/w.

After bisecting on v2.0.0-rc.4...v2.0.0-rc.5, here is the culprit PR: #10186

Steps to reproduce the issue

I could not find a simple reproducer for this issue inasmuch as it comes from concurrent operations running randomly at the same time. On my side, deploying v2-rc5 on a ~30-node cluster with ~10 pods per node consistently triggers the issue (but again, it might depend on many other factors...): there are at all time around 2 nodes that are failing due to those containerd failures.

Describe the results you received and expected

Here are a few examples:

Dec 10 15:24:57 <HOSTNAME> containerd[32960]: fatal error: concurrent map writes
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: goroutine 164 [running]:
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/ttrpc.MD.Set(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:48
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: github.com/containerd/otelttrpc.(*metadataSupplier).Set(0x0?, {0x5c66b02ac976, 0xb}, {0xc00087e640, 0x37})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:58 +0x91
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.TraceContext.Inject({}, {0x5c66b0a036e8?, 0xc000a08a20?}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/trace_context.go:64 +0x747
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/propagation.compositeTextMapPropagator.Inject(...)
Dec 10 15:24:57 <HOSTNAME> containerd[32960]:         /go/src/github.com/containerd/containerd/vendor/go.opentelemetry.io/otel/propagation/propagation.go:106
Dec 10 15:24:57 <HOSTNAME> containerd[32960]: go.opentelemetry.io/otel/internal/global.(*textMapPropagator).Inject(0xc0000ba460?, {0x5c66b0a036e8, 0xc000a08a20}, {0x5c66b09fb520, 0xc0008be0e0})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: fatal error: concurrent map iteration and map write
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 20431 [running]:
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.MD.setRequest(...)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/metadata.go:66
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/ttrpc.(*Client).Call(0xc000619290, {0x62c01be746e8, 0xc0013eee70}, {0x62c01b7364d4, 0x17}, {0x62c01b711e19, 0x4}, {0x62c01bcfa7a0?, 0xc0010c61e0?}, {0x62c01bcfa860, ...})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:163 +0x1a5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Wait(0xc000122ac0, {0x62c01be746e8, 0xc0013eee70}, 0xc0010c61e0)
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:273 +0x92
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Wait(0xc0014c6e88, {0x62c01be746e8, 0xc0013eee70})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:133 +0xbf
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Wait(0x62c01be74720?, {0x62c01be746e8, 0xc0013eee70}, 0xc000fa54a0, {0xc00132dfb0?, 0x2?, 0x62c01cc60250?})
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:633 +0xde
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: github.com/containerd/containerd/v2/client.(*process).Wait.func1()
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/client/process.go:175 +0x27d
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: created by github.com/containerd/containerd/v2/client.(*process).Wait in goroutine 20426
Dec 10 15:24:52 <HOSTNAME> containerd[17813]:         /go/src/github.com/containerd/containerd/client/process.go:168 +0xa5
Dec 10 15:24:52 <HOSTNAME> containerd[17813]: goroutine 1 [chan receive, 17 minutes]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: fatal error: concurrent map iteration and map write
Dec 10 13:30:57 <HOSTNAME> containerd[837]: goroutine 28835 [running]:
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.inject({0x55d4b49606e8, 0xc0013bc180}, {0x55d4b4958100, 0xc0001b5ad0}, 0xc000f14100)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/metadata_supplier.go:86 +0x2e5
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/otelttrpc.UnaryClientInterceptor.func1({0x55d4b49606e8, 0xc000e86300}, 0xc000f14100, 0xc00011c640, 0x55d4b47e39e0?, 0xc00187c020)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/otelttrpc/interceptor.go:98 +0x285
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/ttrpc.(*Client).Call(0xc0008f2000, {0x55d4b49606e8, 0xc000e86300}, {0x55d4b42224d4, 0x17}, {0x55d4b41fdedd, 0x4}, {0x55d4b482e5c0?, 0xc00011c5f0?}, {0x55d4b4775020, ...})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/ttrpc/client.go:173 +0x323
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/api/runtime/task/v3.(*ttrpctaskClient).Kill(0xc000ea4030, {0x55d4b49606e8, 0xc000e86300}, 0xc00011c5f0)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/vendor/github.com/containerd/containerd/api/runtime/task/v3/shim_ttrpc.pb.go:233 +0x92
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/core/runtime/v2.(*process).Kill(0xc001986210, {0x55d4b49606e8, 0xc000e86300}, 0x9, 0xc0?)
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/core/runtime/v2/process.go:41 +0xc7
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/plugins/services/tasks.(*local).Kill(0x55d4b4960720?, {0x55d4b49606e8, 0xc000e86300}, 0xc00191a1e0, {0xc000f70828?, 0x3?, 0x500708?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/plugins/services/tasks/local.go:443 +0xbf
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.(*process).Kill(0xc001838bd0, {0x55d4b4960720, 0xc00191a000}, 0x9, {0xc000ea4028, 0x1, 0x0?})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/client/process.go:157 +0x394
Dec 10 13:30:57 <HOSTNAME> containerd[837]: github.com/containerd/containerd/v2/client.WithProcessKill({0x55d4b49606e8, 0xc000e86240}, {0x55d4b496b950, 0xc001838bd0})
Dec 10 13:30:57 <HOSTNAME> containerd[837]:         /go/src/github.com/containerd/containerd/client/task_opts.go:168 +0x10e

What version of containerd are you using?

containerd v2-rc5 (the issue also occurs on v2.0.0)

Any other relevant information

As far as I understand the issue, it may be a good idea to make the ttrpc.MD object safer by preventing packages that are importing it from accessing its content directly and adding a RW mutex.
Here is a fix that I did on my side that fixes the issue: containerd/ttrpc#176
Feel free to let me know what you think

Show configuration if it is related to CRI plugin.

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions