Add syscall sandboxing using seccomp-bpf (Linux secure computing mode)#20487
Add syscall sandboxing using seccomp-bpf (Linux secure computing mode)#20487laanwj merged 1 commit intobitcoin:masterfrom
Conversation
|
Concept ACK, nice work! |
fb9e303 to
2cba28c
Compare
|
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsReviewers, this pull request conflicts with the following ones:
If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first. |
|
Very cool, Concept ACK |
2cba28c to
54df39b
Compare
src/util/syscall_sandbox.cpp
Outdated
There was a problem hiding this comment.
I'm not entirely convinced it's a good idea to maintain a full list of Linux syscalls here. This definitely doesn't scale to multiple architectures. Yes, it's nice to able to print the name instead of the number in case of failures (that's all it's used for, right?), but this should be a rare enough event and we can look up the syscall number ourselves if a user reports something.
There was a problem hiding this comment.
Yes, it is only used for friendly printing of syscall names.
It might seem like a minor thing, but when developing this feature that translation (from syscall number to syscall name) has been extremely helpful. FWIW I tried developing this feature without it first and that was... very cumbersome! :)
I've now added a comment clarifying how LINUX_SYSCALLS (previously SYSCALLS_LINUX_X86_64) is used:
// This list of syscalls in LINUX_SYSCALLS is only used to map syscall numbers to syscall names in
// order to be able to print user friendly error messages which include the syscall name in addition
// to the syscall number.
//
// Example output in case of a syscall violation where the syscall is present in LINUX_SYSCALLS:
//
// ```
// 2021-06-09T12:34:56Z ERROR: The syscall "execve" (syscall number 59) is not allowed by the syscall sandbox in thread "msghand". Please report.
// ```
//
// Example output in case of a syscall violation where the syscall is not present in LINUX_SYSCALLS:
//
// ```
// 2021-06-09T12:34:56Z ERROR: The syscall "*unknown*" (syscall number 314) is not allowed by the syscall sandbox in thread "msghand". Please report.
// ``
//
// LINUX_SYSCALLS contains two types of syscalls:
// 1.) Syscalls that are present under all architectures or relevant Linux kernel versions for which
// we support the syscall sandbox feature (currently only Linux x86-64). Examples include read,
// write, open, close, etc.
// 2.) Syscalls that are present under a subset of architectures or relevant Linux kernel versions
// for which we support the syscall sandbox feature. This type of syscalls should be added to
// LINUX_SYSCALLS conditional on availability like in the following example:
// ...
// #if defined(__NR_arch_dependent_syscall)
// {__NR_arch_dependent_syscall, "arch_dependent_syscall"},
// #endif // defined(__NR_arch_dependent_syscall)
// ...
Note that if LINUX_SYSCALLS is incomplete for some reason (new architecture or new syscall), then it will fallback on the behaviour you're suggesting (just printing the syscall number).
I suggest we keep it as-is for now and revisit that decision when/if syscall sandbox support for the next architecture (non-x86-64) is added. Could that work for now? :)
There was a problem hiding this comment.
OK, I get your point better now, it's not needed to replicate this entire structure for every architecture, because it doesn't have the raw numbers, it uses the __NR_ constants.
I agree it can be useful, it too bad querying a syscall name isn't a standard libc call, or outside that, a standard linux command that the error message can be run through. To add insult to injury, the syscall numbers are in a different file on every architecture, it's not like you can just grep /usr/include/unistd.h and be done with it.
The thing is that even now, new syscalls are often added to the Linux kernel. It seems really annoying to track this. But maybe it's not needed? How often do we use a new syscall.
There was a problem hiding this comment.
or outside that, a standard linux command
Is there a standard linux command for a popular operating system? Maybe it would be enough to say: To get the syscall name, use grep ... on the latest Ubuntu/Debian?
There was a problem hiding this comment.
It looks like gdb is enough to print the name of the violating syscall. See for example #23248 (comment), which prints mincore.
|
Code review and lightly tested ACK. Sorry for only getting around to this now. |
|
Code review and lightly tested ACK 4747da3 |
| const sock_fprog prog = { | ||
| .len = static_cast<uint16_t>(filter.size()), | ||
| .filter = filter.data(), | ||
| }; |
There was a problem hiding this comment.
https://en.cppreference.com/w/cpp/language/aggregate_initialization
Neither I, nor CI could find a compiler that rejects this C++20 code, so does that mean we are allowed to use it in new code now?
There was a problem hiding this comment.
Looks like this is supported since clang 3.2 and gcc 4.7, but maybe not msvc?
See #23183
msvc: error C7555: use of designated initializers requires at least '/std:c++20'
There was a problem hiding this comment.
FYI adding -Wpedantic turns this into a warning:
$ g++ -Wall -std=c++17 -pedantic pedantic.cpp -c -o pedantic.o
pedantic.cpp: In function ‘int main()’:
pedantic.cpp:8:13: warning: C++ designated initializers only available with ‘-std=c++2a’ or ‘-std=gnu++2a’ [-Wpedantic]
8 | foo bar{.a = 1};
I actually thought we had -Wpedantic on by default, I'm not sure why we don't. @fanquake @dongcarl is there an obvious reason I'm forgetting?
Edit: Clang also has -Wc++20-designator.
Edit2: pushed a quick branch to demonstrate the changes needed to build bitcoind (only, everything else untested) with -Wpedantic, which is meant to warn about the use of compiler exensions: https://github.com/theuni/bitcoin/tree/build-pedantic
|
Just curious, but merging this seems to have caused a few build problems that required followups: #23178 #23179 #23196, and I guess some test problems in local developer checkouts. I'm wondering if there are any CI improvements that might have caught these problems. Are too many of the CI builds using old kernels, and could adding CI builds with newer kernels catch these problems? Are too many of the CI builds using depends and could adding CI builds with system dependencies catch these problems? Is there some extended draftbot build that might have caught these problems? It seems like it could make sense to run some extended CI on PRs that modify |
For the record, it's not the running kernel version that makes a difference here. But the kernel headers. What would have helped find the problem is compiling with older kernel headers. We also should have done a GUIX build. It's the first time these matter at all for our compilation. Besides I think it's a good lesson that we need to make sure to define everything we need from the kernel ourselves, to support compilation on a wide range of installations. I knew about this problem but underestimated how much it's still a problem nowadays. |
|
I just don't understand the basics here. What CI build could you theoretically add to catch these problems? Would it be a CI build with a really old kernel and really old headers? A really new kernel and new headers? An old kernel with new headers? And new kernel with old headers? |
Generally developers tend to run the latest software, so this is usually not something that will catch a lot of fish as a CI task. In fact, CI tasks with bleeding edge software might tend to break more often. (I do run them, but not part of this project).
Simply adding the "DrahtBot Guix build requested" label would have caught this. I think any pull request that is tagged with "Build system" should be Guix-built as well. As DrahtBot is running on limited resources, it might take a day or two to create the guix build, but I think this is an acceptable delay. If there is an "emergency-fix" devs can always do a local guix build. |
|
And when it comes to preventing build errors via the CI, I think this is not something we should try to achieve at all costs. Due to compiler bugs (or just software bugs in general), there will always be at least one software configuration and build flag configuration that will fail to build. I think it is reasonable to have tests for common software configurations on the latest LTS releases of operating systems. Though, when it comes to supporting any software configuration, waiting for an actual user to report the issue instead of integrating into the CI works probably better. |
|
Thanks for the replies. So is the answer to my question of what CI build would catch all these problems (#23178 #23179 #23196) that any CI build using newer kernel headers would catch all these problems? That all CI builds are using older headers and practicalswift was also using older headers, and this slipped past review? My motivation for asking is just to understand when I am reviewing a PR what class of errors I could expect CI to catch, and what class of errors I should be testing manually or asking the PR author about. In this case, I guess I failed to ask practicalswift how is this code affected when new syscalls are added? And in the future, when I'm reviewing a PR that makes significant changes to the build system, I can request GUIX builds? I don't really have opinions on what CI should check for. I just want to understand what CI is checking for so I can review PRs better. For example, it is useful to know if CI is only checking old kernels and we are depending on developers to manually test newer kernels. It is also useful to know more generally if CI builds tend to use older or different dependencies than actual the GUIX release. |
Sure, just add the label.
CI is a mix of cross-compile depends builds that try to mimic guix (older LTS releases with older gcc-8) and recent |
| versions = [None] * num_nodes | ||
| if self.is_syscall_sandbox_compiled() and not self.disable_syscall_sandbox: | ||
| for i in range(len(extra_args)): | ||
| if versions[i] is None or versions[i] >= 219900: |
There was a problem hiding this comment.
This is actually not in the v22.0 binary and also not backported (yet). I made a commit to bump the version: 4672c1f
Add experimental syscall sandboxing using seccomp-bpf (Linux secure computing mode).
Enable filtering of system calls using seccomp-bpf: allow only explicitly allowlisted (expected) syscalls to be called.
The syscall sandboxing implemented in this PR is an experimental feature currently available only under Linux x86-64.
To enable the experimental syscall sandbox the
-sandbox=<mode>option must be passed tobitcoind:The allowed syscalls are defined on a per thread basis.
I've used this feature since summer 2020 and I find it to be a helpful testing/debugging addition which makes it much easier to reason about the actual capabilities required of each type of thread in Bitcoin Core.
Quick start guide:
About seccomp and seccomp-bpf: