Skip to content

Add syscall sandboxing using seccomp-bpf (Linux secure computing mode)#20487

Merged
laanwj merged 1 commit intobitcoin:masterfrom
practicalswift:seccomp-bpf-2020-11
Oct 4, 2021
Merged

Add syscall sandboxing using seccomp-bpf (Linux secure computing mode)#20487
laanwj merged 1 commit intobitcoin:masterfrom
practicalswift:seccomp-bpf-2020-11

Conversation

@practicalswift
Copy link
Contributor

@practicalswift practicalswift commented Nov 25, 2020

Add experimental syscall sandboxing using seccomp-bpf (Linux secure computing mode).

Enable filtering of system calls using seccomp-bpf: allow only explicitly allowlisted (expected) syscalls to be called.

The syscall sandboxing implemented in this PR is an experimental feature currently available only under Linux x86-64.

To enable the experimental syscall sandbox the -sandbox=<mode> option must be passed to bitcoind:

  -sandbox=<mode>
       Use the experimental syscall sandbox in the specified mode
       (-sandbox=log-and-abort or -sandbox=abort). Allow only expected
       syscalls to be used by bitcoind. Note that this is an
       experimental new feature that may cause bitcoind to exit or crash
       unexpectedly: use with caution. In the "log-and-abort" mode the
       invocation of an unexpected syscall results in a debug handler
       being invoked which will log the incident and terminate the
       program (without executing the unexpected syscall). In the
       "abort" mode the invocation of an unexpected syscall results in
       the entire process being killed immediately by the kernel without
       executing the unexpected syscall.

The allowed syscalls are defined on a per thread basis.

I've used this feature since summer 2020 and I find it to be a helpful testing/debugging addition which makes it much easier to reason about the actual capabilities required of each type of thread in Bitcoin Core.


Quick start guide:

$ ./configure
$ src/bitcoind -regtest -debug=util -sandbox=log-and-abort
…
2021-06-09T12:34:56Z Experimental syscall sandbox enabled (-sandbox=log-and-abort): bitcoind will terminate if an unexpected (not allowlisted) syscall is invoked.
…
2021-06-09T12:34:56Z Syscall filter installed for thread "addcon"
2021-06-09T12:34:56Z Syscall filter installed for thread "dnsseed"
2021-06-09T12:34:56Z Syscall filter installed for thread "net"
2021-06-09T12:34:56Z Syscall filter installed for thread "msghand"
2021-06-09T12:34:56Z Syscall filter installed for thread "opencon"
2021-06-09T12:34:56Z Syscall filter installed for thread "init"
…
# A simulated execve call to show the sandbox in action:
2021-06-09T12:34:56Z ERROR: The syscall "execve" (syscall number 59) is not allowed by the syscall sandbox in thread "msghand". Please report.
…
Aborted (core dumped)
$

About seccomp and seccomp-bpf:

In computer security, seccomp (short for secure computing mode) is a facility in the Linux kernel. seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit(), sigreturn(), and read() and write() to already-open file descriptors. Should it attempt any other system calls, the kernel will terminate the process with SIGKILL or SIGSYS. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

[…]

seccomp-bpf is an extension to seccomp that allows filtering of system calls using a configurable policy implemented using Berkeley Packet Filter rules. It is used by OpenSSH and vsftpd as well as the Google Chrome/Chromium web browsers on Chrome OS and Linux. (In this regard seccomp-bpf achieves similar functionality, but with more flexibility and higher performance, to the older systrace—which seems to be no longer supported for Linux.)

@laanwj
Copy link
Member

laanwj commented Nov 25, 2020

Concept ACK, nice work!

@DrahtBot
Copy link
Contributor

DrahtBot commented Nov 25, 2020

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #22956 (validation: log CChainState::CheckBlockIndex() consistency checks by jonatack)
  • #21943 (Dedup and RAII-fy the creation of a copy of CConnman::vNodes by vasild)
  • #21878 (Make all networking code mockable by vasild)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@jb55
Copy link
Contributor

jb55 commented Nov 25, 2020

Very cool, Concept ACK

Copy link
Member

@laanwj laanwj Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely convinced it's a good idea to maintain a full list of Linux syscalls here. This definitely doesn't scale to multiple architectures. Yes, it's nice to able to print the name instead of the number in case of failures (that's all it's used for, right?), but this should be a rare enough event and we can look up the syscall number ourselves if a user reports something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is only used for friendly printing of syscall names.

It might seem like a minor thing, but when developing this feature that translation (from syscall number to syscall name) has been extremely helpful. FWIW I tried developing this feature without it first and that was... very cumbersome! :)

I've now added a comment clarifying how LINUX_SYSCALLS (previously SYSCALLS_LINUX_X86_64) is used:

// This list of syscalls in LINUX_SYSCALLS is only used to map syscall numbers to syscall names in
// order to be able to print user friendly error messages which include the syscall name in addition
// to the syscall number.
//
// Example output in case of a syscall violation where the syscall is present in LINUX_SYSCALLS:
//
// ```
// 2021-06-09T12:34:56Z ERROR: The syscall "execve" (syscall number 59) is not allowed by the syscall sandbox in thread "msghand". Please report.
// ```
//
// Example output in case of a syscall violation where the syscall is not present in LINUX_SYSCALLS:
//
// ```
// 2021-06-09T12:34:56Z ERROR: The syscall "*unknown*" (syscall number 314) is not allowed by the syscall sandbox in thread "msghand". Please report.
// ``
//
// LINUX_SYSCALLS contains two types of syscalls:
// 1.) Syscalls that are present under all architectures or relevant Linux kernel versions for which
//     we support the syscall sandbox feature (currently only Linux x86-64). Examples include read,
//     write, open, close, etc.
// 2.) Syscalls that are present under a subset of architectures or relevant Linux kernel versions
//     for which we support the syscall sandbox feature. This type of syscalls should be added to
//     LINUX_SYSCALLS conditional on availability like in the following example:
//         ...
//         #if defined(__NR_arch_dependent_syscall)
//             {__NR_arch_dependent_syscall, "arch_dependent_syscall"},
//         #endif // defined(__NR_arch_dependent_syscall)
//         ...

Note that if LINUX_SYSCALLS is incomplete for some reason (new architecture or new syscall), then it will fallback on the behaviour you're suggesting (just printing the syscall number).

I suggest we keep it as-is for now and revisit that decision when/if syscall sandbox support for the next architecture (non-x86-64) is added. Could that work for now? :)

Copy link
Member

@laanwj laanwj Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I get your point better now, it's not needed to replicate this entire structure for every architecture, because it doesn't have the raw numbers, it uses the __NR_ constants.

I agree it can be useful, it too bad querying a syscall name isn't a standard libc call, or outside that, a standard linux command that the error message can be run through. To add insult to injury, the syscall numbers are in a different file on every architecture, it's not like you can just grep /usr/include/unistd.h and be done with it.

The thing is that even now, new syscalls are often added to the Linux kernel. It seems really annoying to track this. But maybe it's not needed? How often do we use a new syscall.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or outside that, a standard linux command

Is there a standard linux command for a popular operating system? Maybe it would be enough to say: To get the syscall name, use grep ... on the latest Ubuntu/Debian?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like gdb is enough to print the name of the violating syscall. See for example #23248 (comment), which prints mincore.

@laanwj
Copy link
Member

laanwj commented Sep 20, 2021

Code review and lightly tested ACK. Sorry for only getting around to this now.
It looks good to me. Found no serious issues but left some nits.

@laanwj
Copy link
Member

laanwj commented Oct 4, 2021

Code review and lightly tested ACK 4747da3

const sock_fprog prog = {
.len = static_cast<uint16_t>(filter.size()),
.filter = filter.data(),
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/language/aggregate_initialization

Neither I, nor CI could find a compiler that rejects this C++20 code, so does that mean we are allowed to use it in new code now?

Copy link
Member

@maflcko maflcko Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is supported since clang 3.2 and gcc 4.7, but maybe not msvc?

See #23183

msvc: error C7555: use of designated initializers requires at least '/std:c++20'

Copy link
Member

@theuni theuni Oct 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI adding -Wpedantic turns this into a warning:
$ g++ -Wall -std=c++17 -pedantic pedantic.cpp -c -o pedantic.o

pedantic.cpp: In function ‘int main()’:
pedantic.cpp:8:13: warning: C++ designated initializers only available with ‘-std=c++2a’ or ‘-std=gnu++2a’ [-Wpedantic]
8 | foo bar{.a = 1};

I actually thought we had -Wpedantic on by default, I'm not sure why we don't. @fanquake @dongcarl is there an obvious reason I'm forgetting?

Edit: Clang also has -Wc++20-designator.

Edit2: pushed a quick branch to demonstrate the changes needed to build bitcoind (only, everything else untested) with -Wpedantic, which is meant to warn about the use of compiler exensions: https://github.com/theuni/bitcoin/tree/build-pedantic

@ryanofsky
Copy link
Contributor

Just curious, but merging this seems to have caused a few build problems that required followups: #23178 #23179 #23196, and I guess some test problems in local developer checkouts.

I'm wondering if there are any CI improvements that might have caught these problems. Are too many of the CI builds using old kernels, and could adding CI builds with newer kernels catch these problems? Are too many of the CI builds using depends and could adding CI builds with system dependencies catch these problems? Is there some extended draftbot build that might have caught these problems?

It seems like it could make sense to run some extended CI on PRs that modify configure.ac or build scripts, but not clear what might have been useful in this case.

@laanwj
Copy link
Member

laanwj commented Oct 5, 2021

could adding CI builds with newer kernels catch these problems

For the record, it's not the running kernel version that makes a difference here. But the kernel headers. What would have helped find the problem is compiling with older kernel headers. We also should have done a GUIX build.

It's the first time these matter at all for our compilation. Besides syscall(SYS_getrandom… for random context gathering, there is no code that directly interfaces with the Linux kernel at all.

I think it's a good lesson that we need to make sure to define everything we need from the kernel ourselves, to support compilation on a wide range of installations. I knew about this problem but underestimated how much it's still a problem nowadays.

@ryanofsky
Copy link
Contributor

I just don't understand the basics here. What CI build could you theoretically add to catch these problems? Would it be a CI build with a really old kernel and really old headers? A really new kernel and new headers? An old kernel with new headers? And new kernel with old headers?

@maflcko
Copy link
Member

maflcko commented Oct 6, 2021

Are too many of the CI builds using old kernels, and could adding CI builds with newer kernels catch these problems?

Generally developers tend to run the latest software, so this is usually not something that will catch a lot of fish as a CI task. In fact, CI tasks with bleeding edge software might tend to break more often. (I do run them, but not part of this project).

Is there some extended draftbot build that might have caught these problems?

Simply adding the "DrahtBot Guix build requested" label would have caught this. I think any pull request that is tagged with "Build system" should be Guix-built as well. As DrahtBot is running on limited resources, it might take a day or two to create the guix build, but I think this is an acceptable delay. If there is an "emergency-fix" devs can always do a local guix build.

@maflcko
Copy link
Member

maflcko commented Oct 6, 2021

And when it comes to preventing build errors via the CI, I think this is not something we should try to achieve at all costs. Due to compiler bugs (or just software bugs in general), there will always be at least one software configuration and build flag configuration that will fail to build. I think it is reasonable to have tests for common software configurations on the latest LTS releases of operating systems. Though, when it comes to supporting any software configuration, waiting for an actual user to report the issue instead of integrating into the CI works probably better.

@ryanofsky
Copy link
Contributor

Thanks for the replies. So is the answer to my question of what CI build would catch all these problems (#23178 #23179 #23196) that any CI build using newer kernel headers would catch all these problems? That all CI builds are using older headers and practicalswift was also using older headers, and this slipped past review?

My motivation for asking is just to understand when I am reviewing a PR what class of errors I could expect CI to catch, and what class of errors I should be testing manually or asking the PR author about. In this case, I guess I failed to ask practicalswift how is this code affected when new syscalls are added? And in the future, when I'm reviewing a PR that makes significant changes to the build system, I can request GUIX builds?

I don't really have opinions on what CI should check for. I just want to understand what CI is checking for so I can review PRs better. For example, it is useful to know if CI is only checking old kernels and we are depending on developers to manually test newer kernels. It is also useful to know more generally if CI builds tend to use older or different dependencies than actual the GUIX release.

@maflcko
Copy link
Member

maflcko commented Oct 6, 2021

I can request GUIX builds?

Sure, just add the label.

It is also useful to know more generally if CI builds tend to use older or different dependencies than actual the GUIX release.

CI is a mix of cross-compile depends builds that try to mimic guix (older LTS releases with older gcc-8) and recent clang with native sanitizer builds from system packages (except for msan, which is also using depends). Obviously Ubuntu/Debian ship different compilers than guix, so the CI is just a proxy and not a replacement for a guix build.

versions = [None] * num_nodes
if self.is_syscall_sandbox_compiled() and not self.disable_syscall_sandbox:
for i in range(len(extra_args)):
if versions[i] is None or versions[i] >= 219900:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not in the v22.0 binary and also not backported (yet). I made a commit to bump the version: 4672c1f

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.