mirror of
https://github.com/fail0verflow/switch-linux.git
synced 2025-05-04 02:34:21 -04:00
doc: ReSTify seccomp_filter.txt
This updates seccomp_filter.txt for ReST markup, and moves it under the user-space API index, since it describes how application author can use seccomp. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
parent
5e33994dca
commit
c061f33f35
3 changed files with 62 additions and 56 deletions
|
@ -16,6 +16,7 @@ place where this information is gathered.
|
||||||
.. toctree::
|
.. toctree::
|
||||||
:maxdepth: 2
|
:maxdepth: 2
|
||||||
|
|
||||||
|
seccomp_filter
|
||||||
unshare
|
unshare
|
||||||
|
|
||||||
.. only:: subproject and html
|
.. only:: subproject and html
|
||||||
|
|
|
@ -1,8 +1,9 @@
|
||||||
SECure COMPuting with filters
|
===========================================
|
||||||
=============================
|
Seccomp BPF (SECure COMPuting with filters)
|
||||||
|
===========================================
|
||||||
|
|
||||||
Introduction
|
Introduction
|
||||||
------------
|
============
|
||||||
|
|
||||||
A large number of system calls are exposed to every userland process
|
A large number of system calls are exposed to every userland process
|
||||||
with many of them going unused for the entire lifetime of the process.
|
with many of them going unused for the entire lifetime of the process.
|
||||||
|
@ -27,7 +28,7 @@ pointers which constrains all filters to solely evaluating the system
|
||||||
call arguments directly.
|
call arguments directly.
|
||||||
|
|
||||||
What it isn't
|
What it isn't
|
||||||
-------------
|
=============
|
||||||
|
|
||||||
System call filtering isn't a sandbox. It provides a clearly defined
|
System call filtering isn't a sandbox. It provides a clearly defined
|
||||||
mechanism for minimizing the exposed kernel surface. It is meant to be
|
mechanism for minimizing the exposed kernel surface. It is meant to be
|
||||||
|
@ -40,13 +41,13 @@ system calls in socketcall() is allowed, for instance) which could be
|
||||||
construed, incorrectly, as a more complete sandboxing solution.
|
construed, incorrectly, as a more complete sandboxing solution.
|
||||||
|
|
||||||
Usage
|
Usage
|
||||||
-----
|
=====
|
||||||
|
|
||||||
An additional seccomp mode is added and is enabled using the same
|
An additional seccomp mode is added and is enabled using the same
|
||||||
prctl(2) call as the strict seccomp. If the architecture has
|
prctl(2) call as the strict seccomp. If the architecture has
|
||||||
CONFIG_HAVE_ARCH_SECCOMP_FILTER, then filters may be added as below:
|
``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below:
|
||||||
|
|
||||||
PR_SET_SECCOMP:
|
``PR_SET_SECCOMP``:
|
||||||
Now takes an additional argument which specifies a new filter
|
Now takes an additional argument which specifies a new filter
|
||||||
using a BPF program.
|
using a BPF program.
|
||||||
The BPF program will be executed over struct seccomp_data
|
The BPF program will be executed over struct seccomp_data
|
||||||
|
@ -55,24 +56,25 @@ PR_SET_SECCOMP:
|
||||||
acceptable values to inform the kernel which action should be
|
acceptable values to inform the kernel which action should be
|
||||||
taken.
|
taken.
|
||||||
|
|
||||||
Usage:
|
Usage::
|
||||||
|
|
||||||
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
|
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog);
|
||||||
|
|
||||||
The 'prog' argument is a pointer to a struct sock_fprog which
|
The 'prog' argument is a pointer to a struct sock_fprog which
|
||||||
will contain the filter program. If the program is invalid, the
|
will contain the filter program. If the program is invalid, the
|
||||||
call will return -1 and set errno to EINVAL.
|
call will return -1 and set errno to ``EINVAL``.
|
||||||
|
|
||||||
If fork/clone and execve are allowed by @prog, any child
|
If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child
|
||||||
processes will be constrained to the same filters and system
|
processes will be constrained to the same filters and system
|
||||||
call ABI as the parent.
|
call ABI as the parent.
|
||||||
|
|
||||||
Prior to use, the task must call prctl(PR_SET_NO_NEW_PRIVS, 1) or
|
Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or
|
||||||
run with CAP_SYS_ADMIN privileges in its namespace. If these are not
|
run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not
|
||||||
true, -EACCES will be returned. This requirement ensures that filter
|
true, ``-EACCES`` will be returned. This requirement ensures that filter
|
||||||
programs cannot be applied to child processes with greater privileges
|
programs cannot be applied to child processes with greater privileges
|
||||||
than the task that installed them.
|
than the task that installed them.
|
||||||
|
|
||||||
Additionally, if prctl(2) is allowed by the attached filter,
|
Additionally, if ``prctl(2)`` is allowed by the attached filter,
|
||||||
additional filters may be layered on which will increase evaluation
|
additional filters may be layered on which will increase evaluation
|
||||||
time, but allow for further decreasing the attack surface during
|
time, but allow for further decreasing the attack surface during
|
||||||
execution of a process.
|
execution of a process.
|
||||||
|
@ -80,51 +82,52 @@ PR_SET_SECCOMP:
|
||||||
The above call returns 0 on success and non-zero on error.
|
The above call returns 0 on success and non-zero on error.
|
||||||
|
|
||||||
Return values
|
Return values
|
||||||
-------------
|
=============
|
||||||
|
|
||||||
A seccomp filter may return any of the following values. If multiple
|
A seccomp filter may return any of the following values. If multiple
|
||||||
filters exist, the return value for the evaluation of a given system
|
filters exist, the return value for the evaluation of a given system
|
||||||
call will always use the highest precedent value. (For example,
|
call will always use the highest precedent value. (For example,
|
||||||
SECCOMP_RET_KILL will always take precedence.)
|
``SECCOMP_RET_KILL`` will always take precedence.)
|
||||||
|
|
||||||
In precedence order, they are:
|
In precedence order, they are:
|
||||||
|
|
||||||
SECCOMP_RET_KILL:
|
``SECCOMP_RET_KILL``:
|
||||||
Results in the task exiting immediately without executing the
|
Results in the task exiting immediately without executing the
|
||||||
system call. The exit status of the task (status & 0x7f) will
|
system call. The exit status of the task (``status & 0x7f``) will
|
||||||
be SIGSYS, not SIGKILL.
|
be ``SIGSYS``, not ``SIGKILL``.
|
||||||
|
|
||||||
SECCOMP_RET_TRAP:
|
``SECCOMP_RET_TRAP``:
|
||||||
Results in the kernel sending a SIGSYS signal to the triggering
|
Results in the kernel sending a ``SIGSYS`` signal to the triggering
|
||||||
task without executing the system call. siginfo->si_call_addr
|
task without executing the system call. ``siginfo->si_call_addr``
|
||||||
will show the address of the system call instruction, and
|
will show the address of the system call instruction, and
|
||||||
siginfo->si_syscall and siginfo->si_arch will indicate which
|
``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which
|
||||||
syscall was attempted. The program counter will be as though
|
syscall was attempted. The program counter will be as though
|
||||||
the syscall happened (i.e. it will not point to the syscall
|
the syscall happened (i.e. it will not point to the syscall
|
||||||
instruction). The return value register will contain an arch-
|
instruction). The return value register will contain an arch-
|
||||||
dependent value -- if resuming execution, set it to something
|
dependent value -- if resuming execution, set it to something
|
||||||
sensible. (The architecture dependency is because replacing
|
sensible. (The architecture dependency is because replacing
|
||||||
it with -ENOSYS could overwrite some useful information.)
|
it with ``-ENOSYS`` could overwrite some useful information.)
|
||||||
|
|
||||||
The SECCOMP_RET_DATA portion of the return value will be passed
|
The ``SECCOMP_RET_DATA`` portion of the return value will be passed
|
||||||
as si_errno.
|
as ``si_errno``.
|
||||||
|
|
||||||
SIGSYS triggered by seccomp will have a si_code of SYS_SECCOMP.
|
``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``.
|
||||||
|
|
||||||
SECCOMP_RET_ERRNO:
|
``SECCOMP_RET_ERRNO``:
|
||||||
Results in the lower 16-bits of the return value being passed
|
Results in the lower 16-bits of the return value being passed
|
||||||
to userland as the errno without executing the system call.
|
to userland as the errno without executing the system call.
|
||||||
|
|
||||||
SECCOMP_RET_TRACE:
|
``SECCOMP_RET_TRACE``:
|
||||||
When returned, this value will cause the kernel to attempt to
|
When returned, this value will cause the kernel to attempt to
|
||||||
notify a ptrace()-based tracer prior to executing the system
|
notify a ``ptrace()``-based tracer prior to executing the system
|
||||||
call. If there is no tracer present, -ENOSYS is returned to
|
call. If there is no tracer present, ``-ENOSYS`` is returned to
|
||||||
userland and the system call is not executed.
|
userland and the system call is not executed.
|
||||||
|
|
||||||
A tracer will be notified if it requests PTRACE_O_TRACESECCOMP
|
A tracer will be notified if it requests ``PTRACE_O_TRACESECCOM``P
|
||||||
using ptrace(PTRACE_SETOPTIONS). The tracer will be notified
|
using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified
|
||||||
of a PTRACE_EVENT_SECCOMP and the SECCOMP_RET_DATA portion of
|
of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of
|
||||||
the BPF program return value will be available to the tracer
|
the BPF program return value will be available to the tracer
|
||||||
via PTRACE_GETEVENTMSG.
|
via ``PTRACE_GETEVENTMSG``.
|
||||||
|
|
||||||
The tracer can skip the system call by changing the syscall number
|
The tracer can skip the system call by changing the syscall number
|
||||||
to -1. Alternatively, the tracer can change the system call
|
to -1. Alternatively, the tracer can change the system call
|
||||||
|
@ -138,19 +141,19 @@ SECCOMP_RET_TRACE:
|
||||||
allow use of ptrace, even of other sandboxed processes, without
|
allow use of ptrace, even of other sandboxed processes, without
|
||||||
extreme care; ptracers can use this mechanism to escape.)
|
extreme care; ptracers can use this mechanism to escape.)
|
||||||
|
|
||||||
SECCOMP_RET_ALLOW:
|
``SECCOMP_RET_ALLOW``:
|
||||||
Results in the system call being executed.
|
Results in the system call being executed.
|
||||||
|
|
||||||
If multiple filters exist, the return value for the evaluation of a
|
If multiple filters exist, the return value for the evaluation of a
|
||||||
given system call will always use the highest precedent value.
|
given system call will always use the highest precedent value.
|
||||||
|
|
||||||
Precedence is only determined using the SECCOMP_RET_ACTION mask. When
|
Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When
|
||||||
multiple filters return values of the same precedence, only the
|
multiple filters return values of the same precedence, only the
|
||||||
SECCOMP_RET_DATA from the most recently installed filter will be
|
``SECCOMP_RET_DATA`` from the most recently installed filter will be
|
||||||
returned.
|
returned.
|
||||||
|
|
||||||
Pitfalls
|
Pitfalls
|
||||||
--------
|
========
|
||||||
|
|
||||||
The biggest pitfall to avoid during use is filtering on system call
|
The biggest pitfall to avoid during use is filtering on system call
|
||||||
number without checking the architecture value. Why? On any
|
number without checking the architecture value. Why? On any
|
||||||
|
@ -160,39 +163,40 @@ the numbers in the different calling conventions overlap, then checks in
|
||||||
the filters may be abused. Always check the arch value!
|
the filters may be abused. Always check the arch value!
|
||||||
|
|
||||||
Example
|
Example
|
||||||
-------
|
=======
|
||||||
|
|
||||||
The samples/seccomp/ directory contains both an x86-specific example
|
The ``samples/seccomp/`` directory contains both an x86-specific example
|
||||||
and a more generic example of a higher level macro interface for BPF
|
and a more generic example of a higher level macro interface for BPF
|
||||||
program generation.
|
program generation.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Adding architecture support
|
Adding architecture support
|
||||||
-----------------------
|
===========================
|
||||||
|
|
||||||
See arch/Kconfig for the authoritative requirements. In general, if an
|
See ``arch/Kconfig`` for the authoritative requirements. In general, if an
|
||||||
architecture supports both ptrace_event and seccomp, it will be able to
|
architecture supports both ptrace_event and seccomp, it will be able to
|
||||||
support seccomp filter with minor fixup: SIGSYS support and seccomp return
|
support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return
|
||||||
value checking. Then it must just add CONFIG_HAVE_ARCH_SECCOMP_FILTER
|
value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``
|
||||||
to its arch-specific Kconfig.
|
to its arch-specific Kconfig.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Caveats
|
Caveats
|
||||||
-------
|
=======
|
||||||
|
|
||||||
The vDSO can cause some system calls to run entirely in userspace,
|
The vDSO can cause some system calls to run entirely in userspace,
|
||||||
leading to surprises when you run programs on different machines that
|
leading to surprises when you run programs on different machines that
|
||||||
fall back to real syscalls. To minimize these surprises on x86, make
|
fall back to real syscalls. To minimize these surprises on x86, make
|
||||||
sure you test with
|
sure you test with
|
||||||
/sys/devices/system/clocksource/clocksource0/current_clocksource set to
|
``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to
|
||||||
something like acpi_pm.
|
something like ``acpi_pm``.
|
||||||
|
|
||||||
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
|
On x86-64, vsyscall emulation is enabled by default. (vsyscalls are
|
||||||
legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccomp, with a few oddities:
|
legacy variants on vDSO calls.) Currently, emulated vsyscalls will
|
||||||
|
honor seccomp, with a few oddities:
|
||||||
|
|
||||||
- A return value of SECCOMP_RET_TRAP will set a si_call_addr pointing to
|
- A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to
|
||||||
the vsyscall entry for the given call and not the address after the
|
the vsyscall entry for the given call and not the address after the
|
||||||
'syscall' instruction. Any code which wants to restart the call
|
'syscall' instruction. Any code which wants to restart the call
|
||||||
should be aware that (a) a ret instruction has been emulated and (b)
|
should be aware that (a) a ret instruction has been emulated and (b)
|
||||||
|
@ -200,7 +204,7 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
|
||||||
emulation security checks, making resuming the syscall mostly
|
emulation security checks, making resuming the syscall mostly
|
||||||
pointless.
|
pointless.
|
||||||
|
|
||||||
- A return value of SECCOMP_RET_TRACE will signal the tracer as usual,
|
- A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual,
|
||||||
but the syscall may not be changed to another system call using the
|
but the syscall may not be changed to another system call using the
|
||||||
orig_rax register. It may only be changed to -1 order to skip the
|
orig_rax register. It may only be changed to -1 order to skip the
|
||||||
currently emulated call. Any other change MAY terminate the process.
|
currently emulated call. Any other change MAY terminate the process.
|
||||||
|
@ -209,14 +213,14 @@ legacy variants on vDSO calls.) Currently, emulated vsyscalls will honor seccom
|
||||||
rip or rsp. (Do not rely on other changes terminating the process.
|
rip or rsp. (Do not rely on other changes terminating the process.
|
||||||
They might work. For example, on some kernels, choosing a syscall
|
They might work. For example, on some kernels, choosing a syscall
|
||||||
that only exists in future kernels will be correctly emulated (by
|
that only exists in future kernels will be correctly emulated (by
|
||||||
returning -ENOSYS).
|
returning ``-ENOSYS``).
|
||||||
|
|
||||||
To detect this quirky behavior, check for addr & ~0x0C00 ==
|
To detect this quirky behavior, check for ``addr & ~0x0C00 ==
|
||||||
0xFFFFFFFFFF600000. (For SECCOMP_RET_TRACE, use rip. For
|
0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For
|
||||||
SECCOMP_RET_TRAP, use siginfo->si_call_addr.) Do not check any other
|
``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other
|
||||||
condition: future kernels may improve vsyscall emulation and current
|
condition: future kernels may improve vsyscall emulation and current
|
||||||
kernels in vsyscall=native mode will behave differently, but the
|
kernels in vsyscall=native mode will behave differently, but the
|
||||||
instructions at 0xF...F600{0,4,8,C}00 will not be system calls in these
|
instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these
|
||||||
cases.
|
cases.
|
||||||
|
|
||||||
Note that modern systems are unlikely to use vsyscalls at all -- they
|
Note that modern systems are unlikely to use vsyscalls at all -- they
|
|
@ -11492,6 +11492,7 @@ F: kernel/seccomp.c
|
||||||
F: include/uapi/linux/seccomp.h
|
F: include/uapi/linux/seccomp.h
|
||||||
F: include/linux/seccomp.h
|
F: include/linux/seccomp.h
|
||||||
F: tools/testing/selftests/seccomp/*
|
F: tools/testing/selftests/seccomp/*
|
||||||
|
F: Documentation/userspace-api/seccomp_filter.rst
|
||||||
K: \bsecure_computing
|
K: \bsecure_computing
|
||||||
K: \bTIF_SECCOMP\b
|
K: \bTIF_SECCOMP\b
|
||||||
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue