Set disable_rcu_irq_enter on not only rcu_eqs_enter_common() but also
rcu_eqs_exit(), since rcu_eqs_exit() suffers from the same issue as was
fixed for rcu_eqs_enter_common() by commit 03ecd3f48e ("rcu/tracing:
Add rcu_disabled to denote when rcu_irq_enter() will not work").
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The _rcu_barrier_trace() function is a wrapper for trace_rcu_barrier(),
which needs TPS() protection for strings passed through the second
argument. However, it has escaped prior TPS()-ification efforts because
it _rcu_barrier_trace() does not start with "trace_". This commit
therefore adds the needed TPS() protection
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
These RCU waits were set to use interruptible waits to avoid the kthreads
contributing to system load average, even though they are not interruptible
as they are spawned from a kthread. Use the new TASK_IDLE swaits which makes
our goal clear, and removes confusion about these paths possibly being
interruptible -- they are not.
When the system is idle the RCU grace-period kthread will spend all its time
blocked inside the swait_event_interruptible(). If the interruptible() was
not used, then this kthread would contribute to the load average. This means
that an idle system would have a load average of 2 (or 3 if PREEMPT=y),
rather than the load average of 0 that almost fifty years of UNIX has
conditioned sysadmins to expect.
The same argument applies to swait_event_interruptible_timeout() use. The
RCU grace-period kthread spends its time blocked inside this call while
waiting for grace periods to complete. In particular, if there was only one
busy CPU, but that CPU was frequently invoking call_rcu(), then the RCU
grace-period kthread would spend almost all its time blocked inside the
swait_event_interruptible_timeout(). This would mean that the load average
would be 2 rather than the expected 1 for the single busy CPU.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is currently event tracing to track when a task is preempted
within a preemptible RCU read-side critical section, and also when that
task subsequently reaches its outermost rcu_read_unlock(), but none
indicating when a new grace period starts when that grace period must
wait on pre-existing readers that have been been preempted at least once
since the beginning of their current RCU read-side critical sections.
This commit therefore adds an event trace at grace-period start in
the case where there are such readers. Note that only the first
reader in the list is traced.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This commit saves a few lines in kernel/rcu/rcu.h by moving to single-line
definitions for trivial functions, instead of the old style where the
two curly braces each get their own line.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Strings used in event tracing need to be specially handled, for example,
using the TPS() macro. Without the TPS() macro, although output looks
fine from within a running kernel, extracting traces from a crash dump
produces garbage instead of strings. This commit therefore adds the TPS()
macro to some unadorned strings that were passed to event-tracing macros.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently, the exit-time support for TASKS_RCU is open-coded in do_exit().
This commit creates exit_tasks_rcu_start() and exit_tasks_rcu_finish()
APIs for do_exit() use. This has the benefit of confining the use of the
tasks_rcu_exit_srcu variable to one file, allowing it to become static.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
The actual use of TASKS_RCU is only when PREEMPT, otherwise RCU-sched
is used instead. This commit therefore makes synchronize_rcu_tasks()
and call_rcu_tasks() available always, but mapped to synchronize_sched()
and call_rcu_sched(), respectively, when !PREEMPT. This approach also
allows some #ifdefs to be removed from rcutorture.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
With the new lockdep crossrelease feature, which checks completions usage,
a false positive is reported in the workqueue code:
> Worker A : acquired of wfc.work -> wait for cpu_hotplug_lock to be released
> Task B : acquired of cpu_hotplug_lock -> wait for lock#3 to be released
> Task C : acquired of lock#3 -> wait for completion of barr->done
> (Task C is in lru_add_drain_all_cpuslocked())
> Worker D : wait for wfc.work to be released -> will complete barr->done
Such a dead lock can not happen because Task C's barr->done and Worker D's
barr->done can not be the same instance.
The reason of this false positive is we initialize all wq_barrier::done
at insert_wq_barrier() via init_completion(), which makes them belong to
the same lock class, therefore, impossible circles are reported.
To fix this, explicitly initialize the lockdep map for wq_barrier::done
in insert_wq_barrier(), so that the lock class key of wq_barrier::done
is a subkey of the corresponding work_struct, as a result we won't build
a dependency between a wq_barrier with a unrelated work, and we can
differ wq barriers based on the related works, so the false positive
above is avoided.
Also define the empty lockdep_init_map_crosslock() for !CROSSRELEASE
to make the code simple and away from unnecessary #ifdefs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170817094622.12915-1-boqun.feng@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This implements refcount_t overflow protection on x86 without a noticeable
performance impact, though without the fuller checking of REFCOUNT_FULL.
This is done by duplicating the existing atomic_t refcount implementation
but with normally a single instruction added to detect if the refcount
has gone negative (e.g. wrapped past INT_MAX or below zero). When detected,
the handler saturates the refcount_t to INT_MIN / 2. With this overflow
protection, the erroneous reference release that would follow a wrap back
to zero is blocked from happening, avoiding the class of refcount-overflow
use-after-free vulnerabilities entirely.
Only the overflow case of refcounting can be perfectly protected, since
it can be detected and stopped before the reference is freed and left to
be abused by an attacker. There isn't a way to block early decrements,
and while REFCOUNT_FULL stops increment-from-zero cases (which would
be the state _after_ an early decrement and stops potential double-free
conditions), this fast implementation does not, since it would require
the more expensive cmpxchg loops. Since the overflow case is much more
common (e.g. missing a "put" during an error path), this protection
provides real-world protection. For example, the two public refcount
overflow use-after-free exploits published in 2016 would have been
rendered unexploitable:
http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/http://cyseclabs.com/page?n=02012016
This implementation does, however, notice an unchecked decrement to zero
(i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it
resulted in a zero). Decrements under zero are noticed (since they will
have resulted in a negative value), though this only indicates that a
use-after-free may have already happened. Such notifications are likely
avoidable by an attacker that has already exploited a use-after-free
vulnerability, but it's better to have them reported than allow such
conditions to remain universally silent.
On first overflow detection, the refcount value is reset to INT_MIN / 2
(which serves as a saturation value) and a report and stack trace are
produced. When operations detect only negative value results (such as
changing an already saturated value), saturation still happens but no
notification is performed (since the value was already saturated).
On the matter of races, since the entire range beyond INT_MAX but before
0 is negative, every operation at INT_MIN / 2 will trap, leaving no
overflow-only race condition.
As for performance, this implementation adds a single "js" instruction
to the regular execution flow of a copy of the standard atomic_t refcount
operations. (The non-"and_test" refcount_dec() function, which is uncommon
in regular refcount design patterns, has an additional "jz" instruction
to detect reaching exactly zero.) Since this is a forward jump, it is by
default the non-predicted path, which will be reinforced by dynamic branch
prediction. The result is this protection having virtually no measurable
change in performance over standard atomic_t operations. The error path,
located in .text.unlikely, saves the refcount location and then uses UD0
to fire a refcount exception handler, which resets the refcount, handles
reporting, and returns to regular execution. This keeps the changes to
.text size minimal, avoiding return jumps and open-coded calls to the
error reporting routine.
Example assembly comparison:
refcount_inc() before:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
refcount_inc() after:
.text:
ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp)
ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3
...
.text.unlikely:
ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx
ffffffff816c36d7: 0f ff (bad)
These are the cycle counts comparing a loop of refcount_inc() from 1
to INT_MAX and back down to 0 (via refcount_dec_and_test()), between
unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL
(refcount_t-full), and this overflow-protected refcount (refcount_t-fast):
2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s:
cycles protections
atomic_t 82249267387 none
refcount_t-fast 82211446892 overflow, untested dec-to-zero
refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero
This code is a modified version of the x86 PAX_REFCOUNT atomic_t
overflow defense from the last public patch of PaX/grsecurity, based
on my understanding of the code. Changes or omissions from the original
code are mine and don't reflect the original grsecurity/PaX code. Thanks
to PaX Team for various suggestions for improvement for repurposing this
code to be a refcount-only protection.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Hans Liljestrand <ishkamiel@gmail.com>
Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
Cc: Jann Horn <jannh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arozansk@redhat.com
Cc: axboe@kernel.dk
Cc: kernel-hardening@lists.openwall.com
Cc: linux-arch <linux-arch@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlmUlmUUHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIo92hAAqbffYKqih+3VPCYg0bx7N9pCl8Ya
k9RNxyRPv9+IxJGTrnG00x6k8GIv3hjyJIYmqGQl/GWdbZadmySazl20YI9ls47p
7ydJAJELRPnfKFLJ9T2mqi6Az8qDtRoV2DwLCSCnsBCJdsK4wcUxtM3/qV2JGxzJ
O2YIw4C4kuoM2SRl6weGnCUTVkdaDdHk6GcC2GClIlsjapUpNB+UieGijN/3HqHi
YpSofAXD1lkZ4DZCM51t/3vuIlNTGSQOVvXqsVZWJv4fFR1qZbGiYuVQervYaaP2
sRN+2OwNtdy5yUStQ5BMHT44zTc49ACizSqU3j96yzEa5H3IfMSN9U5Aa+GYIy5N
um6qeUz7wKOto0/hBtDpabGeeBkdLZBY6L7Dt2NLTcC8vT65b8NveGj4rvVGt0b5
REjoT0Slja4yQeER3IgUByR5H6h983Em/cjDmL6V/oLqxfOGGLkLQgKyfGoF+aSK
DrpCWS/XiGU/Q2W3XhLSSIlJXbZ6y/dttM4tFOrk6omekLpdzdJwgo8DRz91dIZI
vB5DAHG+Pvxw6sYFz2eAF2/3UYeEdxhAsQs8V3NJWz+7BD/AxAdfMDriGQnQ6jfU
NIWRcCxkU/FtrqsznIqp0BkitOQ7ZwDqusUebWl34y8iNa/m2f9Jp+rvSnxq8+Zu
Zw0EjuRyfwu2SE0=
=tP6Y
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
"Two small fixes to the audit code, both explained well in the
respective patch descriptions, but the quick summary is one
use-after-free fix, and one silly fanotify notification flag fix"
* tag 'audit-pr-20170816' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: Receive unmount event
audit: Fix use after free in audit_remove_watch_rule()
Resolve issues with !CONFIG_BPF_SYSCALL and !STREAM_PARSER
net/core/filter.c: In function ‘do_sk_redirect_map’:
net/core/filter.c:1881:3: error: implicit declaration of function ‘__sock_map_lookup_elem’ [-Werror=implicit-function-declaration]
sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
^
net/core/filter.c:1881:6: warning: assignment makes pointer from integer without a cast [enabled by default]
sk = __sock_map_lookup_elem(ri->map, ri->ifindex);
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
psock will uninitialized in default case we need to do the same psock lookup
and check as in other branch. Fixes compile warning below.
kernel/bpf/sockmap.c: In function ‘smap_state_change’:
kernel/bpf/sockmap.c:156:21: warning: ‘psock’ may be used uninitialized in this function [-Wmaybe-uninitialized]
struct smap_psock *psock;
Fixes: 174a79ff95 ("bpf: sockmap with sk redirect support")
Reported-by: David Miller <davem@davemloft.net>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the devmap alloc map logic we check to ensure that the sizeof the
values are not greater than KMALLOC_MAX_SIZE. But, in the dev map case
we ensure the value size is 4bytes earlier in the function because all
values should be netdev ifindex values.
The second check is harmless but is not needed so remove it.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently we added a new map type called dev map used to forward XDP
packets between ports (6093ec2dc3). This patches introduces a
similar notion for sockets.
A sockmap allows users to add participating sockets to a map. When
sockets are added to the map enough context is stored with the
map entry to use the entry with a new helper
bpf_sk_redirect_map(map, key, flags)
This helper (analogous to bpf_redirect_map in XDP) is given the map
and an entry in the map. When called from a sockmap program, discussed
below, the skb will be sent on the socket using skb_send_sock().
With the above we need a bpf program to call the helper from that will
then implement the send logic. The initial site implemented in this
series is the recv_sock hook. For this to work we implemented a map
attach command to add attributes to a map. In sockmap we add two
programs a parse program and a verdict program. The parse program
uses strparser to build messages and pass them to the verdict program.
The parse programs use the normal strparser semantics. The verdict
program is of type SK_SKB.
The verdict program returns a verdict SK_DROP, or SK_REDIRECT for
now. Additional actions may be added later. When SK_REDIRECT is
returned, expected when bpf program uses bpf_sk_redirect_map(), the
sockmap logic will consult per cpu variables set by the helper routine
and pull the sock entry out of the sock map. This pattern follows the
existing redirect logic in cls and xdp programs.
This gives the flow,
recv_sock -> str_parser (parse_prog) -> verdict_prog -> skb_send_sock
\
-> kfree_skb
As an example use case a message based load balancer may use specific
logic in the verdict program to select the sock to send on.
Sample programs are provided in future patches that hopefully illustrate
the user interfaces. Also selftests are in follow-on patches.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf_prog_inc_not_zero will be used by upcoming sockmap patches this
patch simply exports it so we can pull it in.
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The complete_all() function modifies the completion's "done" variable to
UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
it back to zero. That means that any call to complete_all() must have a
reinit_completion() before that completion can be used again.
Document this fact by the complete_all() function.
Also document that completion_done() will always return true if
complete_all() is called.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a resource managed version of irq_sim_init(). This can be
conveniently used in device drivers.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-gpio@vger.kernel.org
Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Link: http://lkml.kernel.org/r/20170814145318.6495-3-brgl@bgdev.pl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement a simple, irq_work-based framework for simulating
interrupts. Currently the API exposes routines for initializing and
deinitializing the simulator object, enqueueing the interrupts and
retrieving the allocated interrupt numbers based on the offset of the
dummy interrupt in the simulator struct.
Signed-off-by: Bartosz Golaszewski <brgl@bgdev.pl>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: linux-gpio@vger.kernel.org
Cc: Bamvor Jian Zhang <bamvor.zhangjian@linaro.org>
Cc: Jonathan Cameron <jic23@kernel.org>
Link: http://lkml.kernel.org/r/20170814145318.6495-2-brgl@bgdev.pl
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull networking fixes from David Miller:
1) Fix TCP checksum offload handling in iwlwifi driver, from Emmanuel
Grumbach.
2) In ksz DSA tagging code, free SKB if skb_put_padto() fails. From
Vivien Didelot.
3) Fix two regressions with bonding on wireless, from Andreas Born.
4) Fix build when busypoll is disabled, from Daniel Borkmann.
5) Fix copy_linear_skb() wrt. SO_PEEK_OFF, from Eric Dumazet.
6) Set SKB cached route properly in inet_rtm_getroute(), from Florian
Westphal.
7) Fix PCI-E relaxed ordering handling in cxgb4 driver, from Ding
Tianhong.
8) Fix module refcnt leak in ULP code, from Sabrina Dubroca.
9) Fix use of GFP_KERNEL in atomic contexts in AF_KEY code, from Eric
Dumazet.
10) Need to purge socket write queue in dccp_destroy_sock(), also from
Eric Dumazet.
11) Make bpf_trace_printk() work properly on 32-bit architectures, from
Daniel Borkmann.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (47 commits)
bpf: fix bpf_trace_printk on 32 bit archs
PCI: fix oops when try to find Root Port for a PCI device
sfc: don't try and read ef10 data on non-ef10 NIC
net_sched: remove warning from qdisc_hash_add
net_sched/sfq: update hierarchical backlog when drop packet
net_sched: reset pointers to tcf blocks in classful qdiscs' destructors
ipv4: fix NULL dereference in free_fib_info_rcu()
net: Fix a typo in comment about sock flags.
ipv6: fix NULL dereference in ip6_route_dev_notify()
tcp: fix possible deadlock in TCP stack vs BPF filter
dccp: purge write queue in dccp_destroy_sock()
udp: fix linear skb reception with PEEK_OFF
ipv6: release rt6->rt6i_idev properly during ifdown
af_key: do not use GFP_KERNEL in atomic contexts
tcp: ulp: avoid module refcnt leak in tcp_set_ulp
net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
PCI: Disable Relaxed Ordering Attributes for AMD A1100
PCI: Disable Relaxed Ordering for some Intel processors
PCI: Disable PCIe Relaxed Ordering if unsupported
...
James reported that on MIPS32 bpf_trace_printk() is currently
broken while MIPS64 works fine:
bpf_trace_printk() uses conditional operators to attempt to
pass different types to __trace_printk() depending on the
format operators. This doesn't work as intended on 32-bit
architectures where u32 and long are passed differently to
u64, since the result of C conditional operators follows the
"usual arithmetic conversions" rules, such that the values
passed to __trace_printk() will always be u64 [causing issues
later in the va_list handling for vscnprintf()].
For example the samples/bpf/tracex5 test printed lines like
below on MIPS32, where the fd and buf have come from the u64
fd argument, and the size from the buf argument:
[...] 1180.941542: 0x00000001: write(fd=1, buf= (null), size=6258688)
Instead of this:
[...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
One way to get it working is to expand various combinations
of argument types into 8 different combinations for 32 bit
and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
as well that it resolves the issue.
Fixes: 9c959c863f ("tracing: Allow BPF programs to call bpf_trace_printk()")
Reported-by: James Hogan <james.hogan@imgtec.com>
Tested-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
State of a register doesn't matter if it wasn't read in reaching an exit;
a write screens off all reads downstream of it from all explored_states
upstream of it.
This allows us to prune many more branches; here are some processed insn
counts for some Cilium programs:
Program before after
bpf_lb_opt_-DLB_L3.o 6515 3361
bpf_lb_opt_-DLB_L4.o 8976 5176
bpf_lb_opt_-DUNKNOWN.o 2960 1137
bpf_lxc_opt_-DDROP_ALL.o 95412 48537
bpf_lxc_opt_-DUNKNOWN.o 141706 78718
bpf_netdev.o 24251 17995
bpf_overlay.o 10999 9385
The runtime is also improved; here are 'time' results in ms:
Program before after
bpf_lb_opt_-DLB_L3.o 24 6
bpf_lb_opt_-DLB_L4.o 26 11
bpf_lb_opt_-DUNKNOWN.o 11 2
bpf_lxc_opt_-DDROP_ALL.o 1288 139
bpf_lxc_opt_-DUNKNOWN.o 1768 234
bpf_netdev.o 62 31
bpf_overlay.o 15 13
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is
not part of AUDIT_FS_WATCH mask and thus such event never gets to
audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify
subsystem on unmount without audit being notified about that which leads
to a strange state of existing audit rules with dead fsnotify marks.
Add FS_UNMOUNT to the mask of events to be received so that audit can
clean up its state accordingly.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
audit_remove_watch_rule() drops watch's reference to parent but then
continues to work with it. That is not safe as parent can get freed once
we drop our reference. The following is a trivial reproducer:
mount -o loop image /mnt
touch /mnt/file
auditctl -w /mnt/file -p wax
umount /mnt
auditctl -D
<crash in fsnotify_destroy_mark()>
Grab our own reference in audit_remove_watch_rule() earlier to make sure
mark does not get freed under us.
CC: stable@vger.kernel.org
Reported-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Tony Jones <tonyj@suse.de>
Signed-off-by: Paul Moore <paul@paul-moore.com>
In some cases, an architecture might wish its stacks to be aligned to a
boundary larger than THREAD_SIZE. For example, using an alignment of
double THREAD_SIZE can allow for stack overflows smaller than
THREAD_SIZE to be detected by checking a single bit of the stack
pointer.
This patch allows architectures to override the alignment of VMAP'd
stacks, by defining THREAD_ALIGN. Where not defined, this defaults to
THREAD_SIZE, as is the case today.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: linux-kernel@vger.kernel.org
The error variable in do_syslog() is preemptively set to the error code
before the error condition is checked, and then set to 0 if the error
condition is not encountered. This is not necessary, as it is likely
simpler to return immediately upon encountering the error condition. A
redundant set of the error variable to 0 is also removed.
This patch has been build-tested on x86_64, but not tested for
functionality.
Link: http://lkml.kernel.org/r/20170730033636.GA935@vostro
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Nikitas Angelinas <nikitas.angelinas@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Pull RCU fix from Paul McKenney:
" This pull request is for an RCU change that permits waiting for grace
periods started by CPUs late in the process of going offline. Lack of
this capability is causing failures:
http://lkml.kernel.org/r/db9c91f6-1b17-6136-84f0-03c3c2581ab4@codeaurora.org"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Right now, SECCOMP_RET_KILL_THREAD (neé SECCOMP_RET_KILL) kills the
current thread. There have been a few requests for this to kill the entire
process (the thread group). This cannot be just changed (discovered when
adding coredump support since coredumping kills the entire process)
because there are userspace programs depending on the thread-kill
behavior.
Instead, implement SECCOMP_RET_KILL_PROCESS, which is 0x80000000, and can
be processed as "-1" by the kernel, below the existing RET_KILL that is
ABI-set to "0". For userspace, SECCOMP_RET_ACTION_FULL is added to expand
the mask to the signed bit. Old userspace using the SECCOMP_RET_ACTION
mask will see SECCOMP_RET_KILL_PROCESS as 0 still, but this would only
be visible when examining the siginfo in a core dump from a RET_KILL_*,
where it will think it was thread-killed instead of process-killed.
Attempts to introduce this behavior via other ways (filter flags,
seccomp struct flags, masked RET_DATA bits) all come with weird
side-effects and baggage. This change preserves the central behavioral
expectations of the seccomp filter engine without putting too great
a burden on changes needed in userspace to use the new action.
The new action is discoverable by userspace through either the new
actions_avail sysctl or through the SECCOMP_GET_ACTION_AVAIL seccomp
operation. If used without checking for availability, old kernels
will treat RET_KILL_PROCESS as RET_KILL_THREAD (since the old mask
will produce RET_KILL_THREAD).
Cc: Paul Moore <paul@paul-moore.com>
Cc: Fabricio Voznika <fvoznika@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
This introduces the BPF return value for SECCOMP_RET_KILL_PROCESS to kill
an entire process. This cannot yet be reached by seccomp, but it changes
the default-kill behavior (for unknown return values) from kill-thread to
kill-process.
Signed-off-by: Kees Cook <keescook@chromium.org>
In preparation for adding SECCOMP_RET_KILL_PROCESS, rename SECCOMP_RET_KILL
to the more accurate SECCOMP_RET_KILL_THREAD.
The existing selftest values are intentionally left as SECCOMP_RET_KILL
just to be sure we're exercising the alias.
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new action, SECCOMP_RET_LOG, that logs a syscall before allowing
the syscall. At the implementation level, this action is identical to
the existing SECCOMP_RET_ALLOW action. However, it can be very useful when
initially developing a seccomp filter for an application. The developer
can set the default action to be SECCOMP_RET_LOG, maybe mark any
obviously needed syscalls with SECCOMP_RET_ALLOW, and then put the
application through its paces. A list of syscalls that triggered the
default action (SECCOMP_RET_LOG) can be easily gleaned from the logs and
that list can be used to build the syscall whitelist. Finally, the
developer can change the default action to the desired value.
This provides a more friendly experience than seeing the application get
killed, then updating the filter and rebuilding the app, seeing the
application get killed due to a different syscall, then updating the
filter and rebuilding the app, etc.
The functionality is similar to what's supported by the various LSMs.
SELinux has permissive mode, AppArmor has complain mode, SMACK has
bring-up mode, etc.
SECCOMP_RET_LOG is given a lower value than SECCOMP_RET_ALLOW as allow
while logging is slightly more restrictive than quietly allowing.
Unfortunately, the tests added for SECCOMP_RET_LOG are not capable of
inspecting the audit log to verify that the syscall was logged.
With this patch, the logic for deciding if an action will be logged is:
if action == RET_ALLOW:
do not log
else if action == RET_KILL && RET_KILL in actions_logged:
log
else if action == RET_LOG && RET_LOG in actions_logged:
log
else if filter-requests-logging && action in actions_logged:
log
else if audit_enabled && process-is-being-audited:
log
else:
do not log
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Add a new filter flag, SECCOMP_FILTER_FLAG_LOG, that enables logging for
all actions except for SECCOMP_RET_ALLOW for the given filter.
SECCOMP_RET_KILL actions are always logged, when "kill" is in the
actions_logged sysctl, and SECCOMP_RET_ALLOW actions are never logged,
regardless of this flag.
This flag can be used to create noisy filters that result in all
non-allowed actions to be logged. A process may have one noisy filter,
which is loaded with this flag, as well as a quiet filter that's not
loaded with this flag. This allows for the actions in a set of filters
to be selectively conveyed to the admin.
Since a system could have a large number of allocated seccomp_filter
structs, struct packing was taken in consideration. On 64 bit x86, the
new log member takes up one byte of an existing four byte hole in the
struct. On 32 bit x86, the new log member creates a new four byte hole
(unavoidable) and consumes one of those bytes.
Unfortunately, the tests added for SECCOMP_FILTER_FLAG_LOG are not
capable of inspecting the audit log to verify that the actions taken in
the filter were logged.
With this patch, the logic for deciding if an action will be logged is:
if action == RET_ALLOW:
do not log
else if action == RET_KILL && RET_KILL in actions_logged:
log
else if filter-requests-logging && action in actions_logged:
log
else if audit_enabled && process-is-being-audited:
log
else:
do not log
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Adminstrators can write to this sysctl to set the seccomp actions that
are allowed to be logged. Any actions not found in this sysctl will not
be logged.
For example, all SECCOMP_RET_KILL, SECCOMP_RET_TRAP, and
SECCOMP_RET_ERRNO actions would be loggable if "kill trap errno" were
written to the sysctl. SECCOMP_RET_TRACE actions would not be logged
since its string representation ("trace") wasn't present in the sysctl
value.
The path to the sysctl is:
/proc/sys/kernel/seccomp/actions_logged
The actions_avail sysctl can be read to discover the valid action names
that can be written to the actions_logged sysctl with the exception of
"allow". SECCOMP_RET_ALLOW actions cannot be configured for logging.
The default setting for the sysctl is to allow all actions to be logged
except SECCOMP_RET_ALLOW. While only SECCOMP_RET_KILL actions are
currently logged, an upcoming patch will allow applications to request
additional actions to be logged.
There's one important exception to this sysctl. If a task is
specifically being audited, meaning that an audit context has been
allocated for the task, seccomp will log all actions other than
SECCOMP_RET_ALLOW despite the value of actions_logged. This exception
preserves the existing auditing behavior of tasks with an allocated
audit context.
With this patch, the logic for deciding if an action will be logged is:
if action == RET_ALLOW:
do not log
else if action == RET_KILL && RET_KILL in actions_logged:
log
else if audit_enabled && task-is-being-audited:
log
else:
do not log
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Userspace code that needs to check if the kernel supports a given action
may not be able to use the /proc/sys/kernel/seccomp/actions_avail
sysctl. The process may be running in a sandbox and, therefore,
sufficient filesystem access may not be available. This patch adds an
operation to the seccomp(2) syscall that allows userspace code to ask
the kernel if a given action is available.
If the action is supported by the kernel, 0 is returned. If the action
is not supported by the kernel, -1 is returned with errno set to
-EOPNOTSUPP. If this check is attempted on a kernel that doesn't support
this new operation, -1 is returned with errno set to -EINVAL meaning
that userspace code will have the ability to differentiate between the
two error cases.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Kees Cook <keescook@chromium.org>
This patch creates a read-only sysctl containing an ordered list of
seccomp actions that the kernel supports. The ordering, from left to
right, is the lowest action value (kill) to the highest action value
(allow). Currently, a read of the sysctl file would return "kill trap
errno trace allow". The contents of this sysctl file can be useful for
userspace code as well as the system administrator.
The path to the sysctl is:
/proc/sys/kernel/seccomp/actions_avail
libseccomp and other userspace code can easily determine which actions
the current kernel supports. The set of actions supported by the current
kernel may be different than the set of action macros found in kernel
headers that were installed where the userspace code was built.
In addition, this sysctl will allow system administrators to know which
actions are supported by the kernel and make it easier to configure
exactly what seccomp logs through the audit subsystem. Support for this
level of logging configuration will come in a future patch.
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Both the upcoming logging improvements and changes to RET_KILL will need
to know which filter a given seccomp return value originated from. In
order to delay logic processing of result until after the seccomp loop,
this adds a single pointer assignment on matches. This will allow both
log and RET_KILL logic to work off the filter rather than doing more
expensive tests inside the time-critical run_filters loop.
Running tight cycles of getpid() with filters attached shows no measurable
difference in speed.
Suggested-by: Tyler Hicks <tyhicks@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
As Boqun Feng pointed out, current->hist_id should be aligned with the
latest valid xhlock->hist_id so that hist_id_save[] storing current->hist_id
can be comparable with xhlock->hist_id. Fix it.
Additionally, the condition for overwrite-detection should be the
opposite. Fix the code and the comments as well.
<- direction to visit
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh (h: history)
^^ ^
|| start from here
|previous entry
current entry
Reported-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: linux-mm@kvack.org
Cc: npiggin@gmail.com
Cc: walken@google.com
Cc: willy@infradead.org
Link: http://lkml.kernel.org/r/1502694052-16085-3-git-send-email-byungchul.park@lge.com
[ Improve the comments some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
struct irq_desc does not have a member named "act". The correct
name is "action".
Currently, all users of this macro use an iterator named "action".
If a different name is used, it will cause a build error.
Fixes: f944b5a7af ("genirq: Use a common macro to go through the actions list")
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Jason Cooper <jason@lakedaemon.net>
Link: http://lkml.kernel.org/r/1502260341-28184-1-git-send-email-yamada.masahiro@socionext.com
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore replaces the spin_unlock_wait() call in
do_task_dead() with spin_lock() followed immediately by spin_unlock().
This should be safe from a performance perspective because the lock is
this tasks ->pi_lock, and this is called only after the task exits.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ paulmck: Drop smp_mb() based on Peter Zijlstra's analysis:
http://lkml.kernel.org/r/20170811144150.26gowhxte7ri5fpk@hirez.programming.kicks-ass.net ]
"descendants" and "depth" are declared as int, so they can't be larger
than INT_MAX. Static checkers complain and it's slightly confusing for
humans as well so let's just remove these conditions.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Misc trivial changes to prepare for future changes. No functional
difference.
* Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
* Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
* Rename cgroup_stats_show() to cgroup_stat_show() for consistency
with the file name.
Signed-off-by: Tejun Heo <tj@kernel.org>
Conflicts:
include/linux/mm_types.h
mm/huge_memory.c
I removed the smp_mb__before_spinlock() like the following commit does:
8b1b436dd1 ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")
and fixed up the affected commits.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename struct platform_freeze_ops to platform_s2idle_ops to make it
clear that the callbacks in it are used during suspend-to-idle
suspend/resume transitions and rename the related functions,
variables and so on accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Rename the ->enter_freeze cpuidle driver callback to ->enter_s2idle
to make it clear that it is used for entering suspend-to-idle and
rename the related functions, variables and so on accordingly.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>