Commit graph

22800 commits

Author SHA1 Message Date
Jon Hunter
b62b2cf575 irqdomain: Fix handling of type settings for existing mappings
When mapping an IRQ, it is possible that a mapping for the IRQ already
exists. If mapping does exist then there are the following issues with
regard to the handling of the IRQ type settings ...
1. If the domain is part of a hierarchy, then:
   a. We do not check that the type settings for the existing mapping
      match those of the new mapping.
   b. We do not check to see if the type settings have been programmed
      yet (and they might not have been) and so we may never set the
      type.
2. If the domain is NOT part of a hierarchy, we will overwrite the
   current type settings programmed if they are different from the
   previous mapping. Please note that irq_create_mapping()
   calls irq_find_mapping() to check if a mapping already exists.

Although, it may be unlikely that the type settings for a shared
interrupt would not match, nonetheless we should check for this.
Therefore, to fix this check if a mapping exists (regardless of whether
the domain is part of a hierarchy or not) and if it does then:
1. Return the IRQ number if the type settings match or are not
   specified.
2. Program the type settings and return the IRQ number if the type
   settings have not been programmed yet.
3. Otherwise if the type setting do not match, then print a warning
   and don't return the IRQ number.

Furthermore, add a warning if the type return by irq_domain_translate()
has bits outside the sense mask set and then clear these bits. If these
bits are not cleared then this will cause the comparision of the type
settings for an existing mapping to fail with that of the new mapping
even if the sense bit themselves match. The reason being is that the
existing type settings are read by calling irq_get_trigger_type() which
will clear any bits outside the sense mask. This will allow us to detect
irqchips that are not correctly clearing these bits and fix them.

Signed-off-by: Jon Hunter <jonathanh@nvidia.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2016-06-13 11:53:50 +01:00
Linus Torvalds
f5364c150a Merge branch 'stacking-fixes' (vfs stacking fixes from Jann)
Merge filesystem stacking fixes from Jann Horn.

* emailed patches from Jann Horn <jannh@google.com>:
  sched: panic on corrupted stack end
  ecryptfs: forbid opening files without mmap handler
  proc: prevent stacking filesystems on top
2016-06-10 12:10:02 -07:00
Jann Horn
29d6455178 sched: panic on corrupted stack end
Until now, hitting this BUG_ON caused a recursive oops (because oops
handling involves do_exit(), which calls into the scheduler, which in
turn raises an oops), which caused stuff below the stack to be
overwritten until a panic happened (e.g.  via an oops in interrupt
context, caused by the overwritten CPU index in the thread_info).

Just panic directly.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-10 12:09:43 -07:00
Linus Torvalds
60e383037b Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
 "Two scheduler debugging fixes"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/debug: Fix 'schedstats=enable' cmdline option
  sched/debug: Fix /proc/sched_debug regression
2016-06-10 11:24:39 -07:00
Linus Torvalds
7fcbc230c6 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
 "A handful of tooling fixes, two PMU driver fixes and a cleanup of
  redundant code that addresses a security analyzer false positive"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/core: Remove a redundant check
  perf/x86/intel/uncore: Remove SBOX support for Broadwell server
  perf ctf: Convert invalid chars in a string before set value
  perf record: Fix crash when kptr is restricted
  perf symbols: Check kptr_restrict for root
  perf/x86/intel/rapl: Fix pmus free during cleanup
2016-06-10 11:15:41 -07:00
Linus Torvalds
02b07bde61 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
 "Misc fixes:

   - a file-based futex fix
   - one more spin_unlock_wait() fix
   - a ww-mutex deadlock detection improvement/fix
   - and a raw_read_seqcount_latch() barrier fix"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  futex: Calculate the futex key based on a tail page for file-based futexes
  locking/qspinlock: Fix spin_unlock_wait() some more
  locking/ww_mutex: Report recursive ww_mutex locking early
  locking/seqcount: Re-fix raw_read_seqcount_latch()
2016-06-10 10:53:46 -07:00
Linus Torvalds
698ea54dde Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) nfnetlink timestamp taken from wrong skb, fix from Florian Westphal.

 2) Revert some msleep conversions in rtlwifi as these spots are in
    atomic context, from Larry Finger.

 3) Validate that NFTA_SET_TABLE attribute is actually specified when we
    call nf_tables_getset().  From Phil Turnbull.

 4) Don't do mdio_reset in stmmac driver with spinlock held as that can
    sleep, from Vincent Palatin.

 5) sk_filter() does things other than run a BPF filter, so we should
    not elide it's call just because sk->sk_filter is NULL.  Fix from
    Eric Dumazet.

 6) Fix missing backlog updates in several packet schedulers, from Cong
    Wang.

 7) bnx2x driver should allow VLAN add/remove while the interface is
    down, from Michal Schmidt.

 8) Several RDS/TCP race fixes from Sowmini Varadhan.

 9) fq_codel scheduler doesn't return correct queue length in dumps,
    from Eric Dumazet.

10) Fix TCP stats for tail loss probe and early retransmit in ipv6, from
    Yuchung Cheng.

11) Properly initialize udp_tunnel_socket_cfg in l2tp_tunnel_create(),
    from Guillaume Nault.

12) qfq scheduler leaks SKBs if a kzalloc fails, fix from Florian
    Westphal.

13) sock_fprog passed into PACKET_FANOUT_DATA needs compat handling,
    from Willem de Bruijn.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
  vmxnet3: segCnt can be 1 for LRO packets
  packet: compat support for sock_fprog
  stmmac: fix parameter to dwmac4_set_umac_addr()
  net/mlx5e: Fix blue flame quota logic
  net/mlx5e: Use ndo_stop explicitly at shutdown flow
  net/mlx5: E-Switch, always set mc_promisc for allmulti vports
  net/mlx5: E-Switch, Modify node guid on vf set MAC
  net/mlx5: E-Switch, Fix vport enable flow
  net/mlx5: E-Switch, Use the correct error check on returned pointers
  net/mlx5: E-Switch, Use the correct free() function
  net/mlx5: Fix E-Switch flow steering capabilities check
  net/mlx5: Fix flow steering NIC capabilities check
  net/mlx5: Fix root flow table update
  net/mlx5: Fix MLX5_CMD_OP_MAX to be defined correctly
  net/mlx5: Fix masking of reserved bits in XRCD number
  net/mlx5: Fix the size of modify QP mailbox
  mlxsw: spectrum: Don't sleep during ndo_get_phys_port_name()
  mlxsw: spectrum: Make split flow match firmware requirements
  wext: Fix 32 bit iwpriv compatibility issue with 64 bit Kernel
  cfg80211: remove get/set antenna and tx power warnings
  ...
2016-06-10 08:32:24 -07:00
Linus Torvalds
524a3f2ca2 Power management fixes for v4.7-rc3
- Fix two intel_pstate initialization issues, one of which was
    introduced during the 4.4 cycle (Srinivas Pandruvada).
 
  - Fix kernel build with CONFIG_UBSAN set and CONFIG_CPU_IDLE
    unset (Catalin Marinas).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJXWgWVAAoJEILEb/54YlRxG5EP/itHzuh6Sq706zRRAL2zLAiW
 9ZEgnH7Q0rbcaGTkRUe9BInJgS1lOxXlnzaQvx0nMN8omahYasl0FjX+5a+pwV6T
 wE2tuw2wby0XI0gDwsEvTHpy6etCis46JfH8TJfPSkJOog83ZYoWeeolE9jwXPpP
 Zgw8Eby/PvLr89tiDJr1/keChnQu0u2Ejz2hHEygc1Z5uxthVWfM8aEVaq8TnAPx
 TBExCvjZ3+KuEjQTQdSy2Bw8h8z0AO086NARQJ7JwCtSqIyo6WEPvJ5q+Wcrt/Uq
 P0C7Lpkcv6323sHm5hMCy+M3ES5Kws7cRFK3D2qjpTdtGr0/ibbHcrp2Vcw6b1Pr
 8yRzN/ZJ4eLnLZLqggi42FByCygo1uy9zcqcbBCwDtmseM55+NgZGtmGgRBlRRw0
 eOZwNmDHV++uUMF0EclBpIPG4mgUa7EdlJBFBgsRDeRMk+7VqKuw2Nl+F8d+gfvn
 a0uUMHLUoSA7eI8/pKwGbdiaMVSKQahhPXhZcJOmBdV7eLFX5rCKiRujhNg2XNgU
 GaVCRgwP7HRs5RmrQQByQS09BeCgK/sqCZcWZh1VrHQFfww/cxnFQTYJuapgnz2i
 y0n6B2foMF2B3dxMDioqDkNWl7V/fgO0YuKCAea5eFrQA9GZf1zM8sbEtaq2khZw
 xCa0iom5cIrBOCc3gvTv
 =qmBM
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "Stable-candidate fixes for the intel_pstate driver and the cpuidle
  core.

  Specifics:

   - Fix two intel_pstate initialization issues, one of which was
     introduced during the 4.4 cycle (Srinivas Pandruvada)

   - Fix kernel build with CONFIG_UBSAN set and CONFIG_CPU_IDLE unset
     (Catalin Marinas)"

* tag 'pm-4.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpufreq: intel_pstate: Fix ->set_policy() interface for no_turbo
  cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy()
  cpuidle: Do not access cpuidle_devices when !CONFIG_CPU_IDLE
2016-06-10 08:09:12 -07:00
Weongyo Jeong
ff5b706f51 genirq: Remove unnecessary memset() calls
sprintf() and snprintf() implementation of kernel guarantees that
its result is terminated with null byte if size is larger than 0. So we
don't need to call memset() at all.

Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com>
Link: http://lkml.kernel.org/r/1459451703-5744-1-git-send-email-weongyo.linux@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-10 17:07:50 +02:00
Jianyu Zhan
fe3464ca87 genirq: Remove redundant NULL check of irq_desc
for_each_irq_desc() macro has already skipped NULL irq_desc, don't
bother to check it again.

Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Cc: mingo@kernel.org
Cc: yhlu.kernel@gmail.com
Link: http://lkml.kernel.org/r/1458395959-7046-1-git-send-email-nasa4836@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-10 17:07:49 +02:00
Pratyush Patel
86721ab63b hrtimer: Remove redundant #ifdef block
Only need CONFIG_NO_HZ_COMMON as this block is already in a
CONFIG_SMP block.

Signed-off-by: Pratyush Patel <pratyushpatel.1995@gmail.com>
Link: http://lkml.kernel.org/r/20160301172849.GA18152@cyborg
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-10 11:25:35 +02:00
Bjorn Helgaas
b5227d03b7 timers: Clarify usleep_range() function comment
Update the usleep_range() function comment to make it clear that it can
only be used in non-atomic context.

Previously we claimed usleep_range() was a drop-in replacement for udelay()
where wakeup is flexible.  But that's only true in non-atomic contexts,
where it's possible to sleep instead of delay.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/20160531212302.28502.44995.stgit@bhelgaas-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-10 00:59:14 +02:00
Rafael J. Wysocki
3681196ae5 Merge branches 'pm-cpufreq-fixes' and 'pm-cpuidle'
* pm-cpufreq-fixes:
  cpufreq: intel_pstate: Fix ->set_policy() interface for no_turbo
  cpufreq: intel_pstate: Fix code ordering in intel_pstate_set_policy()

* pm-cpuidle:
  cpuidle: Do not access cpuidle_devices when !CONFIG_CPU_IDLE
2016-06-09 23:49:16 +02:00
Zhouyi Zhou
ba62bafe94 kernel/relay.c: fix potential memory leak
When relay_open_buf() fails in relay_open(), code will goto free_bufs,
but chan is nowhere freed.

Link: http://lkml.kernel.org/r/1464777927-19675-1-git-send-email-yizhouzhou@ict.ac.cn
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Mel Gorman
077fa7aed1 futex: Calculate the futex key based on a tail page for file-based futexes
Mike Galbraith reported that the LTP test case futex_wake04 was broken
by commit 65d8fc777f ("futex: Remove requirement for lock_page()
in get_futex_key()").

This test case uses futexes backed by hugetlbfs pages and so there is an
associated inode with a futex stored on such pages. The problem is that
the key is being calculated based on the head page index of the hugetlbfs
page and not the tail page.

Prior to the optimisation, the page lock was used to stabilise mappings and
pin the inode is file-backed which is overkill. If the page was a compound
page, the head page was automatically looked up as part of the page lock
operation but the tail page index was used to calculate the futex key.

After the optimisation, the compound head is looked up early and the page
lock is only relied upon to identify truncated pages, special pages or a
shmem page moving to swapcache. The head page is looked up because without
the page lock, special care has to be taken to pin the inode correctly.
However, the tail page is still required to calculate the futex key so
this patch records the tail page.

On vanilla 4.6, the output of the test case is;

futex_wake04    0  TINFO  :  Hugepagesize 2097152
futex_wake04    1  TFAIL  :  futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.

With the patch applied

futex_wake04    0  TINFO  :  Hugepagesize 2097152
futex_wake04    1  TPASS  :  Hi hydra, thread2 awake!

Fixes: 65d8fc777f "futex: Remove requirement for lock_page() in get_futex_key()"
Reported-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-06-08 19:23:54 +02:00
Peter Zijlstra
6428671bae locking/mutex: Optimize mutex_trylock() fast-path
A while back Viro posted a number of 'interesting' mutex_is_locked()
users on IRC, one of those was RCU.

RCU seems to use mutex_is_locked() to avoid doing mutex_trylock(), the
regular load before modify pattern.

While the use isn't wrong per se, its curious in that its needed at all,
mutex_trylock() should be good enough on its own to avoid the pointless
cacheline bounces.

So fix those and remove the mutex_is_locked() (ab)use from RCU.

Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/20160601185815.GW3190@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:17:01 +02:00
Waiman Long
ddd0fa73c2 locking/rwsem: Streamline the rwsem_optimistic_spin() code
This patch moves the owner loading and checking code entirely inside of
rwsem_spin_on_owner() to simplify the logic of rwsem_optimistic_spin()
loop.

Suggested-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463534783-38814-6-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:17:00 +02:00
Waiman Long
bf7b4c472d locking/rwsem: Improve reader wakeup code
In __rwsem_do_wake(), the reader wakeup code will assume a writer
has stolen the lock if the active reader/writer count is not 0.
However, this is not as reliable an indicator as the original
"< RWSEM_WAITING_BIAS" check. If another reader is present, the code
will still break out and exit even if the writer is gone. This patch
changes it to check the same "< RWSEM_WAITING_BIAS" condition to
reduce the chance of false positive.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463534783-38814-5-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:17:00 +02:00
Waiman Long
fb6a44f33b locking/rwsem: Protect all writes to owner by WRITE_ONCE()
Without using WRITE_ONCE(), the compiler can potentially break a
write into multiple smaller ones (store tearing). So a read from the
same data by another task concurrently may return a partial result.
This can result in a kernel crash if the data is a memory address
that is being dereferenced.

This patch changes all write to rwsem->owner to use WRITE_ONCE()
to make sure that store tearing will not happen. READ_ONCE() may
not be needed for rwsem->owner as long as the value is only used for
comparison and not dereferencing.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463534783-38814-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:16:59 +02:00
Waiman Long
19c5d690e4 locking/rwsem: Add reader-owned state to the owner field
Currently, it is not possible to determine for sure if a reader
owns a rwsem by looking at the content of the rwsem data structure.
This patch adds a new state RWSEM_READER_OWNED to the owner field
to indicate that readers currently own the lock. This enables us to
address the following 2 issues in the rwsem optimistic spinning code:

 1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
    the owner field is NULL which can mean either the readers own
    the lock or the owning writer hasn't set the owner field yet.
    In the latter case, we miss the chance to do optimistic spinning.

 2) While a writer is waiting in the OSQ and a reader takes the lock,
    the writer will continue to spin when out of the OSQ in the main
    rwsem_optimistic_spin() loop as the owner field is NULL wasting
    CPU cycles if some of readers are sleeping.

Adding the new state will allow optimistic spinning to go forward as
long as the owner field is not RWSEM_READER_OWNED and the owner is
running, if set, but stop immediately when that state has been reached.

On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
fio test with multithreaded randrw and randwrite tests on the same
file on a XFS partition on top of a NVDIMM were run, the aggregated
bandwidths before and after the patch were as follows:

  Test      BW before patch     BW after patch  % change
  ----      ---------------     --------------  --------
  randrw         988 MB/s          1192 MB/s      +21%
  randwrite     1513 MB/s          1623 MB/s      +7.3%

The perf profile of the rwsem_down_write_failed() function in randrw
before and after the patch were:

   19.95%  5.88%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed
   14.20%  1.52%  fio  [kernel.vmlinux]  [k] rwsem_down_write_failed

The actual CPU cycles spend in rwsem_down_write_failed() dropped from
5.88% to 1.52% after the patch.

The xfstests was also run and no regression was observed.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Jason Low <jason.low2@hp.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:16:59 +02:00
Jason Low
8ee62b1870 locking/rwsem: Convert sem->count to 'atomic_long_t'
Convert the rwsem count variable to an atomic_long_t since we use it
as an atomic variable. This also allows us to remove the
rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
level of indirection. In follow up patches, we also remove the
rwsem_atomic_{add,update}() definitions across the various architectures.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jason Low <jason.low2@hpe.com>
[ Build warning fixes on various architectures. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Terry Rudd <terry.rudd@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Waiman Long <Waiman.Long@hpe.com>
Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 15:16:42 +02:00
Peter Zijlstra
055ce0fd1b locking/qspinlock: Add comments
I figured we need to document the spin_is_locked() and
spin_unlock_wait() constraints somwehere.

Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
could find a place in there. But currently that document is stale to
the point of hardly being useful.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:44:01 +02:00
Peter Zijlstra
8d53fa1904 locking/qspinlock: Clarify xchg_tail() ordering
While going over the code I noticed that xchg_tail() is a RELEASE but
had no obvious pairing commented.

It pairs with a somewhat unique address dependency through
decode_tail().

So the store-release of xchg_tail() is paired by the address
dependency of the load of xchg_tail followed by the dereference from
the pointer computed from that load.

The @old -> @prev transformation itself is pure, and therefore does
not depend on external state, so that is immaterial wrt. ordering.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:44:00 +02:00
Ingo Molnar
ae0b5c2f03 Merge branch 'locking/urgent' into locking/core, to pick up dependency
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:35:29 +02:00
Josh Poimboeuf
03c041c5bf sched/debug: Always show 'nr_migrations'
The nr_migrations field is updated independently of CONFIG_SCHEDSTATS,
so it can be displayed regardless.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5b1b04057ae2b14d73c2d03f56582c1d38cfe066.1464994423.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:34:49 +02:00
Ingo Molnar
067b4f9342 Merge branch 'sched/urgent' into sched/core, to pick up dependency
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:33:28 +02:00
Josh Poimboeuf
4698f88c06 sched/debug: Fix 'schedstats=enable' cmdline option
The 'schedstats=enable' option doesn't work, and also produces the
following warning during boot:

  WARNING: CPU: 0 PID: 0 at /home/jpoimboe/git/linux/kernel/jump_label.c:61 static_key_slow_inc+0x8c/0xa0
  static_key_slow_inc used before call to jump_label_init
  Modules linked in:
  CPU: 0 PID: 0 Comm: swapper Not tainted 4.7.0-rc1+ #25
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
   0000000000000086 3ae3475a4bea95d4 ffffffff81e03da8 ffffffff8143fc83
   ffffffff81e03df8 0000000000000000 ffffffff81e03de8 ffffffff810b1ffb
   0000003d00000096 ffffffff823514d0 ffff88007ff197c8 0000000000000000
  Call Trace:
   [<ffffffff8143fc83>] dump_stack+0x85/0xc2
   [<ffffffff810b1ffb>] __warn+0xcb/0xf0
   [<ffffffff810b207f>] warn_slowpath_fmt+0x5f/0x80
   [<ffffffff811e9c0c>] static_key_slow_inc+0x8c/0xa0
   [<ffffffff810e07c6>] static_key_enable+0x16/0x40
   [<ffffffff8216d633>] setup_schedstats+0x29/0x94
   [<ffffffff82148a05>] unknown_bootoption+0x89/0x191
   [<ffffffff810d8617>] parse_args+0x297/0x4b0
   [<ffffffff82148d61>] start_kernel+0x1d8/0x4a9
   [<ffffffff8214897c>] ? set_init_arg+0x55/0x55
   [<ffffffff82148120>] ? early_idt_handler_array+0x120/0x120
   [<ffffffff821482db>] x86_64_start_reservations+0x2f/0x31
   [<ffffffff82148427>] x86_64_start_kernel+0x14a/0x16d

The problem is that it tries to update the 'sched_schedstats' static key
before jump labels have been initialized.

Changing jump_label_init() to be called earlier before
parse_early_param() wouldn't fix it: it would still fail trying to
poke_text() because mm isn't yet initialized.

Instead, just create a temporary '__sched_schedstats' variable which can
be copied to the static key later during sched_init() after jump labels
have been initialized.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: cb2517653f ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/453775fe3433bed65731a583e228ccea806d18cd.1465322027.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:33:05 +02:00
Josh Poimboeuf
9c57259117 sched/debug: Fix /proc/sched_debug regression
Commit:

  cb2517653f ("sched/debug: Make schedstats a runtime tunable that is disabled by default")

... introduced a bug when CONFIG_SCHEDSTATS is enabled and the
runtime tunable is disabled (which is the default).

The wait-time, sum-exec, and sum-sleep fields are missing from the
/proc/sched_debug file in the runnable_tasks section.

Fix it with a new schedstat_val() macro which returns the field value
when schedstats is enabled and zero otherwise.  The macro works with
both SCHEDSTATS and !SCHEDSTATS.  I put the macro in stats.h since it
might end up being useful in other places.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: cb2517653f ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/bcda7c2790cf2ccbe586a28c02dd7b6fe7749a2b.1464994423.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:31:58 +02:00
Alexander Shishkin
62a92c8f55 perf/core: Remove a redundant check
There is no way to end up in _free_event() with event::pmu being NULL.
The latter is initialized in event allocation path and remains set
forever. In case of allocation failure, the error path doesn't use
_free_event().

Having the check, however, suggests that it is possible to have a
event::pmu==NULL situation in _free_event() and confuses the robots.

This patch gets rid of the check.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/1465303455-26032-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:30:01 +02:00
Peter Zijlstra
2c61002271 locking/qspinlock: Fix spin_unlock_wait() some more
While this prior commit:

  54cf809b95 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

... fixes spin_is_locked() and spin_unlock_wait() for the usage
in ipc/sem and netfilter, it does not in fact work right for the
usage in task_work and futex.

So while the 2 locks crossed problem:

	spin_lock(A)		spin_lock(B)
	if (!spin_is_locked(B)) spin_unlock_wait(A)
	  foo()			foo();

... works with the smp_mb() injected by both spin_is_locked() and
spin_unlock_wait(), this is not sufficient for:

	flag = 1;
	smp_mb();		spin_lock()
	spin_unlock_wait()	if (!flag)
				  // add to lockless list
	// iterate lockless list

... because in this scenario, the store from spin_lock() can be delayed
past the load of flag, uncrossing the variables and loosing the
guarantee.

This patch reworks spin_is_locked() and spin_unlock_wait() to work in
both cases by exploiting the observation that while the lock byte
store can be delayed, the contender must have registered itself
visibly in other state contained in the word.

It also allows for architectures to override both functions, as PPC
and ARM64 have an additional issue for which we currently have no
generic solution.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Giovanni Gherdovich <ggherdovich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <waiman.long@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: stable@vger.kernel.org # v4.2 and later
Fixes: 54cf809b95 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:29:08 +02:00
Sebastian Andrzej Siewior
a461d58792 locking/rtmutex: Only warn once on a trylock from bad context
One warning should be enough to get one motivated to fix this. It is
possible that this happens more than once and that starts flooding the
output. Later the prints will be suppressed so we only get half of it.
Depending on the console system used it might not be helpful.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:22:00 +02:00
Peter Zijlstra
dfaaf3fa01 locking/lockdep: Use __jhash_mix() for iterate_chain_key()
Use __jhash_mix() to mix the class_idx into the class_key. This
function provides better mixing than the previously used, home grown
mix function.

Leave hashing to the professionals :-)

Suggested-by: George Spelvin <linux@sciencehorizons.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 14:22:00 +02:00
Ingo Molnar
616d1c1b98 Merge branch 'linus' into perf/core, to refresh the branch
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 09:26:46 +02:00
David Carrillo-Cisneros
a4f144ebbd perf/core: Fix crash due to account/unaccount_sb_event() inconsistency
unaccount_pmu_sb_event() did not check for attributes in event->attr
before calling detach_sb_event(), while account_pmu_event() did.

This caused NULL pointer reference in cgroup events that did not
have any of the attributes checked by account_pmu_event().

To trigger the bug just wait for a cgroup event to terminate, e.g.:

  $ mkdir /dev/cgroup/devices/test
  $ perf stat -e cycles -a -G test sleep 0

... see crash ...

Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zheng <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/1464809585-66072-1-git-send-email-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 09:18:45 +02:00
Daniel Borkmann
5b6c1b4d46 bpf, trace: use READ_ONCE for retrieving file ptr
In bpf_perf_event_read() and bpf_perf_event_output(), we must use
READ_ONCE() for fetching the struct file pointer, which could get
updated concurrently, so we must prevent the compiler from potential
refetching.

We already do this with tail calls for fetching the related bpf_prog,
but not so on stored perf events. Semantics for both are the same
with regards to updates.

Fixes: a43eec3042 ("bpf: introduce bpf_perf_event_output() helper")
Fixes: 35578d7984 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-07 14:48:03 -07:00
Linus Torvalds
8c52b6dcdd Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
 - a few simple fixes for fallout from the recent gic-v3 changes
 - a workaround for a Cavium thunderX erratum
 - a bugfix for the pic32 irqchip to make external interrupts work proper
 - a missing return value in the generic IPI management code

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/irq-pic32-evic: Fix bug with external interrupts.
  irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
  irqchip/gic-v3: Fix quiescence check in gic_enable_redist
  irqchip/gic-v3: Fix copy+paste mistakes in defines
  irqchip/gic-v3: Fix ICC_SGI1R_EL1.INTID decoding mask
  genirq: Fix missing return value in irq_destroy_ipi()
2016-06-03 16:12:35 -07:00
Thomas Gleixner
2eec3707a3 irqchip updates for 4.7-rc1:
- A number of embarassing buglets (GICv3, PIC32)
 - A more substential errata workaround for Cavium's GICv3 ITS
   (kept for post-rc1 due to its dependency on NUMA)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXUXcjAAoJECPQ0LrRPXpDGFcQAICkj5cIsQghW2dgR0eo2d7S
 +ieyxr55tz3A0c1Cisw09ESHz6wCrQ+PmvXkKISIG1l2AHv9UOUZVrsB1cl2CvBg
 C9ClvUyMafiZ+FlhxMO1QM1Vfa3EUV2EPIx3mh5klUp8ph/cT+aArJe+WmJApS25
 nlYiobi2AE0+m2V5ekikMtVbM5xXWKHPRgzYqZUlNBV74k/FGgRlBk/bw1AWqnsd
 TfUF+QZpEd/4GPglbQLvwJwjQg+qanl59CJqi403U00emLuvRqdqTeMoqPEfw5id
 MgVPMtUF+N7fgtygo20oPFBriFBHFUTj+c5Oafd4ahgp6eU02HYX8A7w/jj84tTP
 cPa9bcoyKyec5vpO1mbU2a/VzqXPDNL17Dg9tRaf0NpksMeLvBh14jXWp5v8vEqU
 Qm4mFlmEYKivWTJhz6pGJmxFX/X5vMa2wrFY7xvOVYby5mSuEGD7+puuKuVNNBEa
 THaElOYM4ogTrUBM39dzfzXxSEQN/bcLHXNd2IDuUUK49NvNFjnke3PvLxesiDuT
 Pxk4mO912+/Ldk7K1LVVVWltVtOHFNG+I7a3R75gwftHXMYONuOpEiflA4QorxVk
 9Rq9yUI4h/69V5fIthBN8BUYU5hxaTtLi0DI1fWSweugZb5PUXiagKnXKaIOLQ/4
 A3pvoYEHdynDVO7nJ+Kd
 =E+j+
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/urgent

Merge irqchip updates from Marc Zyngier:

- A number of embarassing buglets (GICv3, PIC32)
- A more substential errata workaround for Cavium's GICv3 ITS
  (kept for post-rc1 due to its dependency on NUMA)
2016-06-03 15:05:51 +02:00
Jason Low
6e2814745c locking/mutex: Set and clear owner using WRITE_ONCE()
The mutex owner can get read and written to locklessly.
Use WRITE_ONCE when setting and clearing the owner field
in order to avoid optimizations such as store tearing. This
avoids situations where the owner field gets written to with
multiple stores and another thread could concurrently read
and use a partially written owner value.

Signed-off-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Waiman Long <Waiman.Long@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Terry Rudd <terry.rudd@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1463782776.2479.9.camel@j-VirtualBox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 12:06:10 +02:00
Jason Low
c0fcb6c2d3 locking/rwsem: Optimize write lock by reducing operations in slowpath
When acquiring the rwsem write lock in the slowpath, we first try
to set count to RWSEM_WAITING_BIAS. When that is successful,
we then atomically add the RWSEM_WAITING_BIAS in cases where
there are other tasks on the wait list. This causes write lock
operations to often issue multiple atomic operations.

We can instead make the list_is_singular() check first, and then
set the count accordingly, so that we issue at most 1 atomic
operation when acquiring the write lock and reduce unnecessary
cacheline contention.

Signed-off-by: Jason Low <jason.low2@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long<Waiman.Long@hpe.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Jason Low <jason.low2@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Terry Rudd <terry.rudd@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:47:13 +02:00
Davidlohr Bueso
e38513905e locking/rwsem: Rework zeroing reader waiter->task
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. Because of the way readers are
implemented, there's a small chance that the waiter will never
block in the slowpath (rwsem_down_read_failed), and therefore
requires some form of reference counting to avoid the following
scenario:

rwsem_down_read_failed()		rwsem_wake()
  get_task_struct();
  spin_lock_irq(&wait_lock);
  list_add_tail(&waiter.list)
  spin_unlock_irq(&wait_lock);
					  raw_spin_lock_irqsave(&wait_lock)
					  __rwsem_do_wake()
  while (1) {
    set_task_state(TASK_UNINTERRUPTIBLE);
					    waiter->task = NULL
    if (!waiter.task) // true
      break;
    schedule() // never reached

   __set_task_state(TASK_RUNNING);
 do_exit();
					    wake_up_process(tsk); // boom

... and therefore race with do_exit() when the caller returns.

There is also a mismatch between the smp_mb() and its documentation,
in that the serialization is done between reading the task and the
nil store. Furthermore, in addition to having the overlapping of
loads and stores to waiter->task guaranteed to be ordered within
that CPU, both wake_up_process() originally and now wake_q_add()
already imply barriers upon successful calls, which serves the
comment.

Now, as an alternative to perhaps inverting the checks in the blocker
side (which has its own penalty in that schedule is unavoidable),
with lockless wakeups this situation is naturally addressed and we
can just use the refcount held by wake_q_add(), instead doing so
explicitly. Of course, we must guarantee that the nil store is done
as the _last_ operation in that the task must already be marked for
deletion to not fall into the race above. Spurious wakeups are also
handled transparently in that the task's reference is only removed
when wake_up_q() is actually called _after_ the nil store.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:47:12 +02:00
Davidlohr Bueso
133e89ef5e locking/rwsem: Enable lockless waiter wakeup(s)
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).

In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).

Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.

Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
page allocations (page_test)

aim9:
	 4.6-rc6				4.6-rc6
						rwsemv2
Min      page_test   378167.89 (  0.00%)   382613.33 (  1.18%)
Min      exec_test      499.00 (  0.00%)      502.67 (  0.74%)
Min      fork_test     3395.47 (  0.00%)     3537.64 (  4.19%)
Hmean    page_test   395433.06 (  0.00%)   414693.68 (  4.87%)
Hmean    exec_test      499.67 (  0.00%)      505.30 (  1.13%)
Hmean    fork_test     3504.22 (  0.00%)     3594.95 (  2.59%)
Stddev   page_test    17426.57 (  0.00%)    26649.92 (-52.93%)
Stddev   exec_test        0.47 (  0.00%)        1.41 (-199.05%)
Stddev   fork_test       63.74 (  0.00%)       32.59 ( 48.86%)
Max      page_test   429873.33 (  0.00%)   456960.00 (  6.30%)
Max      exec_test      500.33 (  0.00%)      507.66 (  1.47%)
Max      fork_test     3653.33 (  0.00%)     3650.90 ( -0.07%)

	     4.6-rc6     4.6-rc6
			 rwsemv2
User            1.12        0.04
System          0.23        0.04
Elapsed       727.27      721.98

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:47:10 +02:00
Vineet Gupta
a1396555ab perf/abi: Change the errno for sampling event not supported in hardware
Change the return code for sampling event not supported from -ENOTSUPP
to -EOPNOTSUPP.

This allows userspace to identify this case specifically, instead of
printing the catch-all error message it did previously.

Technically this is an ABI change, but we think we can get away
with it.

Old behavior:
 -------
 | # perf record ls
 | Error:
 | The sys_perf_event_open() syscall returned with 524 (Unknown error 524)
 | for event (cycles:ppp).
 | /bin/dmesg may provide additional information.
 | No CONFIG_PERF_EVENTS=y kernel support configured?

New behavior:
 -------
 | # perf record ls
 | Error:
 | PMU Hardware doesn't support sampling/overflow-interrupts.

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <acme@redhat.com>
Cc: <linux-snps-arc@lists.infradead.org>
Cc: <vincent.weaver@maine.edu>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
Link: http://lkml.kernel.org/r/1462786660-2900-3-git-send-email-vgupta@synopsys.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:40:42 +02:00
Kan Liang
ab7fdefba6 perf/core: Fix implicitly enable dynamic interrupt throttle
This patch fixes an issue which was introduced by commit:

  91a612eea9 ("perf/core: Fix dynamic interrupt throttle")

... which commit unconditionally sets the perf_sample_allowed_ns value
to !0. But that could trigger a bug in the following corner case:

The user can disable the dynamic interrupt throttle mechanism by setting
perf_cpu_time_max_percent to 0. Then they change perf_event_max_sample_rate.
For this case, the mechanism will be enabled implicitly, because
perf_sample_allowed_ns becomes !0 - which is not what we want.

This patch only updates perf_sample_allowed_ns when the dynamic
interrupt throttle mechanism is enabled.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1462260366-3160-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:40:16 +02:00
Peter Zijlstra
aab5b71ef2 perf/core: Rename the perf_event_aux*() APIs to perf_event_sb*(), to separate them from AUX ring-buffer records
There are now two different things called AUX in perf, the
infrastructure to deliver the mmap/comm/task records and the
AUX part in the mmap buffer (with associated AUX_RECORD).

Since the former is internal, rename it to side-band to reduce
the confusion factor.

No change in functionality.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:40:15 +02:00
Kan Liang
f2fb6bef92 perf/core: Optimize side-band event delivery
The perf_event_aux() function iterates all PMUs and all events in
their respective per-CPU contexts to find the events to deliver
side-band records to.

For example, the brk test case in lkp triggers many mmap() operations,
which, if we're also running perf, results in many perf_event_aux()
invocations.

If we enable uncore PMU support (even when uncore events are not used),
dozens of uncore PMUs will be iterated, which can significantly
decrease brk_test's throughput.

For example, the brk throughput:

  without uncore PMUs: 2647573 ops_per_sec
  with    uncore PMUs: 1768444 ops_per_sec

... a 33% reduction.

To get at the per-CPU events that need side-band records, this patch
puts these events on a per-CPU list, this avoids iterating the PMUs
and any events that do not need side-band records.

Per task events are unchanged to avoid extra overhead on the context
switch paths.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reported-by: Huang, Ying <ying.huang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1458757477-3781-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:40:15 +02:00
Oleg Nesterov
bac7857319 sched/fair: Use task_rcu_dereference()
Simplify task_numa_compare()'s task reference magic by using
task_rcu_dereference().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <ktkhai@parallels.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518195733.GA15914@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:58 +02:00
Oleg Nesterov
150593bf86 sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).

As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().

Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().

( Also add try_get_task_struct() to make it easier to use this API
  correctly. )

Suggested-by: Kirill Tkhai <ktkhai@parallels.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:57 +02:00
Gaurav Jindal (Gaurav Jindal)
df55f462b9 sched/idle: Optimize the generic idle loop
Currently, smp_processor_id() is used to fetch the current CPU in
cpu_idle_loop(). Every time the idle thread runs, it fetches the
current CPU using smp_processor_id().

Since the idle thread is per CPU, the current CPU is constant, so we
can lift the load out of the loop, saving execution cycles/time in the
loop.

x86-64:

Before patch (execution in loop):
	148:    0f ae e8                lfence
	14b:    65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax
	152:    00
	153:    89 c0                   mov %eax,%eax
	155:    49 0f a3 04 24          bt %rax,(%r12)

After patch (execution in loop):
	150:    0f ae e8                lfence
	153:    4d 0f a3 34 24          bt %r14,(%r12)

ARM64:

Before patch (execution in loop):
	168:    d5033d9f        dsb     ld
	16c:    b9405661        ldr     w1,[x19,#84]
	170:    1100fc20        add     w0,w1,#0x3f
	174:    6b1f003f        cmp     w1,wzr
	178:    1a81b000        csel    w0,w0,w1,lt
	17c:    130c7000        asr     w0,w0,#6
	180:    937d7c00        sbfiz   x0,x0,#3,#32
	184:    f8606aa0        ldr     x0,[x21,x0]
	188:    9ac12401        lsr     x1,x0,x1
	18c:    36000e61        tbz     w1,#0,358

After patch (execution in loop):
	1a8:    d50339df        dsb     ld
	1ac:    f8776ac0        ldr     x0,[x22,x23]
	ab0:    ea18001f        tst     x0,x24
	1b4:    54000ea0        b.eq    388

Further observance on ARM64 for 4 seconds shows that cpu_idle_loop is
called 8672 times. Shifting the code will save instructions executed
in loop and eventually time as well.

Signed-off-by: Gaurav Jindal <gaurav.jindal@spreadtrum.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Sanjeev Yadav <sanjeev.yadav@spreadtrum.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160512101330.GA488@gauravjindalubtnb.del.spreadtrum.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:56 +02:00
Xunlei Pang
1a99ae3f00 sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()
Two minor fixes for cfs_rq_clock_task():

 1) If cfs_rq is currently being throttled, we need to subtract the cfs
    throttled clock time.

 2) Make "throttled_clock_task_time" update SMP unrelated. Now UP cases
    need it as well.

Signed-off-by: Xunlei Pang <xlpang@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1462885398-14724-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 09:18:56 +02:00
Chris Wilson
0422e83d84 locking/ww_mutex: Report recursive ww_mutex locking early
Recursive locking for ww_mutexes was originally conceived as an
exception. However, it is heavily used by the DRM atomic modesetting
code. Currently, the recursive deadlock is checked after we have queued
up for a busy-spin and as we never release the lock, we spin until
kicked, whereupon the deadlock is discovered and reported.

A simple solution for the now common problem is to move the recursive
deadlock discovery to the first action when taking the ww_mutex.

Suggested-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-03 08:37:26 +02:00