This patch gives mirror the ability to handle device failures
during normal write operations.
The 'write_callback' function is called when a write completes.
If all the writes failed or succeeded, we report failure or
success respectively. If some of the writes failed, we call
fail_mirror; which increments the error count for the device, notes
the type of error encountered (DM_RAID1_WRITE_ERROR), and
selects a new primary (if necessary). Note that the primary
device can never change while the mirror is not in-sync (IOW,
while recovery is happening.) This means that the scenario
where a failed write changes the primary and gives
recovery_complete a chance to misread the primary never happens.
The fact that the primary can change has necessitated the change
to the default_mirror field. We need to protect against reading
garbage while the primary changes. We then add the bio to a new
list in the mirror set, 'failures'. For every bio in the 'failures'
list, we call a new function, '__bio_mark_nosync', where we mark
the region 'not-in-sync' in the log and properly set the region
state as, RH_NOSYNC. Userspace must also be notified of the
failure. This is done by 'raising an event' (dm_table_event()).
If fail_mirror is called in process context the event can be raised
right away. If in interrupt context, the event is deferred to the
kmirrord thread - which raises the event if 'event_waiting' is set.
Backwards compatibility is maintained by ignoring errors if
the DM_FEATURES_HANDLE_ERRORS flag is not present.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Provided sector_t is 64 bits, reduce the in-memory footprint of the
snapshot exception table by the simple method of using unused bits of
the chunk number to combine consecutive entries.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds additional information to the status line. It is added at the
end of the returned text so it will not interfere with existing
implementations using this data. The addition of this information will allow
for a common return interface to match that returned with the dm-raid1.c
status line (with Jonathan Brassow's patches).
Here is a sample of what is returned with a mirror "status" call:
isw_eeaaabgfg_mirror: 0 488390920 mirror 2 8:16 8:32 3727/3727 1 AA 1 core
Here's what's returned with this patch for a stripe "status" call:
isw_dheeijjdej_stripe: 0 976783872 striped 2 8:16 8:32 1 AA
Signed-off-by: Brian Wood <brian.j.wood@intel.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch adds the stripe_end_io function to process errors that might
occur after an IO operation. As part of this there are a number of
enhancements made to record and trigger events:
- New atomic variable in struct stripe to record the number of
errors each stripe volume device has experienced (could be used
later with uevents to report back directly to userspace)
- New workqueue/work struct setup to process the trigger_event function
- New end_io function. It is here that testing for BIO error conditions
take place. It determines the exact stripe that cause the error,
records this in the new atomic variable, and calls the queue_work() function
- New trigger_event function to process failure events. This
calls dm_table_event()
Signed-off-by: Brian Wood <brian.j.wood@intel.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
If the log type is not recognised, attempt to load the module
'dm-log-<type>.ko'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a single-thread workqueue for each mapped device
and move flushing of the lists of pushback and deferred bios
to this new workqueue.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Move encrypt/decrypt core to async crypto call.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Prepare callback function for async crypto operation.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Prepare completion for async crypto request.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Introduce mempool for async crypto requests.
cc->req is used mainly during synchronous operations
(to prevent allocation and deallocation of the same object).
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm-crypt: Use crypto ablkcipher interface
Move scatterlists to separate dm_crypt_struct and
pick out block processing from crypt_convert.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Process write request in separate function and queue
final bio through io workqueue.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add sector into dm_crypt_io instead of using local variable.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move error code setting outside of crypt_dec_pending function.
Use -EIO if crypt_convert_scatterlist() fails.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove write attribute from convert_context and use bio flag instead.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Drop the EXPERIMENTAL tag from well-established device-mapper targets, so
the newer ones stand out better.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Change io_locking to allow processing flush in separate thread.
Because we have DMF_BLOCK_IO already set, any possible
new ios are queued in dm_requests now.
In the case of interrupting previous wait there can be more
ios queued (we unlocked io_lock for a while) but this is safe.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy dm_suspend function
- change return value logic in dm_suspend
- use atomic_read only once.
- move DMF_BLOCK_IO clearing into one place
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Refactor deferred bio_list processing.
- use separate _merge_pushback_list function
- move deferred bio list pick up to flush function
- use bio_list_pop instead of bio_list_get
- simplify noflush flag use
No real functional change in this patch.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy labels in alloc_dev to make later patches more clear.
No functional change in this patch.
Signed-off-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/dm-ioctl.c:1405: warning: 'param' may be used uninitialized in this function
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/dm-table.c: In function 'dm_get_device':
drivers/md/dm-table.c:478: warning: 'dev' may be used uninitialized in this function
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
drivers/md/dm-exception-store.c: In function 'persistent_read_metadata':
drivers/md/dm-exception-store.c:452: warning: 'new_snapshot' may be used uninitialized in this function
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Since the source file already includes the log2.h header file, it
seems pointless to re-invent the necessary routine.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch is some minor janitorish cleanup, using some macros
from linux/list.h (already #included via dm.h) to improve
readability.
Signed-off-by: Paul Jimenez <pj@place.org>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove lock_kernel() from the device-mapper ioctls - there should
be sufficient internal locking already where required.
Also remove some superfluous casts.
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
raid5's 'make_request' function calls generic_make_request on underlying
devices and if we run out of stripe heads, it could end up waiting for one of
those requests to complete. This is bad as recursive calls to
generic_make_request go on a queue and are not even attempted until
make_request completes.
So: don't make any generic_make_request calls in raid5 make_request until all
waiting has been done. We do this by simply setting STRIPE_HANDLE instead of
calling handle_stripe().
If we need more stripe_heads, raid5d will get called to process the pending
stripe_heads which will call generic_make_request from a
This change by itself causes a performance hit. So add a change so that
raid5_activate_delayed is only called at unplug time, never in raid5. This
seems to bring back the performance numbers. Calling it in raid5d was
sometimes too soon...
Neil said:
How about we queue it for 2.6.25-rc1 and then about when -rc2 comes out,
we queue it for 2.6.24.y?
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Neil Brown <neilb@suse.de>
Tested-by: dean gaudet <dean@arctic.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Finish ITERATE_ to for_each conversion.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more in line with common practice in the kernel. Also swap the
args around to be more like list_for_each.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As this is more consistent with kernel style.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As suggested by Andrew Morton.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Due to possible deadlock issues we need to use a schedule work to kobject_del
an 'rdev' object from a different thread.
A recent change means that kobject_add no longer gets a refernce, and
kobject_del doesn't put a reference. Consequently, we need to explicitly hold
a reference to ensure that the last reference isn't dropped before the
scheduled work get a chance to call kobject_del.
Also, rename delayed_delete to md_delayed_delete to that it is more obvious in
a stack trace which code is to blame.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, a given device is "claimed" by a particular array so that it cannot
be used by other arrays.
This is not ideal for DDF and other metadata schemes which have their own
partitioning concept.
So for externally managed metadata, just claim the device for md in general,
require that "offset" and "size" are set properly for each device, and make
sure that if a device is included in different arrays then the active sections
do not overlap.
This involves adding another flag to the rdev which makes it awkward to set
"->flags = 0" to clear certain flags. So now clear flags explicitly by name
when we want to clear things.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If you try to start an array for which the number of raid disks is listed as
zero, md will currently try to read metadata off any devices that have been
given. This was done because the value of raid_disks is used to signal
whether array details have been provided by userspace (raid_disks > 0) or must
be read from the devices (raid_disks == 0).
However for an array without persistent metadata (or with externally managed
metadata) this is the wrong thing to do. So we add a test in do_md_run to
give an error if raid_disks is zero for non-persistent arrays.
This requires that mddev->persistent is set corrently at this point, which it
currently isn't for in-kernel autodetected arrays.
So set ->persistent for autodetect arrays, and remove the settign in
super_*_validate which is now redundant.
Also clear ->persistent when stopping an array so it is consistently zero when
starting an array.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows userspace to control resync/reshape progress and synchronise it
with other activities, such as shared access in a SAN, or backing up critical
sections during a tricky reshape.
Writing a number of sectors (which must be a multiple of the chunk size if
such is meaningful) causes a resync to pause when it gets to that point.
Signed-off-by: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>