Disclaimer :
The original version of this article was first published on IBM
developerWorks, and is property of Westtech Information Services. This
document is an updated version of the original article, and contains
various improvements made by the Gentoo Linux Documentation team.
This document is not actively maintained.
|
Software RAID in the new Linux 2.4 kernel, Part 2
1.
Setting up RAID-1 in a production environment
Real-world RAID
In my previous
article, I introduced you to Linux 2.4's software RAID
functionality, showing you how to set up linear, RAID-0, and RAID-1
volumes. In this article, we look at what you need to know in order to
use RAID-1 to increase availability in a production environment. This
requires a lot more understanding and knowledge than just setting up
RAID-1 on a test server or at home -- specifically, you'll need to
know exactly what RAID-1 will protect you against, and how to keep
your RAID volume up and running in case of a disk failure. In this
article, we'll cover these topics, starting with an overview of what
RAID-1, 4, and 5 can and can't do for you, and ending with a complete
test simulation of a failed RAID 1 drive replacement -- something that
you should actually do (with this article as your guide) if at all
possible. After going through the simulation, you'll have all the
experience you need to handle a RAID-1 failure in a real-world
environment.
What RAID doesn't do
The fault-tolerant features of RAID are designed to protect you from
the negative impacts of a spontaneous complete drive failure. That's
a good thing. But RAID isn't a perfect fix for every kind of
reliability problem. Before implementing a fault-tolerant form of RAID
(1,4,5) in a production environment, it's extremely important that you
know exactly what RAID will and will not do for you. When we're
in a situation where we're depending on RAID to perform, we don't want
to make any false assumptions about what it does. Let's start by
dispelling common myths about RAID 1, 4, and 5.
A lot of people think that if they place all their important data on a
RAID 1/4/5 volume, then they won't have to perform regular backups.
This is completely false -- here's why. RAID 1/4/5 helps to protect
against unplanned downtime caused by a random drive failure.
However, it offers no protection against accidental or malicious
data corruption. If you type cd /; rm -rf * as root on a
RAID volume, you'll lose a lot of very important data in a matter of
seconds, and the fact that you have a 10 drive RAID-5 configuration will
be of little significance. Also, RAID won't help you if your server is
physically stolen or if there's a fire in your building. And of course,
if you don't implement a backup strategy, you won't have an archive of
past data -- if someone in your office deletes a bunch of important
files, you won't be able to recover them. That alone should be enough
to convince you that, in most circumstances, you should plan and
implement a backup strategy before even thinking about tackling
RAID-1, 4, or 5.
Another mistake is to implement software RAID on a system composed of
low-quality hardware. If you're putting together a server that's going
to do something important, it makes sense to purchase the
highest-quality hardware that's still comfortably within your budget.
If your system is unstable or improperly cooled, you'll run into
problems that RAID can't solve. On a similar note, RAID obviously can't
give you any additional uptime in the case of a power outage. If your
server is going to be doing anything relatively important, make sure
that it's been equipped with an uninterruptible power supply (UPS).
Next, we move on to filesystem issues. The filesystem exists "on top"
of your software RAID volume. This means that using software RAID does
not allow you to escape filesystem issues, such as long and potentially
problematic fscks if you happen to be using a non-journalled or
flaky filesystem. So, software RAID isn't going to make the ext2
filesystem more reliable; that's why it's so important that the Linux
community has ReiserFS, as well as JFS and XFS in the works. Software
RAID and a reliable journalling filesystem make a great combination.
RAID - intelligent implementation
Hopefully, the previous section dispelled any RAID myths that you might
have had. When you implement RAID-1, 4, or 5, it's very important that
you view the technology as something that will enhance uptime.
When you implement one of these RAID levels, you're protecting yourself
against a very specific situation -- a spontaneous complete (single or
multiple) drive failure. If you experience this situation, software
RAID will allow the system to continue running, while you make
arrangements to replace the failed drive with a new one. In other words,
if you implement RAID 1, 4, or 5, you'll be reducing your risk of
having a long, unplanned downtime due to a complete drive failure.
Instead, you can have a short planned downtime -- just enough time to
replace the dead drive. Obviously, this means that if having a
highly-available system isn't a priority for you, then you shouldn't be
implementing software RAID, unless you plan to use it primarily as a
way to boost file I/O performance.
A smart system administrator uses software RAID for a specific purpose
-- to improve the reliability of an already very reliable server. If
you're a smart sysadmin, you've already covered the basics. You've
protected your organization against catastrophe by implementing a
regular backup plan. You've hooked your server up to a UPS, and have
the UPS monitoring software up and running so that your server will
shut down safely in the case of an extended power outage. Maybe you're
using a journalling filesystem such as ReiserFS to reduce fsck
time and increase filesystem reliability and performance. And hopefully,
your server is well-cooled and is composed of high-quality hardware,
and you've paid close attention to security issues. Now, and only now,
should you consider implementing software RAID-1, 4 or 5 -- by doing so,
you'll potentially give your server a few more percentage points of
uptime by guarding it against a complete drive failure. Software RAID
is that added layer of protection that makes an already rugged server
even better.
2.
A RAID-1 walkthrough
Now that you've read about what RAID can and can't do, I hope you have
reasonable expectations and the right attitude. In this section, I'll
walk you through the process of simulating a disk failure, and then
bringing your RAID volume back out of degraded mode. If you're have the
ability to set up a RAID-1 volume on a test machine and follow along
with me, I highly recommend that you do so. This kind of simulation can
be fun. And having a little fun right now will help to ensure that when
a drive really fails, you'll be calm and collected, and know exactly
what to do.
Important:
To perform this test, it's essential that you set up your RAID-1 volume
so that you can still boot your Linux system with one hard drive
unplugged, because this is how we're going to simulate a drive failure.
|
OK, our first step is to set up a RAID-1 volume; refer to my previous article if
you need a refresher on how to do this. Once you've set up your volume,
you'll see something like this if you cat /proc/mdstat:
Code Listing 2.1: Examining the RAID volume |
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 ide/host2/bus0/target0/lun0/part1[1] ide/host0/bus0/target0/lun0/part5[0]
4610496 blocks [2/2] [UU]
[======>..............] resync = 34.8% (1606276/4610496) finish=3.2min speed=15382K/sec
unused devices: <none>
|
Note that I'm using devfs, and that's why you see the extremely long
device names listed above. I'm actually using /dev/hda5
and /dev/hde1 as my RAID-1 disks. At the moment, the
kernel software RAID code is synchronizing the drives so that they're
exact mirrors of each other. If your RAID-1 volume is at this point,
you can go ahead and create a filesystem on the volume, and then
mount it somewhere. Copy some files over to it, and then set up your
/etc/fstab so that the volume (/dev/md0)
will be mounted when your system boots. Here's the line I added to my
fstab; yours may differ slightly:
Code Listing 2.2: fstab information |
/dev/md0 /mnt/raid1 reiserfs defaults 0 0
|
OK; we're almost ready to simulate a drive failure, but not quite.
First, cat /proc/mdstat again, and wait until all your
volume's disks are synchronized. When they are, your
/proc/mdstat will look like this:
Code Listing 2.3: Re-examining the RAID volume |
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 ide/host2/bus0/target0/lun0/part1[1] ide/host0/bus0/target0/lun0/part5[0]
4610496 blocks [2/2] [UU]
unused devices: <none>
|
The simulation begins
OK, now that the resync is complete, we're ready for the
simulation. Go ahead and shut down your machine and power it
down. Then, open it up and unplug one of the hard disks that
make up your RAID-1 array. Of course, you won't want to unplug
the disk that contains your Linux root partition -- we'll need
to boot Linux again! OK, now that the hard drive is unplugged,
bring the machine back up. Once you log in, you should find that
/dev/md0 is mounted and that you're still able to
use the volume. When you cat /proc/mdstat, you'll see
the major change:
Code Listing 2.4: Missing a disk |
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 ide/host0/bus0/target0/lun0/part5[0]
4610496 blocks [2/1] [U_]
unused devices: <none>
|
Here, you can see that my /dev/md0 volume is running
in degraded mode. I unplugged drive /dev/hde, so
/dev/hde1 wasn't found when the kernel booted and
tried to autostart my array. Fortunately, the kernel found
/dev/hda5, and /dev/md0 was able to
start in degraded mode. As you can see, the /dev/hde1
partition isn't listed in /proc/mdstat, and one of the
RAID disks is marked as "down" ([U_] instead of
[UU]). But hey, since /dev/md0 is still going,
software RAID-1 is doing what it's supposed to do: keeping our data
available.
Recovery
Right now, we're experiencing a simulated drive failure. If the
drive that currently doesn't have power actually failed while the
system was running, this is the kind of situation we'd be in. Our
RAID-1 volume would be running in degraded mode, meaning that our
volume is still available but without any redundancy. At a
convenient time, we'd want to shut down the system, replace the
failed drive, and start the system back up again. Our RAID-1 volume
would still be running in degraded mode at this point.
Once we have the new drive in the machine, we'd want to create a
RAID autodetect (FD) partition of the appropriate size on our
new disk. An additional reboot may be needed so that Linux can reread
the disk's partition tables. Once the new partition is visible to the
system, we're ready to restore our degraded RAID-1 array -- then,
we'll have some redundancy again.
Of course, we're only performing a simulation. To practice adding a
partition back into our RAID array, we can do one of two things,
depending on what kind of scenario you'd like to prepare or. You can
either shut down your machine, plug the drive in, boot it up, and add
the old partition back to the array, or you can shut down your
machine, plug the drive in, boot up, wipe the drive, create a new
RAID autodetect (FD) partition to add the array (of the correct
size, of course -- at least as big as the partition it's replacing)
and then add this brand-new partition to the array. The second choice
would be closer to what would happen in the event of a real drive
failure, while the first would simulate something like a failed disk
controller or bad cable situation -- where one of your mirror drives
was temporarily unavailable, causing /dev/md0 to run in
degraded mode, and requiring one of the partitions to be added back
to the volume after the problem was remedied. Whichever simulation
you choose to do, the "fix" is the same -- after the new partition is
ready, we need to manually add it back to the /dev/md0
volume.
Looking at dmesg
Before we add the partition back to our array, this would be a good
time to take a look at our kernel boot messages. If you type
dmesg | more, you'll be able to view the kernel boot
messages. You should see a bunch of text similar to this:
Code Listing 2.5: Kernel boot messages |
linear personality registered
raid0 personality registered
raid1 personality registered
raid5 personality registered
raid5: measuring checksumming speed
8regs : 1291.209 MB/sec
32regs : 1195.197 MB/sec
pII_mmx : 2110.740 MB/sec
p5_mmx : 2652.522 MB/sec
raid5: using function: p5_mmx (2652.522 MB/sec)
md driver 0.90.0 MAX_MD_DEVS=256, MD_SB_DISKS=27
md.c: sizeof(mdp_super_t) = 4096
autodetecting RAID arrays
(read) ide/host0/bus0/target0/lun0/part5's sb offset: 4610560 [events: 00000004]
(read) ide/host2/bus0/target0/lun0/part1's sb offset: 4610496 [events: 00000002]
autorun ...
considering ide/host2/bus0/target0/lun0/part1 ...
adding ide/host2/bus0/target0/lun0/part1 ...
adding ide/host0/bus0/target0/lun0/part5 ...
created md0
bind<ide/host0/bus0/target0/lun0/part5,1>
bind<ide/host2/bus0/target0/lun0/part1,2>
running: <ide/host2/bus0/target0/lun0/part1><ide/host0/bus0/target0/lun0/part5>
now!
ide/host2/bus0/target0/lun0/part1's event counter: 00000002
ide/host0/bus0/target0/lun0/part5's event counter: 00000004
md: superblock update time inconsistency -- using the most recent one
freshest: ide/host0/bus0/target0/lun0/part5
md: kicking non-fresh ide/host2/bus0/target0/lun0/part1 from array!
unbind<ide/host2/bus0/target0/lun0/part1,1>
export_rdev(ide/host2/bus0/target0/lun0/part1)
md0: max total readahead window set to 124k
md0: 1 data-disks, max readahead per data-disk: 124k
raid1: device ide/host0/bus0/target0/lun0/part5 operational as mirror 0
raid1: md0, not all disks are operational -- trying to recover array
raid1: raid set md0 active with 1 out of 2 mirrors
md: updating md0 RAID superblock on device
ide/host0/bus0/target0/lun0/part5 [events: 00000005](write) ide/host0/bus0/target0/lun0/part5's sb offset: 4610560
md: recovery thread got woken up ...
md0: no spare disk to reconstruct array! -- continuing in degraded mode
md: recovery thread finished ...
..
.... autorun DONE.
|
Now would be a good time to carefully read these messages, because
they'll help you to understand the process that the kernel uses to
autostart /dev/md0, giving you another valuable
insight into the inner workings of Linux software RAID. If you
read the kernel output listed above, you'll find that my kernel
found /dev/hda5 and /dev/hde1, but
hde1 was out of sync with hda5. So, the
kernel started up /dev/md0 in degraded mode, using
/dev/hda5 and not touching /dev/hde1 at
all. Now, it's time to add our original (or newly created)
partition to our volume. Here's how.
Restoration continues
First, if your replacement partition has a new device name, update
/etc/raidtab so that it reflects this new information.
Then, add the new partition to the volume using the following
command, replacing /dev/hde1 with the device name of
the partition you're adding:
Code Listing 2.6: Adding the new device |
# raidhotadd /dev/md0 /dev/hde1
|
Your hard drive lights should begin glowing as reconstruction
begins. Go ahead and cat /proc/mdstat to check the status of
the RAID-1 reconstruction that's now in progress:
Code Listing 2.7: Check the status of the RAID-1 reconstruction |
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 ide/host2/bus0/target0/lun0/part1[2] ide/host0/bus0/target0/lun0/part5[0]
4610496 blocks [2/1] [U_]
[>....................] recovery = 1.8% (84480/4610496) finish=3.5min speed=21120K/sec
unused devices: <none>
|
In a matter of minutes, your RAID-1 volume will be back to normal:
Code Listing 2.8: The normal RAID volume |
# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md0 : active raid1 ide/host2/bus0/target0/lun0/part1[1] ide/host0/bus0/target0/lun0/part5[0]
4610496 blocks [2/2] [UU]
unused devices: <none>
|
Voila! We've successfully recovered from a simulated drive failure,
and you're ready to start using RAID-1 in a production environment.
You can now affix your homemade "RAID-1 certified" sticker to your
forehead and begin flapping your arms and running around the office
to the delight of your coworkers. Actually, maybe that isn't such a
great idea. See you in the next article.
Resources
|
|
Updated October 9, 2005 |
Summary:
In this two-part series, Daniel Robbins introduces you to Linux 2.4
Software RAID, a technology used to increase disk performance and
reliability by distributing data over multiple disks. In this article,
Daniel explains what software RAID-1, 4, and 5 can and cannot do for
you and how you should approach the implementation of these RAID
levels in a production environment. In the second half of the article,
Daniel walks you through the simulation of a RAID-1 failed drive
replacement.
|
Daniel Robbins
Author
|
|
Donate to support our development efforts.
|
 Support OSL
|
 VR Hosted
|
 Tek Alchemy
|
 SevenL.net
|
 Global Netoptex Inc.
|
 Linux World Expo
|
|
|