Linux Snapshot Servers
I am writing this on the 4th of July, a holiday here in the US. It has been a terrific day. The space shuttle Discovery made a beautiful liftoff and climb to orbit, making the whole thing look so easy you'd think there was no stress about it at all. As a former NASA sub-contractor, such moments do my heart proud. I guess you can take the boy out of the space program, but you can't take the space program out of the boy.
As you might expect, my thoughts turned to Linux shortly after. I guess it was looking at all those computer displays in Mission Control, wondering how many behind the scenes computers were Linux these days, and knowing all the safeties of the safeties there are. Like Hubble have a backup power supply on it's main camera. All that rolled about in the skull, and out popped this post about Linux snapshot servers. On a holiday. I know: I have a weird mind sometimes.
It may seem a little trite to think of mission critical file serving in the same day as watching such real life and death type drama. But we do have servers that are mission critical to us, even in R&D support.
One of the key features our current, 5 year old, soon to be replaced “mission critical” NAS server is missing is snapshotting: The ability to take point in time copies of a file system, and keep them on the same server. Snapshots usually are just a map of the changed blocks since the last time the snapshot was run, in order to save space. There is usually OS and file system (FS) trickery to make them look like complete copies of the FS being snapshotted.
You can keep many generations of these FS's around with most modern NAS appliances. Our Tru64 based TruCluster NAS server has many pluses to it (especially given how long ago we put it in), such as no single points of failure and rolling upgrades, but it was missing this snapshotting feature / option: the ADVFS file system native to Tru64 may be cluster aware, but it just did not have that.
ADVFS is a modern file system in most important ways: growing FS's on the fly, journaling, logical volumes, all the good stuff except snapshots. With a journaled FS, even when you have enterprise class storage like Compaq Storage Works bolted in, and everything set up in RAID 5 groups, you can still lose a file system many ways:
Corruption of data structures.
RAID scrub failures.
RAID controller failures
Accidental erasure by an authorized personal.
Accidental erasure by a new bit of code that mounted the wrong FS.
Intentional erase by a hostile MS Windows Virus mounting the file systems via AFS or a Samba proxy.
Sure: it's backed up to tape, and the tapes rotated offsite and all the things you would expect from mission critical data. But it can take hours, even days to recover a large file system, once you get the tapes back from the offsite. I know, we went through this: In our case a corrupted data structure. That FS was toast. No recovery procedure we tried could fish out the data. The tape restores took forever. Our customer (R&D inside BMC) was not amused.
The math is pretty simple: here it conservatively costs more than 25,000 US dollars an hour in wasted time for every hour this server is down, and people are waiting on it. It does depend on where we are in a development cycle, and how many projects are getting ready to ship. That could be low by half or more even. Many hundreds of developers and QA people who may not be able to work on what their project plan says they need to be working on because their data evaporated.
At your place just take the number of people idled by the downed server, multiply it by the average of their fully burdened salary, and then subtract off a fudge factor since they will be able to do other things like catch up on their email that made it through the spam filter while waiting for the server to return. Then add back in the time spent talking about “Superman Returns” by the coffee pot, and you should be in the ballpark.
We decided after our last experience with this that we could fund a few inexpensive file servers to act as snapshot repositories for the main file server. The idea was that we could use the general design of our second tier archive storage NAS servers to add this new function. Here is the general outline of what we came up with:
4U or 5U cases with 12 - 16 hotplug drive slots and 3 650w power supplies
AMD Opteron mainboards, PATA (SATA these days) controller built in
Gig-e Intel PCI cards
3ware 9000 series 8 or 12 port SATA controllers
WD SATA disks (ours are 250's, but 500 are out)
2 80 GB PATA disks in the spare slots (The new designs are SATA here too. PATA is going bye-bye)
If that recipe looks familiar, it is because it is the same one we used for the tier II NAS servers. And the Wiki server. It is a darn handy design really.
We set up the OS (FC4 with updated kernel) on the main PATA disk, and occasionally we do a DD to the backup disk. A newer version of this design has 3 disks and an Adaptec RAID controller so that we have a mirrored boot disks, with the DD spare still there.
The idea here is to keep a working copy of the OS around in case we do something stupid like run 'rm -rf /' as root. Not that we would ever do that. On purpose. We also protect ourself from disk hardware failures. This is less critical on the snapshot servers for reasons I'll get into in a bit.
The TruCluster NAS is not all that big by todays standards: less than 5 Terabytes. We have made a strong effort to keep only mission critical data here, and to move less critical data to the second tier NAS servers. Even then, moving around Terabytes of data can take some real time, especially from tape.
All told, our snapshot servers tip the scale at about 1000 USD a terabyte, and if we had the new Western Digital 500 GB disks, we would be at about 750 USD a terabyte.
Next, using scripts we wrote, we rsync file systems from the main NAS. The first time takes a while, but it is quick after that. We export read-only these copies of the file systems, and document on our Wiki where R&D can go to get a copy of the file if they should delete it from the main NAS accidentally.
Already we are way ahead, since now R&D can manage their own recoveries from accidental deletes. Have an “oops” on the main NAS, manually mount up the snapshot export, and copy it back. The read-only nature of the snapshot means there is not a second “oops”.
If it is worse than that, we can switch the snapshot servers exports over to r/w, and it can act as the main NAS while we repair the problem with the main one. It is plenty of system to support a full build run against it. Truth be told, it is probably faster than the TruCluster, and the TruCluster is a pretty good sized system. Later we rsync the other way when there is no change happening. This already happened once after we built these (an accidental FS erasure this time around as I recall). These servers are paid for.
We run backups off the main server, but if we needed out-of-band backups, we could schedule them to run off the snapshot servers right after a snap.
Then there is one more trick: using “cp -l” out of cron over on the snapshot server to make generations of the data:
Script fragment, written by one of my R&D support team members for enabling the snapshot servers to keep multiple generations of a given file system.
/bin/ksh
.
(snipping out code for mounting file systems and whatnot)
.
# Age the incremental filesystem trees
#
# NOTE: Do not change this section unless you understand
# exactly how this works!
#
echo "INFO, Age the incrementals"
Incremental=${Age}
Current=$(( ${Incremental} - 1 ))
echo "INFO, deleting ${Incremental}"
rm -rf /snapshot/${FSName}/${Incremental}
#
while [ ${Current} -ge 1 ]
do
if [ -d /snapshot/${FSName}/${Current} ]
then
echo "INFO, age ${Current} -> ${Incremental}"
mv /snapshot/${FSName}/${Current} /snapshot/${FSName}/
${Incremental}
fi
Incremental=$(( ${Incremental} - 1 ))
Current=$(( ${Incremental} - 1 ))
done
echo "INFO, creating incremental tree 1"
cp -al /snapshot/${FSName}/0 /snapshot/${FSName}/1
All of that code really is just setting up what happens in the very last line. The magic of the incremental copy using the linux 'cp' command. From the man page, here is what '-l' is doing:
-l, --link
link files instead of copying
This “well known” (look at the link to see the commands the script fragment above builds) hard link option of 'cp' is the tech-trick-bit that makes this snapshot work. Without it we would need to buy a ton more disk space on the snapshot server, rather than just reserving an extra approximately 20% of headroom space. That link will tell you a great deal more about the way this works.
Given what these servers cost, I suppose you could make a case for keeping complete copies, but there is also the speed of copy issue. By 'snapping' the FS, only the changed objects get copied, so it is orders of magnitude faster, even on all local disk.
Putting this together with the low cost of the Linux-driven hardware, we have created a snapshot capability for a server that needed it but did not have it, for a fraction of the cost of one outage. We added some new capabilities, like being able to run with the main server or it's file systems completely down, and because this is an “offline” process, it is both invisible when it runs, and invisible when we take the snapshot server down to work on it. This last part is why we really don't absolutely need to build in normal HA capabilities to the snapshot server. It is only needed to be up when it is doing a copy, and when the main server is having a problem. Since we would no doubt need this at exactly the time when the snapshot server was down having a boot disk problem, and since 80 GB PATA disks are not expensive, we went with the belt AND the suspenders.
One other nice thing: Since the snapshot servers shares the same general hardware design as the Tier 2 NAS servers and the Wiki server, all the SATA and PATA disks we keep on the shelves as cold spares for those work for these as well. And with COTS hardware, you want to keep spares around. Trust me on this.
We have a similar thing for our source code repository using vendor supported hardware, but that is a post for another day.
_____
tags:
