Even Clusters get the Blues
We designed our current, mission critical file server based on previous experience with an up-time challenged storage appliance. We decided between reboots to do a clean sheet exercise (if writing a a great big white board that was recently erased qualifies as a “Clean Sheet”) on what should replace our not-so-favorite server. We laid out the parameters of what we needed and also what we wanted, and delineated between the two. To the extent this is ever possible, we did not know the answer ahead of time, as we asked the questions.
These were some of the main points as I recall them:
- Tolerate our R&D network traffic (this was the major failing of our current “solution”)
- Rolling upgrade capable (another failing)
- No single points of failure in either the hardware or the software stack (and another)
- Proven technology from a stable vendor (yet another...)
- Expandable for the foreseeable lifetime of the hardware.
- Be able to handle at least 50 Megabytes a second on a single Gig-E wire
- Able to deal with Bimodal (NFS and CIFS) access to a single file system, as well as either NFS or CIFS only file systems.
- Logical Volume Management, so that file system could grow without downtime.
- Able to be maintained by members of my team rather than needing a vendor house call every time we were working on it. Not that we wanted to have to work on it often.
The basic idea then was that it had to perform well enough to do our software builds, and that, even when we were servicing it, R&D would never see an outage from their end of things.
Once we had our list of things, we started calling vendors, asking questions, seeing presentations, getting in test systems, and doing general technical triage. At the end of the day, 4.5 years ago... we didn’t choose Linux. I know.. I know.. it was not an easy thing in some ways. We do have quite a comfort zone with Linux, even back then. But the Linux of the day just was not ready for this particular role in the glass house.
We chose a Compaq TruCluster. What a machine! Two ES40 main nodes, each with 4 screaming Alpha processors, 4 GB RAM, multiple Gig-E cards on separate buses, two high speed Memory Channel (http://www.hp.com/techservers/systems/symc.html) interconnects, connected to the venerable Compaq Storage Works SAN via twin Brocade switches.
TruCluster ran on top of Tru64, is a Single System Image (SSI), active-active cluster, and is based on the best, most proven cluster technology known to human-kind: VMS Cluster. With active-active and with both nodes up, we’d get twice as much throughput, so no having an expensive system sitting around waiting for the other to fail. And because VMS Clustering has been around for decades, most of the complexity of such a solution would have been worked through. Normally something this complex could be expected to fail more often rather than less, but this was proven technology.
As an aside, this is also why we have been following the Linux OpenSSI project so closely over the years (http://openssi.org/cgi-bin/view?page=openssi.html), since the project is sponsored in large part by HP, who now of course owns TruCluster. More on this later.
From time to time over the last several years, as have assembled Linux and other clusters out of spare parts to evaluate their current state of the art relative to TruCluster. We are of course interested in how well these stack up against our original white board list, but also:
- How easy it is to build one, especially the level of customization
- How cluster-aware the NFS and CIFS software stacks are
- General speeds and feeds
- Basic Reliability / Availability / Serviceability (RAS)
We were doing this to remain educated and up to speed on the current state of the art... another part of our job is to build these clusters for R&D from time to time for various projects they have going: it’s nice to be ahead of the curve on requests like “Please build me a cluster, and I could really use it by Tuesday if at all possible”. We were also interested primarily in cluster technology that can be used to provide a NAS service: compute clusters or grids were not really in scope.
Then, our cluster got the blues. OK, really, several things came to pass that have made it time for us to re-visit this whole thing again. First and worst was that our client base shifted. As new UNIX and Linux versions have been released, there has been a drift towards using NFS version 3 over TCP/IP, and away from NFS version 2 or 3 over UDP. This is just a new R&D client default behavior. This makes sense really: NFS V3 over TCP/IP is better on the WAN than UDP. Unless you have a sudden bunch of them, and the server is a TruCluster from when most clients were UDP.
Because of the design of the TruCluster, and it’s particular implementation of cluster-aware NFS services, Memory Channel (MC) traffic has been increasing. In the classic performance and capacity planning scenario, what used to be no problem at all hit a knee in the curve, and suddenly we had a huge file server and some fairly low speed but very important clients waiting around while traffic cleared inside the MC. And of course, being that all systems wait at the same speed, having all these high speed CPU’s is not helping a bit (even by todays standards, the 4.5 year old Alpha chips still get around pretty good). And of course with Compaq bought out by HP now, and Tru64 being end-of-lifed means we are probably not going to be able to get a major design change implemented on TruCluster to deal with this behavior. It’s not really “broke”, it's just that NFS Version 3 over TCP/IP on one just doesn’t appear to scale. We can manage this of course by changing clients to use UDP one by one (and we have), but that is a short term solution.
So, it is time to look at Linux again. We have been building the Low Cost File Servers with Linux for a number of years now: We have a pretty good handle on the technology for non-HA purposes. There are a few things to think about that have changed since we went with the TruCluster that means it is time to build some new test systems:
- Linux SSI now supports the 2.6 kernel, which we think may mean better scalability than a 2.4 kernel. Although we do have cause to wonder about this assertion: more on that next time.
- As mentioned, HP has been providing a great deal of support to Linux SSI, and HP has also been historically very supportive of Linux. Since our TruCluster is now essentially an HP system... well, it just seems like there ought to be some possibilities there
- The relative low cost of commodity based hardware makes SSI not nearly as critical: we can afford to have inexpensive servers waiting around for other inexpensive server to fail, so we need to look at the Linux HA project to see what it can offer: http://www.linux-ha.org/
- Also in Linux HA’s favor: it is a far simpler cluster technology, and so in theory it should be easy to get to a stable, mission critical level of service when there are fewer complications: SSI on the TruCluster came from a 20 + year old technology base: nothing in Linux is quite that old and venerable yet.....
The Low Cost File servers have had their moments though. Next time some of the things we have learned from them.
_____
tags:
