Skip to content.

TalkBMC

Sections
You are here: Home » Blog Archive » Steve Carl » Adventures in Linux » NAS Redeaux

NAS Redeaux NAS Redeaux

Document Actions
The Next Generation of Tier II storage: Intro

I am going to assume here that not everyone reading this today will have read all of the 300,000 words and 130 or so posts to date. What I plan is to write three posts: today's will just be a general intro and restatement in many ways of things I have said here before. Then I'll talk about Linux in the second post, and the third one will be about our new hardware solution. My goal is to get all of these up in the next week or so.

This all started back in my second post, Linux and NAS. Not started... just the first time I wrote about it. I later posted a series about NAS testing with seven parts, that started with an Intro kind of like this one. There was another one focused on using Linux NAS appliances for snapshot servers. We have gotten back our money with those 10 times over. More even. There are other references scattered throughout the other blogs, but those are some of the high points.

Let us assume for a moment that you do not want to read all that old grungy last year model posts though. Fresh recycled content then....

The Goal of the Second Tier of NAS Storage

Not all storage is created equal. That is pretty handy since not all storage needs are the same either. Depending on your environment: Data warehousing, billing applications, web applications, whatever, there is going to be a surprisingly small amount of the data in use that is "Hot": that is beaten to death all the time. For what we do in R&D, that number is less than 10%.

This is not the same thing as saying that "When I need data, I need it now" is not also in force. Needing data quickly, and pounding the data to death are two utterly different things.

Example: A build data set gets pounded here. We do nightly builds, and sometimes more often than that. During the build, that disk is beaten to a pulp, and handed back shredded. We literally can not put in fast enough, or highly available enough disks for this.

Counter Example: The build data for releases of products that we retired: we may not have done anything with that code base for years. We don't need code that *at all*, right up until we get a customer call, and then we need that data *now*.

Both of those data types are spinning on disks, and are on backups, and part of offsite rotations and BC planning and so forth. But I do not put the same class of online storage underneath each type. For six years, the first class of data has been running on a Compaq Trucluster. The most highly available data I have. No single points of failure. Rolling Upgrades. Tons of arms. Enterprise class disks. Separate power. Everything. In those six years, that server has been down once for a hardware issue that defeated all our redundancy. Another time one file system was corrupted that idled one set of products. On four or five occasions data was accidentally erased by people with the access to do so (and recovered nearly as quickly from the above mentioned mirror servers). The rest of the time, it has been data tone. The cost was high: about 150,000 USD a terabyte, in 2001. But the server was also (to one way of thinking) pretty much free. All the times I avoided going down, times the number of people that would have idled equals "I made money on that server". We are looking to replace this server, but we are not in a hurry: it has been everything we wanted and we have to be careful that the new solution is still "Data Tone".

My other storage is about 1000 USD a Terabyte. It is free too, but in a different way. The first time a customer calls in, and we solve a problem faster because we did not have to wait for a tape to come in from offsite and be restored equals money well spent. Alternatively, the first time I have to rebuild a production file system from a mirror server, that mirror server just became free relative to the downtime I just avoided and the people I might have otherwise idled.

To get to the price point I mentioned for Tier II, we hand-built the hardware and of course we used Linux. Here was the thing with that: I have a staff of veteran people, with average experience of over 15+ years each. They know hardware, they know software. When BMC needs a throat to choke on these servers, we are it. No abdication. No finger pointing. The buck stopped here. I talked about this issue in a post about who the integrator is with COTS (Consumer / Off The Shelf) hardware. I was not worried though: I have faith in my folks, and for six years they have not let me down.

So, the goal of the TII NAS was to provide the data BMC needed for R&D or customer support nearly instantly but at a reasonable cost.

Rethinking TII

NAS appliances are not nearly as expensive as they were in 2001, but they are also still not as inexpensive as own in-house brew. Not even close. After six years of building these servers, they have not all stayed the same either. We have updated various part of the design over time. Different motherboards, disk trays, cases, power supplies. The first Linux on the prototype (which we just retired) was Fedora Core 1, but we updated the kernel to a generic kernel.org 2.4 kernel (2.4.18 or 2.4.19 if memory serves: 2.4.34.4 is current), pulled in some reach-ahead NFS patches, and added IBM's EVMS for volume management. That prototype spawned a series of other systems, and over time we experimented and learned. We changed design features to make them more Enterprise class. COTS kept giving us more for less, and we kept leveraging that.

No design point was a sacred feature. We always re-examined the next set of servers in light of what we had learned, and the new COTS gear available. This made keeping onsite cold spares a bit of a challenge, but other than that it worked well.

The original concept was of an all-in-one appliance, sort of like a Sun X4500 is today. While this was partially a cost decision, it was also envisioned that a failed unit would just be replaced *as a unit*. Swap the hard drives out of the failed unit into the hot spare kind of thing. We started out at 8 drives in the chassis, and moved to 12, and were looking at 16 before we changed course completely. That design point was dictated by the 3Ware card we used.

We changed course because we found that, at least for us, this just did not work out that way. We kept fixing the unit in place instead of field swapping it. The reason in part was how hard it was to pull all the drives out, stick them in a new case, and get going again. The RAID data was intact and on the drives and could be rebuilt. It was just slow, a pain, and we just never did it. Well.. Once we did it. But we did not like it.

The other reason was that need for spinning NAS capacity always outstripped supply: If we had a fully built unit, just sans hard drives, and someone needed more space, the easy way to fix that was to buy the hard drives and deploy the unit, and re-order the cold spares parts. One way or another, it just never quite worked the way we thought it might.

There were two other things that fed the changes that we made to this new generation of TII.

First: the sentiment of the group was that we needed to separate that disk storage from the server head. This would make swapping out failed heads easy. We could even re-deploy a Linux PC if we had to in order to get running.

The second was one that is probably common many places. Some on the team did not like being the neck to choke. It all came back to the post above about who the integrator was. My personal comfort zone, and my faith exceeded their own perhaps. Or they just felt they didn't have time anymore. In any case, this was an area that I was willing to bend on, because we had something in mind.

Next

Next time I'll go into the things we did in the NAS operating system space with Linux. A tale of mystery and intrigue.... next post after that, I'll cover the hardware solution we picked. At the end of it, we tested this thing out at about 100 Megabytes a second sustained write rates, with cache disabled (sequential writes), which is close enough to wire speed that we are pretty happy with it. Happy enough, that now we are back at the white-board thinking about the generation after this one...


_____
tags:
Tuesday, June 05, 2007  |  Permalink |  Comments (0)
 

Powered by Plone

This site conforms to the following standards: