Skip to content.

TalkBMC

Sections
You are here: Home » Blogs » Steve Carl » Adventures in Linux » CentOS 5 NAS Cluster

CentOS 5 NAS Cluster CentOS 5 NAS Cluster

Document Actions
Update on replacing the Tru64 NAS server with Linux High Availability (HA) CentOS 5 server

As I noted in my last post, here is an update on where we are at with replacing our trusty but aged Tru64 TruCluster NAS server with a new HA NAS Server. The new server is a CentOS 5 based cluster with three nodes. I'll get into the particular in a second, but first,

How We Got Here (in a nutshell)

Digital created the best cluster software in the world, VAXCluster. Digital ported this to Tru64. Digital was sold to Compaq. Compaq continued Tru64 and TruCluster. We had a NAS appliance. We bought another. It failed and failed and failed, for over a year. We replaced that with the TruCluster. HP bought Compaq and killed the AlphaChip and Tru64 TruCluster future development. Our TruCluster aged, and we began to look at replacements. Two appliance vendors came in, were tested, failed. Tru64 started to have issues with new NFS clients. We started our in-House HA NAS testing based off our years of Tier II NAS using Linux. Pant Pant Pant. Whew. Twenty plus years of history in one paragraph!

What We Liked About Tru64 TruCluster

It may be true that we over-engineered the Tru64 NAS solution. After being burned so badly by the appliance, and having so many critical builds depend on the server, we were not prepared for anything other than the most reliable NAS we could figure out how to build. Tru64 was tried and true. TruCluster was the best cluster software there was for UNIX, and the Alphachip was the hottest chip on the block back then. It all seemed to be a no-brainer.

Once built, we had rolling upgrades, and while a node might fail, the service stayed up. Customer facing (my customer being of course BMC R&D) outages were few and far between, and while we had data loss issues once (leading to the Linux snapshot servers), never a server failure. TruCluster let us sleep at night.

We hoped that Linux clustering would one day catch up to TruCluster, and so watched things like the Linux SSI project with great interest.

Whatever we ultimately use, it has to pass the NAS Server testing suite of tests.

Re-Thinking the NAS Solution

We knew what we liked about TruCluster, but after seven years, we also decided it was time to question some of its very basic design assumptions. We came up with a new set, tested the two new appliances against them, and then decided to try to build it ourselves out of Linux parts we found laying about the OSS World.

On the assumption that a picture is worth a large quantity of words, here is a DIA diagram, saved as PNG I drew of the new beastie:

http://lh3.google.com/stevecarl/RzTqSfdxlLI/AAAAAAAAADI/S3txOOy_KrQ/s144/lcfs-ha.png

Words Anyway

And now that we have that picture, a fair quantity of words is probably in order explaining what in the heck that is all about.

The Servers are Sun X2200 M2's running CentOS 5 and Cluster Suite. An X2200 is small, but it is big enough to keep the gig pipe full, so we do not need anything bigger.

To make all the cluster stuff happen, we are using Cluster LVM over the top of the Linux Multipath drivers. Each device has two paths because there are two switches in the SAN fabric, and each cluster node is hooked to each switch. GFS lays in on top of that to create the global file system across all the nodes.

Node one runs NFS. Node two runs Samba. Node three runs the backups. Should the NFS or Samba node fail, the service will restart on one of the surviving nodes, and since the file system is global to all three nodes, no magic occurs at the service level to move the file systems or anything.

The Spinning Bits

The disks are the ever nifty Apple Xserve RAID units. We burn a fair amount of capacity for HA on these: The RAID 5 is 5+1, with a hot spare, for a total of seven disks per RAID controller. The disks are 750 GB SATA. There are 14 disks in each shelve, and we have two shelves, for a total of 15 Terabytes of capacity, before formatting.

There is a single point of failure here: there is a single RAID card over each side, and so even though there are two cards in the shelve, each card only manages half the disks. They do not talk to each other. This is not Enterprise grade storage.

We mitigate that risk by having bought the spares kits: We have spare disks in carriers, spare RAID card, and spare RAID card battery. This was part of the rethink: we decided to save some money on the disks but have a recoverability plan. It is not that it will never go down, but that we can get it going again quickly. The gear is all on three year hardware support, so broken bits are a matter of RMA'ing things, and everything should be designed to return to service quickly.

We have over a year of runtime on these units on the second tier storage, and have not had any serious issues thus far, thus our willingness to try this configuration out.

Testing and Migration

In addition to all the run time on similar gear, we have been beating the heck out of these. By “We” I of course mean “Dan”, the master NAS blaster. Here is his Wiki record of the problems and the workarounds from the testing:


NFSV2 "STALE File Handle" with GFS filesystems

Problem Description

Only when using NFSV2 over a GFS filesystem! NFSV3 over GFS is OK. NFSV2 over XFS is also OK.

We were able to duplicate this from any NFSV2 client;

  • cd /data/rnd-clunfs-v2t - To trigger the automount

  • ls - Locate one of the test directories, a simple folder called "superman"

  • cd superman - Step down into the folder

  • ls - Attempt to look at the contents, returns the error:

ls: cannot open directory .: Stale NFS file handle

Note: This might be the same problem as in Red Hat bugzilla #229346
Not sure, and it appears to be in a status of ON_Q, so it is not yet released as a update. If this is the same problem, it's clearly a problem in the GFS code.

Problem Resolution

To verify that this was indeed the same bug as the Red Hat buzilla #229346, I found the patch for the gfs kernel module and applied it to our CentOS cluster.
The patch does indeed fix this problem!

Instructions to apply the patch;

  • Download the gfs kernel module source, gfs-kmod-0.1.16-5.2.6.18_8.1.8.el5.src.rpm (if your kernel is 2.6.18_8.1.1.el5)

  • rpmbuild -bp gfs-kmod-0.1.16-5.2.6.18_8.1.8.el5.src.rpm - Unpack the source rpm to /usr/src/redhat/SOURCES

  • cd /usr/src/redhat/SOURCES and add the following patch;

Filename: gfs-nfsv2.patch

--- gfs-kernel-0.1.16/src/gfs/ops_export.c_orig 2007-08-31 09:43:29.000000000 -0500
+++ gfs-kernel-0.1.16/src/gfs/ops_export.c      2007-08-31 09:43:52.000000000 -0500
@@ -61,9 +61,6 @@

        atomic_inc(&get_v2sdp(sb)->sd_ops_export);
 
-       if (fh_type != fh_len)
-               return NULL;
-
        memset(&parent, 0, sizeof(struct inode_cookie));
 
        switch (fh_type) {
  • cd /usr/src/redhat/SPECS and make the following changes;

Filename: gfs-kernel.spec

Name:           %{kmod_name}-kmod
Version:        0.1.16
Release:        99.%(echo %{kverrel} | tr - _) <--Change the version from 5 to 99--<<<<
Summary:        %{kmod_name} kernel modules

Source0:        gfs-kernel-%{version}.tar.gz
Patch0:         gfs-nfsv2.patch                <--Add this line--<<<<
Patch1:         gfs-kernel-extras.patch
Patch2:         gfs-kernel-lm_interface.patch

%setup -q -c -T -a 0
%patch0 -p0                                    <--Add this line--<<<<
pushd %{kmod_name}-kernel-%{version}*
%patch1 -p1 -b .extras
%patch2 -p1
  • rpmbuild -ba --target x86_64 gfs-kmod.spec - Build the new patched kmod-gfs rpm package

  • rpm -Uvh /usr/src/redhat/RPMS/kmod-gfs-0.1.16-99.2.6.18_8.1.8.el5.x86_64.rpm - Install the patched gfs module

  • depmod -a - Required step to see the new module on reboot

  • Reboot the system to load the new kernel

NFSV2 Mount "Permission Denied" on Solaris clients

Problem Description

Certain Solaris clients, Solaris 7, 8, and maybe 9, fail with "Permission Denied" on mount when using NFSV2. Apparently the problem is a known issue in Solaris when the NFS server ( in this case CentOS ) offers NFS ACL support. Apparently, Solaris attempts to use NFS ACL's even with NFSV2 where they are NOT supported.

This problem has been fixed on more recent versions of Solaris (like some 9 and all 10+).

Note: This problem was detected on a previous test/evaluation of Red Hat AS 5 and expected with CentOS 5.
Disclaimer: I think this is a accurate description of the problem.

Problem Resolution

Assume Solaris NFS clients will NOT use NFSV2?

Cluster Recovery Fails on "Power Cord Yank Test"

Problem Description

The cluster software must fence a failed node successfully before it will recover a cluster service, like NFS or Samba. The fence method used in our configuration is the Sun X2200 Embedded LOM via remote ipmi. When the power cord on the X2200 servers is disconnected, the ELOM is also down. This causes the fence operation to the ELOM to fail. The cluster configuration allows multple fence methods to be defined to address this issue. But there appears to be a bug in this version of the software that prevents the ccsd (Cluster Configuration Service Daemon) from answering the fenced "ccs_get" request for the alternate fence method when a node has failed.

Problem Resolution

None at this time. Waiting on a fix from CentOS. Assumption is that we can run with this configuration, but the cluster will not failover services if a power cord or both power supplies on one of the X2200 nodes were to "fail". This would result in a service interruption.


And there you have it so far: We have our teams home directories running on the new servers, and other than being fast, we see no real difference yet. We are trading in a few problems on Tru64 for a few possible problems on CentOS 5, but we assume that we'll be able to either work around them (Such as making Solaris clients use V3, which they tend to prefer anyway) or with a patch to Cluster Services at some point to deal with the power cord issue.

Next time: “The Numbers of NAS” -or- “Speeds and Feeds for the Geeks who just want to know”. And the new way we are going to do snapshots. And if I have time and space, some stuff about storage virtualization.


_____
tags:
Friday, November 09, 2007  |  Permalink |  Comments (0)
Steve Carl

Subscribe to Steve's blog  Subscribe to Steve's blog

Bio & Writings

Email Alert: Steve's Blog

Get an email alert when I publish a new blog! Enter your email address:

Adventures in Linux
« October 2008 »
Su Mo Tu We Th Fr Sa
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
2008-10-07
21:45-21:45 More GFS based CentOS cluster HOW-TO
2008-10-09
18:44-18:44 Mainframe Linux, and BMC's new Open Source tool, VMLMAT
 

Powered by Plone

This site conforms to the following standards: