Skip to content.

TalkBMC

Sections
You are here: Home » Blogs » Steve Carl » Adventures in Linux

Adventures in Linux Adventures in Linux

Document Actions
Steve Carl, Senior Technologist at BMC Software, muses about his adventures in Linux.
Introducing a new Open Source tool from BMC, available at Sourceforge, for managing Linux guests under VM on the mainframe.

http://vmlmat.wiki.sourceforge.net/

If you have been following this blog, you know that my background is the mainframe. I started as a VM system programmer, and VM is still near and dear to my heart. When MF Linux became a reality, it was, to me, a natural thing that it would be a guest OS on VM the same way VM has hosted MF operating systems since it was invented over four decades ago (and the research that went into creating it stretches back even further).

Linux on the MF is just a logical progression of all the things that came before it on the mainframe... things like UTS and AIX/370. It is also natural for Linux: The ultimate in multi-platform OS's.

However...

Linux on the mainframe, like any other technology deployment of human kind, is not without considerations and issues. If you have worked with the X86 spiritual baby brother of VM, VMware, you might know what some of those issues for Linux on the mainframe are: Things like server sprawl, and the tendency for many end users to treat the resources as essentially infinite.

There is also a cultural issue. If you have programmers working on Linux, be it web apps or anything else, and they are less than four decades of age, they probably learned Linux on their Laptop / Desktop / small-server-in-the-cube-next-to-them (shades of "departmental computing") kind of thing. To them, for the most part, Linux is Linux, and it does not really matter where it is, and they most certainly do *not* want to learn anything about the MF in order to access their Linux there.

That means that they don't want to call the data center and have the operator autolog their Linux VM, they don't want to learn how to log into VM, IPL the Linux system, and then do a #CP SET RUN ON#CP DISC. No TN3270 stuff. No green screen. It is not the way of the GUI world, and very much not the way of the Web 2.0 world. CMS is only for us VM'ers it usually seems. I know my track record in teaching people how to use the mainframe who started out in computing on Linux, UNIX, or MS Windows is not good. It is not zero, but it is not the next generation of MF people either. The MF is just too different, and its attractions are often not obvious at first glance. Sure XEDIT is the best text editor *ever*, but it takes a while to come to appreciate it....

In general, Linux users are used to owning the resource, and being able to boot it whenever they want (during development anyway). There is a sort of security in owning the "Big Red Button" so that one is the master of their own computers destiny.

This all stands at odds with a great deal of the culture of the MF: The ultimate in data center glass houses. To solve the problems above, often all the Linux VM's are just autologged when the MF is IPL'ed , and they run whether they are needed or not. This would not be a problem (at least not nearly as much of one) if these were all CMS VM's, but Linux in a VM turns out to have a few design points that operate against the "Just IPL and let it run" way of working. The main one is memory management. Think of the way Linux allocates memory, which is more or less "use everything I can". What is not programs is in-memory cache. It is a great design, and it makes perfect sense if you are running natively on real hardware: Why let the extra memory go to waste? Why not use it to speed things up? The idea is not even original to Linux. UNIX before it had it as a central design precept. The idea was that disks are way slow, and so if there was spare memory laying about, use it to cache the I/O, and speed up the programs. Let the I/O happen when it could. This design point is still valid today, as disks are still extremely slow relative to RAM.

As a guest on the MF though, that means to the host OS...to VM, it looks like the guest memory is 100% busy all the time, and the way the VM likes to page out unused memory on guests so that only active memory is in core is violated. The solution is to keep virtual memory trimmed to just what the guest actually needs to do it's job... At least it was before VMLMAT.

We do R&D on MF Linux. in the pre-VMLMAT days, at any given time, using the above autologging method, we had over 100 Linux VM's running. VM was running out of real memory. VM would look around, find the least recently referenced pages, and page / swap them out to DASD. But Linux would then reference the page, and back in they would come. Paging / swapping is fine when what is being moved is rarely referenced, but it is called thrashing when it goes out and then comes right back in over and over. There are limits to what you can do to tune this, and we were at them.

Many of the VM's were idle in reality, but we in the support group had no way of knowing what was being actively used, what was up as reference, and what was up because it was autologged.

Sure, things like the Build systems need to be up all the time. That was just a handful of the total systems we are talking about here though. We needed a way to make it so that end users could, without knowing *anything* about VM, bring up and down their own Linux VM's. 

Enter VMLMAT.

VMLMAT

Virtual Machine / Linux Management and Archiving Tool.

As system's programmers, we are of course not marketers. Our name for our tool is descriptive, not beautiful. The VM part tells you right away this has nothing to do with LPARS or MVS of DOS/VSE. This is a VM tool, and that only makes sense: where else but the best hypervisor on the planet can you manipulate the guests without them knowing what you are up to? 

The Linux part lets one know we are not dealing with CMS here. The Management and Archiving part is descriptive of function. Thats just the way we roll. Since this is Open Source, maybe someone can contribute a snazzier name some day.

VMLMAT runs a standards compliant bit of HTML under an Apache web server: This is the way that the Linux users interact with the program. The system programmer has a few things to do on the install that do not relate to the Web interface, but the idea is that Linux users are used to doing stuff via the web browser. Moreover, we did not want to assume anything about where the MF Linux user was starting from. Could be AIX, Solaris, Linux, or MS Windows (perhaps with some sort of X loaded up on it). Given the platform diversity, standard HTML was a requirement even if we were not already a pretty standards driven group.

Now when we IPL VM, just the Build and Packaging Linux guests are autologged, and the end users can go to the web interface and IPL their Linux whenever they want. Or bring it down. It seems a simple thing, but that one feature saved us an MF upgrade. Only the Linuxii we need... that we are actually actively using... are up at any given time. 100+ Linux VM's up and running at the same time dropped to one fifth that number. 

That is just the tip of the VMLMAT ice burg though. Here is another nifty feature: Disk space savings

VMLMAT and DASD

MF disks are very expensive things. One of the primary criticisms of Linux on the MF is that Linux was normally run on commodity priced hardware, and now by running it on MF gear all the price advantage was lost. Many saw of course that there were advantages: I/O pathing and monster transaction capabilities, best in the business HA, and so forth: Perfect for the production Linux environment. Not so perfect for all the possible iterations of a development environment.

Since we in R&D Support do care and feeding of all sorts of Linuxii internally on the MF, all the way back to the very first bootable kernels, before RedHat or SUSE were making MF versions of their distros, up to the latest and greatest stuff, we had a real diversity issue. All those VM's were sitting around. Everyone needed a separate VM for each version they were working on, plus a set of their minus one versions of code, plus the betas and alphas for announced code: This was not server sprawl from the "Its all free" point of view i mentioned earlier, but it was expensive server sprawl nonetheless.

VMLMAT takes a different route to this. We do not store all the versions on the MF DASD at all. Everything is archived to inexpensive NAS. If you have been following "Adventures", I have been writing about that inexpensive NAS for a while now. Now you know one of the things it holds: Our MF Linux images. VMLMAT can package up via TAR any VM, and store it off to the NAS. Even better, it can restore that archive to ANY OTHER Virtual Machine. VMLMAT unrolls the archive, and then personalizes the archive to match the VM it is being restored to. 

We then leverage that in many ways:

  1. When a new release comes out, we install it for the first time and then archive it. Now everyone can use that new version without anyone in tech support ever getting involved. 

  2. If applications need to be added, the base archive can be installed, updated with, say for instance, Oracle. Now it can be archived again, and anyone who needs that release of Linux with that release of Oracle can leverage the work.

  3. Business Continuance / Disaster Recovery: Now we can take the NAS archives and replicate them wherever and however we like, and get back all this work, and it is simple and easy.

Only the currently being used copy of Linux is up on the MF DASD. The rest is spinning far less expensively out on the NAS. Multiply by the number of Linux VM's and the number of Linux versions and the number of application setups and that is a bunch of DASD being saved.

With the Web interface to VMLMAT, the end user is now an empowered individual. They can bring up their VM, shut it down, change it, install stuff, archive their changes, share their changes with other teams, and leverage other teams work. Say they are running RH AS 5 with Orcale, and they want SUSE with MySQL for their next test.  They can archive the work, retrieve SUSE with MySQL from archive, and start testing. The whole backup and restore take less then 30 minutes in our shop. The new SUSE VM has the same name and same IP as it did before: VMLMAT took care of editing all the stuff in /etc so that the VM name never changes. Just the Linux version.

At no point did a system programmer get involved in that transaction: We have seen a workload that was frankly swamping Ron drop to the place where rather than him needing to work crazy hours to keep up, he puts in a few hours a week on MF Linux maintenance (like creating new archives of new releases of Linux when they come out) and then moves on to other things. Not only was the end user enabled but we got back most of an employee for other tasks.

There is even more, and I'll write about that in future posts, but for now I want to tell you how anyone can get VMLMAT if they want it. Did I mention that BMC is Open Sourcing it yet?

Sourceforge and BSD

VMLMAT is available to anyone who is interested at Sourceforge:

http://vmlmat.wiki.sourceforge.net/

VMLMAT is licensed under BSD, one of the most open and permissive of open source licenses, and the one that we at BMC have chosen to use when we release Open Source projects. Our little site was just set up yesterday, and we are still knocking around learning how to use Sourceforge, so bear with us. We are loading up the documentation on the Wiki, and the tarball for the current 1.1.0 version is there. We'll be checking it all into SVN soon, but till then this is the way to get it. 

VMLMAT is created entirely out of Open Source projects like Apache, PHP, and Samba, and the VM portions are the very VM standard REXX. No C or assembler was harmed in the creation of the tool. The HTML is scanned and certified as being 100% open standards compliant. As we have it written, it leverages NFS to archive Linux images to NAS.

Ron did a slideshow style presentation to walk through some of the features of VMLMAT, and it is loaded to Sourceforge as well.

I'll be your host on the project, along with the BMC internal author of VMLMAT, Ron Michael.

Ron also has a blog at TalkBMC called "Open for Mainframe", and he has a bunch of posts in various stages of readiness for posting there about VMLMAT. We'll also start cross-posting with the blogs over at Sourceforge soon, so stay tuned for that.

In the meantime, I am jazzed. VMLMAT is a simple concept, and an amazing tool, and it has been saving us all sorts of time and money. I am thrilled to be able to share it with anyone else who is interested out there in two of my great geek-loves: VM and Linux. Further, we knew going in that VMLMAT was created to meet our real world requirements, but it also matches our R&D Support environment. Knowing that it might be Open Sourced, Ron created it so that it can be easily enhanced to add new features. For example, we chose to use Active Directory for a end user authentication (via Samba) rather than maintaining a separate user / password file or perhaps having it in LDAP. The code is modular so that anyone can come in an write their own module to match their internal needs, and hopefully they will contribute that back so the VMLMAT can grow to meet a broader set of real world, Linux on the mainframe management challenges.

As Rachel Maddow says: "One More Thing:" Please do not confuse VMLMAT with BMC's VM Cloning Tool from a number of years ago. It shares no design and no code with that tool.



_____
tags:
Thursday, October 09, 2008  |  Permalink |  Comments (0)
Helpful information for running a CentOS Cluster continued

As noted at the end of last post, todays post is more tasty documentation, straight off our internal R&D Support Wiki and as written by the Czar of NAS, Dan Goetzman, for running a CentOS cluster. Admittedly, this is the kind of post that is more useful from Google than as exciting reading. When you need to know this stuff, you need to know it though, and so i am posting it to help out whomever might come down the same path we have.

Next post I'll be talking about a new BMC Open Source initiative that I am intimately involved with: that one will be one of the most fun posts I have ever had the pleasure to write. In fact, I am ready to write it now, so lets get into the meat of this HOW-TO.

Take it away Dan:


HOWTO Shutdown or reboot a single node

The cluster software is started and stopped using the standard system startup scripts. So all that is required is to use the normal Linux system reboot or shutdown commands.

  • shutdown -h now - To shutdown a single node
Note: 2 out of 3 nodes must remain running to keep the cluster "quorate", or running.
  • reboot - To reboot a single node

HOWTO Remove a node from the cluster

  • Stop applications using cluster resources
  1. uvscan - Currently running only on rnd-fs03
  2. nfs - Actually runs on all cluster nodes but not controlled by rgmanager
  3. Samba - Controlled by rgmanager
  4. bbd - Controlled by rgmanager
  5. vsftpd - Actually runs on all cluster nodes but not controlled by rgmanager

Notes: Services controlled by rgmanager will be stopped when rgmanager is stopped. Non cluster applications, like DNS and NIS slaves, do not need to be stopped.

  • Stop cluster services in this order
  1. service rgmanager stop
  2. service gfs stop
  3. service clvmd stop
  4. service cman stop
  • Optional, disable services on reboot
  1. chkconfig uvscan off
  2. chkconfig nfs off
  3. chkconfig vsftpd off
  4. chkconfig rgmanager off
  5. chkconfig gfs off
  6. chkconfig clvmd off
  7. chkconfig cman off

Note: To add the node back into the cluster, run this procedure in reverse order.


HOWTO Troubleshoot NFS serving

Local tests on the server

  • rpcinfo -p - Verify portmapper is responding
  • showmount -e - Verify mountd is responding
  • rpcinfo -u rnd-clunfs nfs - Verify NFS daemon is reponding to UDP requests
  • rpcinfo -t rnd-clunfs nfs - Verify NFS daemon is responding to TCP request

Test from a NFS client

  • rpcinfo -p rnd-clunfs - Verify portmapper is responding
  • showmount -e rnd-clunfs - Verify mountd is responding
  • rpcinfo -u rnd-clunfs nfs - Verify NFS daemon is reponding to UDP requests
  • rpcinfo -t rnd-clunfs nfs - Verify NFS daemon is responding to TCP request

HOWTO Troubleshoot Samba CIFS serving

The following procedure is what I do to verify that Samba is available. It starts from the server, and works it's way back to testing on the client.

Local tests on the server

  • smbstatus - Verify normal Samba status.
  • cd /var/log/samba - Check for errors in the Samba event logs.

Remote tests from a client

  • ping rnd-fs - Verify TCP/IP connectivity to Samba
  • nbtstat -s rnd-fs - Verify if a simple NetBIOS operation will respond
  • net view \\rnd-fs - Verify if the shares can be browsed
  • net use * \\rnd-fs\${SHARE} * /user:adprod.bmc.com\username - Verify that a share can be mapped
Note: Samba is a cluster service and only runs on a single node at a time.

HOWTO start/stop/admin Samba

Samba is a layered software application that emulates a CIFS server, and on the cluster it is configured as a cluster service.

Note: You must use the cluster commands to start and stop the Samba service,
NOT the normal scripts in /etc/init.d.

Query the cluster services

  • clustat - To verify if the Samba service is running and on what node.

Start and Stop Samba

  • clusvcadm -d Samba - Stop Samba
  • clusvcadm -e Samba - Start Samba

Query commands for Samba

  • smbstatus - Show status

Adding shares and permissions

  • vi /etc/samba/smb.com - Add/Modify/Remove a share

That is it for today. Next time, as noted above, Open Source at BMC.



_____
tags:
Tuesday, October 07, 2008  |  Permalink |  Comments (0)
Continuing to look at some of the useful things one needs to know when going production with Linux. Applies to any other OS too in most ways, but examples are all from our Linux deployments, specifically our first year on our Linux based NAS.

Last post I talked about using a special version of 'fsck" to repair GFS based file systems. As I thought about that post I realized that I had some more general things I wanted to get into in this area. I also noted that i would talk more specifics about useful commands and related things we have learned along the way that you should know about *before* your Linux based cluster fails.

As I also alluded to in my last post, the first year on the Linux cluster has not been utterly pain free. For one thing, we had the fencing set up wrong so that when a node failed for any reason, it could not be "shot in the head" and recovered by the surviving nodes. This has now been fixed: I know this because last night a heartbeat failed, the node was fenced, and the service recovered on a surviving node. Then, the heartbeat returned and the cluster was whole again. We did nothing, and there was no customer facing outage. Here is Dan's exact verbiage:

I see that node #1 on the file serving cluster here in Houston rebooted this morning. That's the node that normally handles the NFS service by default priority. It looks like node #1 lost the heartbeat token to the other nodes. Probably due to the NIC driver or something. This is the same thing that happened a couple of weeks back on node #3.

 This time, the cluster recovered with out my help. As it should! With the fence configuration fixed now, node #2 was able to fence (reset) node #1 via the Sun ELOM (IPMI over LAN)  and node #1 then rebooted and joined the cluster again. All is well!

 The cluster resource manager moved NFS to node #2 to maintain that service. For now, I have left NFS on node #2 with the Samba service that normally defaults to node #2.

The cluster did it's thing, no service outage. Although I suspect NFS stalled for a few moments and then took off again... Life is Good!

We don't yet know why heartbeat is getting lost from time to time, but at least we now totally survive it when it happens. More on that in a second...

This takes me to the things I wanted to say about a few design choices we made in setting up the cluster, and I want to tie these back to another, different cluster we deployed with far less success 8 or so years ago as well as the TruCluster that the Linux cluster replaced.

Design point / choice number one: If you have read the previous series about out NAS cluster design, you might have followed a link to this picture. If not, and you do so now, you will see that we chose to implement three nodes: Sun X2200's in our case. Why? 

Our TruCluster (may it rest in peace, and in this case, pieces) had magnificent uptime. It ran and ran without rarely a burp. However, the TruCLuster was two ES40 nodes. If we took one down to apply patches, we were left literally "standing on one foot": We were not HA any more. At least one time, the *other* node failed while we had one down for service, which meant we had a customer facing outage.

With the price of an ES40, a third node would have been a significant bit of money for the insurance. Our thinking at the time: This is the best cluster software on the planet (when it was viable, before TruCluster was led to the firing line) so what are the chances we'll take a hit on a surviving node when we have one one down for service.

As with all Disaster Recovery / Business Continuance math, that question is tricky, and the real answer is: "There is a 100% chance of the surviving node going down while the other node is offline... given enough time.". In the seven years that the TruCluster was in service, it happened at least once to us.

Commodity hardware and Linux change the spare-hardware-insurance math. The price of a third X2200 plus Linux is an order of magnitude less than another ES40 node would have been. More than an order. There is the issue of increased complexity though, and I'll come back to that in a bit. To led into the complexity issue I want to go back to the heartbeat design

Most clusters use a dedicated bit of hardware for the heartbeat internode signaling. If you use only one interconnect though, you have a single point of failure. The CentOS cluster software does not require a private network segment for heartbeat, and in fact the default is to use the public network segment. That appears to be thought to be "Best Practice".

If the cluster is done right, then at least two high speed, modern, supported, monitorable network switches are in play, and each of the three nodes connects to *both*. The heartbeat signaling is small, low bandwidth traffic. With the port to port switching, high speed switch backplane, and second switch redundancy, the heartbeat should be fine. To do this right on private networks would mean adding two *more* high speed switches, plus two more NIC's to each server. At some point the cost and complexity are not returning much in the way of value, and may in fact be adding more points of failure to your cluster such that it starts failing when nothing is really even wrong!

OK: That is the theory, but as Dan's note indicates the theory is being challenged by the occasional loss of heartbeat. I hate it when that happens!

That seems like a good way to move into my point about the cluster we used to have that lowered our uptime. A lot. it was not a Linux cluster, but it was a vendor supported, vendor installed, vendor configured solution using some of the better clustering technology of the day that was not TruCluster.

The problem was that the application that ran on the cluster was not cluster aware, and we were never able to fully script it so that all the bits and pieces from the application would fail over in cases where there was a problem. The app, not knowing all this redundant stuff was out there was often confused as to which node it was running on, and failed at least once a month. We finally took the cluster apart, created two stand alone servers, and uptime went up to over a year.

There were echoes of this when we first built the Linux cluster: NFS nor Samba are really cluster aware, at least not yet. I think Samba will be soon. NFS, being stateless, does not really need to be as cluster aware as one might think. Since GFS is keeping file state, and all the underlying addressing mechanisms for the files the same across all the nodes, NFS can stop and restart anywhere.

You can see the results of this design in what happened in Dan's account of the failure. We are doing most of our cluster magic by using GFS as the file system so that all nodes could mount the same FS, yet not overwrite each other. The CentOS cluster software only has to worry about where a particular service is running, and moving them around. File state is not it job. We then set it up so that NFS runs on one node, Samba on another, and the third was insurance... That inexpensive insurance we could not afford on the TruCluster.

The takeaway from all this then: Clusters do not, in and of themselves make everything magically more HA. You have to start with the best of breed in cluster software, but you also have to know the cluster environment, and test it seven ways from yesterday to be sure that in failure mode it is actually doing what you think it should be doing. This ties back to something I said last post. To paraphrase: There is no substitute for knowing what you are doing. Linux is not magic. Clustering is not magic. All the magic comes from your people. Your business is only as good as your process: Process designed by people who do not know what they are doing will land you in a world of hurt.

Today's Linux Cluster Commands

Hoping down off my soapbox now, here is another bit of Cluster Wisdom (tm) as documented by our NAS Wizard, Dan Goetzman. First off, I want to back up and establish some common terms, which Dan provides here:

The "TruCluster Replacement" project is a evolution of our LCFS (Low Cost File Server) project using Linux clustering (CentOS) and the "Snapple" (SuN x2200 servers and APPLE XServe Raid (XSR) storage) hardware platform.
Main Features:

  • In addition to the features of the "Snapple" based LCFS platform...
  • CentOS 5 cluster technology to provide failover services for NFS and Samba.
  • GFS Cluster/SAN/Parallel filesystems for user data.
  • CLVM to make the SAN storage available on all cluster nodes.
  • Public network failover using Linux "bonding" driver.

The cluster consists of;

  1. rnd-fs - Main cluster name (NOT in dns).
  2. rnd-fs01 - Cluster node #1 (default node for NFS service).
  3. rnd-fs02 - Cluster node #2 (default node for Samba service).
  4. rnd-fs03 - Cluster node #3 (default backup and virus scanning service.

Cluster Services;

  1. rnd-clunfs.bmc.com - NFS server service.
  2. rnd-fs.bmc.com - Samba server service.
  3. yellow.bmc.com - Big Brother service.

So, now that we have some common terms and server names, the commands in this and future posts will have context. Finally for today, some helpful cluster commands:

clustat

  • clustat - To view the normal cluster configuration.
# clustat
Member Status: Quorate
.
Member Name ID Status
------ ---- ---- ------
rnd-fs01 1 Online, Local, rgmanager
rnd-fs02 2 Online, rgmanager
rnd-fs03 3 Online, rgmanager
.
Service Name Owner (Last) State
------- ---- ----- ------ -----
service:NFS rnd-fs01 started
service:Samba rnd-fs02 started

clusvcadm

  • clusvcadm -r NFS -m rnd-fs03 - To relocate the NFS service to node #3.

system-config-cluster

  • system-config-cluster - To configure the cluster.

ip

  • ip addr show - To display the where the cluster is serving a cluster IP resource

That should do it for this time. Next time, shutting down and rebooting a single node, and removing a node from a cluster, plus some troubleshooting.



_____
tags:
Friday, September 26, 2008  |  Permalink |  Comments (0)
Cost Index: Low. Misery Index: Low but not Zero

Just because a hurricane hit us doesn't mean I can't write a blog post!

Last September we "stood up", for the very first time, our CentOS Linux based cluster to replace the aged and unsupported Tru64 TruCluster. It was not all that long ago in fact that I wrote the wrap-up article to that adventure, so I guess this is a postscript.

First off, the fact that I have changed roles has influenced several things around the file server: A new manager took over my team, and when the server had a problem she suddenly found out she had a Linux file server she was responsible for. It is documented seven ways from Sunday on the Wiki : Dan is amazing about things like that. The problem of course is that when everything is working, no one reads the doc. When it fails they don't have time. Dan works with me on my new team, but went back and fixed the file server for the old team a couple of times, and here is the nut of what this article is about. What I am about to say here is going to be true about any and every complicated bit of technology that people rely on every day: it will not be limited to just Linux.

You have to know how to use the technology.

The Linux NAS was never advertised as being as good as the TruCluster that proceeded it, but when it failed it took people that understood TruCluster / Tru64 / ADVFS to fix it. Same thing with any technology stack I have ever worked with.

Technology is only as good as the people and process that support it. See ITIL for details.

This is a truth that I think about all the time in my new role as a technologist. 10% of the work is designing the solution. The rest of it is training, communicating, and then going back and retraining some more (more than likely).

Along comes this hurricane named Ike, and it is huge: As big as the state of Texas from side to side. Houston's power grid crumbled before Ike. The Linux NAS server has a weak spot in the design: It will not run without electrons. I know, I know: We should have had wind power as a backup. Next time....

Upon the return of power, the Global File System that underlies the core design of the NAS marks many high I/O, high usage file systems as needing repair and they will not mount. The log says that the file system has been "withdrawn":

 --------------------- GFS Begin ------------------------

 WARNING: GFS filesystems withdraw
    GFS: fsid=rnd-fs:p4_gfs.1: withdrawn:

 WARNING: GFS withdraw events
     [<ffffffff884c3c94>] :gfs:gfs_lm_withdraw+0xc4/0xd3:
    GFS: fsid=rnd-fs:p4_gfs.1: about to withdraw from the cluster:
    GFS: fsid=rnd-fs:p4_gfs.1: telling LM to withdraw:

 WARNING: GFS fatal events
    GFS: fsid=rnd-fs:p4_gfs.1: fatal: filesystem consistency error:

 ---------------------- GFS End -------------------------

This is system admin 101 stuff: FSCK and fix stuff, and you are back running... except that in the cluster and GFS the commands name is not FSCK. And you can not just FSCK: here then is what Dan wrote on our Wiki about how to recover from this:


HOWTO: Recover a GFS filesystem from a "withdraw" state

When a corrupt GFS filesystem structure is discovered by a node, that node will "withdraw' from the filesystem. That is, all I/O for the corrupted filesystem will be blocked on that node to prevent further filesystem corruption. Note, other nodes may still have access to the filesystem as they have not discovered the corruption.

  • halt/reboot - Use a hardware halt on the node that is in the "withdraw" state and then reboot that node.

Note: A simple reboot command should work, but on our version of the cluster it seems to hang in the GFS umount stage on the withdrawn filesystem. So a hard reboot of the node seems to be required at this time.

  • umount ${MOUNT_POINT} - Un-mount the filesystem on ALL NODES!
  • gfs_fsck ${BLOCK_DEVICE} - To run a full fsck. Run on one node only!
  • mount ${MOUNT_POINT} - On all nodes to restore service.
Note: nfsd will hang on the withdrawn filesystem. You may
need to relocate the NFS service to a surviving node first!

Since being in production, Dan has had to do this particular recovery action about four times. Ike only gets credit for this last one. The other three times had to do with a single node failing and leaving I/O pending. This in turn appears to be the ILOM card in the node acting up.

Next time: Some other handy Linux cluster things to know before your Linux based cluster fails...



_____
tags:
Monday, September 22, 2008  |  Permalink |  Comments (0)
Minor revision to a great Distro

I have made no secret here of my love for Mint. In the pantheon of Linux distros (and that is a huge pantheon full of worthies), it is the one that just works for me more than any other that I have tried. I admit I have not tried them all. That would be pretty well impossible. It is not just me that has found success with Mint either: I have corresponded with many people over the years of doing this blog who were having troubles installing Linux, tried Mint, and had it just slide in and solve their problem. Most recently someone with an IBM X30 laptop similar to mine, who was having issues getting their Wifi running with Fedora decided to install  Mint and that was it. Problem solved. This was with a Prism 2.5 chipped PCMCIA card too!. 

As I recently noted in this blog, I am currently living between two cities.Unless I want to be schlepping hardware all the time that meant I set up a new set of Linux gear in my new office. One of these new systems was a Dell laptop that, while it has been dropped and looks rough, runs OK. It's main problem was that it was running only Windows XP. In my new role, I do use MS Windows for some things: Mostly for VMware's Virtual Center native client.

Aside: What in the world is up with that? No Linux native client? VMware started off a Linux based product!  ESX uses Linux on the control console! Sigh.

A web interface would normally be my alternative as a Linux user (and as someone with as many feet as possible in the Web 2.0 world) but even the very most current version of Virtual Center does not support Firefox 3.0, and FF 3 is pretty much all I have everywhere. Grrr.

Mint 5r1 on a Dell Laptop Install

While I currently need XP from time to time for Virtual Center, the rest of the time I want to be on Linux, so I took the opportunity to install the new Mint 5 Revision 1 to the Dell laptop. Another aside: Odd nomenclature: Why 5r1 and not 5.1 or 5.0.1 I do not know. I  will take the liberty of call it 5r1 later here, just to speed my typing up.

Since I was planning on keeping XP, and it had a ton of tools installed, I needed to set aside 30 GB of the hard drive for XP. I know: Seems like alot, but  those tools look pretty useful, and the hard drive is big enough for both Linux and a 30 GB MSWin partition at 80GB.

First off, I ran XP's hard drive optimization program to make sure everything was compacted together, and I also ran chkdisk, just to be sure the hard drive looked healthy. Then I booted up Mint 5r1 and went through the very familiar install sequence.

5r1 does not do anything to the time zone picker (The graphical view of the Earth that slips and slides around under the mouse) to make it better. Still easier just to pick the TZ off the menu than to use the graphical selector. A case of a bad use of a graphical interface if there ever was one.

Once I got to the disk partitioner, I over-rode the disk size it selected to give XP a bit more room: It wanted to go with 26GB, but I wanted a round 30GB. If it turns out XP never needs it, I can still read and write to the NTFS space from Linux, so it will not be wasted.

Partitioner would fail, saying there was an error, but not what it was, or what to do about it. I was confused because I had done a 5.0 install on another Dell without issue at all.

I poked around at commandline, invoking the "ntfsresize" command to see what kinds of errors the MSWin disk might be throwing that was causing such a problem, but none of the error messages were all that clear. I thought about it, and decided that the problem must be that the MS Windows disk was "unclean". Even though I had cleaned it before starting the process, something was left undone. A quick boot back to XP, a clean shutdown, and a boot back to Mint 5r1 and now the install / resize went like a champ.

Note to self: boot one last time after a chkdisk so that MSWin will mark the NTFS file system clean. 

The Mint (and therefore, the underlying 8.04 Ubuntu code base) could have been a bit more useful here. I am willing to bet that unclean MSWin NTFS disks are extremely common, and that they are in fact the most common issue when one is trying to install a dual boot setup like this. Instead of 'Error' and little else, a message saying 'Here is something you might try' would have been really nifty.

Mint 5 updates on the Houston Dell

Warmed by the success of the 5r1 install, upon returning to Houston I decided to update the other Dell laptop. I decided that a simple MintUpdate would more than likely get me to the Revision 1 version. Nothing is ever simple. Immediately hit a brick wall. The repositories for medibuntu and Hardy security would not refresh no matter what I did. Arg!

This one was not directly a Mint or Ubuntu thing either, but a nasty interaction between the "apt-get update" process and the Internet cache inside our firewall. Since I have no control over the way Internet content is cached, it required a bit a research to work around. The solution came from a posting in the Ubuntu forums.

sudo bash

apt-get clean
cd /var/lib/apt
mv lists lists.old
mkdir -p lists/partial
apt-get clean
apt-get update

I also did this for good measure:

Add the following lines:

Acquire::http::No-Cache "true";
Acquire::http::Max-Age "0";
to the file:
/etc/apt/apt.conf.d/10broken_proxy

Finally, just for fun, I refreshed the Medibuntu security keys:

sudo apt-get update && sudo apt-get install medibuntu-keyring

That did the trick.

Mint Everywhere?

One might be tempted to think that I just run Mint on all my computers... and I have to admit that is a temptation sometimes. I do not run Mint everywhere. I would never learn anything about the other Distros if I did that, so I keep some computers in reserve and running other OS's:

  • My main Houston Desktop is OpenSUSE 11 as I write this, but it has had some stability issues, and will *not* do a clean shutdown or reboot, so I may move that unit over to Mint in the near future.
  • I have OpenSUSE 11 on my IBM T41 laptop, where it runs very well.
  • My IBM X30 laptop runs plain-vanilla Ubuntu 8.04 at the moment
  • My Acer 5610 dual boots Vista and Mint 5.
  • Both Dell laptops dual boot XP and Mint 5.
  • Another desktop runs PCLinuxOS.
  • My main Austin desktop runs CentOS 5, and I have an upcoming post about that.

There are subtle differences between various distros that sometimes end up making a big difference to me personally: Here is one: OpenSUSE packages NVU (and it is very unstable there), but Mint packages Komposer (much more stable). NVU was developed by Linspire off the Mozilla Composer code base. Linspire stopped developing it some time ago: Well before they were acquired by Xandros in fact. Komposer is an updated NVU, in the sense that it is based off NVU's code but it is still active. There are versions for both Linux and OS.X so no matter which platform I am using I can be writing stuff for one blog or another. That all assume that I can not get to Google Docs of course. I wonder in the Open Source world how many projects there are like Composer / NVU / Komposer. And with Seamonkey actively maintaining the Composer code base, I wonder if they pull back in anything that was done in NVU or Komposer? But I digress.

Mint Still Going Strong

I have written about my brothers Mint system, and it bears repeating here as a proof point. My brother is not a computer person, and is not really interested in them other than as tools. Since he is a carpenter by trade, perhaps that is why to him everything is viewed from a tool-centric point of view. I built a computer out of parts that I later installed Ubuntu on and gave to him. Later, during a visit, I put Mint 4.0 on it. Last weekend I was at his house installing a new stick of RAM. He did not really need it: I just came into a spare 1 GB PC2700 stick from my mom and I thought it might fit his computer. It did, and now he has 2 GB RAM. Can you say "Disk Cache"?

In all the time he has had that computer, other than the time I had to replace his hard drive and update his video card, he has never called me about it. He and his wife have surfed the net, read email, taken classes at school, done papers in OpenOffice.org, etc. He doesn't even really see any reason to come up to Mint 5... or 5r1. It does everything he needs already. There is one big reason I have a tendency to put Mint everywhere. I don't have to support it. Stark contrast to when he and others in the family had MSWin systems.



_____
tags:
Thursday, August 28, 2008  |  Permalink |  Comments (1)
Moves and grooves

Behind the scenes a great deal has been going on for me personally. I have not been posting a great deal here for a reason, and it is not that I lost interest in it, or ran out of Linux things to say.

First off, I am changing BMC offices, moving from our headquarters location to Austin, Texas. There is nothing sinister about any of that really: I just want to live in Austin, nearer to the Open Source action... the Bar Camps, and so forth.

Secondly, you might have noticed a change in the description of my title: I have also left management and returned to 100% technical work. Again, there is nothing deeply mysterious about that either. After being a first line manager for 20 of my 30 year career, I noticed something: I had stayed technical. This blog is part of that, and herein over the last three years I have described in fair detail the technical things my team has been up to. In talking about that to my manager, we decided that perhaps it was time to be a full time techie again, and he helped me make that happen. That is also coincided with my move to Austin is no coincidence either. Everything sort of fell into place at the same time, and I have to say that while scary at first, I have been deeply looking forward to diving back in.

What that should mean for this blog is *more* material, not less.... once I get settled in to the groove of course. I have been doing my old job here for so long, it has taken me a while to get transitioned over. August also means vacation in Far West Texas for me of course, and I have been talking a little about my vacation adventures over in my personal blog.

Torn between two cities

Part of being straddled between two offices between now and December, when I make the big jump, is that I have two desks. Two offices. Two sets of machines to maintain. Fortunately, I can build computers with parts from the trash can and they are highly functional. My Houston office was stacked to the rafters with my computer resurrections. My PCLinuxOS unit came West as a place for me to land "here" (I'm in Austin as I write this) for starters. A test CentOS system also came out: the one that Dan had grabbed from me to set up a test system I talked about in the CentOS NAS cluster article series [part two ]. That is up and running, and so my new experiment was to look at CentOS as a user desktop OS. More on that in a different post, and later.

Another thing was resurrecting a Dell laptop and making it dual boot with WinXP and Mint 5. That too will be a different post. This is using the recently updated Mint 5 R1, so it will essentially be 'new'.

One other thing keeping me busy has been that, as BMC has bought a few companies, such as BladeLogic, we have had some opportunities to consolidate some of our regional R&D data centers. Here is a fun fact: about three years ago, we had over 15,000 computers in the CMDB listed as being assigned to various R&D missions. As we have moved towards various Green initiatives, and virtualized like crazy, we have taken that number to less than 9,000. I alluded to one small part of that in "Virtually Greener". Two data center consolidations will be coming up between now and next spring, and affect over 1500 of those computers. Getting that done and keeping R&D uninterrupted is a huge project, and this one does not generate a great deal of time with Linux other than as an end user. Lions and Tigers and Spreadsheets, oh my! Thank goodness OpenOffice.org has improved Calc with the 2.x releases!



_____
tags:
Monday, August 25, 2008  |  Permalink |  Comments (0)
An unplanned quick look at LinuxWorld 2008

I did not actually have any plans to attend LinuxWorld this year, and I suppose that I barely actually did: I was there half a day as it turned out. Even in the little I saw today ("Today" being while I am writing this, which is Wednesday, August 6th, 2008) the show has changed. More about that in a bit.

I was in San Francisco for a completely different reason than LinuxWorld. I was in the Silicon Valley to do some work on a potential new BMC R&D datacenter. No new hidden announcements there: just consolidating six regional R&D data centers into one much larger and more modernly designed facility. It is amazing how fast a data center design goes retro.

Fun Fact: If you stacked all the computers we use for R&D in the Silicon Valley on top of each, in the shortest possible dimension, that would equal a stack of computers 29 stories tall... and this is after we have retired hundreds and hundreds via virtualization.  I know: Utterly useless knowledge, but kind of fun to know. If nothing else it helps me visualize the scope of the task it will be to move all this as smoothly as possible. Fortunately this is not my first time... and this isn't even our biggest R&D data center.

I finished up what I was in the Bay Area to do a little early, and someone at the BMC office had free passes to go to Linuxworld, and asked if I wanted to attend for a half day. Not one to turn down serendipity, I of course went. I had to pay, but with half a day to spend there I was just going to get a floor pass anyway.

I have not presented at the SanFran LinuxWorld for a couple of years, and I have never been as a non-speaking attendee, so it was very interesting. Here are some of things I noticed that seemed the same... and some that seemed very different. This is utterly my subjective experience of course. I was not really there long enough to square root the show, nor did I attend any sessions. From what I could see of the session list, that is still a very rich, fact filled experience.

  • Coming in, the lobby and the banners and the way everything was decorated was soothly familiar. Very much like coming home. There was Tux all over the place, and the familiar light blue on white signage I have seen at so many of these events.
  • The T shirts and other stuff at the event store looked to have even more, better selection than ever. I resisted another tie-die Linuxworld shirt only by sheer force of will.
  • There were fewer booths than last time I was here. I talked to one vendor in attendance but who did not have a booth about why that might be, and they said that that they used the Internet for a great deal of the things that they used to get from being on the floor. I get needing to spend the marketing money wisely, but it also made me sad: It looks like we have another endangered species on our hands.
  • The flavor of the vendor mix that was there was also interesting:
    • I saw lots of stuff about 10Gig Ethernet, FC over Ethernet, and 8Gig FC.
    • Lots of storage : I was especially interested in Promise technology in that regard because of their recently replacing Apples Xserve RAID product as Apples solution for low cost data center storage. We liked our Xserve RAID gear quite a bit, but this gear looks better in every way but one: It is not as cool a physical design. Oh well, you clearly get more bang for your buck than with the Apple product or days gone by. Promise also supports actively Linux, whichmoves it to a new level for me personally.
    • Rackable systems had a Semi-trailer filled with a portable datacenter. Not the first time for that I know, but the first time I got to touch one. Very cool.
    • The .org area was as fun as usual: This time I spent some time at the DRBL / Clonezilla booth, and I will definitely be looking into these tools when I get back from vacation.
  • There must have been 1.5 Bazillion Linux powered netbook class laptops. In vendors booths driving displays and in attendees hands as their mobile computing device. Makes sense, since they have sold those by the truckloads. If it wasn't a netbook, it was an Apple, and several of the Apples were running Linux. The MacBook in the Clonezilla booth was running Ubuntu.
  • When I first started going to LinuxWorld, RedHat, SuSE, Xandros, and so forth were there. Even though Wednesday (the day I was there, which as I write this is still today) was "OpenSUSE day", they had no booth I could find. Neither did RedHat or Xandros or MS. MS dropped out pretty early I think. That just could not have been comfortable.
    • Side Note: RedHat used to give away red Fedora hats at LW: I always thought that was the best gimme ever at a trade show ever, even beating BMC's own combo laser pointer / pen (or, "laserwriter" as I used to call them when I was giving them away)
  • Ubuntu / Canonical *was* there. There were not before.
    • Is it just me, or is Ubuntu pretty much everywhere now?
  • I saw several products listing Mandriva as supported: Never noticed that before. Good sign.
  • Saw one vendor listing PCLinuxOS as supported. Also good sign.
  • Linux based hardware appliances were all over the place: WAPS, Cells phones, general handhelds, and on and on.
  • Bumper sticker on an Apple: "My Other Computer is a Data Center": From Google.
    • At one point on this trip I could not get to the Internet... while I was writing this in fact, so I could not use Googles Docs as I often do. Had to go with Komposer, which is better for HTML generation anyway. Looking at all the Linux powered netbooks, and thinking about how Apple pulled the iPhone tethering application recently, it seemed to me that Suns "the network is the computer" is still in force, and that we are still a ways away from ubiquitous network access.

My general feel, after walking around and talking to people and looking at stuff was that Linux had turned a corner sometime between the last time I was here and this time. Where it used to be "Linux can do it" where "it" was defined pretty much as "Anything", from desktop replacement to embedded to server, the claim that it could do "it" was based on the fact that it factually could do it, not that it had huge market uptake or maturity.

This felt different. This looked like an event that was about something that was utterly mainstream. It felt like a mainframe conference of old, where all the vendors were selling things that made the MF work better  or analyzed it in some way or added missing functionality (Hey!  We do that!).

In a way it was a little hard to deal with. It was one thing to be an early adopter, but now, looking at all the netbook users running around I realized in some ways the Linux world has caught up to and even passed me a bit. For one thing, I left my XO-1 in Houston, although the netbooks looked more usable in the keyboard department than the XO-1 is in any case. Rats. Time to start saving my pennies.....

Welcome to the Linux World.



_____
tags:
Thursday, August 07, 2008  |  Permalink |  Comments (0)
OpenSUSE 11 General Availability as an ELD, Now with secret sauce

I mentioned a few posts back that I had a test system stack: four identical older systems that I set up to be able to test Linux. The idea was the I could do back to back comparisons and have a good idea how each Distro of Linux stacked up on the same hardware at the same time. No sequential reloading of Distros on the same computer. Just a quick switch of the console via the KVM to look at the same thing (OpenOffice.org, Gnumeric, Evolution, Firefox, whatever is peaking my interest...) on the same type of computer, but two different Distros.

I took it all apart today. I did not reckon with two problems.

  1. Heat and noise while they were up (I left it all down when I was not using 'the stack'). The noise came from the fans in the KVM. Note to self: data center grade gear is lousy for office use.
  2. Even if the hardware looks the same, and specs out the same, when it is old, it does not necessarily act the same. This is probably true even when hardware is new, but as it ages, it becomes more pronounced. In particular, the video cards and how well they worked with the KVM, and the hard drives, and how some systems seemed to be in I/O wait for no apparent reason against /dev/sda.

I had a third reason for doing what I did, which was to learn how our new data center standard KVM switches work from actual setup type experience. I am always looking to stay as current as I can on all sorts of tech, and I had not had a chance to "play" with these yet. that being done, it was time to move on.

OpenSUSE 11

I mentioned in that post about the test stack that I was testing OpenSUSE 11 Alpha. It has since GA'ed, so it was time to go back and have a look. Unlike Fedora 9, OpenSUSE 11 had installed fairly easily even in Alpha state. I expected the GA to be smooth, and it was. All you have to do is look at all the trade reviews of OpenSUSE 11, and read all the praise for the changes that it has brought to the OpenSUSE party to get the feeling the R11 is a significant upgrade over what came before it.

A great deal of the excitement surrounds the fact that the software installer and updating process are significantly improved. They are not yet quite Ubuntu / Mint easy, but they are light years better than they were, and are closing in on the leaders of the pack. It is now dead easy to enable alternate repositories, including ones that allow you to install binary only drivers like Nvidia and ATI's. This, as it turned out, would be key for me.

I did not want to install R11 on 'the stack'. I wanted to turn that off and take it out of my office. My IBM T41 was nominated instead. It has always worked well with SUSE in the past, so I assumed it would be easy, and it was. Boot the LiveCD, run the installer, answer a very similar to Ubuntu set of questions, lay out the hard drive manually as always, and then let it spin on down.

Since the T41 had been running Mint 4, the OpenSUSE look and feel was replaced from the get-go with my customized desktop: Space Shuttle landing at night picture, standard Gnome tool bar at the top of the screen. Some things are missing:

  • No gkrellm is available from any standard OpenSUSE repository. My favorite system monitor... well, other than Patrol of course. I am sure it is out there someplace, and when I get a spare moment, I'll find it.
  • Sensors, avahi, etc all have to be installed since they were not on the LiveCD image, but they are available. 
  • HDDtemp is not available! 
In no time at all the desktop looks more or less the way I like. The tool bars are stocked with goodies. The Wifi card works out of the box and with no muss or fuss (something that Fedora would not have done). Evolution finds the Mint created config files and appears to work well.

Phase 1 complete. No casualties.

Crispy Nvidia 7300

Shortly after I finished up the T41, my Dell 745 desktop, running Mint 4.0, starts acting flaky. It moaned and hummed and whined and wheezed. I opened the case, and watched the fan on the video card stop and start. Speed up, then slow down. Whine then run silently. Uh oh.

A few days later, video stops working on the second monitor. "lspci" says that there is no Nvidia card at all. 

I do what any geek faced with such a situation would do. I went to Fry's (I gotta love a store that has parts to build your own Linux computer and also sells Apple stuff). There I picked up an Nvidia 7200CS that had a big heat sink rather than a fan on it.

In the 7200CS went, and no luck. Mint acts like it can not see it. I decided to try OpenSUSE and see what it would do. My thinking was that OpenSUSE, being from Novell and the Open Source members of that project, should have the worlds best implementation of Evolution on it: Novell bought Ximian, creators of Evolution and the Evolution connector. In the past the SUSE version of Evolution had always been at least workable. This would give me a chance to see how well OpenSUSE worked on desktop hardware, with dual heads, with the Nvidia repositories, and with Evolution.

Late that night, I booted the OpenSUSE 11 LiveCD that I had used on the T41, and it all worked pretty much the same as it had. For fun I tried to use the Open Source Nvidia drivers first but they would not enable the second monitor. The closed source ones worked fine, and enable the "twinhead" setup. I was back in business. Even Compiz worked, and that had never happened on the 745 with the Nvidia 7300 and Mint.

Evolution came up, found everything where Mint 4 had left it, and I was off and running. Well. Not so much

Stable for 24 hours, then a failure. Evolution Connector crashed. 

Evolution 2.22

Evo 2.22 in SUSE has a slightly updated look and feel relative to that same app in Mint 5.0. A few more plugins appeared to ship, all though I did not compare them line by line. 

My desktop can *not* have an unstable version of Evolution on it. It is my main place to read email, check my calendar, open tasks to myself, update contacts, filter emails from various mailing lists into folder for offline reading, etc.

I installed the debugging symbols for Evolution and Connector, and went into the business of sending crashes into the Gnome project. At first it crashed when I was using it. Then it started to crash even was I was no where near the computer. More and more, faster and faster, closer and closer together.

When I say crash, I mean Connector crashed. Evolution stayed up and running. It was just useless.

I created a clean ~/.evolution file, and slowly brought back over the mail folders from the backup copy now called .evolution.mint. I went through and disabled plugins that were not useful in our MS Exchange based shop, like Hula and Groupwise related things.

Crash. crash. crash. 

And now, the secret sauce....

I was trying to decide what to do, and had just about opted to move to Mint 5.0 on the desktop, with a fall back plan to Mint 4, which has been stable. Then, I noticed something odd. A pattern emerged. Every single time Evolution Connector had crashed when I was there to observe it, it had been when the inbox was being filtered: When the rules were running that kept my inbox cleared out. A little status message in the taskbar about filters running was there every time, and always at 0% complete. It looked like a new message was arriving, triggering the rule to run and parse it, but that the rule was immediately freezing and Connector was crashing shortly after that. I have about 20 Rules in the ruleset. I would not think that was a large number, but who knows? My quick looks at the crash dumps before I sent them in to Gnome made me think the crash was happening in the same way every time.

I decided to try something.

  • I disabled filters aka 'Rules' in Evo-speak on INBOX for Evolution Connector.
  • Created and enabled IMAP account to the same MS Exchange 2003 server Inbox
  • Turned on filtering on IMAP. Same exact rule set, same exact Inbox, just running via IMAP rather than Connector. 
  • I made IMAP my default account. The Connector account was there and active, just not default. This means, among other things that outbound email is being delivered via SMTP rather than through the Connector's WebDAV protocol.
My idea and experiment: use Connector *only* for Calendaring, Tasks, and Contacts (including GAL lookups). Take the stress off the Connector code. If this was a timing related or load related issue.....

It has not failed even once since I did this, which means about 5 working days of uptime. Other than the first 24 hours of stability, I could not get Connector to stay up for more than a few hours at a time. It appears that Evolution Connector and the built in rules facility are not compatible at this time, at least with OpenSUSE 11 and Evolution 2.22.

In retrospect, it probably should have been a clue that the OpenSUSE 11 installation on my T41 laptop never had an Evolution crash. I do not run filters there.

Enterprise

As usual, I have to ask the question, is OpenSUSE 11 a viable desktop for an enterprise.  Not for geeks like me but for the average computer user that does not want to know anything about the computer itself: they just want a tool to get a job done. 

The desktop itself is easy to use, easy to configure, easy to update, and a strong preview of what is to come in the next release of SLED (SUSE Linux Enterprise Desktop). It has all sorts of standard Open Support, from Wikis to mailing lists to online doc.

From what I have seen the system is pretty solid except for my corner case of Evolution against MS Exchange 2003 running a fairly large set of filters on my inbox via Connector. I'd have to say I would probably have no problem supporting it, and would prefer all the new shiny goodness of OpenSUSE R11 versus the getting-long-in-the-tooth SLED 10. For the first time ever, I have left OpenSUSE on my primary desktop to be used as my primary OS at the office.

Mint will stay my primary at-home Linux version. Instead of Mint-everywhere, I'll be jumping back and forth. A new experiment has begun.



_____
tags:
Sunday, June 29, 2008  |  Permalink |  Comments (1)
Interesting, but not Enterprise. Not that they ever said it was

The problem with asking a technogeek whether or not something is possible is that you will almost always get back the answer "Yes".

"Can a program be written that monitors all the computers on the network, regardless of who makes it, or what OS it is running?"

"Yes"

"Can it be ready a week from Tuesday?"

"What year?"

There is the rub: a technical question needs a scale framed around it. Is Linux a viable desktop OS: Yes. Can we use Fedora at the office? Yes.

Those last two, while true, ignore scale and ignore training and ignore whether or not other 'flavors' of Linux would be better. Our recent experience with replacing our Tru64 TruCluster with a CentOS based cluster is a lesser example: yes it was possible, but it did require having a guy like Dan Goetzman to read the kernel code, read the traces, find the problem, and write a workaround. Since then, it has worked extremely well. You can not ignore the fact however that most companies do *not* have a Dan or even a Dan-like person on their team. Such skills, while not unavailable are rare enough that most folks just go with a vendor created solution.

That is the eternal tradeoff of IT: Roll your own and get exactly what you want, but then be forever locked in to being the maintenance and update group, or go with a vendor solution where all of this is essentially outsourced.

Our CentOS cluster is an enterprise grade solution, but in point of fact, only because Dan is standing behind it. CentOS has no vendor support. Without Dan, we would have used RedHat Enterprise Linux and bought support instead.

It is in this frame of reference that I went to look at Fedora 9. I know it is not supported, and that it is not meant to be an Enterprise Linux Desktop, any more than my recent foray with Mepis is or was. Fedora is a technology exploration, and I was exploring.

Back to the Stack

I started out a while back to create a test environment where I could compare various Linux environments side by side. At the time, Fedora 9 was pre-GA, and was not behaving well on the test gear. At the time I was trying out the LiveCD version of the install, but Fedora was just not getting the video right, where pre GA or just-recently-GA versions of Ubuntu, OpenSUSE, and Mandriva were working fine on the exact same type of computers. These are standard Dell desktops no less. Nothing to weird about them. Certainly not laptops and their more esoteric hardware.

Even before I did the test installs, I was starting to feel a certain level of frustration with Fedora. I could not quite figure it out. It used to be my *main* distro. I used it ahead of everything else: Where Mint sits today, once there sat Fedora: From Fedora releases 1-5, it was, for me, the *it* distro, replacing Mandrake.

With Fedora 1 through 5 I had to hack the wireless to work on all my laptops. I was getting downright fast at it. Either finding the unsupported-by-Fedora-but-Linux-native-driver-stuff, like MadWifi, or shortcutting it with NDISWrapper. Either way, Fedora was on the air in short order. It was no harder to get going than SUSE back then, and Fedora hacks were better documented on the Internet. it seemed like everyone used it.

When I started using Linux as my full time desktop here at the office, it was Fedora. Not any more, and not for a while. Now-a-days, I only install it to see what is happening in it, and it usually ends up being frustrating because in terms of ease of install and supported hardware it has been passed standing still. Fedora feels stuck in the past, with the Anaconda installer: In truth it is no different to install now, in terms of difficulty, and need to add in 3rd party repositories, than it was in the beginning, or at least that is the way that it feels. One person at the office (a fellow Linux desktop user) said that they felt that Anaconda itself was getting more fragile with every release.

Ubuntu, Mint, Xandros, OpenSUSE, Mandriva, PCLinuxOS... you name it. All of them are dead easy installs, and usually they just work out of the box.

Fedora stands alone

Fedora is outstanding in its field: That is where we found it. Out standing in a field... Sorry.

I have known intellectually for a long time that Fedora is different from all the other highly used Linux Distros. Knowing that and have a visceral understanding of it are not the same thing though. I used to think of OpenSUSE as being a kissing cousin to Fedora, once SUSE started to use the Fedora-like development model. But there is a big big difference, especially now.

Here is where I get into trouble sometimes when I am looking at things like this. I have to recall that Fedora may look exactly like any other Gnome based Linux; Same menus, same packages, same projects underneath it all, but it is assembled out of the bleeding edge stuff. Can it be made to work: yes. Is it interesting to see what some packages are doing? Yes. Should you use it as an ELD: Only of you don't need support.

The OLPC project has been working for a long time to create a production version of Fedora 7... and the Fedora project is two releases down the road from there. Support on OLPC is about what you'd get from Fedora too: online forums, Wiki pages for Doc, etc. No number to call, no throats to choke if you are so inclined.

You can get support, from a commercial company, for years, on Ubuntu (especially the LTS versions like the current 8.04). Mint is community supported but close enough to Ubuntu to be pretty supportable. Many of the things published in the Ubuntu forums work on Mint. Xandros and SUSE stand behind their versions with support options.

You want support on a Linux desktop from RH, you go with Red Hat Enterprise Linux 5 Desktop or one of its variants.

Part of what made relative lack of support for Fedora pop back into focus for me was a note I got from the CodeWeavers folks:

...The bad news is that extensive testing on Fedora Core 9 has revealed
severe problems with FC9 itself.  There's a serious font-drawing
problem, and also a periodic crashing bug.  Both of these are problems
in Fedora and outside of our control, so this latest release is likely
to exhibit those problems as much as the betas did...

That was interesting two ways: The obvious technical issue, but also that Codeweavers was *trying* to support Fedora as a viable desktop for Linux. 

Looking at BMC for a moment, we only support versions of Linux that have vendor support for the version: currently Novell and RedHat GA releases. I personally would like to see Ubuntu added to that list, but I am sure that comes as no surprise to anyone that reads this blog. No I am not announcing or hinting at anything. Just wishing.

ELD and Fedora

Anyone who is a Linux maven could make a go of Fedora as an ELD. I know lots of people here at BMC that do just that. Fragile installers do not scare them, and fixing drivers is no big deal, etc. Fedora, for them, is beauty because it is bleeding edge. Sure the Xorg server in R9 is experimental and causing screen tearing. Now. In a few days or weeks it will get fixed, and then they'll have access to the latest greatest features. The speed. The bleeding edge hardware support. It will have been worth it. To them.

As an ELD for the masses though, all Fedora is going to do it give you a clue as to what you will see in some point in the future: maybe RH ELD 6.  And even that is not a dead certainty: RH will err to the side of stability, so some bleeding edge stuff will not make the cut. Maybe RH ELD 7. Maybe never.

For use in a shop like ours.. an MS Exchange based shop, I always look at what Evolution is looking like and how it is behaving. I have learned over the years that the point release of the project is not all you need to know. The way that the Distro packages and tests it is important. See what happened when I tried to run Evolution under Mepis for example.

I did test 2.22 on Fedora 9. It works almost the same as 2.12 did on the last Fedora, and it also works about the same as 2.12 or 2.22 does on Ubuntu or Mint. Recall the 2.12 and 2.22 are adjacent releases, despite the jump in the numbering. Evo has all the same features, and all the same problems. Do a mass delete from Evolution on one computer, and the other one will completely loose track of the inbox message count. Exchange back end crashes fairly often still. Finally, nothing has really happened (as I feared it would not) on the MAPI support front.

I did do one experiment I have never done before: I set up both IMAP and Connector at the same time and pointing at the same inbox on the same server. When the connector crashes, IMAP keeps right on running. This tells me that the instability is probably not in the base Evolution code, but in the protocol connector of "Connector" itself.

Install of Fedora 9


I originally planned this post to be about how the Fedora 9 installer has changed between the pre GA and GA code. I changed my mind. I was interested in the 'fragile' comment that had been made. It matched my experience with the pre-GA LiveCD. I decided to go conservative, and download the install CD set (the Dell test computers not having bootable DVD media). It was by and large the same Anaconda install I have seen for a while now except that it would not run in GUI mode. I had to run it in the ASCII character based curses based mode to see it. No big deal: Done that before.

When the final boot came, the same thing happened that did with the LiveCD: The video mode was whacked (same as the GUI install it appeared), and the boot messages were invisible. The Dell monitor said "This video mode can not be displayed".

I booted to single user, erased the /etc/X11 xorg.conf, ran 'system-config-monitor', and got past that problem. But now 'firstboot' had not run, so I manually added userids and config-ed things on the system that the firstboot stuff normally does.

I don't know if this is a global thing or not, but I have to agree now about the fragile comment my co-worker made: the installer is not very solid. We both have Dell gear to work with so it could just be a limited sample type problem. Given the ubiquity of Dell gear, and the fact no other OS is having these issues with the same hardware, that seems odd. Perhaps by being on the bleeding edge some backward compatibility was left behind?

Once up and running, it is a very standard Gnome desktop: None of that MS look-and-feel stuff that SLED or Mint is going in for. It is fairly crisp on the older hardware, but that is a Linux hallmark. I would have been shocked to see it going slowly. 2.0 Ghz and 512 MB of RAM is still a *big* Linux system, even if this hardware is over three years old.

Applying maintenance via yum makes the video break again on reboot. I guess a new xorg came in, and replaced t