Is your IT up and green?
Most managers are aware of acquiring technology that is designed to be more robust and reliable as part of their IT strategy. In that analysis, we depend on many factors in making our decision. We might even look on the manufacturer's Mean Time To Failure (MTTF) specifications along with our experience with that vendor's products before choosing products.
Here are two studies on hard drive failures presented in the Proceedings of the 5th USENIX Conference on File and Storage Technologies(FAST'07) that might change your commonly held beliefs:
http://www.cs.cmu.edu/~bianca/fast07.pdf
http://research.google.com/archive/disk_failures.pdf
The google study of a large disk population showed a lack of correlation of failure rates related to higher utilization of ATA drives or operating temperatures. The Carnegie Mellon Study showed that failures were much higher than MTTF alone would suggest. Additionally, that SCSI, Fiber, ATA had similar rates of failure.
Historically, failures have been excepted as the normal operation in IT. Therefore, contingency plans have been formally to informally use to repair/replace equipment and plan for DR events. Redundancy in storage like RAID technology is now a given requirement in most shops. This accepted technology contrasts to when disk failures plagued IT on all platforms and usually meant an inadvertent downtime to applications. However, some may not realize that higher number of servers can actually lower your up-time and reliability than improve it. Here is what I mean by that. When applications are spread out over many servers, there is a greater chance for a failure to bring down that application. Many years ago, there was a push to get anapplication away from central computing to a distributed environment, perhaps for better control of the IT service requests and to simplify the change management approvals to upgrade their own system. This artificial fencing of applications by residing on separate servers and now even more servers intertwined to support larger applications actually began to create unintentional consequences.
What became apparent was the need to have physical computing resources abstracted away from the application. The mainframe years ago with virtualizations and continuing to evolve it's own reliability through redundancies and managing hardware components lead the way on processor designs. Having the system able to manage its failures is critical to large applications or even a large number of them in a consolidation project. Yes, it was true that distributed environment lead the way on RAID technology, and together, you can have a more monolithic computing environment where you can have the best of both worlds. In such large consolidations that include the mainframe you can have failures that wont bring down the application. Yet, it would be naive to just look at hardware cost in the TCO without considering the cost to maintain the equipment and the personnel required to maintain it. It is very obvious that 3000 servers requires more work and personnel cost associated to them than 100 larger ones or just moving to a few mainframes. Also, the cost of failure should be included in as well. So, green technology is about lower all associated costs and keeping a better up-time along the way. Today for Linux and z/OS, you can have z/Series processors all the way up to an IBM z10 Model E64 maximum configured where IBM compares its performance to 1500 x86 servers. For Unix, Sun Sparc M9000 is sold as more mainframe like with its RAS. These are but a few examples.
It is becoming clear for many a datacenter that the right mix of platforms where consolidation will lead to enhanced manageability and reliability is what going green is about. Yes, it's the environmental thing, but it is also a green you can save in your wallet.
_____
tags:


