Skip to content.

TalkBMC

Sections
You are here: Home » Blogs » David Wagner » Twenty-First Century Capacity Management

Twenty-First Century Capacity Management Twenty-First Century Capacity Management

Document Actions
When Planning, it is critical to discern the time to act

So much has been happening in the arean of optimization of Data Center Server resources, power, cooling and the worldwide hype cycle around all things green, that it can make your head spin!

In my case my head has been spinning of late around what MY personal plans should be.  As you all know, I'm hugely passionate around all things efficiency and server related; I've spent 25+ years now thinking about ways to optimize computers and their usage... The further I got into researching the implations around the game-changer of Virtualization, the more excited I got about all the possibilities.

Now we see the Green Grid kicking into high gear, trying to bring together multiple hardware and software companies around this entire macro-issue.  I believe we really are on the cusp of an all new era in capacity management and resource optimization. And I believe I can be a key contributor to the larger solution space.

I believe this so strongly, that I am taking the boldest possible action I can conceive: I am leaving BMC on May 1st to go off and start my own endeavor in this area!  For that reason, it is highly likely that this will be my last blog entry here at BMC. And I wanted to proffer my thanks to all of you who have contacted me, shared your thoughts, insights and ideas, and generally guided me during this exciting time.

In my new endeavor, I am excited and thrilled that I will continue to work closely with the BMC company, partner, and customer ecosystems, as well as expand beyond it to try and offer comprehensive, simple, and high ROI solutions for this complex area.  As such, I will continue to be in direct, and indirect, contact with many of you going forwards. To me that is a very much undeserved, but gratefully accepted, benefit.

Special thanks go out to BMC; a stronger, more passionate group of IT software solution professionals I've certainly not worked with over the last 25 years... That, more than anything else, is why I've continued to stay at BMC!  But sometimes, no matter how wonderful the environment, no matter how open to implementing new ideas, no matter how strong the capabilities of your co-workers, the relationships built up with customers, etc... Sometimes it is the right time to venture off on your own, and "live or die" on the strengths of your individual vision, passion, skills and desires to indeed make the world a better place. So, that is what I'm doing!

After I get things more established - I'm literally going to be hitting the ground running in this space! - I plan to do all the usual things like setting up a website, blog, etc... But you wouldn't imagine (or maybe you would - LOL) how much time-sinking work it is to actually setup corporations with the lawyers, accountants, etc...  That is important stuff that has to happen quickly!  So, much as I personally enjoy this communications part; well, its going to have to wait a bit for the more mundane things to take precedence.

A more personal thanks to the senior management here at BMC is also in order. I have really appreciated the leadership of Bob Beauchamp over the time I've worked under his organizations all the way up to and through his leadership as our CEO. I'll never forget the time a competitor came up to me at a user-event right after Bob had spoken and said (paraphrasing): "Wow, you guys are lucky to have a CEO that really gets it like Bob does!"... This was no marketing spin-meister, or dilbertian "sales weasel" (apologies in advance to the non-weasel sales professionals out there!) mind you... This was a highly experienced, technically proficient development director for one of BMC's direct competitors in the Performance and Capacity market segment... Bob's leadership has truly set the stage for an entire new set of solutions in the Systems Management space: Business Service Management.  And BSM has now in its turn set the stage for acceptance of more holistic, business-value based solutions. Such as solving the data center power capacity issues.  Without Bob's vision, there would be no BSM, no BMC blogging... and no "a lot of other great things from BMC" as well!

Personal thanks to our CTO office and Senior strategists, who have been very encouraging: Tom Bishop, Kia Behnia and Herb Van Hook. Your wisdom and grace not only continue to benefit BMC, they've benefitted me personally - Thanks!

For those of you who want to get in touch with me, I suggest contacting those you know at BMC, they'll be happy to point you to me (at least until I become "visible" in the new entity!). Or, you can find me on LinkedIn (which I highly recommend as a sort of professionals version of MySpace for the network-building and network-maintaining IT professional!).

So...

Best regards, keep it cool, and, please,  be efficient!

Dave



_____
tags:
Monday, April 30, 2007  |  Permalink |  Comments (0)
Wherein a "horror story" around Data Center Optimization is experienced first hand...

I know - first rule of blog-dom is to not go this long between updates... but well... I've been slammed (almost like I myself have been consolidated, virtualized, and now am running  multiple concurrent workloads)...

Seriously, in my last blog entry I predicted we're going to see more and more "horror stories"...

Here is a brand new one, experienced first-hand just a couple of weeks ago (well end of January actually, but who's counting!).

Context: sales opportunity call

I met with the Global Head of Data Center Services (reports to CIO) of a "Top 10 worldwide financial services" corporation. Owns responsibility to manage all servers worldwide. Their IT budget is around $2Bn per year, with an IT organization of over 20,000 personnel worldwide. And, no I will NOT name them, ever.

Everything in "quotes" is verbatim from my notes...

Their biggest challenge?

The current data centers are:

  • "In the cities"
  • "Obsolete - they are 30 years old"
  • "Out of power, floorspace and cooling"

We're hearing a lot of this... so far, so good!

They are going to implement 10 all new data centers, "completely from scratch in all new locations: 4 in Americas, 4 in Europe, 2 in Asia"

Seems sensible... after all they are in a real tough position here!  (Our rep is getting excited! LOL), But then, here is the going gets weird, and the resullt is scary!

Because of the urgency of their lack of data center capacity, and their focus on cost reduction they are primarily focused on answering the "where to put these new data centers (geography, power costs, labor costs, security)" as their top priority.

Because their organizational culture, led by the CIO is one of "urgent action", they are "planning an actual migration [that] ignores planning"  They have been given a timeframe to go from site selection to completion of implementation and migration in 18 months! For tens of thousands of servers. They don't even have sites selected yet!

"Existing applications and technology platforms will be migrated as-is... Basically, there will be an update of the underlying HW technology (latest server, cpu, etc.) but no paradigm shifts like consolidation or virtualization..."

"We plan to do no consolidation or virtualization projects that aren't already done until after the migration to the new Data Centers"

At this point in the meeting, every cell of my being wanted to shout from the roof-tops: "SAY WHAT????? What are you, nuts?"

Time for some "quick thinking" - this was, after all, a sales prospect call!

Lemme get this straight... typical IT shop, running at resource utilization levels of 5-10% (or less!)... They are going to invest in 10 new data centers, at a cost of several hundred million dollars... and do a one for one replacement of server hardware technology...

Wonder what their average utilization will be AFTER this is done... Replacing 3-5 year old x86 CPU-based 1U rack servers with multi-core and Blade Servers - without consolidation? Without Virtualization?

I'm guessing their average utilization will be so low they won't be able to measure it consistently!

Doing my best to control my body language, I gently asked (while our sales rep cringed silently!):

"May I, ahem, ask why you aren't doing the capacity planning first?"

"We recognize this is a unique opportunity to redo all our data center kit, but we are not focusing on that right now, there is no time. We have no capacity planning team, and, frankly, nothing against our very talented ops teams, but I am not blessed with an IT Operations staff that have foresight"

I wish them all good luck... And I do hope they will be receptive to our efforts to show them how they CAN have time to do planning correctly (so they can at least right size the HW even if they don't do virtualization)... Some capacity planning is better than none!

Frankly, I don't see how this can possibly succeed without doing capacity management... Best case, they are wasting probably $100M...

Conclusions:

  • If you don't PLAN today, you won't have time to plan tomorrow
  • Customers who don't PLAN are HW vendors favorite people
  • Your lack of planning doesn't constitute my emergency, it will constitute YOURS (sooner or later)
  • I'm glad I don't have any of MY personal assets invested with this company!

Excuse me now while I plan my next trip

Dave

 



_____
tags:
Thursday, March 22, 2007  |  Permalink |  Comments (0)
Some ramblings on observations from Gartner Data Center in Las Vegas last week

Gartner Data Center in Las Vegas was a really "validating" experience for me in terms of some of the topics for this blog...

Lets talk "Hot Vegas" first

First, I really have to offer some additional comments (apologies?) on SprayCool (whom I discussed before). I actually got to see them in action, live on the floor. It is quite cool (HA!) technology. It doesn't use water, but rather some kinda of special liquid chemical that has an evaporation point around the optimal temperature for chips. So, it sprays the liquid onto the hardware (they have heatsink sized versions, as well as variants for other chipsets such as RAM, and they can also encapsulate the entire board!), and then evaporation cools it off.  The condensate (having removed the heat), then is collected and routed to a plenum at the back of the rack... all then flows down to a heat exchanger (one per rack) at the bottom, where it can be pumped out of the rack an into the chilled water loop of the data center. Worries of toxicity (in case of leaks) were addressed by the bold claim by the marketing rep "I'd drink it if I had some here in a glass!" (shudder!)...

The good part is this in total takes only 130 watts (for an entire rack). So, its highly efficient at moving heat - a big challenge. The downside, as I discussed before is that there is simply no way a data center can do this for all their servers and racks. Not only would the capital costs be outrageous, but the sheer complexity would be immense.

All this being said, it looks to be an ideal solution for solving "spot" heat issues (localized hot racks, hot zones) to buy some time in the data center "fighting the heat" battle... But yhou are only moving the heat caused by server inefficiency, you aren't solving the root cause of the problem...

Matt Stansbury of SearchDataCenter. com has an interesting summary/take from the show which really ties the challenges back to the types of solutions we had in the days dominated by proprietary mainframes. His number one conclusion? "Everybody is looking for metrics" - imagine that!  Again, if you can't measure it, you simply cannot manage it.

But its really hard to get all the metrics needed, let alone into one solution!

Speaking of hard...

Had some interesting discussions with Aperture, whose Vista solution is a really nifty data center planning and optimization tool focused on the physical components of the data center.  In many ways it reminds me very much of the hardware/physical analog to our BMC Performance Assurance solutions (it has a hardware library, modeling capabilities, purpose-built graphics, etc.) all designed to allow data centers to understand their infrastructure capacity - not in terms of performance (response time, throughput, resource utilization of servers, etc. like our solution), but rather in terms of power, floorspace, rackspace, etc. type of capacity variables... Melding the knowledge provided by both types of solutions is clearly going to increasingly have to be recommended best practice, I think...

The elephant in the room...

That nobody is talking about is the reality that all this complexity, all this wasting of power, and all this service risk, is caused by having too many servers! If things were'nt so darned underutilized, everything gets orders of magnitude simpler. Why don't more folks realize this?

But, cooperation is sooooo difficult...

There was a great Gartner-hosted session (Ronnii Colville and Kris Brittain - "IT Operations - the Three Tenors: Change, Configuration and RElease Management") where they really drilled down on the unique challenges of managing Change, Confugration, and Release Management; especially on the issues/realities that these critical data center process disciplines typically span organizational silos, thus slowing their adoption! My informal survey of folks there (as well as some of Gartners famous "real time/spot surveys") seemed to indicate that many folks are embarking on consolidation and virtualization (to reduce physical server counts) prior to implementation of complete Change, Configuration and Release Management... This seems to be due to this "different buying centers" phenomena...

So, on the one hand, you can't get to Virtualization or a Real Time Infrastructure without Capacity Planning (presented and reinforced by both Tom Bittman in his keynote on Virtualization, as well as by Donna Scott in her keynote on the Real Time infrastructure), and on the other hand, the complexities of configuration and change in consolidated and virtualized environments calls out for implementing Change, Configuration and Release Management (CCRM) in order to meet compliance and reduce and manage risk of change...

Result? Most data centers seem to be implementing Conslidation and Virtualization projects before implementing CCRM, or perhaps in parallel in different silos... And most appear to be doing it with minimal to no performance oriented capacity planning.

Here's another prediction: over the next year, we are going to see lots of:

  • Horror stories of service unavailiability due to poor/non-existent capacity planning, and/or
  • Horror stories of continued energy wastage, expense, and even regulation, and/or
  • A new buzz-phrase: Virtual Server Sprawl

You can at least start solving both of these problems without implementing complete CCRM by at least doing (even rudimentary) Capacity Management on a project by project basis before any/all migrations or configuration changes, and at least use whatever formal change process you currently have...  then try and break down those organizational silos, and implement a CCRM solution.

But certainly, by the time you are in production - on any scale - with virtualization - you had better have a good control of CCRM,,,

Or, maybe I'll be writing about YOUR data center, perhaps?

Stay Cool! Be Efficient!

Dave

PS: I just got an email from a co-worker about a shopping experience that occurred on Friday, November 24th (busiest shopping day here in the US)... from his attempt to shop at Macy's online store. Heres the text verbatim:

"This was what the site said (keep in mind that I left my browser open all day and it never did let me in!):
 

'We'll be right with you.

It's a little crowded in here right now, and to make sure everyone enjoys shopping with us, we're asking new visitors to wait here a few moments (less than a minute!) while other shoppers finish up. We'll refresh your browser and welcome you in momentarily. Thanks for your patience!'

 



_____
tags:
Tuesday, December 05, 2006  |  Permalink |  Comments (0)
More ramblings on the data center heat and power problem, with some new "heat problems" of my own! But wait, there is more! Wait til you see some of these latest "solutions" to the heat problem in the data center I just tripped across!

Ok, sorry for the delay in posting, but things been kinda hectic around these parts...

Following the "bouncing Dave":

First, I had a great visit to Interop in NYC w/o September 18th, where I co-presented on a Panel with the VP of Marketing from Opsware, Erik Vishria. We talked about Managing the Virtualized Data Center: Monitoring and Managing for Performance and Availability.  In the Q&A that followed, we had several attendees come up and talk further about the challenges of the "we've got too much capacity, and its heat is killing us" vein... Yet another set of validation datapoints.

On a whim, I decided to swing by the floor, where I immediately gravitated to the APC exhibit. After some probing questions, I was steered to a helpful Sales Manager, Keith Markowitz, who worked with me to demo some of their latest approaches in this space.

They showed me a demonstration of their NetBotz solution/console (at least that is what they said it was!), which basically can give you access (via SNMP queries) to the voltage and current draws at an "outlet by outlet" level. Envision the inside of their rack technology having a bunch of fancy (basically!) power strips. You know, the kind you have when you need to plug 6 things into one outlet?  Anyway, they can query (and graph!) power consumption at the outlet level... the blades, or rack-mount servers, or whatever, are then plugged, on a one to one basis, into those outlets...

So, you can measure power fairly close to the physical server/cpu level over time... Problem is you don't know its a server, unless you have an up to date CMDB which maps things like: "Rack 12, Strip 17, Outlet 2 == ExchangeServer01@Bldg3@USA" or the equivalent... This is a technological start/pre-requisite, but its basically extremely hardware centric! 

(Naturally I just now went back to their website and tried to find more information on NetBotz, and all I can find on the web is how it offers video surveillance of your racks! Kinda what you would expect from a hardware vendor actually... just try and find a software solution on their website!)

Earlier that day in NYC, I had met with the new VP of Capacity Management for a very large worldwide Financial institution. They have 18,000 physical servers (Intel/AMD-based) and they have not yet begun any virtualization projects because they don't have a good handle on Capacity or where to start!  So now I am just envisioning the sheer management hassles of associating 18,000 servers with the specific electrical outlets they are plugged into! Someone is going to do that manuall? And then type it in? Yeah, that will happen in our lifetimes! NOT!

New favorite Rant #1: Why is it that everyone in IT technology has to make everything so darn geeky and technically detailed? We don't want to identify the specific fungus spot on *that* leaft on *that* tree; we just want to know - is the forest healthy? Or Not? And Why? Argh - the approach is always: lets build yet another widget!

On the other hand, check out this Information Week article on our Heat/Power topic. There is actually a company called SprayCool, with a solution to heat they call "SprayCool M-Series direct chip-cooling technology". It's water injection cooling to spray a cooling mist directly onto the chips! Check out that gnarly rats nest of water tubes! Looks almost like the inside of my water cooled PC! I can only imagine how cheap/easy a solution that is going to be for customers with 1000's of servers! Sigh.

See Rant #1! (Disclaimer: yes they do make a rack solution, but now you have a proprietary rack that is HW vendor specific, and you still gotta move that heat somewhere - they conveniently say to "connect it to your buildings cold water loop! Wonder how long it will remain "cold water"?)...

Now, HP are further getting into the fray, naturally, as a HW vendor with yet another clever HW solution. Techno-geeks rejoice! They are introducing Thermal Logic Technology, a new type of rack with special, intelligent, patented fan technology! Apparently they spin faster, have better blades for lower cavitation and less noise, and some intelligent controllers! Way cool!

Think about this folks... It may be a cheaper/better way of cooling your blades, but how do you spell "Vendor lock-in"? See Rant #1!

Why, oh why can't people understand that most of all of these servers are quite simply being wasted? Its not about cooling them better, its about having only as many as you really need!

Then last week, it was off to that HOT vacation destination, Acapulco, Mexico for our annual BMC analyst event. I now understand why it is actually the off-season there right now... it was really HOT!  Even my compadres from Houston were really hurting... something about 16 North lattitude, coupled with temps in the 90s and humidity in the 90s really gets to you! I guess we all needed the Human version of the "SprayCool" (is that the H-series?).

Speaking of things "capacity related", any of you ever have the pleasure of trying to change airlines in Mexico City using an e-Ticket? Lets just say this: the only people that can help you are two incredibly friendly (and massively overworked) ladies at a manual face-to-face help desk. They use walkie talkies to find out gate assignments. Because they don't have enough gates for "peak demand", Mexico City uses a system whereby gates are "dynamically assigned" at the "very last minute"... so you don't know where you have to run to, until it is literally almost too late... Any of you ever RUN at 7000' altitude in a city with the worst air pollution on earth and where they all smoke? Yeah, not fun...

I ended up with - no joke - a handwritten boarding pass created at the gate as they were boarding! Amazing! But, all was not awful - at least I got upgraded to First class!

Just yesterday, a quick "day trip" to NYC for some meetings with CTO and staff of a very large life insurance company, and separately with the Sr. Management in charge of capacity and data center consolidation and virtualization at a very large brokerage. Once again, wonderful, first hand validations of the drivers behind virtualization: complexities and costs associated with too many physical servers!  At the brokerage, they admitted that across their thousands of servers, their average utilization was 7%... which means that they are using 14x the power they actually need! And, btw, they are OUT of data center space, so they have a "one server in must be preceeded by one server out" policy.

Folks: if you get one thing outta this blog, make it: ignoring the discipline and process of capacity management is gonna cost you. May not be today, but it will happen! It may cost you in service outages. It may cost you in running out of floorspace. Or Heat. Or Electrical bills. Or in loss of agility by having to purchase expensive, highly proprietary hardware "solutions" to a problem you wouldn't have if you could plan more accurately...

The merging of the knowledge discipline of capacity management, with dynamic provisioning of virtualized and shared technology is going to happen folks... It quite simply has to... the alternative is ever more bizarre hardware-oriented band-aids to the problem, with all the vendor lock in associated with those approaches.

In the meantime, let the heat build (at your peril!)

regards

Dave



_____
tags:
Tuesday, October 03, 2006  |  Permalink |  Comments (0)
Digging into acquiring the measurements of power

Missed blogging on this topic the last two weeks, but had a really good excuse - probably the best one there is: I was at our Annual BMC User World event, with well over a thousand partners and customers across our various solutions. A great, informative and fun time was had by all!

While there, I took the opportunity to start "digging into" this issue with just a couple of our partners at the event, testing if they knew anything about this issue, if they had any ideas on how to further research it, and just generally spending a lot of time seeing if there were partner/vendor interest. Every customer I discussed this problem with was highly intrigued!

Most interested partner goes to Sun, perhaps not surprisingly since they own a current market(ing) advantage in terms of "greener servers" (see my earlier blog entry on SWaP, etc.). The technical gentleman at the Sun Demo pod (Spod?) noted they take advantage of the AMD API: PowerNow...  and suggested I look into it. So off I went to research just what that would allow.

After digging around both Sun (for Solaris) and Microsoft webpages, I came to the conclusion that none of them (separately or in combination) allow for a software way (from the Operating system via kernel calls or other APIs) to actually measure power over time. Easiest route to solve the problem is exhausted (for now!)... Time to check the HW platform angle some more...

Exploring the Intel angle just a bit further, I tripped across an in depth whitepaper exploring the relationships between CPU interrupts (over time) and power consumed by the CPU.  It was herein that I discovered the way that Intel measures power over time (at least for this whitepaper):

They use HW data acquisition equipment from Fluke. Now, I remember Fluke from way back in my EE days in the 70's, so it wasn't hard to veer down that thread (or have an inkling of where it'd lead, unfortunately).  Here's what I found:

The Fluke NetDAQ* 2686 has the ability to acquire and measure things like power over time, but requres hardware probes on all the device(s) that you want to measure power consumption across.

This is NOT what I'm looking for!  We can't require IT to go around hooking wirest to each/every one of their CPU boards/Blades in order to just measure what is going on! Think of the tangle in a fairly typical data center with a couple thousand CPU/Blades! Think of the potentials for ground-loops causing crashes. And just think about getting permission to physically instrument production servers.  Simply put, not gonna happen!

Next Step? Gotta dig into the bowels of the hardware. First stop (for today!), the Intel IPMI Specification. Over 650 pages of reading guaranteed to make any EE/Software guy's eyes glaze over... After a couple of hours reading, my current conclusion.  Intel has an API to query status of just about everything... but NOT power consumption! Argh... Time to find humans inside Intel (get it? - haha) to ask my detailed questions...

Wish me LUCK! Without this type of instrumentation, well, lets just say, again - you can't manage what you can't measure!

Stay tuned for next update

Dave



_____
tags:
Friday, September 08, 2006  |  Permalink |  Comments (0)
More musings, the government (EPA) weighs in, and perspectives from industry biggie, APC

First, I want to thank those of you who've contacted me on this blog, you're pointing me to some additional information out there on this problem, as well as the current-state of efforts to solve it. Please do continue!

First, check out this "hot of the presses" (August 18th, 2006) Draft EPA standard relating to Server Efficiency. What is fascinating about this is it directly talks to some of my earlier ramblings. And especially to the critical issue of "how do you measure performance" (i.e. useful work).

I'll spare you a lot of hunting and digging, but perhaps the most interesting information was in a documented footnote, where it references a paper by Bruce Nordman of Livermore Berkley National Labs. His March, 2005 conclusion? (my bolding):

"It is clear that many commercial servers operate at low levels of activity for much of the time. No current standard metric shows how this affects power consumption for current products, or could do so for future ones designed to exploit this fact. There is a need for such a metric, and for clear and consistent definition of relevant terms. There are a variety of benchmarks that could be applied to the problem. The simplest one that correctly reflects system performance should be selected and then used."

He actually mentions the potential of applying "SPEC" or "TCP" (typo: he meant TPC) to the problem, which directly relates to my anectdotal comments in previous blog entry reflecting on my 4+ years representing Stratus on the TPC... hmmmm...

My prediction #1: the rub is going to be solving the "correctly reflects system performance" issue. For purposes of vendor benchmarking and selecting the most efficient hw to purchase and use going forwards, applying "measured power" versus SPEC's or TPC-(C, H, etc.) can give general guidance.

But it still doesn't reflect how underutilized (over time and the business cycle) the total HW compute infrastructure is. And this is where the vast cost-savings opportunity lies!

APC (one of the largest vendors in the Power distribution/management space) have an interesting paper that speaks directly to this point. In their paper on "Determining Total Cost of Ownership in the Data Center and Network Room Infrastructure", they have a really interesting breakdown on the various sources for the total (power-related) costs.

The really interesting conclusion (IMO) is that in a table on page 7: "Rightsizing the system to the actual requirement over time" is 60.1% of the total cost savings TCO opportunity! They posit that even if you were to able to be "obtaining all capital equipment at 50% discount from standard" you'd only save 12.3%!!! Or even if you made cooling 100% more efficient you'd only save a little over 4%...

So, the key is going to be to not only purchase more efficient technologies - as I previously pointed out a "one shot" benefit and one that is of little total savings opportunity...but making sure you need what you purchase... and continually monitoring it for your total utilization versus power...

This is going to be something that has to be done continually as part of a Best Practices continuous improvement process, and those data centers that do this the best are going to deliver a huge competitive cost-advantage to their parent corporations - not to mention the potential for public relations good-will relating to environmental stewardship, etc...

Back to my simple question: anyone know how to measure the power consumed at the physical CPU/Board level over time? I'll shortly have out a podcast on this exact quest/topic - so keep an eye out for it!

Still searching, the quest continues

Regards

Dave



_____
tags:
Tuesday, August 22, 2006  |  Permalink |  Comments (0)
Musing on pre-requisites to addressing the challenge

Well, this is rapidly becoming something of an obsession. For those who know me, when I get passionate about something, well, lets just say I tend to dive completely in...

Part of that was looking for ever more background information in this area, to really get an assessment of the "state of the electrical environment" so to speak.

By far the most comprehensive article unearthed to-date, The Balance of Server Powers is an excellent compendium of all the dimensions being considered in this area.  I think Timothy had done an excellent job of covering not only the dimensions of the problem, but also in beginning to unearth and at least discuss what is an obvious, and necessary, step that is a fundamental pre-requisite to ultimately getting a handle on the problem. 

Namely: if we want to manage (optimize) it, we've gotta measure it... 

Timothy points out the introduction of a Sun proposal called SWaP - Space Watt and Performance, an attempt at a "uniform metric" by which to measure a Servers ability to conduct useful work against the actual power (and space) consumed by that server... Sounds like a good idea, right? At least as a proxy?

Flashing back to my days as a founding member of the Transaction Processing Performance Council  I remember the amazing long days, nights and negotiations behind trying to get 30+ Server vendors to agree on a single metric by which to measure net transaction processing power.  I kid you not - it took us the better part of 18 months to get the first benchmark (TPC-A) agreed and standardized... I know, because as the secretary, I literally typed every single character of every single worf for all the motions, counter motions, votes, and the standard itself.  All on my little old, nicely portable (ha!) MAC SE/30...

And then, we immediately had to begin on TPC-B, and TPC-C for the *other* types of "work" that servers did. You see, different workloads use and need different ratios and types of resources... and different vendors had different, ahem, "needs"

PS: If you've never sat on a standards body, its hard to describe the degree of "watching sausage being made". But, just like sausage making, if you've watched it, you probably will never eat it again! But I digress, back to workloads...

  • Some workloads are intensely CPU bound (think signals intelligence, or code cracking done by the dudes at Fort Meade for the ultimate example there! These workloads are going to max out electrical consumption for CPU and Memory, with a proportionally smaller amount of Storage related
  • Others are more transactional in nature, with multiple read/writes needed per transaction, using proportionally less CPU (per unit of wall clock time, that is).
  • And there are a bunch of other variants all over the spectrum...

The point here is that "Performance" means different things to different:

  • Vendors
  • Application mixes
  • Customers - the only group that actually cares (non-altruistically) about efficiency by the way!

So, waiting for a "standard" by which to measure power consuption versus "useful work" (as valued by the end customer/user) is, in my humble opinion, a low return option... At best it is going to be a long while, at worst (more likely) we're going to have a plethora of different measures and "your mileage may vary"... Not helpful.

Will market forces generally trend everyone in the right direction? Yes, but who can afford to wait? Besides, think of the competitive advantage if YOU could get ahead of this dimension!

The right metric to measure data center efficiency has got to be something like:

Total Work (Transactions and Throughput) / Total Cost Of Ownership.

Then we have to (somehow) accomplish two things:

  1. Measure Total Work - in a fashion that the end user/customer cares about

    The point here is to get away from measuring CPU utilization (which is only averaging 10-20% in standard 1-2 CPU
    rack configurations anyways) and more towards measuring the useful work that is getting done on that CPU over time. This is not necessarily trivial (after all it varies customer by customer and application by application). But it is doable with today's technology, and it is doable at differing levels of accuracy (versus effort to implement).
  2. Decompose Total Cost Of Ownership into its main sub-constituent "buckets":
  • Hardware (e.g. # Servers)
    All the information I have reviewed shows that all the data center electrical costs scale in direct proportion to the number of physical servers installed in the data center. This includes sizing backup generators, cooling, feeding the battery backup systems, floorspace, lighting, software licenses, management costs (personnel and software)... Even storage capacity scales in a linear relationship here...the whole shebang. Net? Reduce the number of servers, reduce those other costs...
  • Energy (electricity used by servers)
    Here the problem is that it is easy to measure the total electrical load of the data center, and sometimes with custom HW measuring technology, it is possible to measure at the "row" level (or sometimes even at the rack level). But the measurement isn't related to the electricity used by each physical server! Or CPU/board(s)...
  • Software/Licenses
    Easily measureable with today's asset management solutions. Again, though, typically scales close to linearly with number of servers (exceptions to rarely implemented/used variable workload licensing charges notwithstanding!)
  • Personnel
    Typically scale linearly related to servers... if you can get more work out of same number of servers, productivity metric here goes way up...
  • Other Capital Assets/Expenses
    Typical finance stuff here: amortization, asset inventory, etc..
  • Other Operating Expenses
    Miscellaneous OpEx not in above categories...

When I was at LinuxWorld this last spring, I embarked on a desperate quest to find someone, anyone, who was measuring how much power each CPU, or rackable set of CPUs (e.g. 4 cores on a board, etc.) was using.
I couldn't find anyone doing this!  Yeah some have cool blinky lights monitoring workstations that will alert you if things are getting too hot, or using too much power... But, actually measuring it in a way that could be used to do planning and optimization... Nope, nobody... Googling it looking for a solution? Es gibt nichts! The big zilch!

I've asked Intel, AMD, the blade vendors... nada... (all said, "hey, cool idea though"!) <Wink>

So, if we're gonna truly solve this issue, and give it the visibility it needs, we gotta measure it. Are there "proxies"? yes! Can they help? Yes! But do we really know how much electricity is being used to do USEFUL work? Nope!

Anyone out there know of a way to do this? I'd love to hear from you!

Regards

Dave



_____
tags:
Monday, August 21, 2006  |  Permalink |  Comments (0)
Ramblings on Global Warming, Data Centers, and good old "Open Loop" Thermal Runaway...

Now, I don't know where any of you stand on the "Global Warming" spectrum (in terms belief systems on causality). Personally, I think would be the pinnacle of hubris to believe that we humans are the "sole cause/blame", in this developing saga, but I also believe it incumbent on us to be good stewards our our environment... And, it definitely is getting warmer, so, for today's topic, lets assume it is:

  • Real - i.e. it is warming now (cause is irrelevant)
  • Something we can and should do something about

In The Impact of Global Warming on IT , I was struck by the the discussion centering around the reality that because of global warming, Data Centers would have to have more, and more powerful AC systems (with failover, etc.). That - as temperatures rise - ACs will have to work harder, and use more energy... And get more AC systems, etc...

Strikes me that...

  1. More Energy = More power plants burning fuel to generate it
  2. More fuel burned = More emissions and more "global warming"

When I studied EE way back in college, I was really into Stereo... we were always searching to balance opposing goals:

  1. Clarity of analog signal through a circuit; which means little, or better no, negative feedback (its a type of distortion)
  2. Stability of the safe operating area (SOA) of the transistors, which means negative feedback to prevent "thermal runaway" (among other things).

Thermal runaway (put very simply) goes like this: As a transistor is run closer to higher levels of output (i.e. efficiency) it heats up, and there is less of a voltage drop across the collector/emitter for any given current level; but this causes it to draw more bias current, making it heat up... This has the unfortunate side-effect of making it ever warmer, and a vicious cycle is setup whereupon it ultimately melts or explodes (sometimes quite spectacularly)... I not so fondly remember the time when I took the negative feedback loop down to zero in my 100Watt amplifier an a test-bench... it didn't smell good, trust me... [Maybe there is a reason I went into software after all!]

A perhaps less geeky analog would be hearing a PA system in a gym get into a feedback "squeal" because the microphone is too close to the speaker... that's another (related) kind of runaway situation

Negative feedback is basically a closed loop process whereby you take the output of a circuit, completely invert it, and feed a small amount of it back into the input of the circuit. It should be noted that a very similar conceptual approach has been used since the late 1960's for reducing car emissions: take some of the exhaust and reintroduce it back into the input side of the combustion process... Anyone remember PVC valves on their 1968 muscle cars?

So, seems to me this data center, electricity and global warming is the same situation: We need a "closed loop process" by which to ensure the "output" of the data center doesn't go into "thermal runaway"... Maybe something like:

  • Measure the "output" of the data center in terms of resouces actually used to deliver acceptable service
  • Ensure that is fed back into the up-front decision process and steps by which more/different resources are put into service
  • Make sure you always have enough to meet service levels

I think what data centers need is some "Negative Feedback"!!

"No, you cannot just buy more, or newer servers or blades"

"No, you must justify the capacity requirements before you change anything"

In these go-go internet-pervades-every-aspect-of-life-and-commerce times, nobody likes to be told no. Nobody likes negative feedback... but without it, we have "data center thermal runaway".

But today's data center clearly needs the discipline of the "negative feedback" of a properly integrated capacity management process. Closed-loop Capacity Management anyone?

What are your thoughts?

Dave

 



_____
tags:
Monday, August 14, 2006  |  Permalink |  Comments (0)
Why moving to latest technologies is not quickest way to solve the data center capacity challenges

OK, so now that I have gotten the blog bug, I started to do some more validation on the web... Check this out:

http://www.cio.com/archive/041506/energy.html?page=1

They talk about this electricity, heat and data center problem, but then posit the following solutions:

  • More Efficient Computers - basically moving older technologies to newer chip architectures, etc...
  • The Latest in Cooling - basically getting fancy with plenums, hot and cold aisles, etc...
  • A More Efficient Data Center - basically retiring and replacing with new technologies, and using virtualization...

Here are some problems with these as I see it.

  • More efficient computers means adding more extraneous capacity to environments that are already overprovisioned (in most cases)... this means going through massive configuration changes to move from older to newer, or to add newer... But managing the magnitude of such a change is a big deal.  Most service failures are caused by unmanaged and poorly planned change.  Some customers I have talked to are doing it the slow way: issue a mandate such as "all new servers must be virtual" so that the change is a very gradual, semi-self-managing one... i.e. they eventually get there. In the meantime, their environment actually gets more complex and hence more brittle, while it also actually grows in capacity - making the inefficiency problem bigger! Moving backwards to go forward?
  • The problem with the best and latest cooling systems is that this is a one-time only approach to the problem. Once you've implemented hot/cold aisles, latest plenum technologies and such, then where do you go? The systems themselves are still inefficient and underutilized! And you're still adding capacity...
  • Finally, the problem with a more efficient data center by replacing servers is that it simple takes too long and involves incremental complexity and configuration inefficiency while you go through the changes needed to get there.

Net: everyone is ignoring the root-cause of the problem. Sure CPUs generate lotsa heat, sure latest technologies are getting more efficient (incrementally), but why are we simply wasting so much CPU capacity?  I mean, mainframes had the heat problem decades ago! Remember water cooling? But remember what came after water cooling? Capacity Planning! And then Virtualization, partitions, automated workload balancing, etc... The process came before the technology!

In my book, the simple definition of capacity planning is nothing more that "A process for getting the most out of what you have before buying more"... And with x86 architectures nobody is doing that. Nobody ever did it! It is <pick one of: faster, cheaper, easier> to just throw another server at it. And organizations, business units and application owners are setup with budgets accordingly. Result? Server sprawl and average utilizations - depending on who you talk to - of 10-20% at best!  Mainframes are typically at 90% or more.

That means that we're pumping out 5-8x or more heat than needed. 5-8x the electricity. 5-8x the costs. Virtualization is a part of the solution (Check out Fred's Blog: http://talk.bmc.com/blogs/blog-fjohan/fred-johannessen/ for more discussions on the virtualization angle.

Last I heard from Gartner http://www.gartner.com/ was "without virtualization, a typical volume server will run at 10 percent utilization. With virtualization. many organizations are increasing that figure to about 40 percent" [Source: Consumerization of IT: The Gartner Analyst Keynote, Spring 2006 - July 3, 2006; Stephen Prentice, Simon Hayward, Brian Gammage, Nick Jones, David A. Willis, Kathy Harris, Martin Reynolds, and Daryl C. Plummer]

So, virtualize and get your average utilization to 40%... That's a great ROI - no wonder VMware http://www.vmware.com/solutions/consolidation/ is so full of stories like: "We implemented VMware and save hundreds of thousands of dollars!  Remember: all that change had to be managed and survived, and they are still missing the fundamental root cause: lack of a capacity planning process!  I can just see the headlines 3 years from now "Virtual Server Sprawl... " you fill the rest in. Treated this way, virtualization is a one-shot (just like building better cooling). If everyone in the world were somehow able to virtualize tomorrow with VMware, XEN , and/or Microsoft Virtual Server http://www.microsoft.com/windowsserversystem/virtualserver/evaluation/vsoverview.mspx, they'd still be buying more resources hand over fist without a disciplined process.

The root cause is that people are still buying and deploying resource capacity without knowing they really, REALLY need it. And how much, and when. They are putting the technology ahead of the process.

Its all about aprocess...

Now, on to my family to remind them our new process is to turn OFF things we don't need :)

Don't forget to turn off that PC!

Dave



_____
tags:
Friday, August 11, 2006  |  Permalink |  Comments (0)
Pondering the implications on the strange intersections between IT, Weather, and the Price of Oil

About 18 months ago, I co-wrote with Charles Rego of Intel Corp an article about some pending data center capacity and power related issues because we had been working closely together to assist data centers migration to latest technologies.

With all that transpired in the intervening time, I kinda "back burnered" this area...

Just this last week as I was "sweating out" record highs here at my home office in Bolton, Massachusetts, I was reminded of just how critical Heat, or more importantly getting rid of it, is to today's IT environments.

The air conditioner in my home office was "set to stun", it was 102F with 85% humidity outside; I was trying to complete yet another marathon day of conference calls, webexes, and general voice communications. Problem was it was just too darn LOUD.

 So, off would go the AC, I'd be able to use speaker phone for a while.... Cut forward to "getting warm in here"... back on with the AC, etc... and the cycle repeated. The major source for the heat (notwitstanding my ability to generate hot air ;) ??? My home PC/Server.

Built 3 years ago with uber-high end (at that time) configuration. We're talking dual RAID arrays (0 for boot, 1 for data). The graphics card, 2GM memory, Pentium IV at 3.4GHz, yada yada... still pretty uber even 3 years later... but it puts out about 500-600 Watts of heat - not good.

Today's data centers have rack after rack of either monolithic Intel or AMD-based servers... sometimes 64 or more CPU's per rack... these puppies are putting out some serious heat... and it takes AC to remove the heat... Its doubling down on Electricity. I just read this cool article: http://www.information-age.com/article/2006/july/fighting_the_fire where they talk about this, but in this article appears to me they are missing a fundamental concept here. They are so focused on the technology (inefficient chips, higher densities, more efficient and better planned cooling, etc...) that they are missing the real cause. And I'm just as guilty as anyone else! We're all wasting tons of electricity making and removing heat because most of the time, our systems aren't doing anything useful!

My uber PC is idle way over 90% of the time. I only need all its "juice" when I'm crunching my latest 15MB Tiff file in Photoshop... the rest of the time its making heat (and noise!). Same thing with Servers...

And now the government wants to get involved!  Does anyone believe *they* will make IT's life easier (heck, we're only now working through the realities implied by SarBox!) I can't wait.

Seems to me IT needs to get ahead of this. I sure am with my home PC... I just did the calculation: 600Watts times 24Hours a day = 432Kwh per month... My fully burdened cost of electricity here in Massachusetts is $.014 per Kwh... Thats $60 a month! Not counting the incremental load on the AC! $60 a month for 3 years would be $2160 - more than it would cost to REPLACE IT!

 This blog will explore every angle I can find relating to the intersection of Power, IT resource Capacity and how to meet service levels without wasting energy... Do we want our energy resources so blatantly wasted? Can we afford it?

Stay tuned Dave

 

 



_____
tags:
Tuesday, August 08, 2006  |  Permalink |  Comments (0)
David Wagner

Subscribe to David's blog Subscribe to David's blog

Bio & Writings

Email Alert: David's Blog

Get an email alert when I publish a new blog! Enter your email address:

 

Powered by Plone

This site conforms to the following standards: