Tag Archives: scalability

Inside the Gates of Hell

Last week, I called out Dilbert for avoiding the dreaded load test – it is hard, expensive and takes too much time.  Yeah, but.  I use my stuff to make money and if it is slow then nobody will like it and I will die in the gutter, gasping for air.  So, I simply must enter through the gates of hell and get some testing done – no way around it.

So, I checked with NBC and despite the recent loss of Olympic coverage they’re still not desperate enough to cover my idea that testing directly on a production environment is news worthy.  They mumbled something about boring and too geeky.  Pfsst.  I’ll press on.

Beyond the obligatory warning, “Don’t blame me if you screw up and spend your weekend discovering that your disaster recovery plan has a few holes that you’ll need to fix on the fly”, let me bullet point the typical test in your production environment stuff here (I will try to add some value later in this post):

  • Make sure you have a disaster recovery plan and have a backup or two handy
  • Conduct your test during the lowest traffic periods
  • (Optional) Redirect users to a “Temporarily Out of Service” page, if you don’t want to include any real users in your experiment
  • Increase load volumes gradually to avoid complete system crashes

A quick side bar, perhaps a good subject for a later post – disaster recovery plans that haven’t been validated through execution are virtually useless.  Not that I think that combining these complicated and risky test scenarios is optimal, but I certainly think that validating a disaster recovery plan precludes any type of live load testing in your production environment.  So if you haven’t done this, stop here – now!

Many technology platforms today can be vertically subdivided front to back – in particular to enable maintenance and upgrades.  This is most often accomplished with a load balancer rule change that sends traffic down one path or another.  I can use this technique to send folks to the out of service page or maintain service while you use either all or part of your production gear to test.  Further back in the stack – caches, queues, indexers, databases – it gets a bit more complicated and depends highly on your architecture.

But this common, well covered approach only seems to solve the where do I get a test environment problem.  I still have all the data problems that I mentioned last post and I still have the load generation problems.

The technique that may be a little less obvious than just separating out a test environment within my production gear is what I’ll call load testing through reduction.  Basically, the idea is to test various parts of my infrastructure under load by removing a component member and measure the increase in load being served by the remaining production environment.

For example, of course I’ll use the easiest component here, lets say I want to measure the front end server performance under load and I have 10 front end servers.  Of course, I already know that under “normal” traffic that my array of front end servers are performing well within the acceptable range at 25% of their peak capacity – so removing one front end should not tip everything over.  The starting measurement is when all 10 servers are working and there is measurable “normal” traffic.  Removing one front end server should increase traffic to the remaining 9 servers.  It is also worth measuring the impact on other components in the stack – caches, queues, indexers, databases, etc…  Some components will have an initial reaction to the change in load and then should settle down to a new normal performance under load.  In a perfectly scalable world, I should notice approximately a 10% decrease in performance at the remain front end servers.  In the real world, mileage varies and knowing the details is super valuable in forecasting and planning capacity.

I can continue using this technique, gathering multiple data points, until I reach various stress points.  Usually a stress point is defined as the point in which performance becomes unacceptable or worse, failure emerge.   I typically stop before catastrophic failure and usually have plenty to work on well before many stress points.  There is tons of learning available using this technique – developing a deep understanding or my product’s scalability on a component by component basis is incredibly useful in guiding future development investment strategies, inspiring consensus building around “acceptable” performance metrics and even budgeting future infrastructure spend.

This technique allows me to avoid time consuming environment building, expensive and inaccurate approximations as well as the nearly impossible and risky data moving and cleansing  for testing purposes.

No data movement, cleaning, backups/restores, load generation needed.  Know thy product performance and scalability.

Reblog this post [with Zemanta]

Even Dilbert Dodges Load Testing

While many folks think about, talk about and even worry about load testing their product, they end up not testing while they hide behind many excuses. Load testing is really hard to do, efficiently. It is expensive. By the time we’re done, it won’t be valid. We don’t have time. Even Dilbert dodges load testing.

Dilbert.com

Usually, load testing means setting up a load test environment, instrumenting for measurement, generating a load or replaying logged events and finally gathering results for analysis and comparison. Lather, rinse, repeat often.

The problems typically start immediately. Setting up a load test environment can be very expensive – it needs to be a reasonable approximation of your production environment. That gets very expensive, quickly. I wrote a few weeks back that Cloud computing environments can really be instrumental in cutting into the need to purchase tons of gear for temporary testing tasks.

Oh, by the way, it is not simple at all to generate a useful load. You can use a few client machines to replay some scripts, but realistic scripts are hard to write especially dynamic scripts that drive fancy AJAX based front ends.

Enter the practice of replaying logged events – great idea on the surface, just take real world traffic logs and replay it. Heck you can even grab the events from the Production environment – no such thing like the real thing, right? Until you realize that these logged events typically require the state of your system to be consistent with when the events actually occurred. Is it at all practical to “snap shot” your production environment state that is synchronized to the point in time that matched your log data? Sometimes that can be very difficult to do correctly, if at all – the data set can be very large, it may really burden your production environment, it may sever transactions and simply not work at all.

Finally, of course, even if you are able to grab a synchronized production snapshot, move it to your load testing environment and find some way to crank up the front end loads, you’re not ready yet. There is the landmine of replaying events against a production data set – whoops. Do you have customer emails in your data and does everyone now get messages from your load test? Are there financial transactions – might not want to replay those again. Sure, you can isolate the environment so email doesn’t get out or transactions don’t take place, but every time you alter the environment you may miss something important about your load testing.

In my next post, I’ll share a practical idea that addressed these challenges and lets you get efficiently get you load testing done.

Reblog this post [with Zemanta]

Follow

Get every new post delivered to your Inbox.

Join 168 other followers