psilva's prophecies: Interopt Out

The word came from upon high, ‘Stay at home, take care of yourself…Heal.’ I had surgery a couple weeks ago and had thought I’d be ready in time for Interop. Well, some stitches ripped, recovery got extended a bit and I’ll be ‘working from home’ next week. Slightly disappointed since this will be the first one I’ve missed in a few years – even did Interop Tokyo in ‘06 – butt, the procedure I had will still make it difficult to ‘sit this one out,’ if you get my drift. :-)

The cool thing is ‘working from home’ doesn’t mean I can’t participate or get any of my regular work done – I can! Along with my usual tasks, I’ll also be alerting/updating our twitter audience with our Theater schedule, Speaking Sessions, Booth Prize winners, McAdam’s keynote and whole lot more. Follow us (@f5networks/#F5Interop) to get updates throughout the week along with a few ‘tweet-only’ special promos for cool schwag.

And, somehow, that gets me to what I really wanted to write about.

A couple weeks ago I wrote about preparing for such things as the H1N1 flu and really, any unforeseen circumstance where it is not business as usual. These disruptions come in all forms at any time, sometimes without warning. I suggested that a hurried approach, reacting to incidents one by one might not be best, but to always have a plan and extra capacity to handle any crisis that comes along. This all falls under the general theme of Disaster Recovery, Business Continuity & Workforce Continuity. Many have lumped these together (since it is all about protecting irreplaceable data) but I’d like to give my interpretation of each and how they can actually be somewhat different.

Disasters happen at any time – sometimes from nature, announced (or at least anticipated) like a hurricane or unannounced like a flooded server room. Sometimes its caused by human interaction like hackers/data breach or simply an infected host. I try to avoid using ‘DR’ a lot since disaster, to me, means destruction. Most (or many) large corporations have some redundancy (datacenter, db, network, storage, etc) so they don’t have to drive somewhere to get the backup tape and rebuild their systems. Sure, that one location is a mess and they might have to replace it but the business overall (hopefully), continues to function. Now certainly, those in the affected area might not be functioning & admittedly, their lives probably are a disaster – especially if they lost their home, a loved one or something special during the event.

The key to successful, continuous productivity is planning. Because there are many reasons you need to keep your operation going smoothly – profitability, customer service agreements and regulatory requirements are just a few. Its not about backing up and restoring, it’s about keeping things going. The length of acceptable downtime due to any of these issues has significantly shortened from 1 day (long ago) to less than an hour and if it’s an e-commerce site, then even seconds can mean lost revenue. That’s why you need a plan.

While your plan needs to focus on DATA, don’t forget that people should also be both counted on and counted out. That makes the challenge even greater. We understand this --- and have designed our products to help keep your operation running even in extreme circumstances. As a matter of fact, we have first hand experience at F5… let me share some examples:

Disaster Recovery

A perfect Disaster Recovery story is one that happened at F5 a couple years ago. There were torrential rains during the Summer 06 in the UK and our Chertsey (London) office experienced a flood.

We got a call late on a Sunday evening from our office manager who was in our UK office with our IT guy. It was 9 am US Pacific, 5 pm in the UK. The two of them were literally bailing out the support area on the ground floor of the office with buckets. It had been raining for days and a nearby river had risen over its banks, flooding our offices.

Working in near darkness (they’d shut off the power for safety reasons) they were wrapping whatever they could in plastic bin liners to keep it from further damage. The lab floor was covered in about 5 inches of water, and water was dripping from the ceiling down onto the server racks. I don’t know if you’ve ever dealt with a flood, but its not uncommon for the flood to force water back up the waste pipes and then pour down through ceilings…in our case, directly onto our critical equipment. It was clear there was no way to open the office on Monday and that they would need to make alternate plans before 8am GMT when the the UK call center was supposed to go online. Our IT guy quickly identified 4 support reps with broadband at home, and they were able to verify access to one of our corporate FirePass SSL VPNs.

Come Monday morning, the team was split in 2: some coming into the office to rebuild the call center with undamaged equipment on the upper floor, while others fielded cases from home. We distributed the incoming calls to members of our Seattle & Singapore teams who then opened new cases and dispatched them to the appropriate queue. The EMEA team, working from home, could then pickup the cases and work them from there. Through Monday we were able to continue to work existing and new cases and by mid afternoon we had enough desktops and phones setup on the first floor to enable everyone to work from the UK office on Tuesday, a real achievement given the bleak situation we faced on Sunday night.

We pulled it off due to the distributed work environment that relies on SSL VPN and BIG-IP to give remote workers access while still protecting our valuable corporate network. Mind you, these workers were using a variety of OS’s (different flavors of Windows, Mac OSX and Linux) and needed access to a number of client applications like email, the call tracking database, etc. And, since the UK server(s) had been damaged by the floodwaters, they needed to access the network through the US-based servers – causing a big shift in our typical network traffic flow.

Once established, they could “remote desktop” to our BIG-IP load balanced Terminal Server -- which gave them access to their email, call tracking database and diagnostic tools -- all on the remote systems.

First hand proof that designing a distributed network not only allows your day to day operations to operate more smoothly, but can be a major factor in ensuring continuous workflow even in the face of calamitous events. Allowing a prescribed number of remote users is a no-brainer – many companies are doing that already, and can easily handle the slight unexpected jump. But what about when Mother Nature causes hundreds of your employees to stay home all at the same time?

Business Continuity

I look at Business Continuity in two ways. On the one hand, Business Continuity happens because of Disaster Recovery planning. No F5 customers were affected even though we had a flood in one of our offices. I also look at Business Continuity as handling those events that are not necessarily destructive on a massive scale but can disrupt normal daily life. Snowstorms, extreme cold and maybe even the ‘now less scary’ H1N1 flu outbreak. Seattle snow is a nice example.

Winter of ‘06 brought 7-10 inches to Seattle (a lot for Seattle -- a city of many hills and few snow plows) and stopped everything. F5 alerted local employees to stay home, instead of risking the icy roads. Thanks to our SSL VPN environment, more than 1200 remote sessions were initiated – up from the typical few hundred. Because we were prepared, F5 employees they were able to get online and be productive at home rather than building snowmen (although that is fun!), or worse, risk becoming a snowman if their car fell into a ditch. It was also a relatively quiet day for IT department, if you can believe it.

Workforce Continuity

I define Workforce Continuity as those things that might hinder an individual worker. When you think of disaster recovery, you may not be thinking of a plan for the more likely scenarios which keep an employee from getting to the office. Health/sickness, lack of transportation, accident or family requirements fall into this category. It is, I must admit, sexier to discuss floods, storms and earthquakes, but it is also important to think about the small ways in which your business operations can be challenged by something as simple as an employee walking a dog, and ending up in the hospital.

I know, I am that employee – twice now.

A couple years ago, I severely sprained ankle late on a Saturday night. Doc said to stay off, ice, elevate, etc. When the nurse asked if a doctor’s note was needed to say ‘unable to work,’ I said, not at all, in fact, the solutions that I work on allows me to work from home, or any internet location, as if I was sitting at my desk! I’m even writing this article from my home office.

At F5, we use our own equipment to create an infrastructure that allows employees access from wherever they happen to be. This benefits not only our sales team, who we expect to be out on the road, but also allows us to encourage telecommuting, and, in unexpected circumstances like what happened with me, allows any employee to contribute even when they are away from the office due to whatever reason.