Here’s an intriguing article from CIO Journal (WSJ) about a marquee company who did indeed (and still does) have a backup data center but made an explicit decision during Hurricane Sandy to not switch to it. They opted to suspend operations rather than flex to to the backup.
Read the article to understand why.
Would you make the same decision for your business?
Is your company prepared — not just logistically, but also business-process-wise — to move from primary to standby systems?
Are your decision criteria for switching to a backup site well-defined and broadly understood amongst the decision-making team?
And are your employees “mentally prepared” to work with/from a backup data center? Have you practiced with actual transfer drills?
Emphasis in red added by me.
Brian Wood, VP Marketing
Knight Capital Punts on Using Backup Center
Knight Capital seemed to have all of the critical elements of a business continuity plan in place. Yet as the wind, rain and storm surges of Sandy battered the Northeast, that plan failed to keep the brokerage operating on Wednesday. It shut down most of its trading late that morning, out of fear that a backup power generator at its Jersey City, N.J. would run out of fuel.
And there was another problem. Even though the firm has a backup data center in Purchase, N.Y.—with a separate source of power—it decided early Wednesday to keep operating out of headquarters in Jersey City. And later in the day, it decided against switching data centers in the middle of the trading session because it feared operational risks. As the Wall Street Journal reported:
“Knight maintains a backup data center in Purchase, N.Y., and has the capacity to switch crucial trading systems over to that facility. But the company told some customers and others it was concerned about switching to the disaster site in the middle of the trading day, according to people familiar with the matter.”
The episode highlights one of the major problems with corporate backup systems. For all the money that companies invest in this infrastructure, management is often unable or unwilling to implement it when it is needed most. “It happens more often than you would think,” says Rachel Dines, senior analyst at Forrester Research.
It was the second time in three months that Knight, one of the largest handlers of individual equity trades, was forced to tell customers to route their business to its rivals. On August 1, the firm stopped trading after a software problem disrupted trading in more than 140 NYSE-listed companies, including GE. The firm ended up losing $461 million and had to seek help from outside investors, including Jefferies. That depressed the value of its own stock by as much as 75% at one point, and caused difficulty for CEO Tom Joyce.
Dines said the latest problem at Knight suggests the company was grappling with its business resiliency planning. The problems ranged from the basic—did it have enough fuel on hand?—to more complex issues about how and when to implement its backup data centers.
As the WSJ reported, Knight told employees Thursday morning that “at lower levels of fuel, generators can experience disruptions.” While the generators never actually ran out of fuel, the company was concerned that they might, and shut down preemptively.
Dines said “it’s best practice to store 48 to 72 hours worth of fuel to power backup generators,” and also to have contracts for emergency delivery of fuel. [BW: Good news: AIS has 16,000 gallons of diesel fuel onsite to keep our massive backup generators running for a long, long time.]
Knight issued a statement late Wednesday saying it “has identified and resolved the power outages that were experienced earlier today.” But it did not respond to questions from CIO Journal about its power management, data centers, or business contingency planning.
While maintaining an adequate supply of fuel for backup generators may be a relatively straightforward issue, the question of when to switch, or “fail over” to a backup data center is more complex.
“Honestly, what I believe happened at Knight is that they didn’t have a clear-cut decision making process for managing when to switch over,” Dines said. If those business policies aren’t clear “people will stall” when it comes to implementing fail-over in the event of an actual disaster, she said.
It’s also important for businesses to conduct regular exercises to test their backup systems. If companies have confidence in their ability to recover, they might be more willing to fail over to a backup site, she said.
It’s also possible that Knight made a rational decision not to fail over, said Alexander Tabb, a partner at TABB Group, a researcher and consultant to the financial services industry. Tabb, who specializes in crisis and continuity, said “disaster recovery does not mean there is no interruption in service.” When businesses fail over to backup data centers, basic functionality is preserved, but customer service or experience may be impaired. So businesses sometimes decide that it is better to shut down completely for a period of time and get their affairs in order, than it is to switch to a lower level of service for an extended period of time.
Looking ahead, however, new approaches to data center design may increase the odds that businesses can fail-over without impairing their business. In the meantime, businesses need to decide just how much protection and resiliency is necessary and cost-effective.
However, a new type of data center architecture is emerging that replaces the traditional fail-over site approach. The architecture involves building clusters of identical data centers located in close proximity and linked by a fast network, allowing applications to exist simultaneously in more than one location.
Dines said that sort of construction—known as load balanced systems, or active-active architecture—are expensive to deploy. But they are already commonly used by many of the largest financial institutions. “They are very secretive about it,” Dines said. “It is probably why some of them didn’t go down.”