Fisher Plaza: Service Restored, Portent Response

Random

Ian Lurie Jul 4 2009

If you host a web site with Portent, you’ve been kept apprised of the situation that started Thursday night at 11 PM.
At this point, all sites are back ‘live’. The Fisher Plaza East data center is running on temporary power from Caterpillar generators that the City of Seattle set up outside.
Here’s a quick rundown of what happened, why, and how we can avoid it in the future:

Chain of Events

Thursday
11 PM: A fire breaks out in the electrical hub at Plaza Center East. The sprinkler system activates, but floods and shorts out the emergency generators. Plaza Center East goes dark some time before 12 AM Friday.
Friday
12 AM: We receive an alarm that our sites are ‘down’. We attempt to contact our servers, can’t connect, and then attempt to contact Adhost. Adhost’s phones and servers are all down, as well, because of the fire. We try to contact them repeatedly over the next several hours.
5 AM: We get news of the fire at Fisher Plaza. Until 5 AM Pacific, no news outlet or other resource reported about it. We found the information via Twitter and by checking Seattle Fire Department incident logs. So much for Seattle’s ‘old media’. To be fair, though, two of Seattle’s primary news stations are also knocked out by the fire, so they can’t report any news.
5:30 AM: We contact all Portent hosting clients regarding the outage. At this point, it’s very clear this problem will last the day. I review alternatives and decide we have to sit tight and wait for power to be restored. We can’t get into the building yet; any repointed web sites will take 24 hours to ‘repoint’ across the entire internet; and then repointing them back to Adhost would take another 24 hours.
5:45 AM: We shut down all pay per click advertising for sites that are offline.
We spend the rest of Friday monitoring the situation. Where possible, we use client’s Facebook and Twitter accounts to notify their customers.
Saturday
1 AM: Building power is reconnected. Power supplies are recharged, and HVAC is restarted to cool the building before servers restart. The building got up to over 100 degrees inside during the day (it’s been over 85 degrees for the last several days).
4 AM: Servers come back online. Power restored to the building.

What Could Have Prevented This?

In the end, Portent had only one way to avoid this outage: Host all web sites in two physical locations. Then, when Fisher Plaza East lost power, we could have started up the sites in the other location.
The problem with this solution: It’s expensive. Extremely expensive. It requires that we duplicate the hosting environment – hardware and software – and that we constantly keep both hosting locations in sync. For a once-per-decade power outage, the cost/benefits balance just doesn’t work.

Future Plans

However, cost/benefits analysis doesn’t feel as good right after your sites are down for 26 hours. So Portent will be planning and offering two additional services:
For e-commerce and other database-driven sites: Redundant hosting in multiple physical locations. If there’s a major power outage, we can flip a switch to point at the backup location, and you only experience a few minutes of downtime.
For lower-priority sites: Hosting of ‘static’ files, such as images, video and pages that don’t change, using Amazon S3. Backup of databases and other dynamic files using Amazon S3, as well. That way, moving to a new location in an emergency might take time but is still possible, even if all connectivity to Adhost is lost.
These services will cost extra: Portent will have to pay for data transfer, hardware and power usage for the additional storage and the ‘mirroring’ necessary to keep separate locations synchronized. Once we’ve worked out pricing, we will help you with the cost/benefits analysis.
I hope this information is helpful, and I appreciate everyone’s patience during the outage.