We had 52 minutes of downtime on the English-language Wikipedia site today; only en.wikipedia.org was affected. Our master database server was thrown into a funky state in which hundreds of access threads were stuck in the “statistics” state — which seems to be MySQL’s way of saying “I’ve fallen and I can’t get up”.
It’s unclear exactly what set it off, but basically nothing works until you restart MySQL. After switching the site to an alternate master database, all has been well.
At 52 minutes from start of event, this took us a bit longer than I’d like to resolve — we had to percolate through a couple levels of alert calls before we finished diagnosing it and getting the DB switch pushed through. (Sorry to wake you up early Tim!)
A similar event in future should be fixable within a few minutes, thanks to Tim’s work on making the master-switch system more foolproof. We’re fixing up our internal documentation so all our site ops will now know how to run the database master switch script next time!
– brion

#1 by Pharos on July 1st, 2009
I have a brilliant idea!
From now on, our downtime screen should say, “This Wikipedia is broken. We recommend looking up this subject in your local library; while you’re at it, kindly take down notes and add them to the Wikipedia article later.”
#2 by pfctdayelise on July 1st, 2009
No donation link? A wiki was down; are donations up?
#3 by brion on July 2nd, 2009
In this case the donations page would have worked fine… we don’t always want a link though since some sitewide outages would leave that broken to. :)
#4 by Fred from France on July 2nd, 2009
well it is doing it again, only partial access and it s including the wikinews servers this time with no access to the wikinews page. I vote conspiricy theory. Is it safeguarded against malicious flooding? ~~~~