Posts Tagged downtime
PDF Export currently down (fixed)
Our PDF export server is presently down. It had to be rebooted to organize and route some power cables in our racks. When it powered back on, it is failing to load all software correctly. We are working on resolving it, I just wanted to post something here on the blog since it is the first place that many people check when they think some service is broken.
Intermittent media server load problems
We’ve been seeing some general slowdowns in our image and media file serving recently, including some instances in the last couple days where the sites as a whole have been affected to the point of extreme slowness or temporary inaccessibility.
Domas believes this is related to this reported problem with NFS performance when ZFS snapshots are active. We’ve had some luck so far with it improving after dropping older snapshots (possibly along with restarting NFS and temporarily disabling the image scaler servers to give it a little breathing room to reset).
We’ve been planning for some time to redo the way we access our media files internally which can help reduce the impact on the rest of the site when load problems on the file servers occur, but we might also be able to spread out the load among multiple servers to improve things even more.
Updates will come as we get things back on track…
Update 2009-07-15: We’re temporarily shutting off uploads while we apply the ZFS fix patch and reboot the main file server. You may see some missing images or funky error messages for a little bit, but the sites should otherwise continue working normally until the file server is back up.
Update 2: Server is patched and uploads are back online. This should resolve our performance problems while we continue rearranging the upload servers to be more future-proof.
Blog Downtime
Posted by RobH in open-source, software, wikimedia on June 29th, 2009
I am sure that many folks noticed that on the morning of 2009-06-26, techblog.wikimedia.org and blog.wikimedia.org went down. It turns out that some of the parts of our Wordpress installations were compromised. I do not want to get in to a direct show and tell of what they did, but hopefully we have hardened the installation to the point that it will not occur again.
This is why the blogs exist on their own server, so when things like this happen we can minimize the impact. The blogs are both up and running now, along with the other services that were affected. All but techblog was back online before Friday was over, techblog lagged behind until today. (As techblog was the point of exploit, we got everything else back up first.) Other affected services were the Open Conference Systems site for Wikimania 2009, as well as our survey software. Both of those were back online ASAP after the incident and the rest followed after.
Of course, it was hard to get this information out to folks when the blogs were down! It goes to show how easily using the blogs to get info out has been, since without it we had to scramble to get the information out of other channels.
Thanks to everyone who assisted in the restoration, and also thanks to everyone for their patience while the system was fixed.
csw2-knams seems to have gone down
CSW2-knams is down and with it a few servers: pascal, ragweed, clematis, iris, fuchsia and a couple of sql-text*.knams.
It seems this issue mostly affects the toolserver environment.
I am still working on figuring out a way of fixing this and will update once the issue has been resolved.
Sorry for the inconvenience.
Update: Mark was able to resolve the issue. Apparently, the excess temperature due to the HVAC malfunction at the datacenter caused servers to automatically shutdown.
Server named Singer has a sore throat?
In working on the servers, some apache config files were made inoperable. This is on a misc. services computer named Singer. This is the host for our blogs, as well as some other web-facing info. As such, the cached blogs are affected, but not the tech blog. (It was, but it was the easiest to get back online.)
Apologies for any annoyance this single server downtime may have caused anyone. Rest assured, it will be fixed and steps will be taken to prevent it from occurring in the future.