FleepGrid Provisionally Back Up Following Hard Disk Failure

March 19, 2012 in Downtime

It’s been a long few days.  (This post is being updated as services are restored.)

The TL;DR Version

I’ll try to re-cap what happened in detail, but the long and short of it is that the FleepGrid server had a total hard drive failure and I had difficulty restoring things from backup.  Here’s the quick status of each service:

  • FleepGrid is ONLINE – but has time-warped back to October 2011 since that was the last complete database backup I was able to restore.  This means if you created an account on FleepGrid between October and now, your account, your inventory, and anything else is gone.  =(  I’m really sorry.  Really REALLY sorry.
  • The FleepGrid website is ONLINE - should be fine, no data loss.
  • The FleepGrid Web Shop is ONLINE has been upgraded to a newer version and hopefully everything is completely restored.
  • The FleepGrid Portal region on OSGrid is ONLINE.  

The Tell Me Everything Version

Indications of Trouble

The first hint that something was wrong actually wasn’t obvious.  On Thursday last week, a few people notified me that they were having trouble importing IAR files from the FleepGrid Web Shop that they’d just downloaded.  This seemed quite strange, since everything had been working fine the day before (I’d imported an IAR file into OSGrid from my own website with no trouble).  I sent a message to the listserv asking if anyone had a hint about the error message:

20:26:41 - Command error: System.IO.InvalidDataException: The magic number in GZ
ip header is not correct. Make sure you are passing in a GZip stream.
at System.IO.Compression.GZipDecoder.ReadGzipHeader()
at System.IO.Compression.Inflater.Decode()
at System.IO.Compression.Inflater.Inflate(Byte[] bytes, Int32 offset, Int32 l
ength)

No one replied, and while I was trying to figure out what was wrong with the IAR files, I decided to further separate some of the region processes so I could shut off a few regions that weren’t really being used to improve performance on the grid.  Except when I tried to copy over my opensim config files, I got an ominous windows error:

Cannot copy opensim_0.7.2: Data error (cyclic redundancy check)

One quick google of  cyclic redundancy check and I was suddenly (painfully) aware that the hard drive was probably failing and I needed to get any data I could ASAP.  Which I did.  And I had backups!  So, while I was very bummed, and dreading re-installing Windows and everything, I wasn’t terribly worried about data loss since I regularly take database dump backups and OAR backups and save them out to an external drive.

It will be a pain, I thought, but doable over the weekend.

Installing New Hard Drive, Installing Windows, Installing installing installing…

I hopped in the car and picked up a new drive and got started with the Windows install straight away.  By the next morning, Windows was done and I got going on installing all the other software I use.  I won’t bore you with too many of the details, but I tried to keep super good notes in case anyone else is prepping a Windows XP box for Opensim and wants to follow what I did.  (It’s the 3/17/2012 entry if you’re reading this post later.)

Note:  I also run the http://fleepgrid.com WordPress site and the FleepGrid Shop Opencart site from the same server, so had to install a bunch of stuff for those services that you would NOT need to install for plain Opensim.

O NOES – Problems with my Backups

Everything was going swimmingly until I got to the point of restoring the database mysql dumps.  All of the databases restored beautifully… except the one that was most important!  My Opensim database!  For some unknown reason (I still have to figure this out), all of my opensim database dumps only contained ONE TABLE – the asset table.  All the other tables, poof.  Not there.

Now, I do have OAR backups of all the regions, but I really did panic when I realized everything in my inventory, all of the user accounts, all of the other stuff that makes up FleepGrid besides the stuff out on the sims was gone.  :(

The only saving grace was that I’d done a full NTBackup of the FleepGrid system back in October 2011 as a test, and I still had that file, which I was able to restore and from there extract the mysql data files to restore a snapshot of the opensim database as it existed in October 2011.  Again, all the gory details of all the stupid things I tried in the interim are on the 3/17/12 change log entry.

What I Did Wrong

My primary mistake was not not having backups.  I had backups.  I was backing things up regularly, scheduled, automated even!

My mistake was that I never tested the backups.  If I’d tried it even once, I would have realized all the tables weren’t being dumped, and I would have saved myself a huge huge headache and lots of stress.

As an IT person, I know this.  You know this.  We all know this.  Still, when it comes to hobby projects like FleepGrid, we sometimes get lazy doing all the checklist things we know we should do, and then we pay for it.  Or at least I did, and any poor peeps whose accounts I just lost.  =(

—-

So that’s the update for now, more to come, I’m sure lots of things still aren’t working quite right, but I wanted to give an update as soon as I had a good idea of what the situation was.  Please be on notice that FleepGrid might be going up and down a bit while I get everything repaired, I probably won’t make a blog post each time, will just leave this here for reference.

And my sincere apologies again to anyone who was/is  inconvenienced by the outtage.  I’m really glad FleepGrid is just a test grid and hopefully if you run your own grid you will go off right this minute and test your backups to make sure they work so you don’t have to make a post like this.  ;)