Announcement

Collapse
No announcement yet.

The Great KDE Disaster of 2013

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    The Great KDE Disaster of 2013

    Frightening:

    On 2013-03-22, the server that hosts the git.kde.org virtual machine was taken down for security updates. Both virtual machines running on the server were shut down without incident, security updates were applied to the host, and the machine was rebooted.

    When the host came back up and started the VMs, the VMs immediately showed evidence of file system corruption (the file system in question was ext4). It is not known at this time (and we’ll probably never know) whether this corruption had been silently ongoing for a long period of time, or was the result of something specific that occurred during the shutdown or reboot of the VM or host. There is some evidence to suggest the former, but nothing concrete.

    As most of you reading this are well aware, KDE has a series of “anongit” machines whose purpose is to distribute the heavy load across the 1500 hosted Git repositories and to act as backups for the main server. However, when we checked the anongit machines, every single one of them had severely corrupted repositories and many or all repositories were missing.

    How could this happen?
    The saga unfolds...continue reading.

    #2
    wooah So I skimmed through it, I have trouble to understand all the ins n' outs anyway. As I understand, all essential KDE data were in such a poor state that it were about to get 'nuked' ?
    Just one day before this all happened, the anongit cloning system had been set up on the new server in preparation for the migration. That was by no means the only piece of luck: this single server happened to have the beginning of its syncing window – which happens once every twenty minutes on this box – fall into the time during the server reboot. As a result, the command to fetch the latest projects list timed out, and the script passed over it and simply attempted to fetch the latest revisions from the repositories on the server, which failed as the server could not produce a valid pack.
    One day away of total disaster ? scary....

    b.r

    Jonas
    ASUS M4A87TD | AMD Ph II x6 | 12 GB ram | MSI GeForce GTX 560 Ti (448 Cuda cores)
    Kubuntu 12.04 KDE 4.9.x (x86_64) - Debian "Squeeze" KDE 4.(5x) (x86_64)
    Acer TimelineX 4820 TG | intel i3 | 4 GB ram| ATI Radeon HD 5600
    Kubuntu 12.10 KDE 4.10 (x86_64) - OpenSUSE 12.3 KDE 4.10 (x86_64)
    - Officially free from windoze since 11 dec 2009
    >>>>>>>>>>>> Support KFN <<<<<<<<<<<<<

    Comment


      #3
      Wow.

      Even though Linux gearheads like to wax sentimental about how long some server has been running without a reboot, this almost-disaster highlights the value of a regular periodic reboot and fsck.

      It also highlights the value of a recent backup that is stored "somewhere/anywhere" else, away from the production system.

      Thanks Steve.

      Comment


        #4
        It's easy to play monday morning quarterback, but holy crap...

        KDE has a series of “anongit” machines whose purpose is to distribute the heavy load across the 1500 hosted Git repositories and to act as backups for the main server.
        I don't understand why they were using mirrors and git as some kind of backup strategy. On the face of it, at least to me, it seems like a really bad idea. Why not at least have tarball snapshots (maybe they did have these, I hope so)

        I have about 40 mercurial repos I am responsible for, but the idea of using mercurial as part of a backup strategy would have never crossed my mind. Sure dvcs's are great, but they also have a lot of moving parts, and they are not backups.

        imho any backup strategy should be a simple as possible. E.g Every night your data gets tarred into a dated snapshot, sent to a different machine, and at periodic intervals is checked for consistency. How is this not better than using mirrors and git? In a worse case scenario you're going to lose a few hours of work.

        Comment


          #5
          Originally posted by eggbert View Post
          Why not at least have tarball snapshots (maybe they did have these, I hope so)
          They did. See the follow-on post:

          I forgot to mention that we actually have tarballs of every single repository, updated every couple of days and also transferred to the anongits. These are not full, true mirrors of the repositories, simply normal clones, and they’re designed to make it easier for our contributors or users in low-bandwidth environments to fetch initial clones. They don’t contain our full repository metadata, so we would have lost all of that. And it’s not a perfect backup strategy by any means, as I’ll detail below. But it’s something.

          Comment

          Working...
          X