AWS troubleshooting: how to fix a broken EBS volume (bad superblock on xfs)

As great as EBS volumes are on Amazon Web Services, they can break and not ever mount again, even though your data could still be there intact, a simple corruption on the filesystem structure can cause a lot of damage. On this post I teach you how to move all that data onto a new EBS drive, so keep calm and read slowly.

So, you try to mount your drive after some updates and you get an error like this on dmesg | tail:

[56439860.329754] XFS (xvdf): Corruption detected. Unmount and run xfs_repair

so you unmount your drive, invoke xfs_repair and you get this…

$ sudo xfs_repair -n /dev/xvdf
Phase 1 - find and verify superblock...
bad primary superblock - bad magic number !!!

attempting to find secondary superblock...
..........................................

and no good secondary superblock is found.

Don’t panic, this is what you have to do next to solve this issue:

  1. Go to your AWS dashboard, EC2 section.
  2. Click on “Volumes”
  3. Find the broken volume.
  4. Create a snapshot of the broken volume (this takes a while)
  5. Create a new volume the same size (or larger than) your old drive out of the snapshot you just created (this takes a while)
  6. Attach your new volume to the same EC2 instance (no need to reboot or anything), if the old drive was mapped to /dev/xvdf, the new one will be mapped to /dev/xvdg (see how the last letter increases alphabetically)

Now here’s a gotcha, Amazon will not create your new drive using the same file system type (xfs), for some reason it will create it using the ext2 filesystem.

$ sudo file -s /dev/xvdg
/dev/xvdg: Linux rev 1.0 ext2 filesystem data (mounted or unclean), UUID=2e35874f-1d21-4d2d-b42b-ae27966e0aab (large files)

Here you have two options:
1. Live with the new ext2 file system, make sure your /etc/fstab is updated to look something like this:
/dev/xvdg /path/to/mount/to auto defaults,nobootwait,noatime 0 0

or 2. copy the contents of your drive to a temporary location, usually inside /mnt which has plenty of space from that ephemeral drive the ec2 instances come to, and then mkfs.xfs the new volume, and then copy the contents back… (which what I did, as I chose to create a larger drive and the ext2 format that came on the new volume only recognized the size of the snapshot)

Hope this saved your ass, leave a note if I did.

Remember to never do any irreversible action until you have a disk snapshot, try your best to never lose data.