btrfs RAID

These are notes describing briefly how to deal with btrfs RAID issues.

What is btrfs?

btrfs is a modern Linux file system that offers RAID support without additional third party software tools or RAID cards. Whilst not as fast as hardware RAID, btrfs file systems are reliable and easy manage. Remote management is also easy. StitchIt has been tested on platter btrfs RAID 1+0. IO heavy functions read and write in parallel to leverage the advantages of the file system.

What file system am I using?

Run df -Th to list disk usage and also file system type.

Which disks to use?

Western Digital Gold or Black disks have a long warranty and seem to be reliable. 4 TB or 6 TB disks should be adequate. Use four drives at a minimum; six if you anticipate very heavy work loads. Unless you routinely generate samples in excess of 1 TB, then likely about 12 TB of storage is adequate. Figure that each sample will transiently occupy about 4 times the final stitched data size: raw data, plus compressed raw data, plus original stitched images, plus cropped stitched image. Perhaps you keep 10 to 20 samples on there at any one time whilst they await processing and transfer to the server. On a multi-user system, disk usage will always expand to fill the available space. It can become a headache to manage a large server with dozens of samples (especially if there is a disk failure), so err on the side of less space.

What does btrfs mean for me?

Not much as long as it's working well. However, heavily used platter drives start to fail after about 3 years and you'll need to keep an eye out for this. The larger your RAID array the more disks you have and so the greater the likely of a disk failure. You will at minimum need to know how to identify whether a problem exists, which is the problem disk, and how to replace it.
Your first point of information for working with btrfs is the btrfs wiki.

Setting up notes

The following are remarks about btrfs on Ubuntu 16.04:
  • mkfs.btrfs -L data /dev/<drive1> /dev/<drive2> ... (where <drive1>, <drive2> etc. stands for the full drives used in the RAID pool) make a RAID0 volume not a redundant RAID1 or RAID10. use mkfs.btrfs -L data -d raid1 /dev/<drive1> /dev/<drive2> ... to get RAID1 for data at creation time. Use -m raid1 to get RAID1 for metadata too (seems to be the default actually).
  • To check the RAID type and usage, the command btrfs filesystem usage /mnt/data gives a nice summary.
  • To change RAID type (after creation), use balancing operation btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/data (here to convert data and metadata to RAID1). No need to unmount /mnt/data. It can take quite some time, so running the command in a tmux session or in background is in general a good idea.
  • btrfs-RAID1 just ensures that data is duplicated (2 copies), which is different from other conventional RAID1 (which keep N copies, N >= 2). Apart from that, it is quite extensible, i.e. more drives can be added to a mounted partition and you can use the machine while the new drive is getting integrated to the RAID array.
  • RAID10 necessitates at least 4 drives. Depending on the benchmarks.

Diagnosing problems

Often the first indication that something is wrong is that IO becomes very slow. Probably it's a good idea to keep an eye out for things before this point, but if you do notice slow IO check for a RAID problem. First of all look in dmesg:
$ dmesg | grep -i 'btrfs
That will bring up any errors and also indicate on which drives the errors are happening.
You can find the serial number of the drive (say, /dev/sda) as follows:
udevadm info --query=all --name=/dev/sda | grep ID_SERIAL
With the serial number you can physically ID the drive in the machine.
But what to do if there is a problem?
  • First start a scrub sudo btrfs scrub start <mount point> and wait a long time (e.g. 7H for 4x4 TB drives).
  • Check the scrub status: sudo btrfs scrub status <mount point> to see the number of unrecoverable errors
  • Find affected files in dmesg messages: dmesg | grep BTRFS | grep path
  • Check drives health with smartctl (installed via sudo apt install smartmontools):
  • sudo smartctl -t short <dev path> or sudo smartctl -t long <dev path> to start short or long test, in the background
  • sudo smartctl -a <dev path> or sudo smartctl -x <dev path> to get short or long report about drive and test outcomes
If the errors can not be recovered, if the above tests indicate a sick disk, or if drive is older than about three years (smartctl --all /dev/sdc | grep Power_On_Hours) then you should likely change the disk. Ideally you want your PC to have an empty hot-swap SATA bay. Into this you can plug the new drive without powering down. Then:
  • If needed, you can wipe filesystem informations from this new drive using sudo wipefs -a <dev path of new drive> (CAREFUL!)
  • Use replace command, sudo btrfs replace start <ID> <dev new> <mount point> where <ID> is the btrfs number for the device to replace (can be obtained using sudo btrfs device usage <mount point> for example)
  • Do not use btrfs device delete to remove the problematic drive! btrfs will try to re-duplicate data elsewhere, it will take ages and may not succeed depending on the actual remaining space, and this is not interruptible.
  • Re-balance data across the RAID volume using sudo btrfs balance start <mount point> (use -dusage option, to avoid a full balancing that can take a very long time) and use sudo btrfs balance status <mount point> to monitor it.
  • More information in the link above.