These are notes describing briefly how to deal with
btrfsis a modern Linux file system that offers RAID support without additional third party software tools or RAID cards. Whilst not as fast as hardware RAID,
btrfsfile systems are reliable and easy manage. Remote management is also easy. StitchIt has been tested on platter
btrfsRAID 1+0. IO heavy functions read and write in parallel to leverage the advantages of the file system.
df -Thto list disk usage and also file system type.
Western Digital Gold or Black disks have a long warranty and seem to be reliable. 4 TB or 6 TB disks should be adequate. Use four drives at a minimum; six if you anticipate very heavy work loads. Unless you routinely generate samples in excess of 1 TB, then likely about 12 TB of storage is adequate. Figure that each sample will transiently occupy about 4 times the final stitched data size: raw data, plus compressed raw data, plus original stitched images, plus cropped stitched image. Perhaps you keep 10 to 20 samples on there at any one time whilst they await processing and transfer to the server. On a multi-user system, disk usage will always expand to fill the available space. It can become a headache to manage a large server with dozens of samples (especially if there is a disk failure), so err on the side of less space.
Not much as long as it's working well. However, heavily used platter drives start to fail after about 3 years and you'll need to keep an eye out for this. The larger your RAID array the more disks you have and so the greater the likely of a disk failure. You will at minimum need to know how to identify whether a problem exists, which is the problem disk, and how to replace it.
The following are remarks about
btrfson Ubuntu 16.04:
mkfs.btrfs -L data /dev/<drive1> /dev/<drive2>... (where
<drive2>etc. stands for the full drives used in the RAID pool) make a RAID0 volume not a redundant RAID1 or RAID10. use
mkfs.btrfs -L data -d raid1 /dev/<drive1> /dev/<drive2>... to get RAID1 for data at creation time. Use
-m raid1to get RAID1 for metadata too (seems to be the default actually).
- To check the RAID type and usage, the command
btrfs filesystem usage /mnt/datagives a nice summary.
- To change RAID type (after creation), use balancing operation
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt/data(here to convert data and metadata to RAID1). No need to unmount
/mnt/data. It can take quite some time, so running the command in a
tmuxsession or in background is in general a good idea.
- btrfs-RAID1 just ensures that data is duplicated (2 copies), which is different from other conventional RAID1 (which keep N copies, N >= 2). Apart from that, it is quite extensible, i.e. more drives can be added to a mounted partition and you can use the machine while the new drive is getting integrated to the RAID array.
- RAID10 necessitates at least 4 drives. Depending on the benchmarks.
Often the first indication that something is wrong is that IO becomes very slow. Probably it's a good idea to keep an eye out for things before this point, but if you do notice slow IO check for a RAID problem. First of all look in
$ dmesg | grep -i 'btrfs
That will bring up any errors and also indicate on which drives the errors are happening.
You can find the serial number of the drive (say,
/dev/sda) as follows:
udevadm info --query=all --name=/dev/sda | grep ID_SERIAL
With the serial number you can physically ID the drive in the machine.
But what to do if there is a problem?
- First start a scrub
sudo btrfs scrub start <mount point>and wait a long time (e.g. 7H for 4x4 TB drives).
- Check the scrub status:
sudo btrfs scrub status <mount point>to see the number of unrecoverable errors
- Find affected files in dmesg messages:
dmesg | grep BTRFS | grep path
- Check drives health with
sudo apt install smartmontools):
sudo smartctl -t short <dev path>or
sudo smartctl -t long <dev path>to start short or long test, in the background
sudo smartctl -a <dev path>or sudo
smartctl -x <dev path>to get short or long report about drive and test outcomes
If the errors can not be recovered, if the above tests indicate a sick disk, or if drive is older than about three years (
smartctl --all /dev/sdc | grep Power_On_Hours) then you should likely change the disk. Ideally you want your PC to have an empty hot-swap SATA bay. Into this you can plug the new drive without powering down. Then:
- If needed, you can wipe filesystem informations from this new drive using
sudo wipefs -a <dev path of new drive>(CAREFUL!)
- Use replace command,
sudo btrfs replace start <ID> <dev new> <mount point> where <ID>is the btrfs number for the device to replace (can be obtained using
sudo btrfs device usage <mount point>for example)
- Do not use
btrfsdevice delete to remove the problematic drive!
btrfswill try to re-duplicate data elsewhere, it will take ages and may not succeed depending on the actual remaining space, and this is not interruptible.
- Re-balance data across the RAID volume using
sudo btrfs balance start <mount point>(use -dusage option, to avoid a full balancing that can take a very long time) and use
sudo btrfs balance status <mount point>to monitor it.
- More information in the link above.