Do you check your hard drives using smartctl? You should, and here's why:
I have a family server. With the exception of moving (three times in the last four years), hurricane Matthew (power off for a couple days), and one or two other power outages, the server has been up and running constantly for 8 years.
The original drive configuration was 4 drives: 2X1TB and 2X2TB. At the time of the build, the 1TB drives were a year or so old and the 2TB drives were added, one at a time, to increase capacity. One in 2012 and the other in 2013. Later, I wanted more capacity so I swapped out the 1TB drives for a single 6TB drive in 2014. Finally, wanting full mirrored capacity, in 2017 I added a second 6TB drive and reconfigured both pairs into RAID1 for data and added a single small SSD for the boot drive. This left me with 8TB of redundant storage for data, which has been plenty.
Here's the drive ages this morning in years and days. This is not derived from calendar days, but rather from the Power_On_Hours of the drives:
sda (6TB) - 3 years, 247 days
sdb (2TB) - 4 years, 146 days
sdc (6TB) - 1 year, 42 days
sdd (2TB) - 6 years, 80 days
sde (64GB) - 164 days
Periodically, I check the drive status using smartctl, which runs as daemon - basically checking the "health". I do this a few times a year. This week, I was a little startled to see the two 2TB drives are no longer as "young and spry" as they used to be (me neither ). In the output from sudo smartctl -a on the oldest drive, I got this:
and a couple other not-so-nice indicators:
but these were OK:
The second 2TB drive had only one poor indication:
Clearly, it's time to retire the eldest of the drives, and maybe it's twin, before an actual failure occurs. So yesterday, I made full backups of all the data residing on the 2x2TB RAID1 file system. The next step will be to remove the oldest drive from the RAID by re-configuring the RAID into a single drive file system (NOTE: All my file systems are BTRFS so I can do ALL of this while still using the server and without powering down or rebooting. Love me some BTRFS!). Then I will remove the old drive and insert a new replacement.
Since I will be buying a new drive, I went shopping. I prefer Western Digital drives and have had excellent results with them. The 2TB drives were purchased before WD released it's "Red" drives designed for NAS and server systems. They are both "Black" drives - the performance version. They were not Enterprise class drives, but I upgraded their firmware years ago to the enterprise level. They had a 3 year warranty and both have out-lived that in both calendar years and power-on hours.
After some research, I will be replacing the two 2TB drives with a single 10TB WD "Red Pro" drive. It has a couple advantages over it's smaller relatives that are worth the extra cost and a longer warranty.
By going to a single drive this large, I will also be moving all the data to this single drive, re-configuring the 2X6TB array into stand-alone drives, removing RAID1 altogether. Instead of RAID, I will do automated backups using the 2 6TB drives as backup storage. Since I keep all my data in 15 separate btrfs subvolumes, backups are easy to automate AND if the drive fails, a simple change to the fstab mount will return the system to service. Mounting RAID1 in degraded mode and removing the bad drive is more work than a simple re-mount. Since this is not a work-critical environment, that seems sufficient to me.
I have a family server. With the exception of moving (three times in the last four years), hurricane Matthew (power off for a couple days), and one or two other power outages, the server has been up and running constantly for 8 years.
The original drive configuration was 4 drives: 2X1TB and 2X2TB. At the time of the build, the 1TB drives were a year or so old and the 2TB drives were added, one at a time, to increase capacity. One in 2012 and the other in 2013. Later, I wanted more capacity so I swapped out the 1TB drives for a single 6TB drive in 2014. Finally, wanting full mirrored capacity, in 2017 I added a second 6TB drive and reconfigured both pairs into RAID1 for data and added a single small SSD for the boot drive. This left me with 8TB of redundant storage for data, which has been plenty.
Here's the drive ages this morning in years and days. This is not derived from calendar days, but rather from the Power_On_Hours of the drives:
sda (6TB) - 3 years, 247 days
sdb (2TB) - 4 years, 146 days
sdc (6TB) - 1 year, 42 days
sdd (2TB) - 6 years, 80 days
sde (64GB) - 164 days
Periodically, I check the drive status using smartctl, which runs as daemon - basically checking the "health". I do this a few times a year. This week, I was a little startled to see the two 2TB drives are no longer as "young and spry" as they used to be (me neither ). In the output from sudo smartctl -a on the oldest drive, I got this:
Code:
SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Conveyance offline Completed: read failure 90% 54342 1652502969 # 2 Extended offline Completed: read failure 90% 54340 326819317 # 3 Conveyance offline Completed: read failure 90% 54175 326821768 # 4 Extended offline Completed: read failure 90% 54173 1652502969 # 5 Conveyance offline Completed: read failure 90% 54007 1652503001 # 6 Extended offline Completed: read failure 90% 54005 326821768 # 7 Conveyance offline Completed: read failure 90% 53839 326821768 # 8 Extended offline Completed: read failure 90% 53837 326819317 # 9 Conveyance offline Completed: read failure 90% 53671 1652502969 #10 Extended offline Completed: read failure 90% 53669 326819317 #11 Conveyance offline Completed: read failure 90% 53503 326821768 #12 Extended offline Completed: read failure 90% 53501 326819317 #13 Conveyance offline Completed: read failure 90% 53337 1652502960 #14 Extended offline Completed: read failure 90% 53335 326819317 #15 Conveyance offline Completed: read failure 90% 53169 326821768 #16 Extended offline Completed: read failure 90% 53167 326821768 #17 Conveyance offline Completed: read failure 90% 53070 326819317 #18 Extended offline Completed: read failure 90% 53068 1652502960 #19 Conveyance offline Completed: read failure 90% 52903 1652502960 #20 Extended offline Completed: read failure 90% 52901 326821768 #21 Conveyance offline Completed: read failure 90% 52735 326819317
Code:
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 1
Code:
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
Code:
#16 Conveyance offline Completed: read failure 90% 35957 5445952 #17 Extended offline Completed: read failure 90% 35955 5445952
Since I will be buying a new drive, I went shopping. I prefer Western Digital drives and have had excellent results with them. The 2TB drives were purchased before WD released it's "Red" drives designed for NAS and server systems. They are both "Black" drives - the performance version. They were not Enterprise class drives, but I upgraded their firmware years ago to the enterprise level. They had a 3 year warranty and both have out-lived that in both calendar years and power-on hours.
After some research, I will be replacing the two 2TB drives with a single 10TB WD "Red Pro" drive. It has a couple advantages over it's smaller relatives that are worth the extra cost and a longer warranty.
By going to a single drive this large, I will also be moving all the data to this single drive, re-configuring the 2X6TB array into stand-alone drives, removing RAID1 altogether. Instead of RAID, I will do automated backups using the 2 6TB drives as backup storage. Since I keep all my data in 15 separate btrfs subvolumes, backups are easy to automate AND if the drive fails, a simple change to the fstab mount will return the system to service. Mounting RAID1 in degraded mode and removing the bad drive is more work than a simple re-mount. Since this is not a work-critical environment, that seems sufficient to me.
Comment