[EdLUG] 4TB HDD showing bad blocks

Wed Feb 15 14:28:03 UTC 2023

Hi,

On Wed, 2023-02-15 at 03:31 +0000, Andrew Smith wrote:
> Guys
> My 4TB WD SATA HDD has started showing bad blocks.
> It has SMART.
> How do I configure it, probably with smartctl, to automatically map
> the bad blocks when they arise to good ones elsewhere  on the drive?

These days, you don't; the drive will do it itself. There is no need
for the user to get involved, and it would not do any good even if you
tried. Tools like ext4 badblocks are really a relic of ancient times
when disks were much simpler.

Modern drives have a ton of internal logic to do this sort of thing
automatically. They can detect borderline-unreadable data, repeatedly
attempt to read it until they recover the data; and can then rewrite 
and verify it to make sure that the rewrite has stuck; and if the whole
block is bad and even rewrites don't work, then they can permanently
remap that logical block to a new physical location by themselves.

They have much better access to the actual state of the data than the
user does, and the remapping is automatic, so no user intervention here
will actually help the situation.

What you CAN do is two-fold: 

 * keep an eye on bad block numbers to spot a drive starting to go into
   a death spiral.  It's a bit of black magic to determine that, and
   different drives behave differently, so this is of limited real-
   world use; and

 * Read the entire disk every so often.  This is the big one: by doing
   a full surface scan, the drive gets to stumble over any sectors that
   are on the threshold of becoming unreadable, and rewrite the data
   before it is lost; and in doing so it can also detect sectors which
   are no longer holding on to new data well, so which need to be
   remapped.

The remapping may be automatic, but it can only work if the drive
actually accesses the bad data; forcing a surface scan is the one big
thing you can do to keep the drive healthy by enabling the drive to do
its magic.

This is common best practice, to the point that Fedora/Red Hat MD Raid
configurations are set up out-of-the-box to do an automatic full
background surface scan once a week:

/usr/lib/systemd/system/raid-check.timer

sets it to run at 1am on a Sunday by default.

You can force a surface check manually for MD devices with

   $ sudo mdadm --action=check /dev/md...

which will trigger the scan in the background; and you can then watch the
progress with 

   $ cat /proc/mdstat

> In addition, if I get a list of bad blocks using badblocks, how can I
> find the names of the files using these ad blocks ?

Again, there's little point. If the drive finds a bad block, then it
will get remapped; if the old data is unrecoverable then an existing
file may return an EIO device error, but *new* writes to the same block
will get remapped to a spare block on the disk, so the same logical
block will work fine when written again in the future.

So any errors visible to the user are only transient, and future IOs
will use the remapped block transparently.

--Stephen