Replacing a Broken HDD in my Ceph Cluster

Back in July, I was greeted by this error on my Ceph dashboard while visiting family:

A screenshot of Ceph's web dashboard. It shows a Ceph error, namely an OSD_SCRUB_ERROR and a PG_DAMAGED error. It shows that the Ceph PG 13.a is inconsistent. — A Ceph error you generally don’t want to see while you’re 400 km away from your Homelab.

This error meant that during the nightly scrub, Ceph detected an error that was not trivially resolvable.

Ceph’s nightly scrubs know two kinds: The normal scrubs, and deep scrubs. Scrubs are run on placement groups, and normal scrubs appear very regularly. They only compare object sizes and metadata between the primary and the secondary PGs to ensure they’re consistent. Deep scrubs on the other hand actually read all of the data and compare checksums of said data, to make sure that no bits randomly flipped. These deep scrubs are run on a weekly cadence. In my setup, scrubs are running during the night, a couple of PGs per night.

This was the first time I saw these errors, so I looked at the Ceph docs. I followed the links to the docs on handling the error.

The first goal was to figure out what the actual error was, and on which of the six storage devices in my Ceph cluster it actually appeared. To figure that out, I used the following command:

ceph rados list-inconsistent-obj 13.a --format=json-pretty

The command produced the following output:

  {
    "epoch": 9294,
    "inconsistents": [
        {
            "object": {
                "name": "db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png",
                "nspace": "",
                "locator": "",
                "snap": "head",
                "version": 567448
            },
            "errors": [],
            "union_shard_errors": [
                "read_error"
            ],
            "selected_object_info": {
                "oid": {
                    "oid": "db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png",
                    "key": "",
                    "snapid": -2,
                    "hash": 2980556234,
                    "max": 0,
                    "pool": 13,
                    "namespace": ""
                },
                "version": "9488'567448",
                "prior_version": "0'0",
                "last_reqid": "client.63557317.0:20953500",
                "user_version": 567448,
                "size": 2607565,
                "mtime": "2025-07-19T17:46:47.050900+0000",
                "local_mtime": "2025-07-19T17:46:47.063177+0000",
                "lost": 0,
                "flags": [
                    "dirty",
                    "data_digest"
                ],
                "truncate_seq": 0,
                "truncate_size": 0,
                "data_digest": "0xb75fd373",
                "omap_digest": "0xffffffff",
                "expected_object_size": 0,
                "expected_write_size": 0,
                "alloc_hint_flags": 0,
                "manifest": {
                    "type": 0
                },
                "watchers": {}
            },
            "shards": [
                {
                    "osd": 3,
                    "primary": true,
                    "errors": [],
                    "size": 2607565,
                    "omap_digest": "0xffffffff",
                    "data_digest": "0xb75fd373"
                },
                {
                    "osd": 7,
                    "primary": false,
                    "errors": [
                        "read_error"
                    ],
                    "size": 2607565
                }
            ]
        }
    ]
}

The first good news from this info was that it showed which object the error occurred in: db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png. So that’s fine - that’s only an object in the media cache of my Mastodon instance. The important piece of the information was the shards object at the end:

"shards": [
    {
        "osd": 3,
        "primary": true,
        "errors": [],
        "size": 2607565,
        "omap_digest": "0xffffffff",
        "data_digest": "0xb75fd373"
    },
    {
        "osd": 7,
        "primary": false,
        "errors": [
            "read_error"
        ],
        "size": 2607565
    }
]

It shows which OSD the error occurred on, guiding me to the HDD in one of my Ceph machines.

Logging into the machine and looking at dmesg, I was greeted with this error:

[1505504.381650] ata6.00: exception Emask 0x0 SAct 0x1780000 SErr 0x0 action 0x0
[1505504.381682] ata6.00: irq_stat 0x40000008
[1505504.381728] ata6.00: failed command: READ FPDMA QUEUED
[1505504.381738] ata6.00: cmd 60/00:98:e0:6e:c8/01:00:13:00:00/40 tag 19 ncq dma 131072 in
                          res 41/40:00:c8:6f:c8/00:00:13:00:00/00 Emask 0x409 (media error) <F>
[1505504.381764] ata6.00: status: { DRDY ERR }
[1505504.381772] ata6.00: error: { UNC }
[1505504.384181] ata6.00: configured for UDMA/133
[1505504.384269] sd 5:0:0:0: [sdc] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
[1505504.384281] sd 5:0:0:0: [sdc] tag#19 Sense Key : Medium Error [current] 
[1505504.384289] sd 5:0:0:0: [sdc] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
[1505504.384298] sd 5:0:0:0: [sdc] tag#19 CDB: Read(16) 88 00 00 00 00 00 13 c8 6e e0 00 00 01 00 00 00
[1505504.384303] I/O error, dev sdc, sector 331902920 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[1505504.384362] ata6: EH complete

Well, that was definitely a hardware error. Being 400 km away from the broken disk, I decided to try to see whether it can be fixed by reallocating the sector.

For a long time, HDDs have come with a few spare sectors on the platters, in case a sector gets damaged. The firmware would mark a sector as damaged and instead use a sector from this spare area as a replacement.

After googling around a lot, I found that normally, this sector reallocation can be triggered by overwriting the damaged sector with new data. This is supposed to indicate to the HDD’s firmware that I don’t care about the previous data anymore. In which case the HDD will mark the sector as bad and reallocate it. Beforehand, it couldn’t just do that, because I might have wanted to try to access the data in the sector, trying to salvage it.

A direct write to a sector is inherently dangerous. The data there will be completely overwritten, obviously.

Only do the following if your really don’t care about the data there anymore!

hdparm --yes-i-know-what-i-am-doing --write-sector 331902920 /dev/sdc

The sector number comes from the dmesg error message. Please note the --yes-i-know-what-i-am-doing and make sure you really do!

This seemed to fix the error, as after a re-run of the scrub, I was not getting any read errors anymore. An immediate deep scrub of a PG can be triggered like this:

ceph pg deep-scrub 13.a

After that, the Ceph cluster error also disappeared. I decided that I’ve got a replication factor of two for all of my data anyway, plus backups. So I would wait for another error to show up before switching out the HDD. And that worked for another two months.

Switching out the disk

On September 20th, I got a very similar error, again on the same disk. I initially tried the same approach as before, overwriting the damaged sector to trigger a reallocation and then re-running the deep scrub in Ceph. But this time, the approach did not work. Instead it produced additional read errors in neighboring sectors. I ran three scrubs, and got read errors for different sectors in each of them.

At that point I decided to replace the disk. Here is the disk’s smartctl -a output:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-79-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68N32N0
Serial Number:    WD-WCC7K6EE326H
LU WWN Device Id: 5 0014ee 20f9d1545
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Sep 20 09:29:43 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
                                      was never started.
                                      Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 113)	The previous self-test completed having
                                      the read element of the test failed.
Total time to complete Offline 
data collection: 		(45360) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
                                      Auto Offline data collection on/off support.
                                      Suspend Offline collection upon new
                                      command.
                                      Offline surface scan supported.
                                      Self-test supported.
                                      Conveyance Self-test supported.
                                      Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
                                      power-saving mode.
                                      Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
                                      General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 482) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303d)	SCT Status supported.
                                      SCT Error Recovery Control supported.
                                      SCT Feature Control supported.
                                      SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
3 Spin_Up_Time            0x0027   253   164   021    Pre-fail  Always       -       1775
4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       41
5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
9 Power_On_Hours          0x0032   018   018   000    Old_age   Always       -       59917
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   197   197   000    Old_age   Always       -       10581
194 Temperature_Celsius     0x0022   119   101   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

I bought the disk in late 2018, and it has been running continuously since then.

For the disk replacement, I used the emergency HDD I keep in storage for just such an occasion as this. The new disk is an 8TB Seagate IronWolf, in the 5400 RPM variant.

For the replacement, I followed these instructions.

But those don’t really work.

Let me explain. I started out with removing the disk from the Ceph cluster CRD. I don’t have automated OSD creation enabled, so I’ve got a list of storage devices in my cluster CRD, and I removed the entry of the broken HDD:

      - name: "ceph3"
        devices:
          - name: "/dev/disk/by-id/wwn-0x50014ee20f9d1545"
            config:
              deviceClass: hdd

As the docs state, I started by scaling down the Rook operator:

kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0

Then I scaled down the deployment of the broken OSD as well:

kubectl -n rook-ceph scale deployment rook-ceph-osd-7 --replicas=0

The next command is then supposed to be given via the rook-ceph kubectl plugin:

kubectl rook-ceph rook purge-osd 7

That command is supposed to take the OSD out of the cluster and wait for rebalancing to finish before then destroying the OSD. But this didn’t work, the command failed with this error:

Info: waiting for pod with label "app=rook-ceph-operator" in namespace "rook-ceph" to be running

So it looks like the command needs the operator to run? But at the same time, the docs seem to state that I have to take down the operator first? I’m not getting it.

Anyway, I started up the operator again and then executed the command a second time. This triggered the rebalancing to the remaining OSDs and I went and did some other things.

Okay, that was a lie. 😅 I continuously watched the Ceph dashboard and my Ceph Grafana dashboard of course. 😁

After about twelve hours of rebalancing, I realized that I might not have enough space on the two remaining OSDs, with the OSD overview on the Ceph dashboard looking like this:

A screenshot of the Ceph dashboard's OSD overview. It shows one HDD OSD as out and down, with the two remaining HDD OSDs at 88% and 78% utilization. — I had forgotten to calculate whether I actually had enough space for all of the data to fit onto only two HDDs

That was when I decided to just replace the disk, instead of waiting for the rebalance to finish.

While doing so, I had another confirmation that I’m just utterly incompetent when it comes to doing things in the physical world:

A picture of an HDD installed into a drive caddy. Notably, the HDD's connectors point towards the ventilation grill at the front of the caddy, instead of the open back. — No comment.

Well, at least I realized my mistake before I installed the thing in the server.

After installing the disk into the new server, I added it to the Ceph Cluster CRD, triggering the creation of the Kubernetes deployment. The first thing I noted was that the new OSD didn’t have a device class set:

ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME         STATUS  REWEIGHT  PRI-AFF
 -1         17.28386  root default
-10          4.54839      host ceph1
  6    hdd   3.63869          osd.6         up   1.00000  1.00000
  5    ssd   0.90970          osd.5         up   1.00000  1.00000
 -7          4.54839      host ceph2
  3    hdd   3.63869          osd.3         up   1.00000  1.00000
  2    ssd   0.90970          osd.2         up   1.00000  1.00000
-13          8.18709      host ceph3
  0          7.27739          osd.0         up   1.00000  1.00000
  4    ssd   0.90970          osd.4         up   1.00000  1.00000

Note how osd.0 is missing the CLASS. I’m not sure what’s going wrong here. I did configure the class in the Ceph Cluster entry:

      - name: "ceph3"
        devices:
          - name: "/dev/disk/by-id/wwn-0x5000c500e6f9fde3"
            config:
              deviceClass: hdd

I fixed the issue with this command:

ceph osd crush set-device-class hdd osd.0

Another problem was that for some reason, the Rook operator was still trying to re-create OSD 7, the one from the broken HDD. This was the same symptom as I had during my k8s migration of the Ceph cluster, and I had to do the same song and dance to clean up the removed OSD. See here.

To improve performance during the backfill to the new disk, I tweaked some Ceph configs:

ceph config set osd osd_mclock_profile high_recovery_ops
ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 6

The first line instructs Ceph to prioritize recovery ops higher, the second allows me to override scheduler settings and the third sets the maximum number of backfilling PGs per OSD to 6. Specifically the last option increased recovery throughput from ~11 MB/s to ~50 MB/s.

And I don’t like it. I had to do the same during the aforementioned migration of the Ceph cluster to k8s. And I don’t understand why. My cluster has an average throughput below 10 MB/s. Why does it need a manual intervention from me to get Ceph to use the remaining IO for the backfills? It just doesn’t make any sense to me. I must be doing something wrong, but I have no idea what that might be. As I said in my post about the k8s migration, I will have to really dig into Ceph’s implementation at some point.

Let me also show you the timeline of the replacement, using the Ceph PG states over time:

A screenshot of a Grafana time series visualization. It shows the different PG states and the number of PGs in the cluster currently in that state. The time range goes from 2025-09-20 09:00 to 2025-09-22 02:25. The cluster starts out with all 265 PGs in clean state. Then, around 10:15 on 2025-09-20, the number of clean PGs drops to 202, with 63 PGs degraded, undersized and remapped. Those counts then slowly decrease until reaching 44 PGs undersized+remapped+degraded around 20:58 on the same day. At that time, the counts suddenly go up to 121 degraded, 146 undersized and 42 remapped. That state only remains for about 30 minutes though, after which the values drop back to 38 degraded+undersized+remapped. Then the remapped value only increases to 78 PGs at 21:57, about twenty minutes after all values went back down again. After this, all values consistently decrease. At about 07:12 on 2025-09-21, the number of undersized PGs goes down to zero. The last remapped PG then vanishes at around 00:15 on the next day, after which the cluster is again clean for all 265 PGs. — State of the 265 PGs during the HDD switch

At around 10:15 on 2025-09-20, I launched the replacement of the OSD, still thinking I would wait for the rebalance to finish before taking out the old HDD. That triggered 63 of the 265 PGs to go into undersized+remapped state, waiting to be put onto a different OSD. That operation slowly continued for the next couple of hours, until I realized that I didn’t have enough space to store all the data on the two remaining HDDs. I then decided to switch out the HDD around 21:00 on the 20th. That required me to shut down the Ceph node, also making the PGs from the SSD in that node unavailable, leading to the spike in undersized+remapped PGs around that time. Once I booted the node up again, there were still more remapped PGs than before. That’s due to the fact that I replaced a 4 TB HDD with an 8 TB one, leading Ceph to remap additional PGs to that larger disk.

The danger zone, meaning the time with reduced data redundancy, lasted from 10:15 on the 20th to 07:12 on the 21st, when the last undersized PG was backfilled. Everything after that was just the rebalancing due to the additional space on the new HDD. I could have kept the danger time a lot shorter if had just switched out the HDD right away, instead of waiting for a rebalance I ultimately had to forego anyway.

One last thing to mention is the change in available space in the Ceph cluster. I replaced a 4 TB HDD with an 8 TB one, so how much more space did that net me? Due to the way Ceph works and the fact that I would need some space for Ceph metadata on the device itself, I wouldn’t get an additional 4 TB.

As an example, let’s look at my S3 bucket data pool, which is entirely HDD based. With three 4 TB HDDs, I had about 1.13 TiB free in that pool at the time I removed the broken HDD. When the rebalance was done after the replacement, that same pool had 2.46 TiB free. So I gained about 1.33 TiB from a 4 TB disk space increase. I don’t think that’s too surprising, considering that the pool is a 2x replica pool, with a “host” failure domain. That means each piece of data has to be available on two hosts in the cluster. And two of my hosts still only have 4 TB HDDs, so the 8 TB of space from the new HDD doesn’t have enough space on other hosts to be fully utilized in the cluster.

Finally, I’ve also bought WD’s Red Plus 8 TB 5400 RPM HDD to have a replacement should another HDD fail. And that might happen sooner rather than later, as I’ve got another 4 TB HDD of the same make and model I bought at the same time from the same shop as the failed HDD. So this one might fail soon as well.

This action has yet again shown that I need to take a deep dive into Ceph performance behavior at some point.

Switching out the disk#

Switching out the disk