Back in July, I was greeted by this error on my Ceph dashboard while visiting
family: A Ceph error you generally don’t want to see while you’re 400 km away from your Homelab.
This error meant that during the nightly scrub, Ceph detected an error that was not trivially resolvable.
Ceph’s nightly scrubs know two kinds: The normal scrubs, and deep scrubs. Scrubs are run on placement groups, and normal scrubs appear very regularly. They only compare object sizes and metadata between the primary and the secondary PGs to ensure they’re consistent. Deep scrubs on the other hand actually read all of the data and compare checksums of said data, to make sure that no bits randomly flipped. These deep scrubs are run on a weekly cadence. In my setup, scrubs are running during the night, a couple of PGs per night.
This was the first time I saw these errors, so I looked at the Ceph docs. I followed the links to the docs on handling the error.
The first goal was to figure out what the actual error was, and on which of the six storage devices in my Ceph cluster it actually appeared. To figure that out, I used the following command:
ceph rados list-inconsistent-obj 13.a --format=json-pretty
The command produced the following output:
{
"epoch": 9294,
"inconsistents": [
{
"object": {
"name": "db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png",
"nspace": "",
"locator": "",
"snap": "head",
"version": 567448
},
"errors": [],
"union_shard_errors": [
"read_error"
],
"selected_object_info": {
"oid": {
"oid": "db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png",
"key": "",
"snapid": -2,
"hash": 2980556234,
"max": 0,
"pool": 13,
"namespace": ""
},
"version": "9488'567448",
"prior_version": "0'0",
"last_reqid": "client.63557317.0:20953500",
"user_version": 567448,
"size": 2607565,
"mtime": "2025-07-19T17:46:47.050900+0000",
"local_mtime": "2025-07-19T17:46:47.063177+0000",
"lost": 0,
"flags": [
"dirty",
"data_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xb75fd373",
"omap_digest": "0xffffffff",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 3,
"primary": true,
"errors": [],
"size": 2607565,
"omap_digest": "0xffffffff",
"data_digest": "0xb75fd373"
},
{
"osd": 7,
"primary": false,
"errors": [
"read_error"
],
"size": 2607565
}
]
}
]
}
The first good news from this info was that it showed which object the error
occurred in: db0c7d6a-8b8b-48d5-85d8-f7f77dfac9eb.45990501.1_cache/media_attachments/files/114/881/148/093/347/525/original/8dd522a014922e6a.png
.
So that’s fine - that’s only an object in the media cache of my Mastodon instance.
The important piece of the information was the shards
object at the end:
"shards": [
{
"osd": 3,
"primary": true,
"errors": [],
"size": 2607565,
"omap_digest": "0xffffffff",
"data_digest": "0xb75fd373"
},
{
"osd": 7,
"primary": false,
"errors": [
"read_error"
],
"size": 2607565
}
]
It shows which OSD the error occurred on, guiding me to the HDD in one of my Ceph machines.
Logging into the machine and looking at dmesg
, I was greeted with this error:
[1505504.381650] ata6.00: exception Emask 0x0 SAct 0x1780000 SErr 0x0 action 0x0
[1505504.381682] ata6.00: irq_stat 0x40000008
[1505504.381728] ata6.00: failed command: READ FPDMA QUEUED
[1505504.381738] ata6.00: cmd 60/00:98:e0:6e:c8/01:00:13:00:00/40 tag 19 ncq dma 131072 in
res 41/40:00:c8:6f:c8/00:00:13:00:00/00 Emask 0x409 (media error) <F>
[1505504.381764] ata6.00: status: { DRDY ERR }
[1505504.381772] ata6.00: error: { UNC }
[1505504.384181] ata6.00: configured for UDMA/133
[1505504.384269] sd 5:0:0:0: [sdc] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=3s
[1505504.384281] sd 5:0:0:0: [sdc] tag#19 Sense Key : Medium Error [current]
[1505504.384289] sd 5:0:0:0: [sdc] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
[1505504.384298] sd 5:0:0:0: [sdc] tag#19 CDB: Read(16) 88 00 00 00 00 00 13 c8 6e e0 00 00 01 00 00 00
[1505504.384303] I/O error, dev sdc, sector 331902920 op 0x0:(READ) flags 0x0 phys_seg 3 prio class 0
[1505504.384362] ata6: EH complete
Well, that was definitely a hardware error. Being 400 km away from the broken disk, I decided to try to see whether it can be fixed by reallocating the sector.
For a long time, HDDs have come with a few spare sectors on the platters, in case a sector gets damaged. The firmware would mark a sector as damaged and instead use a sector from this spare area as a replacement.
After googling around a lot, I found that normally, this sector reallocation can be triggered by overwriting the damaged sector with new data. This is supposed to indicate to the HDD’s firmware that I don’t care about the previous data anymore. In which case the HDD will mark the sector as bad and reallocate it. Beforehand, it couldn’t just do that, because I might have wanted to try to access the data in the sector, trying to salvage it.
A direct write to a sector is inherently dangerous. The data there will be completely overwritten, obviously.
Only do the following if your really don’t care about the data there anymore!
hdparm --yes-i-know-what-i-am-doing --write-sector 331902920 /dev/sdc
The sector number comes from the dmesg error message.
Please note the --yes-i-know-what-i-am-doing
and make sure you really do!
This seemed to fix the error, as after a re-run of the scrub, I was not getting any read errors anymore. An immediate deep scrub of a PG can be triggered like this:
ceph pg deep-scrub 13.a
After that, the Ceph cluster error also disappeared. I decided that I’ve got a replication factor of two for all of my data anyway, plus backups. So I would wait for another error to show up before switching out the HDD. And that worked for another two months.
Switching out the disk
On September 20th, I got a very similar error, again on the same disk. I initially tried the same approach as before, overwriting the damaged sector to trigger a reallocation and then re-running the deep scrub in Ceph. But this time, the approach did not work. Instead it produced additional read errors in neighboring sectors. I ran three scrubs, and got read errors for different sectors in each of them.
At that point I decided to replace the disk. Here is the disk’s smartctl -a
output:
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-79-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD40EFRX-68N32N0
Serial Number: WD-WCC7K6EE326H
LU WWN Device Id: 5 0014ee 20f9d1545
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Sep 20 09:29:43 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 113) The previous self-test completed having
the read element of the test failed.
Total time to complete Offline
data collection: (45360) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 482) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 253 164 021 Pre-fail Always - 1775
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 41
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 018 018 000 Old_age Always - 59917
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 39
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 197 197 000 Old_age Always - 10581
194 Temperature_Celsius 0x0022 119 101 000 Old_age Always - 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
I bought the disk in late 2018, and it has been running continuously since then.
For the disk replacement, I used the emergency HDD I keep in storage for just such an occasion as this. The new disk is an 8TB Seagate IronWolf, in the 5400 RPM variant.
For the replacement, I followed these instructions.
But those don’t really work.
Let me explain. I started out with removing the disk from the Ceph cluster CRD. I don’t have automated OSD creation enabled, so I’ve got a list of storage devices in my cluster CRD, and I removed the entry of the broken HDD:
- name: "ceph3"
devices:
- name: "/dev/disk/by-id/wwn-0x50014ee20f9d1545"
config:
deviceClass: hdd
As the docs state, I started by scaling down the Rook operator:
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
Then I scaled down the deployment of the broken OSD as well:
kubectl -n rook-ceph scale deployment rook-ceph-osd-7 --replicas=0
The next command is then supposed to be given via the rook-ceph kubectl plugin:
kubectl rook-ceph rook purge-osd 7
That command is supposed to take the OSD out of the cluster and wait for rebalancing to finish before then destroying the OSD. But this didn’t work, the command failed with this error:
Info: waiting for pod with label "app=rook-ceph-operator" in namespace "rook-ceph" to be running
So it looks like the command needs the operator to run? But at the same time, the docs seem to state that I have to take down the operator first? I’m not getting it.
Anyway, I started up the operator again and then executed the command a second time. This triggered the rebalancing to the remaining OSDs and I went and did some other things.
Okay, that was a lie. 😅 I continuously watched the Ceph dashboard and my Ceph Grafana dashboard of course. 😁
After about twelve hours of rebalancing, I realized that I might not have enough space on the two remaining OSDs, with the OSD overview on the Ceph dashboard looking like this:

I had forgotten to calculate whether I actually had enough space for all of the data to fit onto only two HDDs
That was when I decided to just replace the disk, instead of waiting for the rebalance to finish.
While doing so, I had another confirmation that I’m just utterly incompetent when
it comes to doing things in the physical world: No comment.
After installing the disk into the new server, I added it to the Ceph Cluster CRD, triggering the creation of the Kubernetes deployment. The first thing I noted was that the new OSD didn’t have a device class set:
ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 17.28386 root default
-10 4.54839 host ceph1
6 hdd 3.63869 osd.6 up 1.00000 1.00000
5 ssd 0.90970 osd.5 up 1.00000 1.00000
-7 4.54839 host ceph2
3 hdd 3.63869 osd.3 up 1.00000 1.00000
2 ssd 0.90970 osd.2 up 1.00000 1.00000
-13 8.18709 host ceph3
0 7.27739 osd.0 up 1.00000 1.00000
4 ssd 0.90970 osd.4 up 1.00000 1.00000
Note how osd.0
is missing the CLASS
. I’m not sure what’s going wrong here.
I did configure the class in the Ceph Cluster entry:
- name: "ceph3"
devices:
- name: "/dev/disk/by-id/wwn-0x5000c500e6f9fde3"
config:
deviceClass: hdd
I fixed the issue with this command:
ceph osd crush set-device-class hdd osd.0
Another problem was that for some reason, the Rook operator was still trying to re-create OSD 7, the one from the broken HDD. This was the same symptom as I had during my k8s migration of the Ceph cluster, and I had to do the same song and dance to clean up the removed OSD. See here.
To improve performance during the backfill to the new disk, I tweaked some Ceph configs:
ceph config set osd osd_mclock_profile high_recovery_ops
ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 6
The first line instructs Ceph to prioritize recovery ops higher, the second allows me to override scheduler settings and the third sets the maximum number of backfilling PGs per OSD to 6. Specifically the last option increased recovery throughput from ~11 MB/s to ~50 MB/s.
And I don’t like it. I had to do the same during the aforementioned migration of the Ceph cluster to k8s. And I don’t understand why. My cluster has an average throughput below 10 MB/s. Why does it need a manual intervention from me to get Ceph to use the remaining IO for the backfills? It just doesn’t make any sense to me. I must be doing something wrong, but I have no idea what that might be. As I said in my post about the k8s migration, I will have to really dig into Ceph’s implementation at some point.
Let me also show you the timeline of the replacement, using the Ceph PG states over time:

State of the 265 PGs during the HDD switch
At around 10:15 on 2025-09-20, I launched the replacement of the OSD, still thinking I would wait for the rebalance to finish before taking out the old HDD. That triggered 63 of the 265 PGs to go into undersized+remapped state, waiting to be put onto a different OSD. That operation slowly continued for the next couple of hours, until I realized that I didn’t have enough space to store all the data on the two remaining HDDs. I then decided to switch out the HDD around 21:00 on the 20th. That required me to shut down the Ceph node, also making the PGs from the SSD in that node unavailable, leading to the spike in undersized+remapped PGs around that time. Once I booted the node up again, there were still more remapped PGs than before. That’s due to the fact that I replaced a 4 TB HDD with an 8 TB one, leading Ceph to remap additional PGs to that larger disk.
The danger zone, meaning the time with reduced data redundancy, lasted from 10:15 on the 20th to 07:12 on the 21st, when the last undersized PG was backfilled. Everything after that was just the rebalancing due to the additional space on the new HDD. I could have kept the danger time a lot shorter if had just switched out the HDD right away, instead of waiting for a rebalance I ultimately had to forego anyway.
One last thing to mention is the change in available space in the Ceph cluster. I replaced a 4 TB HDD with an 8 TB one, so how much more space did that net me? Due to the way Ceph works and the fact that I would need some space for Ceph metadata on the device itself, I wouldn’t get an additional 4 TB.
As an example, let’s look at my S3 bucket data pool, which is entirely HDD based. With three 4 TB HDDs, I had about 1.13 TiB free in that pool at the time I removed the broken HDD. When the rebalance was done after the replacement, that same pool had 2.46 TiB free. So I gained about 1.33 TiB from a 4 TB disk space increase. I don’t think that’s too surprising, considering that the pool is a 2x replica pool, with a “host” failure domain. That means each piece of data has to be available on two hosts in the cluster. And two of my hosts still only have 4 TB HDDs, so the 8 TB of space from the new HDD doesn’t have enough space on other hosts to be fully utilized in the cluster.
Finally, I’ve also bought WD’s Red Plus 8 TB 5400 RPM HDD to have a replacement should another HDD fail. And that might happen sooner rather than later, as I’ve got another 4 TB HDD of the same make and model I bought at the same time from the same shop as the failed HDD. So this one might fail soon as well.
This action has yet again shown that I need to take a deep dive into Ceph performance behavior at some point.