While working on some internal documentation of my Rook Ceph setup, I found that my pool’s Placement Groups were still at size 1, even though I had transferred about 350GB of data already.
I have the PG Autoscaler enabled by default on all pools, so I won’t have to have an eye on the PG counts. But for some reason, scaling wasn’t happening.
Digging into the issue, I finally found the following log lines in the MGR logs:
pool rbd-fast won't scale due to overlapping roots: {-3, -1}
pool rbd-bulk won't scale due to overlapping roots: {-3, -1, -2}
pool homelab-fs-metadata won't scale due to overlapping roots: {-3, -1, -2}
pool homelab-fs-bulk won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.control won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.meta won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.log won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.index won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.non-ec won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.otp won't scale due to overlapping roots: {-3, -1, -2}
pool .rgw.root won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.data won't scale due to overlapping roots: {-3, -1, -2}
pool 1 contains an overlapping root -1... skipping scaling
pool 2 contains an overlapping root -3... skipping scaling
pool 3 contains an overlapping root -2... skipping scaling
pool 4 contains an overlapping root -3... skipping scaling
pool 5 contains an overlapping root -2... skipping scaling
pool 6 contains an overlapping root -3... skipping scaling
pool 7 contains an overlapping root -3... skipping scaling
pool 8 contains an overlapping root -3... skipping scaling
pool 9 contains an overlapping root -3... skipping scaling
pool 10 contains an overlapping root -3... skipping scaling
pool 11 contains an overlapping root -3... skipping scaling
pool 12 contains an overlapping root -3... skipping scaling
pool 13 contains an overlapping root -2... skipping scaling
These lines told me that almost all of my pools were suffering from overlapping roots. Which was pretty weird to me - I was pretty sure I had none of those.
My CRUSH map looks like this:
ceph osd crush tree
ID CLASS WEIGHT TYPE NAME
-1 13.64517 root default
-4 9.09679 host nakith
0 hdd 7.27739 osd.0
1 ssd 1.81940 osd.1
-7 4.54839 host neper
3 hdd 3.63869 osd.3
2 ssd 0.90970 osd.2
Compare that to the errors above - the only root I could see here was -1
,
the default root.
After some research, I found the principle of shadow roots. These can be
displayed by adding the --show-shadow
option to the previous command:
ceph osd crush tree --show-shadow
ID CLASS WEIGHT TYPE NAME
-3 ssd 2.72910 root default~ssd
-6 ssd 1.81940 host nakith~ssd
1 ssd 1.81940 osd.1
-9 ssd 0.90970 host neper~ssd
2 ssd 0.90970 osd.2
-2 hdd 10.91608 root default~hdd
-5 hdd 7.27739 host nakith~hdd
0 hdd 7.27739 osd.0
-8 hdd 3.63869 host neper~hdd
3 hdd 3.63869 osd.3
-1 13.64517 root default
-4 9.09679 host nakith
0 hdd 7.27739 osd.0
1 ssd 1.81940 osd.1
-7 4.54839 host neper
3 hdd 3.63869 osd.3
2 ssd 0.90970 osd.2
Now I saw all of the roots. By I still wasn’t getting where the overlapping
roots were coming from. So I took a closer look at one of the mentioned pools,
rbd-fast
, which should be restricted to SSDs only. First, the pool info
on it:
ceph osd pool get rbd-fast crush_rule
crush_rule: rbd-fast_host_ssd
Then looking closer at that rule:
ceph osd crush rule dump rbd-fast_host_ssd
{
"rule_id": 2,
"rule_name": "rbd-fast_host_ssd",
"type": 1,
"steps": [
{
"op": "take",
"item": -3,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
So, this also looks perfectly fine. I googled and googled and read through the
Ceph docs. But I couldn’t really find anything. I found some things which
talked about the .mgr
pool, but that was the one pool which didn’t appear in
the error messages from the MGR daemon above.
But it still was the problem. Even though the autoscaler complained explicitly
about all the other pools, claiming they had overlapping roots, the only pool
which actually had those overlaps was the .mgr
pool - the only one which did
NOT produce an error log line!
So what does the crush rule for this pool look like?
ceph osd pool get .mgr crush_rule
replicated_rule
And that replicated_rule
looks like this:
ceph osd crush rule dump replicated_rule
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
So yes, this rule will actually take OSDs from different roots, namely both the SSD and HDD roots. Which, you know, Ceph could have been a lot clearer about. I finally realized that this is the problem from this post in the Proxmox forums.
So to fix this, the .mgr
pool also needs to get a CRUSH rule which assigns
a specific device class. I did it like this:
ceph osd crush rule create-replicated replicated-mgr default host ssd
ceph osd pool set .mgr crush_rule replicated-mgr
This creates a replicated CRUSH rule, with a failure domain of host
and
assigning objects to SSDs.
And with that, the autoscaler immediately fired up and increased the PG counts on my pools. 😒