PG Autoscaling in Ceph Rook

While working on some internal documentation of my Rook Ceph setup, I found that my pool’s Placement Groups were still at size 1, even though I had transferred about 350GB of data already.

I have the PG Autoscaler enabled by default on all pools, so I won’t have to have an eye on the PG counts. But for some reason, scaling wasn’t happening.

Digging into the issue, I finally found the following log lines in the MGR logs:

pool rbd-fast won't scale due to overlapping roots: {-3, -1}
pool rbd-bulk won't scale due to overlapping roots: {-3, -1, -2}
pool homelab-fs-metadata won't scale due to overlapping roots: {-3, -1, -2}
pool homelab-fs-bulk won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.control won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.meta won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.log won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.index won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.non-ec won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.otp won't scale due to overlapping roots: {-3, -1, -2}
pool .rgw.root won't scale due to overlapping roots: {-3, -1, -2}
pool rgw-bulk.rgw.buckets.data won't scale due to overlapping roots: {-3, -1, -2}
pool 1 contains an overlapping root -1... skipping scaling
pool 2 contains an overlapping root -3... skipping scaling
pool 3 contains an overlapping root -2... skipping scaling
pool 4 contains an overlapping root -3... skipping scaling
pool 5 contains an overlapping root -2... skipping scaling
pool 6 contains an overlapping root -3... skipping scaling
pool 7 contains an overlapping root -3... skipping scaling
pool 8 contains an overlapping root -3... skipping scaling
pool 9 contains an overlapping root -3... skipping scaling
pool 10 contains an overlapping root -3... skipping scaling
pool 11 contains an overlapping root -3... skipping scaling
pool 12 contains an overlapping root -3... skipping scaling
pool 13 contains an overlapping root -2... skipping scaling

These lines told me that almost all of my pools were suffering from overlapping roots. Which was pretty weird to me - I was pretty sure I had none of those.

My CRUSH map looks like this:

ceph osd crush tree
ID  CLASS  WEIGHT    TYPE NAME
-1         13.64517  root default   
-4          9.09679      host nakith
 0    hdd   7.27739          osd.0  
 1    ssd   1.81940          osd.1  
-7          4.54839      host neper 
 3    hdd   3.63869          osd.3  
 2    ssd   0.90970          osd.2

Compare that to the errors above - the only root I could see here was -1, the default root.

After some research, I found the principle of shadow roots. These can be displayed by adding the --show-shadow option to the previous command:

ceph osd crush tree --show-shadow
ID  CLASS  WEIGHT    TYPE NAME          
-3    ssd   2.72910  root default~ssd   
-6    ssd   1.81940      host nakith~ssd
 1    ssd   1.81940          osd.1      
-9    ssd   0.90970      host neper~ssd 
 2    ssd   0.90970          osd.2      
-2    hdd  10.91608  root default~hdd   
-5    hdd   7.27739      host nakith~hdd
 0    hdd   7.27739          osd.0      
-8    hdd   3.63869      host neper~hdd 
 3    hdd   3.63869          osd.3      
-1         13.64517  root default       
-4          9.09679      host nakith    
 0    hdd   7.27739          osd.0      
 1    ssd   1.81940          osd.1      
-7          4.54839      host neper     
 3    hdd   3.63869          osd.3      
 2    ssd   0.90970          osd.2

Now I saw all of the roots. By I still wasn’t getting where the overlapping roots were coming from. So I took a closer look at one of the mentioned pools, rbd-fast, which should be restricted to SSDs only. First, the pool info on it:

ceph osd pool get rbd-fast crush_rule
crush_rule: rbd-fast_host_ssd

Then looking closer at that rule:

ceph osd crush rule dump rbd-fast_host_ssd
{
    "rule_id": 2,
    "rule_name": "rbd-fast_host_ssd",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -3,
            "item_name": "default~ssd"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

So, this also looks perfectly fine. I googled and googled and read through the Ceph docs. But I couldn’t really find anything. I found some things which talked about the .mgr pool, but that was the one pool which didn’t appear in the error messages from the MGR daemon above.

But it still was the problem. Even though the autoscaler complained explicitly about all the other pools, claiming they had overlapping roots, the only pool which actually had those overlaps was the .mgr pool - the only one which did NOT produce an error log line!

So what does the crush rule for this pool look like?

ceph osd pool get .mgr crush_rule
replicated_rule

And that replicated_rule looks like this:

ceph osd crush rule dump replicated_rule
{
    "rule_id": 0,
    "rule_name": "replicated_rule",
    "type": 1,
    "steps": [
        {
            "op": "take",
            "item": -1,
            "item_name": "default"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}

So yes, this rule will actually take OSDs from different roots, namely both the SSD and HDD roots. Which, you know, Ceph could have been a lot clearer about. I finally realized that this is the problem from this post in the Proxmox forums.

So to fix this, the .mgr pool also needs to get a CRUSH rule which assigns a specific device class. I did it like this:

ceph osd crush rule create-replicated replicated-mgr default host ssd
ceph osd pool set .mgr crush_rule replicated-mgr

This creates a replicated CRUSH rule, with a failure domain of host and assigning objects to SSDs.

And with that, the autoscaler immediately fired up and increased the PG counts on my pools. 😒