Wherein I migrate the last remaining data off of my baremetal Ceph cluster and shut it down.

This is part 24 of my k8s migration series.

I set up my baremetal Ceph cluster back in March of 2021, driven by how much I liked the idea of large pools of disk I could use to provide S3 storage, Block devices and a POSIX compatible filesystem. Since then, it has served me rather well, and I’ve been using it to provide S3 buckets and volumes for my Nomad cluster. Given how happy I was with it, I also wanted to continue using it for my Kubernetes cluster.

To this end, I was quite happy to discover the Rook Ceph project, which at its core implements a Kubernetes operator capable of orchestrating an entire Ceph cluster. I’ve described my setup in far more detail in this blog post.

In the original baremetal cluster, I had three nodes, each with one HDD and one SSD for storage, running all Ceph daemons besides the MONs, which ran on my cluster controller Pi 4. I’ve run the cluster with replicated pools and a replication of two, as well as a minimal size of one, so I could reboot a node without all writes to the cluster having to stop e.g. during maintenance. I was lucky in that all of my data also comfortably fit on only two hosts with a 1 TB SSD and a 4 TB HDD. So when the time to start the migration came, I took my emergency replacement HDD and SSD and put them into my old Homeserver. A VM running on that server became the first OSD node in the k8s cluster. I also drained the OSDs and other daemons from one of the original baremetal nodes and moved that into the k8s cluster as well. So I still ended up with 2x replication, just with two clusters of two storage nodes each.

After I was finally done migrating all of my services from Nomad to Kubernetes, I still had the following datasets on the baremetal Ceph cluster:

  1. An NFS Ganesha cluster serving the boot partitions of all of my netbooting hosts
  2. A data dump CephFS volume that contained just some random data, like old slides and digital notes from my University days
  3. The root disks of all of my netbooting nodes, in the form of 50 GB RBDs

In the rest of this post, I will go over how I migrated all three of those, shut down the old baremetal cluster and migrated its two physical nodes into the Rook Ceph cluster.

Root disk migration

The first step was migrating the root disks of my netbooting hosts. Those hosts are eight Raspberry Pi CM4 and an x86 SBC, all without any local storage. Those hosts use a 50 GB RBD each as their root disk. Those needed to be migrated over to the new Rook Ceph cluster and their configuration changed to contact the Rook MON daemons. If you’re interested in the details of my netboot setup, have a look at this series of posts.

As these RBDs were block devices, I was initially at a bit of a loss when thinking about migrating them over. Sure, those nine netbooters were as cattle-ish as it got, so I could just completely recreate them - but the setup of fresh hosts is the weakest part of my Homelab setup. It would have taken me a couple of evenings.

Luckily, Reddit to the rescue. It turns out that the rbd tool can both import and export RBD images, including via stdin/stdout.

I did the migration node by node, and because at this point all of the nodes were in the k8s cluster, I had to start with draining them:

kubectl drain --delete-emptydir-data=true --force=true --ignore-daemonsets=true examplehost

Then the node also needs to be shut down, because migrating the disk from one Ceph cluster to another really isn’t going to work online. Once the host was safely shut down, I could do the actual copy operation:

rbd --id admin export --no-progress hostdisks/examplehost - | rbd -c ceph-rook.conf -k client.admin.key --id admin import - hostdisks/examplehost

The first rbd invocation does not receive an explicit Ceph config file, so it uses the default /etc/ceph/ceph.conf file, which at this point was still the config for the baremetal cluster. The hostdisks pool was the destination pool for the copy operation. One issue worth noting here is that the rbd tool as provided by the Rook kubectl plugin did not work as the receiving command. I was immediately getting broken pipe errors. Probably something to do with how it is implemented as a kubectl plugin.

With the copy done, which took about ten minutes per disk, I then had to adapt the configuration of the MON IPs in the host’s kernel command line. For one of my Pis, it looks something like this:

console=serial0,115200 dwc_otg.lpm_enable=0 console=tty1 root=LABEL=writable rootfstype=ext4 rootwait fixrtc  boot=rbd rbdroot=300.300.300.310,300.300.300.311,300.300.300.312:cephuser:pw:hostdisks:examplehost::_netdev,noatime hllogserver=logs.internal:12345

The list of three IPs after rbdroot= contains the MONs used. Then I also had to change the Ceph key in the pw field.

And then I could reboot the host. And what should I say, all nine hosts went through without a single issue. I had expected at least some sort of problem, but I was seemingly pretty well prepared.

Before going to the next migration, let’s have a look at some Ceph metrics for this copy operation. First the throughput:

A screenshot of a Grafana time series plot. It shows roughly three hours worth of data, time on the X axis and throughput in MB on the Y axis. For the quiet periods, the throughput is around 1-2 MB/s. But there are eight 'hills' in the plot, each about ten minutes long, which show a throughput between 30 and 50 MB/s.

Throughput graph for the receiving Ceph cluster for eight of the disk migrations.

Interesting things here are the approximately ten minutes of duration for each of the disk migrations and the fact that the maximum throughput reached is around 50 MB/s. It’s worth noting that, in contrast to a previous copy operation, the target disks were SSDs this time around. So 50 MB/s sounds a bit too little, doesn’t it? Well, yes and no. 🙂 This time I made another little mistake I had discussed previously, namely I ran the copy operation on my C&C host. And that means that the data needs to go through my router, because the Ceph cluster and the C&C host live on different VLANs and subnets.

This might already somewhat be the explanation, when looking at the throughput on my router next:

A screenshot of a Grafana time series plot. It shows the network utilization Mb/s for the network interface both the Ceph hosts and the C&C host hang off of. Like the previous throughput plot, it shows 8 load phases. In each of them, about 700 - 800 Mb/s come in and go out again.

Network graphs for the NIC both the Ceph hosts and the C&C host hang off of.

While yes, this doesn’t look like the 1 GbE interface is saturated, there might be some other kind of issue? Meaning this might actually be the max it can do for this particular routing scenario? Then again, the CPU really should be capable of routing 1 Gbps.

Next, let’s have a quick look at the IO utilization on the two receiving hosts:

A screenshot of a Grafana time series plot. It shows the IO utilization in percent of two hosts, each represented by its own plot. Again, the eight copy operations are clearly visible as hills in the plots. One host goes up to 20%, while the other goes up to almost 60%. Neither of them come close to 100% IO utilization.

IO utilization of the two receiving hosts.

So clearly, this time around the IO utilization is not the problem. Neither is the CPU:

A screenshot of a Grafana time series plot. The two hosts shown have an idle CPU percentage of 95% and 90% respectively. The eight copy operations are again clearly visible as hills in the plots. For one of the hosts, the idle percentage doesn't move much, only from 95% to 90%. For the other host the impact is more visible, moving the idle percentage down to 65% to 70%.

CPU idle percentage of the two receiving hosts.

Both hosts still have a lot of headroom here. But I did find this as well:
A screenshot of a Grafana time series plot. This time only one host's CPU idle percentage is shown. It is 97% idle at rest, but deep troughs down to around 55% idle are visible for the eight copy operations.

CPU idle percentage the C&C host doing the rbd import/export.

This might be the explanation for why I’m reaching no more than 50 MB/s throughput even though this is a copy from SSD to SSD. The C&C host is a pretty weak one, it has an AMD Embedded G-Series GX-412TC CPU - very low powered. But normally that’s more than enough, as it doesn’t need to do compute heavy stuff. But this might be too much for it? I’m not familiar with the rbd import/export implementation, but looking at the plot, I could theorize: This looks like two of the four cores being fully pegged. Possibly one by the rbd export and one by the rbd import? And the roughly 50 MB/s is simply all it can really do?

I think I need to dig deeper into this at some point, running some proper testing of what I can really do when it comes to reads and writes in Ceph.

That’s it for the disk copying. Let’s move onto the second element of my netboot setup, the boot partitions sitting on NFS.

NFS setup

For the boot partitions, I needed to come up with something special, because those need to be “shared” between the host they belong to and my cluster master, which runs a TFTP server. That’s because to mount the RBD root disks, I need a kernel running, and that kernel needs to come from somewhere. Plus, the hosts should all be able to independently run updates, or even different operating systems. So I couldn’t just share one boot partition between all of them.

For this, again, I’m using my Ceph cluster and the integrated support for NFS Ganesha.

I configured the cluster with the Rook NFS CRD, looking like this:

apiVersion: ceph.rook.io/v1
kind: CephNFS
metadata:
  name: hl-nfs
spec:
  # Settings for the NFS server
  server:
    active: 1
    placement:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: "homelab/role"
                  operator: In
                  values:
                    - "ceph"
      tolerations:
        - key: "homelab/taint.role"
          operator: "Equal"
          value: "ceph"
          effect: "NoSchedule"
    resources:
      limits:
        memory: "1Gi"
      requests:
        cpu: "250m"
        memory: "1Gi"
    priorityClassName: "system-cluster-critical"
    logLevel: NIV_INFO

This creates a single NFS pod in the cluster. If I read the docs right, NFS doesn’t do HA very well, so there’s not much use to have more than one. One of the things Ceph does when an NFS cluster is set up is to create the .nfs pool, as location for some metadata. This in turn leads to the Ceph PG autoscaler to stop working with this warning:

debug 2025-03-10T13:54:45.313+0000 7fa1ad391640  0 [pg_autoscaler WARNING root] pool 6 contains an overlapping root -3... skipping scaling

I’ve written about the last time I encountered this issue here, so suffice to say that the root cause is that the new pool is created with a generic CRUSH rule that contains several other CRUSH rules. It’s fixed by applying a more specific rule to the pool.

Because I wanted to use the NFS cluster outside k8s, I also introduced this Service:

apiVersion: v1
kind: Service
metadata:
  name: nfs-rook-external
  labels:
    homelab/public-service: "true"
  annotations:
    external-dns.alpha.kubernetes.io/hostname: nfs.example.com
    io.cilium/lb-ipam-ips: "300.300.300.102"
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  selector:
    app: rook-ceph-nfs
    ceph_nfs: hl-nfs
    instance: a
  ports:
    - name: nfs
      port: 2049

This allowed me not to have a fixed host I could enter into the /etc/fstab of my hosts, and NFS is perfectly happy with having to use DNS to get the IP of the NFS server.

Next is the NFS share itself. These shares can be backed by either a CephFS subvolume or an S3 bucket. But the S3 bucket backend has severe restrictions. I tried it once, and found that e.g. Git repos on such an NFS share don’t work, with Git commands returning Not Implemented errors. So I created a CephFS subvolume:

ceph fs subvolume create my-cephfs my-share

Then comes the creation of the NFS share:

ceph nfs export create cephfs --cluster-id hl-nfs --pseudo-path /my-share-path --fsname my-cephfs --path /volumes/_nogroup/my-share/UUID-HERE --client_addr 300.300.300.0/24 --client_addr 300.300.315.0/24

The --path parameter can be fetched via this command:

ceph fs subvolume getpath my-cephfs my-share

One thing I’m a bit sad about is that I had to use the command line to create those two objects, the subvolume and the NFS share, instead of being able to use CRDs in the k8s cluster.

The resulting NFS share definition, as fetched with ceph nfs export ls hl-nfs --detailed, looks like this:

[
  {
    "access_type": "none",
    "clients": [
      {
        "access_type": "rw",
        "addresses": [
          "300.300.300.0/24",
          "300.300.315.0/24"
        ],
        "squash": "None"
      }
    ],
    "cluster_id": "hl-nfs",
    "export_id": 1,
    "fsal": {
      "fs_name": "my-cephfs",
      "name": "CEPH",
      "user_id": "nfs.hl-nfs.1"
    },
    "path": "/volumes/_nogroup/my-share/UUID-HERE",
    "protocols": [
      4
    ],
    "pseudo": "/my-share-path",
    "security_label": true,
    "squash": "None",
    "transports": [
      "TCP"
    ]
  }
]

The end effect of all of this is an NFS share which can be mounted like this:

nfs.example.com:/my-share-path /mnt/example nfs defaults,timeo=900,_netdev 0 0

One small note on the migration: Ansible’s mount module does not seem to automatically remount when a mount is changed. Which is likely a good idea, but it meant that I had to execute these commands on all of my netbooters:

ansible "host1:host2:host3" -a "umount /boot/firmware"
ansible "host1:host2:host3" -a "mount /boot/firmware"

After that, they all had the right NFS share mounted and I was one step closer to shutting down the baremetal cluster.

Copying my warehouse volume over

As I’ve mentioned above, I’ve got a “random bunch of stuff” CephFS subvolume that is mounted on my desktop. It really contains exactly that: A random assortment of data. Copies of old University slides and projects, backups for my OpenWRT WiFi router’s config and some old database dumps from services I’m no longer running. Overall, it’s about 129 GB, so not too much data, in contrast to my Linux ISO collection for example.

Here’s the rsync command and its output:

rsync -av --info=progress2 --info=name0 /mnt/tempt1/* /mnt/temp2/
sending incremental file list
129,648,120,620  99%   41.86MB/s    0:49:14 (xfr#1415, to-chk=0/1537)

sent 129,679,927,316 bytes  received 27,564 bytes  43,892,352.30 bytes/sec
total size is 129,654,775,448  speedup is 1.00

Absolutely nothing interesting happened here, it took only about 49 minutes. If you’re interested in some metrics about a 1.7 TB copy operation from one CephFS subvolume on one cluster to another subvolume on another cluster, have a look at this recent post.

Final takedown of the baremetal cluster

So that’s it. With the warehouse volume transferred, there was, supposedly, nothing important on that cluster anymore.

But I wasn’t about to trust that. Instead, I ran ceph df to confirm, and found that there was exactly 348 MB of data left. Deciding that that can’t be anything important, I ran the cluster purge by executing this command on all the remaining cluster hosts, meaning the two OSD nodes and the three cluster controllers hosting the MONs:

cephadm rm-cluster --force --zap-osds --fsid a84c7196-7ebf-11eb-b290-18c04d00217f

And just like that, the baremetal Ceph cluster was gone. It lived for almost exactly four years, having been created on 2021-03-06, at 21:05.

Adding the two baremetal hosts to the Rook cluster

After the old cluster had been removed, I needed to add the two OSD hosts to the Rook cluster. I did so by first adding them to the k8s cluster and then updating the Rook cluster’s Helm values:

      - name: "host1"
        devices:
          - name: "/dev/disk/by-id/wwn-0x5002538e90b5e22f"
            config:
              deviceClass: ssd
          - name: "/dev/disk/by-id/wwn-0x50014ee2ba48465d"
            config:
              deviceClass: hdd
      - name: "host2"
        devices:
          - name: "/dev/disk/by-id/wwn-0x5002538e90b68866"
            config:
              deviceClass: ssd
          - name: "/dev/disk/by-id/wwn-0x50014ee20f9d1545"
            config:
              deviceClass: hdd

Both hosts have one 1 TB SATA SSD and one 4 TB HDD. To be absolutely save, I’m using the disk’s WWN to identify them.

After that, the rebalancing started:

A screenshot of a Grafana time series plot. It shows the state of the 265 Placement Groups in the cluster. At approximately 23:18, they go 130 PGs being remapped. Initially, the count of remapped PGs goes down relatively quickly, reaching only 67 remapped PGs around 01:02. But after that, the number of remapped PGs goes down only slowly, reaching zero around 19:44.

PG state during the rebalancing after adding the two additional hosts with four additional OSDs.

The initial relatively rapid reduction in remapped PGs was probably the SSD PGs, and the rest were the HDD OSDs.

I would love to show you the overall throughput of the backfill operations, but it looks like there are no metrics for those. The ceph_osd_op_r_out_bytes and ceph_osd_op_w_in_bytes metrics I’m using for the general cluster throughput seem to only be actual client operations. That throughput definitely did not show the backfill load on the OSDs.

So let’s instead have a look at the throughput of the six disks in the cluster:

A screenshot of a Grafana time series plot. At the beginning, it hovers somewhere around 10 MB/s, until it goes up to 30 MB/s around 23:19. The next jump comes at 23:49, to 70 MB/s. It goes down a bit again at 01:00, to around 40 MB/s. After that, the plot hovers anywhere between 30 MB/s and 50 MB/s until about 09:20, where it goes up to 70 MB/s for another 45 minutes or so, before coming down to 30 - 40 MB/s at 10 and stays there until 12:00. Then it goes up yet again to 50 MB/s. After 14:00, the plot slowly goes down towards the initial 10 MB/s range, which it reaches around 19:40.

Accumulated bytes written per second on all disks, HDD and SSD, in the Rook cluster during the rebalancing.

I just created that graph by adding up the written bytes per second from the node exporter data I’m gathering, specifically for the eight disks which are part of the Rook cluster at this point.

The graph has a couple of points worth discussing. The first one to note is that there was not much load on the cluster overall from clients, it hovered around 1 - 2 MB/s, typical for my Homelab. And still, during the rebalancing only used, at maximum, 70 MB/s worth of writes. And remember, these are not just HDDs, but also SSDs. And I’m pretty sure that this is entirely due to Ceph itself. At the beginning of the plot, around 23:49, you can see a jump from around 30 MB/s to 70 MB/s. That happened after I entered the following two commands:

ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 2

These instruct Ceph to use more than one backfill per OSD. Then you can also see another jump at 09:20 the next morning, where the throughput suddenly goes from around 40 MB/s to 70 MB/s again, at least for a short while. That was after I entered this command:

ceph config set osd osd_mclock_profile high_recovery_ops

I will refrain from any expletives at this point, because I don’t understand this well enough to judge whether I’m the problem here, or whether Ceph really works this way.

So, a little while ago, Ceph introduced a new IO scheduler, mclock. The config settings I showed above impact how that scheduler works.

Why do I have to make these settings? Why, with barely 1 MB/s throughput, do I actually have to tell Ceph to run more than one backfill per OSD? And why doesn’t Ceph actually use the OSDs full throughput for even that one default backfill? Because that graph above, that’s not a single disk. That’s the sum of the throughput on all of them. This kind of write throughput would be pathetic for a single HDD. I really don’t understand why my mixed HDD/SSD cluster shows it. What does the scheduler actually do here? I mean, don’t get me wrong - there is likely a good reason, but I don’t understand it. Why not use an OSD’s full write capacity for backfills when there is nearly no other traffic happening?

I was really stumped when I saw these numbers. And also, why even have a scheduler when I still need to manually set the maximum number of backfills allowed?

Anyway, there’s now a “Learn Ceph” task in my backlog. When the migration is done, I will not put my old home server back into storage. Instead, I will buy a couple more disks and use it as a Ceph playground. And if I have to read the entire Ceph source code from int main() to the end, I will. Because I’m now intensely curious at why the backfill was so darned slow.

And now, let’s come to the “Michael utterly embarrasses himself” part of this post.

Arrogance

After the addition of the new hosts was done, I could shut down the Ceph VM running on my extension host, as it was no longer required.

And I learned that I have an unhealthy amount of arrogance. I went into this thinking “Well, I sure know how to remove a host from a Ceph cluster, I don’t need any docs!”.

Narrator: He did need docs.

So let’s start with what I should have done. I should have followed these Rook docs. They describe, in very nice detail, what to do to remove OSDs from a Rook Ceph cluster.

But no. That was of course not what I did. What I did was just winging it. So I started by removing the host from the values.yaml file. That had only one effect, namely showing messages like this in the logs of the Rook operator:

2025-03-15 19:20:15.090341 W | op-osd: not updating OSD 0 on node "oldhost". node no longer exists in the storage spec. if the user wishes to remove OSDs from the node, they must do so manually. Rook will not remove OSDs from nodes that are removed from the storage spec in order to prevent accidental data loss

After that slightly embarrassing failure, I deigned to actually skim the doc I linked to above, and found this command:

$ kubectl rook-ceph rook purge-osd 0,1 --force
Info: Running purge osd command
2025/03/15 19:48:59 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2025-03-15 19:48:59.731856 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2025-03-15 19:48:59.731894 I | rookcmd: starting Rook v1.16.5 with arguments 'rook ceph osd remove --osd-ids=0,1 --force-osd-removal=true'
2025-03-15 19:48:59.731897 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0,1, --preserve-pvc=false
2025-03-15 19:48:59.737462 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-03-15 19:48:59.737529 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2025-03-15 19:48:59.921560 I | cephosd: validating status of osd.0
2025-03-15 19:48:59.921571 I | cephosd: osd.0 is healthy. It cannot be removed unless it is 'down'
2025-03-15 19:48:59.921573 I | cephosd: validating status of osd.1
2025-03-15 19:48:59.921575 I | cephosd: osd.1 is healthy. It cannot be removed unless it is 'down'

So that wasn’t exactly a success either. So what did I do? Did I now properly read the entire page? No, of course not. I instead decided that the right way to do this was to scale down the two OSDs I wanted to remove:

kubectl -n rook-cluster scale deployment rook-ceph-osd-0 --replicas=0
kubectl -n rook-cluster scale deployment rook-ceph-osd-1 --replicas=0

And then I repeated the previous command:

$ kubectl rook-ceph rook purge-osd 0,1 --force
Info: Running purge osd command
2025/03/15 19:50:33 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2025-03-15 19:50:33.244662 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2025-03-15 19:50:33.244700 I | rookcmd: starting Rook v1.16.5 with arguments 'rook ceph osd remove --osd-ids=0,1 --force-osd-removal=true'
2025-03-15 19:50:33.244704 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0,1, --preserve-pvc=false
2025-03-15 19:50:33.250479 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-03-15 19:50:33.250539 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2025-03-15 19:50:33.432040 I | cephosd: validating status of osd.0
2025-03-15 19:50:33.432049 I | cephosd: osd.0 is marked 'DOWN'
2025-03-15 19:50:33.622957 I | cephosd: marking osd.0 out
2025-03-15 19:50:34.825971 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:34.828262 E | cephosd: failed to fetch the deployment "rook-ceph-osd-0". deployments.apps "rook-ceph-osd-0" not found
2025-03-15 19:50:34.828271 I | cephosd: purging osd.0
2025-03-15 19:50:35.055813 I | cephosd: attempting to remove host "oldhost" from crush map if not in use
2025-03-15 19:50:35.237143 I | cephosd: failed to remove CRUSH host "oldhost". exit status 39
2025-03-15 19:50:35.427664 I | cephosd: no ceph crash to silence
2025-03-15 19:50:35.427677 I | cephosd: completed removal of OSD 0
2025-03-15 19:50:35.427680 I | cephosd: validating status of osd.1
2025-03-15 19:50:35.427683 I | cephosd: osd.1 is marked 'DOWN'
2025-03-15 19:50:35.608670 I | cephosd: marking osd.1 out
2025-03-15 19:50:36.329162 I | cephosd: osd.1 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:36.331913 E | cephosd: failed to fetch the deployment "rook-ceph-osd-1". deployments.apps "rook-ceph-osd-1" not found
2025-03-15 19:50:36.331920 I | cephosd: purging osd.1
2025-03-15 19:50:36.655373 I | cephosd: attempting to remove host "oldhost" from crush map if not in use
2025-03-15 19:50:37.663211 I | cephosd: removed CRUSH host "oldhost"
2025-03-15 19:50:37.930419 I | cephosd: no ceph crash to silence
2025-03-15 19:50:37.930431 I | cephosd: completed removal of OSD 1

Note especially those lines:

2025-03-15 19:50:34.825971 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:36.329162 I | cephosd: osd.1 is NOT ok to destroy but force removal is enabled so proceeding with removal

That’s where my arrogance really bit me. I had just copy+pasted the rook purge-osd command from the docs, including the --force at the end. Not a good idea. Instead of taking the OSD out and then letting the cluster rebalance, the OSD was just removed, meaning I had a relatively long phase of reduced data redundancy.

Not my most stellar Homelab moment.

It took another 17 hours of rebalancing to recover the cluster. But I still wasn’t done yet, because now the Rook operator logs were showing these messages:

2025-03-20 20:31:32.598410 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down and a possible node drain is detected
2025-03-20 20:31:32.598473 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down and a possible node drain is detected
2025-03-20 20:31:32.814825 I | clusterdisruption-controller: osd is down in failure domain "oldhost". pg health: "all PGs in cluster are clean"

And again, had I read the docs properly, this would not have happened. I fixed the issue with the following commands:

kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
kubectl delete deployments.apps -n rook-cluster rook-ceph-osd-0
kubectl delete deployments.apps -n rook-cluster rook-ceph-osd-1
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1

With that I finally had removed the old host cleanly from the Rook cluster. It could all have been a lot smoother if I’d just read the docs properly the first time. Again, no data loss of course, but it could have gone better.

The last step in the Ceph baremetal to Rook saga was to remove the old host from the Kubernetes cluster entirely:

kubectl drain oldhost --ignore-daemonsets --delete-local-data
kubectl delete node oldhost

And then resetting the Ceph scheduler options:

kubectl rook-ceph ceph config rm osd osd_mclock_profile
kubectl rook-ceph ceph config rm osd osd_mclock_override_recovery_settings

Conclusion

And that was it. The entire migration from baremetal Ceph took me quite a while, but a lot of it was just waiting for copying and rebalancing operations to finish. The effort I had to put in was relatively low. Even considering that somewhere towards the end I temporarily forgot the value of reading documentation from beginning to end.

The fact that I think I did not make use of my full storage performance especially during the addition/removal of hosts has reinforced my wish to do a real deep dive into Ceph, how it’s implemented and how it works. Luckily, it’s written in C++, which I’m working with for work as well. But I’m hoping I can also find some more high-level explanations of the algorithms used. I even plan to read Weil’s original thesis and papers on RADOS and the CRUSH algorithm.

While writing these lines, I’m also working on the last step of the migration, migrating Vault into the cluster and then migrating the cluster control plane nodes from VMs to my three Pi 4.