Wherein I migrate the last remaining data off of my baremetal Ceph cluster and shut it down.
This is part 24 of my k8s migration series.
I set up my baremetal Ceph cluster back in March of 2021, driven by how much I liked the idea of large pools of disk I could use to provide S3 storage, Block devices and a POSIX compatible filesystem. Since then, it has served me rather well, and I’ve been using it to provide S3 buckets and volumes for my Nomad cluster. Given how happy I was with it, I also wanted to continue using it for my Kubernetes cluster.
To this end, I was quite happy to discover the Rook Ceph project, which at its core implements a Kubernetes operator capable of orchestrating an entire Ceph cluster. I’ve described my setup in far more detail in this blog post.
In the original baremetal cluster, I had three nodes, each with one HDD and one SSD for storage, running all Ceph daemons besides the MONs, which ran on my cluster controller Pi 4. I’ve run the cluster with replicated pools and a replication of two, as well as a minimal size of one, so I could reboot a node without all writes to the cluster having to stop e.g. during maintenance. I was lucky in that all of my data also comfortably fit on only two hosts with a 1 TB SSD and a 4 TB HDD. So when the time to start the migration came, I took my emergency replacement HDD and SSD and put them into my old Homeserver. A VM running on that server became the first OSD node in the k8s cluster. I also drained the OSDs and other daemons from one of the original baremetal nodes and moved that into the k8s cluster as well. So I still ended up with 2x replication, just with two clusters of two storage nodes each.
After I was finally done migrating all of my services from Nomad to Kubernetes, I still had the following datasets on the baremetal Ceph cluster:
- An NFS Ganesha cluster serving the boot partitions of all of my netbooting hosts
- A data dump CephFS volume that contained just some random data, like old slides and digital notes from my University days
- The root disks of all of my netbooting nodes, in the form of 50 GB RBDs
In the rest of this post, I will go over how I migrated all three of those, shut down the old baremetal cluster and migrated its two physical nodes into the Rook Ceph cluster.
Root disk migration
The first step was migrating the root disks of my netbooting hosts. Those hosts are eight Raspberry Pi CM4 and an x86 SBC, all without any local storage. Those hosts use a 50 GB RBD each as their root disk. Those needed to be migrated over to the new Rook Ceph cluster and their configuration changed to contact the Rook MON daemons. If you’re interested in the details of my netboot setup, have a look at this series of posts.
As these RBDs were block devices, I was initially at a bit of a loss when thinking about migrating them over. Sure, those nine netbooters were as cattle-ish as it got, so I could just completely recreate them - but the setup of fresh hosts is the weakest part of my Homelab setup. It would have taken me a couple of evenings.
Luckily, Reddit to the rescue. It turns out that the rbd tool can both import and export RBD images, including via stdin/stdout.
I did the migration node by node, and because at this point all of the nodes were in the k8s cluster, I had to start with draining them:
kubectl drain --delete-emptydir-data=true --force=true --ignore-daemonsets=true examplehost
Then the node also needs to be shut down, because migrating the disk from one Ceph cluster to another really isn’t going to work online. Once the host was safely shut down, I could do the actual copy operation:
rbd --id admin export --no-progress hostdisks/examplehost - | rbd -c ceph-rook.conf -k client.admin.key --id admin import - hostdisks/examplehost
The first rbd
invocation does not receive an explicit Ceph config file, so it
uses the default /etc/ceph/ceph.conf
file, which at this point was still the
config for the baremetal cluster. The hostdisks
pool was the destination pool
for the copy operation.
One issue worth noting here is that the rbd
tool as provided by the Rook kubectl plugin
did not work as the receiving command. I was immediately getting broken
pipe errors. Probably something to do with how it is implemented as a kubectl plugin.
With the copy done, which took about ten minutes per disk, I then had to adapt the configuration of the MON IPs in the host’s kernel command line. For one of my Pis, it looks something like this:
console=serial0,115200 dwc_otg.lpm_enable=0 console=tty1 root=LABEL=writable rootfstype=ext4 rootwait fixrtc boot=rbd rbdroot=300.300.300.310,300.300.300.311,300.300.300.312:cephuser:pw:hostdisks:examplehost::_netdev,noatime hllogserver=logs.internal:12345
The list of three IPs after rbdroot=
contains the MONs used. Then I also had
to change the Ceph key in the pw
field.
And then I could reboot the host. And what should I say, all nine hosts went through without a single issue. I had expected at least some sort of problem, but I was seemingly pretty well prepared.
Before going to the next migration, let’s have a look at some Ceph metrics for this copy operation. First the throughput:
Throughput graph for the receiving Ceph cluster for eight of the disk migrations.
This might already somewhat be the explanation, when looking at the throughput
on my router next: Network graphs for the NIC both the Ceph hosts and the C&C host hang off of.
Next, let’s have a quick look at the IO utilization on the two receiving hosts: IO utilization of the two receiving hosts.
So clearly, this time around the IO utilization is not the problem. Neither is
the CPU: CPU idle percentage of the two receiving hosts. CPU idle percentage the C&C host doing the rbd import/export.
This might be the explanation for why I’m reaching no more than 50 MB/s throughput
even though this is a copy from SSD to SSD. The C&C host is a pretty weak one,
it has an AMD Embedded G-Series GX-412TC CPU - very low powered. But normally
that’s more than enough, as it doesn’t need to do compute heavy stuff. But this
might be too much for it? I’m not familiar with the rbd
import/export implementation,
but looking at the plot, I could theorize: This looks like two of the four cores
being fully pegged. Possibly one by the rbd export
and one by the rbd import
?
And the roughly 50 MB/s is simply all it can really do?
I think I need to dig deeper into this at some point, running some proper testing of what I can really do when it comes to reads and writes in Ceph.
That’s it for the disk copying. Let’s move onto the second element of my netboot setup, the boot partitions sitting on NFS.
NFS setup
For the boot partitions, I needed to come up with something special, because those need to be “shared” between the host they belong to and my cluster master, which runs a TFTP server. That’s because to mount the RBD root disks, I need a kernel running, and that kernel needs to come from somewhere. Plus, the hosts should all be able to independently run updates, or even different operating systems. So I couldn’t just share one boot partition between all of them.
For this, again, I’m using my Ceph cluster and the integrated support for NFS Ganesha.
I configured the cluster with the Rook NFS CRD, looking like this:
apiVersion: ceph.rook.io/v1
kind: CephNFS
metadata:
name: hl-nfs
spec:
# Settings for the NFS server
server:
active: 1
placement:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "homelab/role"
operator: In
values:
- "ceph"
tolerations:
- key: "homelab/taint.role"
operator: "Equal"
value: "ceph"
effect: "NoSchedule"
resources:
limits:
memory: "1Gi"
requests:
cpu: "250m"
memory: "1Gi"
priorityClassName: "system-cluster-critical"
logLevel: NIV_INFO
This creates a single NFS pod in the cluster. If I read the docs right,
NFS doesn’t do HA very well, so there’s not much use to have more than one.
One of the things Ceph does when an NFS cluster is set up is to create the
.nfs
pool, as location for some metadata. This in turn leads to the Ceph
PG autoscaler to stop working with this warning:
debug 2025-03-10T13:54:45.313+0000 7fa1ad391640 0 [pg_autoscaler WARNING root] pool 6 contains an overlapping root -3... skipping scaling
I’ve written about the last time I encountered this issue here, so suffice to say that the root cause is that the new pool is created with a generic CRUSH rule that contains several other CRUSH rules. It’s fixed by applying a more specific rule to the pool.
Because I wanted to use the NFS cluster outside k8s, I also introduced this Service:
apiVersion: v1
kind: Service
metadata:
name: nfs-rook-external
labels:
homelab/public-service: "true"
annotations:
external-dns.alpha.kubernetes.io/hostname: nfs.example.com
io.cilium/lb-ipam-ips: "300.300.300.102"
spec:
type: LoadBalancer
externalTrafficPolicy: Local
selector:
app: rook-ceph-nfs
ceph_nfs: hl-nfs
instance: a
ports:
- name: nfs
port: 2049
This allowed me not to have a fixed host I could enter into the /etc/fstab
of
my hosts, and NFS is perfectly happy with having to use DNS to get the IP of
the NFS server.
Next is the NFS share itself. These shares can be backed by either a CephFS
subvolume or an S3 bucket. But the S3 bucket backend has severe restrictions.
I tried it once, and found that e.g. Git repos on such an NFS share don’t work,
with Git commands returning Not Implemented
errors. So I created a CephFS
subvolume:
ceph fs subvolume create my-cephfs my-share
Then comes the creation of the NFS share:
ceph nfs export create cephfs --cluster-id hl-nfs --pseudo-path /my-share-path --fsname my-cephfs --path /volumes/_nogroup/my-share/UUID-HERE --client_addr 300.300.300.0/24 --client_addr 300.300.315.0/24
The --path
parameter can be fetched via this command:
ceph fs subvolume getpath my-cephfs my-share
One thing I’m a bit sad about is that I had to use the command line to create those two objects, the subvolume and the NFS share, instead of being able to use CRDs in the k8s cluster.
The resulting NFS share definition, as fetched with ceph nfs export ls hl-nfs --detailed
,
looks like this:
[
{
"access_type": "none",
"clients": [
{
"access_type": "rw",
"addresses": [
"300.300.300.0/24",
"300.300.315.0/24"
],
"squash": "None"
}
],
"cluster_id": "hl-nfs",
"export_id": 1,
"fsal": {
"fs_name": "my-cephfs",
"name": "CEPH",
"user_id": "nfs.hl-nfs.1"
},
"path": "/volumes/_nogroup/my-share/UUID-HERE",
"protocols": [
4
],
"pseudo": "/my-share-path",
"security_label": true,
"squash": "None",
"transports": [
"TCP"
]
}
]
The end effect of all of this is an NFS share which can be mounted like this:
nfs.example.com:/my-share-path /mnt/example nfs defaults,timeo=900,_netdev 0 0
One small note on the migration: Ansible’s mount module does not seem to automatically remount when a mount is changed. Which is likely a good idea, but it meant that I had to execute these commands on all of my netbooters:
ansible "host1:host2:host3" -a "umount /boot/firmware"
ansible "host1:host2:host3" -a "mount /boot/firmware"
After that, they all had the right NFS share mounted and I was one step closer to shutting down the baremetal cluster.
Copying my warehouse volume over
As I’ve mentioned above, I’ve got a “random bunch of stuff” CephFS subvolume that is mounted on my desktop. It really contains exactly that: A random assortment of data. Copies of old University slides and projects, backups for my OpenWRT WiFi router’s config and some old database dumps from services I’m no longer running. Overall, it’s about 129 GB, so not too much data, in contrast to my Linux ISO collection for example.
Here’s the rsync command and its output:
rsync -av --info=progress2 --info=name0 /mnt/tempt1/* /mnt/temp2/
sending incremental file list
129,648,120,620 99% 41.86MB/s 0:49:14 (xfr#1415, to-chk=0/1537)
sent 129,679,927,316 bytes received 27,564 bytes 43,892,352.30 bytes/sec
total size is 129,654,775,448 speedup is 1.00
Absolutely nothing interesting happened here, it took only about 49 minutes. If you’re interested in some metrics about a 1.7 TB copy operation from one CephFS subvolume on one cluster to another subvolume on another cluster, have a look at this recent post.
Final takedown of the baremetal cluster
So that’s it. With the warehouse volume transferred, there was, supposedly, nothing important on that cluster anymore.
But I wasn’t about to trust that. Instead, I ran ceph df
to confirm, and found
that there was exactly 348 MB of data left. Deciding that that can’t be anything
important, I ran the cluster purge by executing this command on all the remaining
cluster hosts, meaning the two OSD nodes and the three cluster controllers hosting
the MONs:
cephadm rm-cluster --force --zap-osds --fsid a84c7196-7ebf-11eb-b290-18c04d00217f
And just like that, the baremetal Ceph cluster was gone. It lived for almost exactly four years, having been created on 2021-03-06, at 21:05.
Adding the two baremetal hosts to the Rook cluster
After the old cluster had been removed, I needed to add the two OSD hosts to the Rook cluster. I did so by first adding them to the k8s cluster and then updating the Rook cluster’s Helm values:
- name: "host1"
devices:
- name: "/dev/disk/by-id/wwn-0x5002538e90b5e22f"
config:
deviceClass: ssd
- name: "/dev/disk/by-id/wwn-0x50014ee2ba48465d"
config:
deviceClass: hdd
- name: "host2"
devices:
- name: "/dev/disk/by-id/wwn-0x5002538e90b68866"
config:
deviceClass: ssd
- name: "/dev/disk/by-id/wwn-0x50014ee20f9d1545"
config:
deviceClass: hdd
Both hosts have one 1 TB SATA SSD and one 4 TB HDD. To be absolutely save, I’m using the disk’s WWN to identify them.
After that, the rebalancing started:
PG state during the rebalancing after adding the two additional hosts with four additional OSDs.
I would love to show you the overall throughput of the backfill operations, but
it looks like there are no metrics for those. The ceph_osd_op_r_out_bytes
and
ceph_osd_op_w_in_bytes
metrics I’m using for the general cluster throughput
seem to only be actual client operations. That throughput definitely did not show
the backfill load on the OSDs.
So let’s instead have a look at the throughput of the six disks in the cluster: Accumulated bytes written per second on all disks, HDD and SSD, in the Rook cluster during the rebalancing.
The graph has a couple of points worth discussing. The first one to note is that there was not much load on the cluster overall from clients, it hovered around 1 - 2 MB/s, typical for my Homelab. And still, during the rebalancing only used, at maximum, 70 MB/s worth of writes. And remember, these are not just HDDs, but also SSDs. And I’m pretty sure that this is entirely due to Ceph itself. At the beginning of the plot, around 23:49, you can see a jump from around 30 MB/s to 70 MB/s. That happened after I entered the following two commands:
ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 2
These instruct Ceph to use more than one backfill per OSD. Then you can also see another jump at 09:20 the next morning, where the throughput suddenly goes from around 40 MB/s to 70 MB/s again, at least for a short while. That was after I entered this command:
ceph config set osd osd_mclock_profile high_recovery_ops
I will refrain from any expletives at this point, because I don’t understand this well enough to judge whether I’m the problem here, or whether Ceph really works this way.
So, a little while ago, Ceph introduced a new IO scheduler, mclock. The config settings I showed above impact how that scheduler works.
Why do I have to make these settings? Why, with barely 1 MB/s throughput, do I actually have to tell Ceph to run more than one backfill per OSD? And why doesn’t Ceph actually use the OSDs full throughput for even that one default backfill? Because that graph above, that’s not a single disk. That’s the sum of the throughput on all of them. This kind of write throughput would be pathetic for a single HDD. I really don’t understand why my mixed HDD/SSD cluster shows it. What does the scheduler actually do here? I mean, don’t get me wrong - there is likely a good reason, but I don’t understand it. Why not use an OSD’s full write capacity for backfills when there is nearly no other traffic happening?
I was really stumped when I saw these numbers. And also, why even have a scheduler when I still need to manually set the maximum number of backfills allowed?
Anyway, there’s now a “Learn Ceph” task in my backlog. When the migration is done,
I will not put my old home server back into storage. Instead, I will buy a couple
more disks and use it as a Ceph playground. And if I have to read the entire
Ceph source code from int main()
to the end, I will. Because I’m now intensely
curious at why the backfill was so darned slow.
And now, let’s come to the “Michael utterly embarrasses himself” part of this post.
Arrogance
After the addition of the new hosts was done, I could shut down the Ceph VM running on my extension host, as it was no longer required.
And I learned that I have an unhealthy amount of arrogance. I went into this thinking “Well, I sure know how to remove a host from a Ceph cluster, I don’t need any docs!”.
Narrator: He did need docs.
So let’s start with what I should have done. I should have followed these Rook docs. They describe, in very nice detail, what to do to remove OSDs from a Rook Ceph cluster.
But no. That was of course not what I did. What I did was just winging it.
So I started by removing the host from the values.yaml
file. That had only one
effect, namely showing messages like this in the logs of the Rook operator:
2025-03-15 19:20:15.090341 W | op-osd: not updating OSD 0 on node "oldhost". node no longer exists in the storage spec. if the user wishes to remove OSDs from the node, they must do so manually. Rook will not remove OSDs from nodes that are removed from the storage spec in order to prevent accidental data loss
After that slightly embarrassing failure, I deigned to actually skim the doc I linked to above, and found this command:
$ kubectl rook-ceph rook purge-osd 0,1 --force
Info: Running purge osd command
2025/03/15 19:48:59 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2025-03-15 19:48:59.731856 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2025-03-15 19:48:59.731894 I | rookcmd: starting Rook v1.16.5 with arguments 'rook ceph osd remove --osd-ids=0,1 --force-osd-removal=true'
2025-03-15 19:48:59.731897 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0,1, --preserve-pvc=false
2025-03-15 19:48:59.737462 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-03-15 19:48:59.737529 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2025-03-15 19:48:59.921560 I | cephosd: validating status of osd.0
2025-03-15 19:48:59.921571 I | cephosd: osd.0 is healthy. It cannot be removed unless it is 'down'
2025-03-15 19:48:59.921573 I | cephosd: validating status of osd.1
2025-03-15 19:48:59.921575 I | cephosd: osd.1 is healthy. It cannot be removed unless it is 'down'
So that wasn’t exactly a success either. So what did I do? Did I now properly read the entire page? No, of course not. I instead decided that the right way to do this was to scale down the two OSDs I wanted to remove:
kubectl -n rook-cluster scale deployment rook-ceph-osd-0 --replicas=0
kubectl -n rook-cluster scale deployment rook-ceph-osd-1 --replicas=0
And then I repeated the previous command:
$ kubectl rook-ceph rook purge-osd 0,1 --force
Info: Running purge osd command
2025/03/15 19:50:33 maxprocs: Leaving GOMAXPROCS=8: CPU quota undefined
2025-03-15 19:50:33.244662 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2025-03-15 19:50:33.244700 I | rookcmd: starting Rook v1.16.5 with arguments 'rook ceph osd remove --osd-ids=0,1 --force-osd-removal=true'
2025-03-15 19:50:33.244704 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --osd-ids=0,1, --preserve-pvc=false
2025-03-15 19:50:33.250479 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-03-15 19:50:33.250539 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2025-03-15 19:50:33.432040 I | cephosd: validating status of osd.0
2025-03-15 19:50:33.432049 I | cephosd: osd.0 is marked 'DOWN'
2025-03-15 19:50:33.622957 I | cephosd: marking osd.0 out
2025-03-15 19:50:34.825971 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:34.828262 E | cephosd: failed to fetch the deployment "rook-ceph-osd-0". deployments.apps "rook-ceph-osd-0" not found
2025-03-15 19:50:34.828271 I | cephosd: purging osd.0
2025-03-15 19:50:35.055813 I | cephosd: attempting to remove host "oldhost" from crush map if not in use
2025-03-15 19:50:35.237143 I | cephosd: failed to remove CRUSH host "oldhost". exit status 39
2025-03-15 19:50:35.427664 I | cephosd: no ceph crash to silence
2025-03-15 19:50:35.427677 I | cephosd: completed removal of OSD 0
2025-03-15 19:50:35.427680 I | cephosd: validating status of osd.1
2025-03-15 19:50:35.427683 I | cephosd: osd.1 is marked 'DOWN'
2025-03-15 19:50:35.608670 I | cephosd: marking osd.1 out
2025-03-15 19:50:36.329162 I | cephosd: osd.1 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:36.331913 E | cephosd: failed to fetch the deployment "rook-ceph-osd-1". deployments.apps "rook-ceph-osd-1" not found
2025-03-15 19:50:36.331920 I | cephosd: purging osd.1
2025-03-15 19:50:36.655373 I | cephosd: attempting to remove host "oldhost" from crush map if not in use
2025-03-15 19:50:37.663211 I | cephosd: removed CRUSH host "oldhost"
2025-03-15 19:50:37.930419 I | cephosd: no ceph crash to silence
2025-03-15 19:50:37.930431 I | cephosd: completed removal of OSD 1
Note especially those lines:
2025-03-15 19:50:34.825971 I | cephosd: osd.0 is NOT ok to destroy but force removal is enabled so proceeding with removal
2025-03-15 19:50:36.329162 I | cephosd: osd.1 is NOT ok to destroy but force removal is enabled so proceeding with removal
That’s where my arrogance really bit me. I had just copy+pasted the rook purge-osd
command from the docs, including the --force
at the end. Not a good idea.
Instead of taking the OSD out and then letting the cluster rebalance, the OSD was
just removed, meaning I had a relatively long phase of reduced data redundancy.
Not my most stellar Homelab moment.
It took another 17 hours of rebalancing to recover the cluster. But I still wasn’t done yet, because now the Rook operator logs were showing these messages:
2025-03-20 20:31:32.598410 I | clusterdisruption-controller: osd "rook-ceph-osd-0" is down and a possible node drain is detected
2025-03-20 20:31:32.598473 I | clusterdisruption-controller: osd "rook-ceph-osd-1" is down and a possible node drain is detected
2025-03-20 20:31:32.814825 I | clusterdisruption-controller: osd is down in failure domain "oldhost". pg health: "all PGs in cluster are clean"
And again, had I read the docs properly, this would not have happened. I fixed the issue with the following commands:
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
kubectl delete deployments.apps -n rook-cluster rook-ceph-osd-0
kubectl delete deployments.apps -n rook-cluster rook-ceph-osd-1
kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
With that I finally had removed the old host cleanly from the Rook cluster. It could all have been a lot smoother if I’d just read the docs properly the first time. Again, no data loss of course, but it could have gone better.
The last step in the Ceph baremetal to Rook saga was to remove the old host from the Kubernetes cluster entirely:
kubectl drain oldhost --ignore-daemonsets --delete-local-data
kubectl delete node oldhost
And then resetting the Ceph scheduler options:
kubectl rook-ceph ceph config rm osd osd_mclock_profile
kubectl rook-ceph ceph config rm osd osd_mclock_override_recovery_settings
Conclusion
And that was it. The entire migration from baremetal Ceph took me quite a while, but a lot of it was just waiting for copying and rebalancing operations to finish. The effort I had to put in was relatively low. Even considering that somewhere towards the end I temporarily forgot the value of reading documentation from beginning to end.
The fact that I think I did not make use of my full storage performance especially during the addition/removal of hosts has reinforced my wish to do a real deep dive into Ceph, how it’s implemented and how it works. Luckily, it’s written in C++, which I’m working with for work as well. But I’m hoping I can also find some more high-level explanations of the algorithms used. I even plan to read Weil’s original thesis and papers on RADOS and the CRUSH algorithm.
While writing these lines, I’m also working on the last step of the migration, migrating Vault into the cluster and then migrating the cluster control plane nodes from VMs to my three Pi 4.