Ceph MON Migration

In the course of spreading my homelab over a couple more machines, I finally arrived at the Ceph cluster’s MON daemons. These were running on three Ceph VMs on my main x86 server up to now. In this post, I will describe how I moved them to three Raspberry Pis. While the cluster was up the entire time.

First, a couple of considerations:

MON daemons use on average about 1GB of memory in my cluster
My cluster, and most of my services, went down during the migration. So please be cautious if you plan to do your own migration

The MON daemons are something of a control plane for Ceph clusters. They hold the MON map of daemons and data locations. Every client which uses the Ceph cluster will use them to access a map of available OSDs to work with.

Please Note: Be cautious with this! If you lose all three of your Monitors, your cluster is broken.

Due to the centrality of the MON daemons for both, the cluster itself and any clients, a lot of places potentially hold the IPs of your monitors. Most of the time, that will be in the form of ceph.conf files.

Clients are generally not automatically receiving new MON addresses. They will need to be updated manually!

So how did I do it all? I started out with migrating a single daemon. My thinking here: I can migrate one daemon, then update all three MON’s addresses to their new values everywhere, and then I can migrate the other two daemons as well.

For the sake of this article, let’s assume that the old MONs are located on oldhost1,oldhost2,oldhost3 and the new hosts are called newhost1,newhost2,newhost3.

Also note that I’m running a cephadm cluster.

So to begin with, a single daemon can be migrated by using the ceph orch apply command:

ceph orch apply mon --placement "newhost1,oldhost1,oldhost2"

This will disable the MON on oldhost3 and place a fresh one on newhost. The MON daemons on oldhost1 and oldhost2 will not be touched at all and continue running.

At this point, nothing much can go wrong in cluster operations. Any connected clients will automatically go searching for another MON daemon and find either oldhost1 or oldhost2. But note: Those clients will not automagically get the IP of newhost1 added to their potential MONs. Many parts of the cluster, including the MON daemons on oldhost1 and oldhost2, will be informed about the new MON daemon. But other parts of the cluster will not. Among the daemons which will not automatically get the new MON address are the OSDs and NFS daemons.

At this point, I was not aware that there is any kind of problem.

I then adapted all of the ceph.conf files and other places where the MON IPs are mentioned. These were:

Ceph CSI jobs running in my Nomad cluster
ceph.conf files on a number of unmanaged physical hosts
The kernel command lines of my netbooting hosts, which contain the MONs

This was where I diverged from my original plan. Instead of just replacing the IP of oldhost3 with the one of newhost1, I went ahead and replaced all of them.

And here’s where the problems started. During reboots, my OSDs suddenly were no longer recognized in the ceph -s output. They were down, even though I could see that they were up and running on their respective hosts.

The reason for this: The OSDs do not seem to be updated with new MON addresses automatically, and they also ignore their host’s ceph.conf file. Instead, they have their own conf file, located at /var/lib/ceph/CLUSTER_ID/OSD_NAME/config. The CLUSTER_ID here is the id: line in the ceph -s output and OSD_NAME is for example osd.1. That file seems to be a ceph.conf file used by the OSDs. Just manually changing the MON addresses in there and restarting the daemons fixed the issue.

I also observed that the NFS daemon I had running did not seem to be working anymore. It had the same problem and the same solution worked.

A final comment on performance: It seems that Raspberry Pis manage the load of MON daemons just fine. I’ve got three of them hosting the MONs now, and they are also running Nomad, Consul and Vault servers. The CPU utilization seldom goes above 10%.