Ceph NFS on ARM64: Not currently possible

Today, I wanted to migrate my single Ceph NFS daemon from my LXD VM on an x86 server to my Hardkernel H3. This did not work out too well. Yesterday, I had finally finished the migration of my Ceph cluster’s MON daemons to my three Raspberry Pi controllers. This move had it’s own problems, which I will detail a bit in another blog entry. Part of this migration was to also move over the MGR daemons to my Pis. This actually worked without any problem - I thought.

Ceph’s MGR daemon serves a number of functions related to managing a Ceph cluster. It provides the rather nicely made Ceph Dashboard, for example. Its most important task though is controlling the Orchestrator accessible with the ceph orch CLI interface. Of particular interest for this post though is the ceph nfs NFS module. This module allows a user to run an NFS server (or multiple) which are backed by S3 buckets or CephFS volumes. In my setup, I’m using this functionality to have an NFS export which houses the /boot partitions of my netbooting machines. For details of my setup, have a look at this previous post.

So now what happened today? I wanted to migrate my NFS daemon from the VM it was running on to my Hardkernel H3.

First step: Adding a second NFS daemon on the new host, while leaving the old daemon untouched:

ceph orch apply nfs my-nfs --placement "newhost,oldhost"

The result of this was an error message along these lines:

bash[61872]: debug 2022-12-21T11:57:05.302+0000 ffff84614280 -1 log_channel(cephadm) log [ERR] : Failed to apply nfs.my-nfs spec NFSServiceSpec.from_json(yaml.safe_load('''service_type: nfs
bash[61872]: service_id: hn-nfs
bash[61872]: service_name: nfs.hn-nfs
bash[61872]: placement:
bash[61872]:   hosts:
bash[61872]:   - khonsu
bash[61872]: ''')): [Errno 2] No such file or directory: 'ganesha-rados-grace': 'ganesha-rados-grace'
bash[61872]: Traceback (most recent call last):
bash[61872]:   File "/usr/share/ceph/mgr/cephadm/serve.py", line 507, in _apply_all_services
bash[61872]:     if self._apply_service(spec):
bash[61872]:   File "/usr/share/ceph/mgr/cephadm/serve.py", line 760, in _apply_service
bash[61872]:     daemon_spec = svc.prepare_create(daemon_spec)
bash[61872]:   File "/usr/share/ceph/mgr/cephadm/services/nfs.py", line 66, in prepare_create
bash[61872]:     daemon_spec.final_config, daemon_spec.deps = self.generate_config(daemon_spec)
bash[61872]:   File "/usr/share/ceph/mgr/cephadm/services/nfs.py", line 87, in generate_config
bash[61872]:     self.run_grace_tool(spec, 'add', nodeid)
bash[61872]:   File "/usr/share/ceph/mgr/cephadm/services/nfs.py", line 225, in run_grace_tool
bash[61872]:     timeout=10)
bash[61872]:   File "/lib64/python3.6/subprocess.py", line 423, in run
bash[61872]:     with Popen(*popenargs, **kwargs) as process:
bash[61872]:   File "/lib64/python3.6/subprocess.py", line 729, in __init__
bash[61872]:     restore_signals, start_new_session)
bash[61872]:   File "/lib64/python3.6/subprocess.py", line 1364, in _execute_child
bash[61872]:     raise child_exception_type(errno_num, err_msg, err_filename)
bash[61872]: FileNotFoundError: [Errno 2] No such file or directory: 'ganesha-rados-grace': 'ganesha-rados-grace'

So it’s missing the file ganesha-rados-grace in the Docker image Ceph uses. A quick google of the error message leads to this bug in Ceph’s GitHub repo. As indicated in the bug, the nfs-ganesha package is not available for ARM64. Ganesha is an enhanced NFS server which allows the user to provide a lot of different types of storage via NFS to users.

I’m honestly not sure what the problem is here, but for me it just reads as if Ganesha is not build for ARM64. So it doesn’t become part of the ARM64 Ceph container.

One of the reasons it took me so long to realize that this was the problem I actually encountered: The new NFS daemon was supposed to run on the Hardkernel H3 - and that’s just an x86 machine. After some more tests, I finally realized that the above error wasn’t coming from the failed daemon start - it was coming from my MGR instances on my Raspberry Pi!

And this makes sense: There are a couple of ganesha commands which need to be run when a new daemon is initialized. And those commands run as part of the MGR NFS module - and so are executed on the MGR host, not on the host which is going to run the NFS daemon.

My only possible solution for now: Adding another MGR instance to the H3 (luckily it has enough memory) and making that the active MGR instance whenever I need to run NFS commands. Which should not be too often.

But as a consequence of this problem, I’m now considering whether I should just go ahead and order another two H3 to have all my Ceph nodes on x86. Decision to be made later.