Wherein I figure out why my Ceph S3 is so slow and think about potential hardware upgrades.

As part of my goaccess post, I had to copy around almost 60 GB of logs, from my laptop to my desktop. I decided to do that via my Ceph S3. And it was very, very slow. There were 185 files to copy, with a total size just shy of 60 GiB. The majority of that size comes from two Traefik log files, both around 30 GiB in size. I used Rclone to sync the files to an empty directory on my desktop with this command:

rclone sync -P my-s3:public/traefik-log ./

And that was very slow. I saw a maximum of 25 MiB/s, that’s it. And sure, the S3 bucket I was copying from was only backed by HDDs, so there’s an upper limit. And my network is only 1 Gb/s, so there’s another potential bottleneck. But, at the same time, 25 MiB/s is still a bit anemic, considering that the HDDs should be able to do 120 MB/s, and the network should be able to do about the same. This S3 performance issue has been dogging me for quite a while. I found it in earlier copy operations as well, including for example my backups, which also go into S3 buckets via restic.

Before describing the problem in detail, here is what the overall setup looks like:

A diagram showing the setup. It shows five hosts. The first three have one HDD each. Two of them also have one RGW instance each. Then the fourth host has the Traefik instance, and the fifth host is my desktop machine. The two RGW instances each have connections to all three HDDs, and the Traefik instance has connections to both RGWs. The desktop only has one connection, to the Traefik instance.

Setup of RGW access in my Homelab.

So my desktop contacts the Traefik proxy for access to S3. Traefik, in turn, contacts the two load-balanced Ceph RGW instances. Those, in turn, are backed by three HDDs for data storage. Each HDD is in a different host, and the two RGW instances also run on different hosts. Both the HDDs and the RGW instances run on one of my three Ceph hosts, while the Traefik instances runs on a non-Ceph host.

With that out of the way, let’s have a look at the problem. Here is the transmission chart for my RGWs while copying the aforementioned 60 GB from an S3 bucket to my desktop:

A screenshot of a Grafana time series chart. It shows the bytes send by my Ceph S3 setup. The graph starts at around 0. At 19:55, it starts going up, until it reaches the maximum of about 27 MB/s around 20:03. The plot hovers around that value until about 20:41, when it slowly goes back to zero.

Throughput of my RGW cluster during the transfer.

As I’ve said above, 27 MB/s isn’t exactly good throughput in my setup, it should be topping out at four times that, under ideal circumstances, as my HDDs should be able to make 120 MB/s at most.

Looking around, I first thought that the disks of my Ceph nodes were fully utilized by a mere 25 MB/s read. But that wasn’t the case. My next thought was the network on the Ceph nodes, but that also topped out at about 200 Mbit/s. The last thing I checked was the CPU utilization on the Ceph nodes running the RGW instances. My thinking being: Perhaps the load of the OSD for disk access combined with the RGW load was too much? But that also wasn’t it. Max CPU load was around 25%.

Then I had an idea: All the traffic needed to go through the Traefik I’m using as my k8s Ingress. So what about the machine running that? And it turned out that the CPU for that machine was nearly fully loaded during the entire copy process:

A screenshot of a Grafana time series chart. It shows the CPU utilization during the copy process, starting at around 12% utilization until 20:00 and then increasing rapidly to 84%. It then stays around that value until about 20:43, when it returns to the previous 12%. The graph also shows the different types of CPU utilization. Throughout the entire duration, about 30% of the CPUs is taken up by softirq, another 30% by sys and only 17% by user usage.

CPU utilization on the Pi 4 running Traefik during the copy operation.

The poor Pi 4 is far beyond its capabilities here, it seems. To confirm that this was really due to CPU power, I repeated the test. The overall setup is the same, just that now, I scheduled the Traefik Ingress Pod on one of my Ceph hosts. It’s a 12th Gen Intel i3-12100T. In contrast to the Pi, it was able to do about 93 MB/s, and instead of over 40 minutes, it only needed 12 for the same copy operation:

And another screenshot of a Grafana time series plot. It again shows the throughput per second of my RGW cluster. It again starts near zero, and slowly goes up starting at 12:21, before it reaches its maximum of about 93 MB/s at 12:24. It keeps that throughput until 12:32, when it slowly start going down again, hitting near-zero again at 12:34.

Throughput of my RGW cluster during the copy operation, with Traefik running on a beefier machine.

And these 93 MB/s are probably not the max that my RGW cluster can push, as I think that the Traefik Ingress is again holding it back. Only this time not due to CPU, but rather due to networking:

Another screenshot of a Grafana time series chart. This time, it shows the receiving and transmitting network traffic for the host running Traefik. It shows that at for the whole transmission, 924 Mbit/s are transmitted, and about 700 Mbit/s are received.

Network utilization of the host running Traefik. Left Y axis is receiving, right Y axis is sending.

While there is still some breathing room on the 1 Gbit/s connection in the RX direction, with only 700 Mbit/s used, the TX direction is pretty much full, with 924 Mbit/s. I believe there is likely still a few MB/s to be had for the S3 copy. Because the host here is also a Ceph host. So besides sending out data from Traefik, it is also sending out data from its Ceph OSD to the other RGW running on another host. That’s likely the majority of the 700 Mbit/s of incoming data.

Sadly, I don’t have any powerful hosts which are not also Ceph hosts to test the theory. Everything else in the Homelab is currently either a Pi 4 or an old Pentium N3710.

Future hardware thoughts

I’ve thought about updating the 8 Raspberry Pi CM4 8GB which form the main compute in my Homelab since my issues with my control plane nodes last year. The Pi 4 is now over 6 years old, and it wasn’t a performance beast to begin with. But the thing is: Besides very specific instances like what I described above or my control plane issues, the performance of the Pi 4 is still perfectly fine for everything I’m doing. Could my Grafana dashboards load a bit faster during the first load of the day? Sure. Could the Nextcloud UI be a bit more performant? Yes. But it really isn’t that bad.

Still, I think it’s time to at least consider an update. There are three basic options I’m currently seeing:

  1. Do a 1:1 replacement, replacing all of the Pi CM4 with Pi CM5
  2. Expand the RAM in the Ceph nodes and get rid of the 8 CM4
  3. Replace the Pi CM4 with three or four SFF machines

Switching to the Pi 5

By now, the Raspberry Pi CM5 has been released, and the Turing Pi 2 board, which I have my CM4 in at the moment, might support the CM5 out of the box. Might being the important word here. They had a blog post back in December which showed a Turing Pi 2 board with CM5, but that blog post has now vanished. I looked at their Discord, and it seems the general support wasn’t quite as good as that blog post made it sound. Potentially, especially flashing the CM5 was not quite working properly. Which wouldn’t matter to me too much - my worker nodes do netboot anyway. So let’s put this option into a maybe.

Provided that the CM5 actually works in the Turing Pi 2 boards, this would be the lowest effort approach. I would just take out all of the CM4 and replace them with CM5. Looking at my trusted Pi dealer BerryBase, each CM5, in the no eMMC, no WiFi/Bluetooth variant with 8 GB of RAM would cost me 86,90 €, for a total of 695,20 €. Add to that some passive heatsinks for around 50 bucks total, and the entire upgrade would cost me about 750 € shipped.

Total Effort: Minimal Total Costs: About 750 €

Moving the entire Homelab onto the three current Ceph hosts

First, what would I need to replace the 8 CM4? I don’t think CPU is that much of a bottleneck. Most of the time, my Homelab’s combined CPUs are about 87% idle. More interesting is that I would need 8x8 GB = 64 GB of RAM. Sure, I would probably need a bit less, because I wouldn’t need all the foundational services eight times, but let’s still use the 64 GB as a ballpark.

I’ve currently got three hosts running my Ceph cluster. First an Odroid H3. I would like to get rid of this one to be honest, as it’s not really fit for purpose. It has two SATA and SATA power connectors, and that’s it. So in this scheme, it would definitely need to be replaced entirely. Before looking at replacements, let’s look at the other two hosts.

The next one is my old home server, with an AMD A10-9700e 3 GHz CPU and 16 GB of RAM. The board allows up to only 32 GB. Not great, but also not terrible. Looking around, a 32 GB kit would be somewhere around 270 €. For my future readers, it’s the beginning of 2026, and LLM data centers are currently eating hardware like there’s no tomorrow. I’d need a 32 GB kit instead of a 16 GB kit because the MB only has two RAM slots.

The last host is the newest, it’s running an Intel i3-12100T and already has 32 GB of RAM. Another 32 GB should be fine here, so that would be another 270 €.

Then there’s the replacement for the Odroid H3. The following is just a quick search, I could likely get it for cheaper.

PartPrice
Intel Core Ultra 5 225T250 €
ASUS PRIME B860-PLUS158 €
GSkill Flare DDR5-6000 2x16 GB328 €
Noctua NH-L9x65 CPU Cooler60 €
beQuiet! Pure Power 13M 650W101 €
Kingston 512 GB NVMe SSD147 €
——————————–——–
Total1044 €

As I said, mostly quick and dirty. I would have loved to calculate with 64 GB, but those DDR5 RAM prices are certainly something else.

So with this, I would end up with the same amount of RAM as before.

The effort would be moderate. I would need to build the new machine, and then migrate the Ceph OSDs over to it from the H3. I could then of course leave the H3 running as well and use that as a worker?

But the cost would be quite significant, compared to the CM5 replacement.

Total Effort: Moderate Total Costs: 270 + 270 + 1044 = 1584 €

It would be interesting to see what this option would do with my Homelab’s power consumption. I’m currently at around 150 W. Replacing the H3 with a beefier machine would increase the consumption, but at the same time, removing the 8 CM4 is also going to do something. And by how much would the power consumption of the two current Ceph hosts increase when they also need to run workloads besides Ceph?

One issue I would see here: The ability to reboot machines. I only ended up with so many CM4 because I wanted the ability to reboot any physical host without having to take down the Homelab. With only three machines providing for both, the Ceph cluster and the rest of my workloads, could I take one of them down?

Moving to SFF machines for my workers

The last, and to me most interesting option: Do a bigger change. Replace the CM4 with SFF PCs, probably at least three of them. Here, again, the main point is that I want to retain the ability to restart any physical host without having to take down the entire Homelab beforehand.

The main advantage of this setup would be that I would want to experiment a bit. Instead of adding the bare hosts to my k8s cluster, I’d want to install Incus and work with VMs. Mostly so that I’ve got an easy way to experiment a bit, without having to run the experimental VMs on my desktop. I very much enjoyed the time when I was running VMs on my old home server while doing the k8s migration.

The costs are a bit unclear, from some quick searching I would need to do a lot more thinking and research to see what I can get for which price. I’m also a bit worried about both, the power consumption and the noise levels. I will likely ask around a bit on the Fediverse and see what other Homelabbers have to say on those topics.

One big question with this option would be whether I would keep my three Pi 5 boards, which currently serve as k8s control plane nodes and run the Ceph MONs. I could put one on each of the SFF PCs in a separate VM, for example.

Conclusions

🤷 I’m honestly unsure at the moment. While I don’t want to spend insane amounts of money, the above cost estimates fall rather comfortably into the “eh, I can live with that” bracket.

I don’t think I will ultimately end up with option 1. Especially while playing around with Tinkerbell, I again got a bit annoyed with the idiosyncrasies of ARM SBCs. I really want something conforming to established standards for the next hardware iteration.

Option 2 would feel to me a bit too much like putting too many eggs into too few baskets. I’d gotten quite used to having my storage on separate hosts. But then again, looking at the CPU utilization of those hosts, I’m wasting a lot of compute by not running more things on them.

Option 3 is my current favorite, to be honest. I’d love introducing something new into the Homelab with Incus. Plus it would definitely introduce some interesting challenges when it comes to my automated Homelab host OS update Ansible playbook. 😁

We shall see. For now, the current Homelab is still fine. Plus, I’m also planning a networking upgrade that will likely happen first. But it was interesting to think about, at least.