Wherein I update my container image build pipeline in Woodpecker with buildah.
A couple of weekends ago, I naively thought: Hey, how about stepping away from my Tinkerbell experiments for a weekend and quickly setting up a Bookwyrm instance?
As such things tend to turn out, that rookie move turned into a rather deep rabbit hole, mostly on account of my container image build pipeline not really being up to snuff.
The current setup
Before going into details on the problem and ultimate solution, I’d like to sketch out my setup. For a detailed view, have a look at this post.
I’m running Woodpecker CI in my Kubernetes cluster, running container image builds via the docker-buildx plugin.
As I’m running Woodpecker with the Kubernetes backend, each step in a pipeline will be executed in its own Pod. Each pipeline, in turn, gets a PersistentVolume mounted, which is shared between all steps of that pipeline. In my pipelines for the container image builds, I only run the docker-buildx plugin as a step, once for PRs where the image is only build but not pushed, and once for pushes onto main, where the image is build and pushed.
The docker-buildx plugin uses Docker’s buildx
command, and the BuildKit that
makes available to run the image build. Important to note for this post is that
BuildKit will happily build multi-arch images. It does so utilizing Qemu.
Now the issue with that is: The majority of my Homelab consists of Raspberry Pi 4 and a single low power x86 machine. As you might imagine, that makes emulation very slow, especially on the Pis, which do not have any virtualization instructions.
Now onto the problems I’m having with that setup.
The problems
Let’s start with the problem which triggered this particular rabbit hole, the Bookwyrm image build. I won’t go into the details of the image here, that will come in the next post when I describe the Bookwyrm setup.
The initial issue was one I had seen before on occasion. In this scenario, the build just gets canceled, with no indication of what went wrong in the Woodpecker logs for the build step. After quite a lot of digging, I finally found these lines in the logs of the machine running one of the failed CI Pods:
kubelet[1088]: I0728 21:07:42.763129 1088 eviction_manager.go:366] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
kubelet[1088]: I0728 21:07:42.763296 1088 container_gc.go:88] "Attempting to delete unused containers"
kubelet[1088]: I0728 21:07:43.131475 1088 image_gc_manager.go:404] "Attempting to delete unused images"
kubelet[1088]: I0728 21:07:43.172539 1088 eviction_manager.go:377] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
kubelet[1088]: I0728 21:07:43.174677 1088 eviction_manager.go:395] "Eviction manager: pods ranked for eviction" pods=["woodpecker/wp-01k194yzh8bg8tzngrf7x6w3k4","monitoring/grafana-pg-cluster-1","harbor/harbor-pg-cluster-1","harbor/harbor-registry-5cb6c944f5-wm6np","wallabag/wallabag-679f44d9d5-9gl8m","harbor/harbor-portal-578db97949-d52sp","forgejo/forgejo-74948996b9-r94c2","harbor/harbor-jobservice-6cb7fc6d4b-gsswv","harbor/harbor-core-6569d4f449-grtrr","woodpecker/woodpecker-agent-1","taskd/taskd-6f9699f5f4-qkjkr","kube-system/cilium-5tx4t","fluentbit/fluentbit-fluent-bit-frskm","rook-ceph/csi-cephfsplugin-8f4jh","rook-ceph/csi-rbdplugin-cnxfz","kube-system/cilium-envoy-gx7ck"]
crio[780]: time="2025-07-28 21:07:43.179344359+02:00" level=info msg="Stopping container: 7ba324965ba9ed751bd08ac4b464631b2d5dfa05d31f36d98253b68a0d5ec7d0 (timeout: 30s)" id=b69f9664-c0ae-4505-9363-6966afa90b77 name=/runtime.v1.RuntimeService/StopContainer
crio[780]: time="2025-07-28 21:07:43.837431719+02:00" level=info msg="Stopped container 7ba324965ba9ed751bd08ac4b464631b2d5dfa05d31f36d98253b68a0d5ec7d0: woodpecker/wp-01k194yzh8bg8tzngrf7x6w3k4/wp-01k194yzh8bg8tzngrf7x6w3k4" id=b69f9664-c0ae-4505-9363-6966afa90b77 name=/runtime.v1.RuntimeService/StopContainer
kubelet[1088]: I0728 21:07:44.097018 1088 eviction_manager.go:616] "Eviction manager: pod is evicted successfully" pod="woodpecker/wp-01k194yzh8bg8tzngrf7x6w3k4"
The Pod just ran out of space while building the images. The fix was relatively
simple, as Woodpecker already provides a Pipeline Volume. In the case of the
Kubernetes backend, that volume is a PVC created per pipeline and then mounted
into the Pods for all of the steps. In my case, that’s a 50 GB CephFS volume. But
I wasn’t using that volume for anything, as the storage for BuildKit, running my
image builds, was still at the default /var/lib/docker
.
So hooray, just move the docker storage to the pipeline volume. I did so by
using the parameter the docker-buildx plugin already provides, storage_path
:
storage_path: "/woodpecker/docker-storage"
And just like that, I had fixed the problem. Or not. 21 minutes and running for a libsass build.
So much for that all too short moment of triumph. The storage issue was fixed,
but the image still could not be build. Looking through previous runs, I saw
that the issue wasn’t just the duration of the I/O utilization on the HDDs in my Ceph cluster, home of the CephFS data pool.pip
install, but also the
initial pull of the Python image. In one of the test builds, the initial pull
took over 50 minutes all on its own. Not much time left for the actual
setup. The root cause was at least not I/O saturation. The CI run I was looking
at ran from 22:25 to 23:25 in the below graph:
But I still had the feeling that storage was at least part of the problem. So I tried to use Ceph RBDs instead of CephFS, which also had the advantage of running on SATA SSDs instead of HDDs. But that also did not bring any real improvements. Sure, the build got a lot further and did not spend all its time just extracting the Python image, but it still didn’t finish within the 1h deadline.
I finally ended up figuring that the reason it was still timing out was emulation.
Removing emulation from my image build pipelines
As I’ve mentioned above, the docker-buildx Woodpecker plugin I was using used Docker’s BuildKit under the hood. BuildKit has the ability to do multi-arch builds out of the box, and uses Qemu for the non-native architectures. This gets pretty slow on a Raspberry Pi or a low power x86 machine. So my next plan was to go for parallel builds of all archs on hosts with the same arch.
BuildKit and docker-buildx already have support for doing this, via BuildKit’s Remote Builders. But as per the docker-buildx documentation, this can only be done via SSH. I initially thought that this would work with BuildKit daemons set up to receive external connections, but I was mistaken. Instead of using BuildKit’s build-in remote driver functionality, docker-buildx instead sets up normal builders with their connection strings pointing to the remote machines for which SSH was configured. BuildKit would then use those remote machine’s Docker sockets to run the builds.
After some thinking, I decided to dump docker-buildx altogether. I really didn’t like the idea of somehow setting up inter-Pod SSH connections. That just felt all kinds of wrong.
So I decided: I’ll just do it myself, using Buildah. I’ve had that on my list anyway, so here we go, a bit earlier than planned. Some inspiration for what follows was found in this blog post. It uses Tekton as the task engine, not Woodpecker, but still was a good starting point. It was especially useful for answering how to put together the images produced for different architectures in one manifest.
I started out by building the image for Buildah. The Containerfile ended up looking like this:
ARG alpine_ver
FROM alpine:$alpine_ver
RUN apk --no-cache update\
&& apk --no-cache add buildah netavark iptables bash jq
I then set up a simple test project in Woodpecker:
- name: build amd64 image
image: harbor.example.com/buildah/buildah:latest
commands:
- buildah build -t testing:0.1 --build-arg alpine_ver=3.22.1 -f testing/Containerfile testing/
depends_on: []
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/arch: "amd64"
The Containerfile looked something like this:
ARG alpine_ver
FROM alpine:$alpine_ver
RUN apk --no-cache update\
&& apk --no-cache add buildah
Basically, a copy of my Buildah image, just to have something to test.
One thing which surprised me to find out: Woodpecker doesn’t actually allow
setting a platform per step. So I got lucky that the Kubernetes backend allows
me to specify the nodeSelector
for the step’s Pod.
Right away, the first run produced the following error:
Error: error writing "0 0 4294967295\n" to /proc/16/uid_map: write /proc/16/uid_map: operation not permittedtime="2025-08-07T20:31:45Z" level=error msg="writing \"0 0 4294967295\\n\" to /proc/16/uid_map: write /proc/16/uid_map: operation not permitted"
Clearly, my dream of rootless image builds would not be fulfilled today, so I wanted to enable the project to be allowed to run privileged pipelines. Up to now, I had the docker-buildx plugin in a separate instance-wide list of privileged plugins. But my new container was, at this point, a simple step, not a plugin.
So my first step was to set my own user as an admin, because I had never needed
admin privileges for Woodpecker before. This I did via the WOODPECKER_ADMIN
environment variable in my values.yaml
file for the Woodpecker chart:
server:
env:
WOODPECKER_ADMIN: "my-user"
After that, the trusted project settings appeared in the Woodpecker settings
page: Trusted settings in the project configuration of Woodpecker. The options under the ‘Trusted’ heading only show up for admin users.Security
option allowed me to run the Buildah containers in
privileged mode, by adding the privileged: true
option.
The next error I got was this one:
Error: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver
time="2025-08-07T20:57:11Z" level=warning msg="failed to shutdown storage: \"'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver\""
At this point, my pipeline volume was still on a Ceph RBD, as I had not yet realized
that, with the plan of running multiple Buildah steps for the different platforms
in parallel, I would need RWX volumes for the pipelines. So I decided that the
right solution would be to move the storage onto my pipeline volume, where before
it just sat in the container’s own filesystem, leading to the above “OverlayFS on OverlayFS” error. I did this by adding --root /woodpecker
to the Buildah command.
And then I got the next one:
STEP 1/2: FROM alpine:3.22.1
Error: creating build container: could not find "netavark" in one of [/usr/local/libexec/podman /usr/local/lib/podman /usr/libexec/podman /usr/lib/podman]. To resolve this error, set the helper_binaries_dir key in the `[engine]` section of containers.conf to the directory containing your helper binaries.
This was fixed rather easily by adding netavark
to the Buildah image. I had a
similar error next, about iptables
not being available. So I installed that
one as well.
But that wasn’t all. Oh no, here’s another error:
buildah --root /woodpecker build -t testing:0.1 --build-arg alpine_ver=3.22.1 -f testing/Containerfile testing/
STEP 1/2: FROM alpine:3.22.1
WARNING: image platform (linux/arm64/v8) does not match the expected platform (linux/amd64)
STEP 2/2: RUN apk --no-cache update && apk --no-cache add buildah
exec container process `/bin/sh`: Exec format error
Error: building at STEP "RUN apk --no-cache update && apk --no-cache add buildah": while running runtime: exit status 1
That one confused me a little bit, to be honest. It wasn’t difficult to fix, I just
had to add the --platform linux/amd64
option to the Buildah command. What
confused me was that Buildah didn’t somehow figure that out for itself.
And this was the point where I realized that my two CI steps, one for amd64, one
for arm64, did not run in parallel. The one started only after the other had failed.
One kubectl describe -n woodpecker pods wp-...
later, I saw that that was
because the Pod which launched second failed to mount the pipeline volume. And
that in turn was because I had switched to an SSD-backed Ceph RBD for the volume,
to improve speed. But RBDs are, by their nature as block devices, RWO, and cannot
be mounted by multiple Pods.
I switched the volumes back to CephFS and was met with the same error I had seen previously and “fixed” by moving Buildah’s storage onto the pipeline volume:
time="2025-08-07T21:56:14Z" level=error msg="'overlay' is not supported over <unknown> at \"/woodpecker/overlay\""
Error: kernel does not support overlay fs: 'overlay' is not supported over <unknown> at "/woodpecker/overlay": backing file system is unsupported for this graph driver
time="2025-08-07T21:56:14Z" level=warning msg="failed to shutdown storage: \"kernel does not support overlay fs: 'overlay' is not supported over <unknown> at \\\"/woodpecker/overlay\\\": backing file system is unsupported for this graph driver\""
I’m not sure why it said “unknown”, but the filesystem was CephFS. After some
searching, I found out that OverlayFS and CephFS are seemingly incompatible. But
the issue was fixable by adding --storage-driver=vfs
to the Buildah command.
The VFS driver is a bit older than OverlayFS, and a bit slower. But at least
it works on CephFS.
And believe it or not, that was the last error. After adding the --storage
option, the build ran through cleanly. At this point, my Woodpecker workflow
looked like this:
when:
- event: push
path:
- '.woodpecker/testing.yaml'
- 'testing/*'
variables:
- &alpine-version '3.22.1'
steps:
- name: build amd64 image
image: harbor.example.com/homelab/buildah:0.4
commands:
- buildah --root /woodpecker build --storage-driver=vfs --platform linux/amd64 -t testing:0.1 --build-arg alpine_ver=3.22.1 -f testing/Containerfile testing/
depends_on: []
privileged: true
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/arch: "amd64"
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
- name: build arm64 image
image: harbor.example.com/homelab/buildah:0.4
commands:
- buildah --root /woodpecker build --storage-driver=vfs --platform linux/arm64 -t testing:0.1 --build-arg alpine_ver=3.22.1 -f testing/Containerfile testing/
depends_on: []
privileged: true
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/arch: "arm64"
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
- name: push image
image: harbor.example.com/homelab/buildah:0.4
commands:
- sleep 10000
depends_on: ["build amd64 image", "build arm64 image"]
privileged: true
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
With this configuration, the two builds for amd64 and arm64 are run in parallel,
and the final push image
step would be responsible for combining the
images into a single manifest and pushing it all to my Harbor instance.
I ran a test build and then exec’d into the Pod when the pipeline arrived at the
push image
step. I used the following commands to combine the manifests
and push them up to Harbor:
buildah --root /woodpecker --storage-driver=vfs manifest create harbor.example.com/homelab/testing:0.1
buildah --root /woodpecker --storage-driver=vfs manifest add harbor.example.com/homelab/testing:0.1 3883d7a9067d
buildah --root /woodpecker --storage-driver=vfs manifest add harbor.example.com/homelab/testing:0.1 0130169db3bb
buildah login https://harbor.example.com
buildah --root /woodpecker --storage-driver=vfs manifest push harbor.example.com/homelab/testing:0.1 docker://harbor.example.com/homelab/testing:0.1
The problematic thing about this approach was that I had no way of knowing the
correct values for the image names in the manifest add
commands, where I used
the image hashes in this example. I could of course set separate names for the
image, e.g. with the platform in the name. But then I would have to remember to
do that every time I create a new pipeline.
Instead, I decided to go one step further and check how painful it would be to turn my simple command-based steps into a Woodpecker plugin.
Building a Woodpecker plugin
And it turns out: It isn’t complicated at all. The docs for new Woodpecker plugins is rather short and sweet. Plugins need to be containerized, and they need to have their program set as the entrypoint in the image. And that’s it. Any options given in the step are forwarded to the step container via environment variables, so there’s nothing special to be done at all.
That was good news, as I was a bit afraid I would have to write some Go. But no, just pure bash was enough.
In the final result, my pipeline for the testing image will look like this:
when:
- event: push
path:
- '.woodpecker/testing.yaml'
- 'testing/*'
variables:
- &alpine-version '3.22.1'
- &image-version '0.2'
- &buildah-config
type: build
context: testing/
containerfile: testing/Containerfile
build_args:
alpine_ver: *alpine-version
steps:
- name: build amd64 image
image: harbor.example.com/homelab/woodpecker-plugin-buildah:latest
settings:
<<: *buildah-config
platform: linux/amd64
depends_on: []
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/arch: "amd64"
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
- name: build arm64 image
image: harbor.example.com/homelab/woodpecker-plugin-buildah:latest
settings:
<<: *buildah-config
platform: linux/arm64
type: build
depends_on: []
backend_options:
kubernetes:
nodeSelector:
kubernetes.io/arch: "arm64"
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
- name: push image
image: harbor.example.com/homelab/woodpecker-plugin-buildah:latest
settings:
type: push
manifest_platforms:
- "linux/arm64"
- "linux/amd64"
tags:
- latest
- 1.5
repo: harbor.example.com/homelab/testing
username: ci
password:
from_secret: container-registry
depends_on: ["build amd64 image", "build arm64 image"]
privileged: true
when:
- evaluate: 'CI_COMMIT_BRANCH != CI_REPO_DEFAULT_BRANCH'
When a Woodpecker plugin is launched, it gets all of the values under settings:
handed in as environment variables.
A normal key/value pair like type: push
would appear as PLUGIN_TYPE="push"
in
the plugin’s container.
Lists like the tags
or manifest_platforms
appear as comma-separated lists in,
e.g. PLUGIN_TAGS="latest,1.5"
.
Objects are a bit more complicated, and they’re handed over as JSON objects, e.g.
PLUGIN_BUILD_ARGS='{"alpine_ver": "3.22.1"}''
.
First, there is a bit of a preamble in the script, to check whether required config options have been set and Buildah is available:
DATA_ROOT="/woodpecker"
if ! command -v buildah; then
echo "buildah not found, exiting."
exit 1
fi
if [[ -z "${PLUGIN_TYPE}" ]]; then
echo "PLGUIN_TYPE not set, exiting."
exit 1
fi
Then, depending on the PLUGIN_TYPE
variable, either the build
or the push
function is executed, while either builds the image for a single platform or
combines multiple platforms into a single manifest and pushes it all to the
given registry:
if [[ "${PLUGIN_TYPE}" == "build" ]]; then
echo "Running build..."
build || exit $?
elif [[ "${PLUGIN_TYPE}" == "push" ]]; then
echo "Running push..."
push || exit $?
else
echo "Unknown type ${PLUGIN_TYPE}, exiting"
exit 1
fi
exit 0
And here is the build
function:
build() {
if [[ -z "${PLUGIN_CONTEXT}" ]]; then
echo "PLUGIN_CONTEXT not set, aborting."
return 1
fi
if [[ -z "${PLUGIN_PLATFORM}" ]]; then
echo "PLUGIN_PLATFORM not set, aborting."
return 1
fi
if [[ -z "${PLUGIN_CONTAINERFILE}" ]]; then
echo "PLUGIN_CONTAINERFILE not set, aborting."
return 1
fi
if [[ -n "${PLUGIN_BUILD_ARGS}" ]]; then
BUILD_ARGS=$(get_build_args "${PLUGIN_BUILD_ARGS}")
fi
command="buildah \
--root ${DATA_ROOT} \
build \
--storage-driver=vfs \
--platform ${PLUGIN_PLATFORM} \
-t ${PLUGIN_PLATFORM}:0.0 \
${BUILD_ARGS} \
-f ${PLUGIN_CONTAINERFILE} \
${PLUGIN_CONTEXT} \
"
echo "Running command: ${command}"
${command}
return $?
}
It again starts out with some checks to make sure the required variables are set.
Then it runs the buildah build
command as in the previous setup with the manual
command. The one “special” thing I’m doing here is that I tag the new image with
the PLUGIN_PLATFORM
variable and the :0.0
version. The storage for the builders
is entirely temporary, so I will never have multiple versions in the storage,
and this allows me to make the names of the images predictable in the later
push
step. So at the end of the function’s run, I would have images linux/amd64:0.0
and linux/arm64:0.0
in the same storage.
Which then brings us to the push
function:
push() {
if [[ -z "${PLUGIN_REPO}" ]]; then
echo "PLUGIN_REPO not set, aborting."
return 1
fi
if [[ -z "${PLUGIN_TAGS}" ]]; then
echo "PLUGIN_TAGS not set, aborting."
return 1
else
TAGS=$(echo "${PLUGIN_TAGS}" | tr ',' ' ')
fi
if [[ -z "${PLUGIN_MANIFEST_PLATFORMS}" ]]; then
echo "PLUGIN_MANIFEST_PLATFORMS not set, aborting."
return 1
else
PLATFORMS=$(echo "${PLUGIN_MANIFEST_PLATFORMS}" | tr ',' ' ')
fi
if [[ -z "${PLUGIN_USERNAME}" ]]; then
echo "PLUGIN_USERNAME not set, aborting."
return 1
fi
if [[ -z "${PLUGIN_PASSWORD}" ]]; then
echo "PLUGIN_PASSWORD not set, aborting."
return 1
fi
echo "Logging in..."
buildah login -p "${PLUGIN_PASSWORD}" -u "${PLUGIN_USERNAME}" "${PLUGIN_REPO}" || return 1
echo "Creating manifest..."
buildah --root "${DATA_ROOT}" --storage-driver=vfs manifest create newimage || return 1
for plt in ${PLATFORMS}; do
echo "Adding platform ${plt}..."
buildah --root "${DATA_ROOT}" --storage-driver=vfs manifest add newimage "${plt}:0.0" || return 1
done
echo "Pushing to registry..."
for tag in ${TAGS}; do
buildah --root "${DATA_ROOT}" --storage-driver=vfs manifest push newimage docker://${PLUGIN_REPO}:${tag} || return 1
done
buildah logout "${PLUGIN_REPO}"
return 0
}
Here I need to do some more things than in the build step. First is the login,
which is done via buildah login
. Something which slightly annoys me here is
the fact that Buildah only seems to support either interactive input of the
password, or providing it via a CLI flag, but not e.g. via an environment
variable.
When the login succeeds, the code iterates over all platforms and adds the
$PLATFORM:0.0
image to the new manifest. Once that’s all done, the resulting
manifest containing all the required platform’s images is pushed to the repository
given in the repo
option for the plugin.
I prefer having a plugin like this, because Woodpecker’s “command form” steps cannot re-use Yaml anchors like I was able to do here, so there would have been a lot more repetition in the pipeline setups.
Performance
After I got the plugin working, I started migrating my existing image builds over to the new plugin. I started out with my Fluentd image, where I take the official Fluentd image and install a few additional plugins into it before deploying into my Kubernetes cluster. The image file looks like this:
ARG fluentd_ver
FROM fluent/fluentd:${fluentd_ver}
USER root
RUN ln -s /usr/bin/dpkg-split /usr/sbin/dpkg-split
RUN ln -s /usr/bin/dpkg-deb /usr/sbin/dpkg-deb
RUN ln -s /bin/rm /usr/sbin/rm
RUN ln -s /bin/tar /usr/sbin/tar
RUN buildDeps="sudo make gcc g++ libc-dev" apt-get update \
&& apt-get install -y --no-install-recommends $buildDeps curl \
&& gem install \
fluent-plugin-grafana-loki \
fluent-plugin-record-modifier \
fluent-plugin-multi-format-parser \
fluent-plugin-rewrite-tag-filter \
fluent-plugin-route \
fluent-plugin-http-healthcheck \
fluent-plugin-kv-parser \
fluent-plugin-parser-logfmt \
&& gem sources --clear-all \
&& SUDO_FORCE_REMOVE=yes \
apt-get purge -y --auto-remove \
-o APT::AutoRemove::RecommendsImportant=false \
$buildDeps \
&& rm -rf /var/lib/apt/lists/* \
&& rm -rf /tmp/* /var/tmp/* /usr/lib/ruby/gems/*/cache/*.gem
USER fluent
And that’s where I discovered that my performance wasn’t exactly up to snuff
still: The fluentd image build takes around 23 minutes, with the lion’s share of 1087s/18 minutes taken by the pull of the fluentd image.
The problem here again seems to be CephFS and/or the nature of container images
on disk. Because for a long time, the Ceph cluster was adding 10k objects per
15s interval: Objects added in a 15s interval to the pools of my Ceph cluster. Orange/top line is my CephFS storage pool. The CI run produced about 180k new objects in the Ceph storage cluster.
After seeing all of this, I decided that the current setup might not be ideal
when it comes to storage. One thought I had was that both builds using the same
Still with a shared volume, but not with a shared directory on that volume, the builds take less time.--root
parameter on the shared volume might be part of the problem, thinking
that perhaps Buildah did some locking of the storage area?
So I switched the different platform builds to different directories on the
shared volume. That did work somewhat, reducing the duration down to about 15
minutes:
This still seemed pretty long, so I started to consider the creation of a new CephFS with the data pool on SSDs, to hopefully improve the performance. But then I had a thought: How about removing the parallelism entirely? If I were to not run the steps in parallel, I could use a Ceph RBD instead, which would likely already be faster. I also already have a StorageClass for SSD-backed RBDs in my cluster, so no additional config would be necessary. And finally, using a Ceph RBD instead of CephFS, I would be able to use the faster OverlayFS storage driver for Buildah.
So I did all of that, switched the StorageClass for Woodpecker’s pipeline volumes
to my SSD RBD class, and then disabled parallelism for the steps. The results
were rather impressive: Both builds done sequentially on an SSD-backed Ceph RBD are faster than the same builds done in parallel, but on a CephFS volume with the VFS storage driver.
The entire pipeline has run through in about six minutes. Less time than the previous setup needed just for pulling down the Fluentd image.
Final thoughts
Even with all the weird errors I had to fix and the wrong turns I took, this was fun, and the fact that I ended up without any parallelism was surprising. I really enjoyed working on this one.
There are still a few improvements to be made, and some things to dig into. One burning question I currently have is why the parallelized version, using the VFS storage driver running on a CephFS shared volume, was so much slower. Was it mostly the slower VFS storage driver? Or was it CephFS? And if it was CephFS, what was actually the bottleneck? Because I wasn’t able to find one, neither in IO utilization, nor network, nor CPU on any of the nodes involved. I checked both, the nodes running the Buildah Pods and the Ceph nodes, and none seemed to show any overloads in any resource. So I’m a bit stumped.
Then there’s also the fact that my Woodpecker steps still need to run in privileged mode. I don’t like that, but I wasn’t able to figure out exactly what to do to remove that requirement. From everything I’ve read, this should be possible with Buildah, but might need some additional configuration on the Kubernetes nodes. I will have to check this in the future.
But for now, finally back to working on setting up a Bookwyrm instance.