Nomad to k8s, Part 20: Migrating Mastodon

Wherein I migrate my Mastodon instance to the k8s cluster.

This is part 21 of my k8s migration series.

Mastodon is currently serving as my presence in the Fediverse. You can find me here, although I’m pretty sure that most of my readers are coming from there already. 😄

If you’re at all interested in joining a genuine community around Homelabbing, I can only recommend to join the fun by following the HomeLab or SelfHosted hashtags and wildly following everyone appearing on there. It’s a great community of rather friendly people enjoying everything from a lonely Pi to several 42U 19" racks full of equipment. If you’re interested in learning more about my own experience with the Fediverse and hosting my own single-user instance, have a look at these older posts.

Preparations

There were two things which needed to be migrated from my Nomad cluster to the k8s deployment: The S3 bucket holding all of the media, and the database.

The database is, by a very large margin, the biggest in my Homelab, clocking in at 2.5 GB. I think it could be a lot smaller, but I completely disabled cleanups for remote posts a while ago. That was due to the fact that the automated cleanup also deletes posts I had bookmarked for reading later, and I’m not very good at actually keeping up with those - so after a while I went through them and became pretty convinced that I was missing some I had bookmarked a while ago. I will likely do some cleanups manually when it really becomes too big to be manageable.

I will not describe the entire migration process here, because it is similar to previous migrations. If you’re interested, have a look at my post about the Gitea migration, where I describe the database migration with CNPG in detail. In short, it was very painless. I provided the database with a 15 GB volume, which seems a bit overboard in hindsight. At some point in the future I will have to figure out how to do database sizing and go through all of my CNPG clusters, because I’m pretty sure most of them are overprovisioned.

Next came the S3 bucket. The first mistake I made here was to forget to exclude the cache/ prefix. So I copied all of the currently cached media over instead of just letting Mastodon re-fetch whatever it actually needed. That prefix currently holds 56 GB out of 61 GB total. Which reminds me that I need to check whether the automatic cleanup is working on the k8s setup or not. But yeah, if I had remembered to remove that prefix, I could have saved a lot of time for the copy operation. As it stands, these are the stats for the copy, which I did with rclone:

Transferred:       61.786 GiB / 61.786 GiB, 100%, 6.279 MiB/s, ETA 0s
Transferred:       384921 / 384921, 100%
Elapsed time:    3h7m29.8s

Those 6.279 MiB/s are utterly abysmal. Those of you who read my previous post on my media library copy operation probably already know: It was the 4 TB Seagate HDD, which was fully slammed again. There’s definitely something bad about this disk. But anyway, three hours later I was done and had everything copied over.

Before I close the preparations, let’s have some fun and look at the CPU usage of the FluentD container in my k8s cluster:

A screenshot of a Grafana time series plot. It's showing the CPU usage, given on the Y axis in 'cores', of my FluentD instance over the three hours from 09:55 to 13:10 where the S3 bucket was copied. It hovers at 0.1 in the beginning and end, but goes up to 0.4, with spikes to 0.5 between 09:55 and 13:10, before then going down again. — CPU usage of my FluentD log aggregation container.

Not even the RGW or OSD containers were using more CPU during the copy. The reason seems to be that I’ve still got my ingress Traefik instance set to debug log level:

A screenshot of a Grafana time series plot. It shows the log rate of my Traefik ingress container during the S3 bucket copy. The rate goes from about 1 log entry per second to over 70 per second, where it stays throughout the copy operation, before finally going back to about 1 per second. — Log rate of my Traefik ingress container.

I’m now starting to wonder whether this might be part of the reason for why the copy was so slow - the disk might have also been loaded by Loki pushing all these log lines to its own S3 bucket. 🤦 Sadly, I don’t have precise enough metrics for that, as I can only see the throughput by pool in my Ceph stats, and both the Mastodon bucket and the Loki bucket are in the same pool. Something to try to dig into a little bit later.

The Mastodon setup

I deployed my Mastodon instance with the official Mastodon chart. One important note: This one is, at some point in the future, going to be replaced with a new one, see the relevant issue.

I won’t go through every single option I set, but there were a couple of things which tripped me up.

The first and perhaps most important one: The default appVersion of the current chart is 4.2.17. But I was already on 4.3.3. The main issue I encountered related to this discrepancy in versions is the split of the Mastodon container into two containers, one for the streaming component, and one for everything else. To fix this, I had to explicitly set the image in the values.yaml:

mastodon:
  streaming:
    image:
      repository: "ghcr.io/mastodon/mastodon-streaming"

With that, the chart seems to work for 4.3.3 and 4.3.4 without issues.

Then there’s the Redis configuration. I’ve got a central Redis instance in my cluster, instead of running one for every app. And the chart supports this, but unless I’ve overlooked something here, the chart requires the Redis instance to have a password, which mine does not. The way this shows is that the mastodon-redis secret is unconditionally added to each container’s env, for example in the mastodon-web deployment from here:

- name: "REDIS_PASSWORD"
  valueFrom:
    secretKeyRef:
      name: {{ template "mastodon.redis.secretName" . }}
      key: redis-password

There’s no condition around that, checking whether Redis is configured with a password. I also tried to just set an empty password in redis.auth.password, but in this case the Secret is not created by the chart, and my containers are left in ContainerCreationError state because of the missing Secret. The only way I found was to create a dummy secret with an empty data.redis-password:

apiVersion: v1
kind: Secret
metadata:
  name: masto-redis-mock
  labels:
    homelab/part-of: mastodon
type: Opaque
data:
  redis-password: ""

And then using that Secret in the Helm chart:

redis:
  auth:
    existingSecret: "masto-redis-mock"

With that, the Redis password env variable is set, but to an empty value, which seems to make Mastodon use Redis properly, without adding a password of any kind to the connection string.

The next noteworthy configuration to be set was the mastodon.trusted_proxy_ip variable. This one needed the source IP of my Traefik ingress, but that doesn’t have a fixed IP, so I needed to add the Pod CIDR:

mastodon:
  trusted_proxy_ip: "300.300.300.1,127.0.0.1,10.8.0.0/16"

Without this setting, I got the following error in the mastodon-web logs:

[05332434-d3d6-40b1-950d-ae73da0d4967] ActionDispatch::RemoteIp::IpSpoofAttackError (IP spoofing attack?! client 10.8.4.103 is not a trusted proxy HTTP_CLIENT_IP=nil HTTP_X_FORWARDED_FOR="67.241.47.40, 10.86.10.10")

I also decided to switch off the CronJob for media removal:

mastodon:
  cron:
    removeMedia:
      enabled: false

This is because I recently spend quite some time working on Masotodon’s internal process. From what I can see, this CronJob uses the tootctl CLI with the tootctl media remove command. I like that better than the internal Mastodon process, because back when I looked at it, tootctl worked a lot better because it made separate DELETE requests. But the one thing which keeps me from using the CronJob is that I can’t configure the retention periods. I might still use it later and just live with the defaults.

And that’s really all I have to say. For completeness’ sake, here is the full values.yaml content:

mastodon:
  labels:
    homelab/part-of: mastodon
  createAdmin:
    enabled: false
  cron:
    removeMedia:
      enabled: false
  local_domain: "social.mei-home.net"
  trusted_proxy_ip: "300.300.300.1,127.0.0.1,10.8.0.0/16"
  singleUserMode: true
  autherizedFetch: false
  limitedFederationMode: false
  s3:
    enabled: true
    existingSecret: "mastodon-bucket"
    bucket: masto-media
    endpoint: "http://rook.service:80"
    alias_host: "s3-mastodon.mei-home.net"
  deepl:
    enabled: false
  hcaptcha:
    enabled: false
  secrets:
    existingSecret: "mastodon-secrets"
  sidekiq:
    resources:
      limits:
        memory: 1024Mi
      requests:
        cpu: 400m
  smtp:
    auth_method: "plain"
    from_address: "Meiers Mastodon <mastodon@mei-home.net>"
    openssl_verify_mode: "peer"
    port: "465"
    server: "mail.example.com"
    tls: false
    existingSecret: "mastodon-mail"
  streaming:
    image:
      repository: "ghcr.io/mastodon/mastodon-streaming"
    resources:
      requests:
        cpu: 500m
      limits:
        memory: 2000Mi
  web:
    resources:
      requests:
        cpu: 500m
      limits:
        memory: 1000Mi
  cacheBuster:
    enabled: false
  metrics:
    statsd:
      exporter:
        enabled: false
  otel:
    enabled: false

  extraEnvVars:
    SMTP_SSL: true
    OIDC_CLIENT_ID: "mastodon"
    OIDC_DISPLAY_NAME: "Login with Keycloak"
    OIDC_ISSUER: "https://login.example.com/realms/example"
    OIDC_DISCOVERY: true
    OIDC_SCOPE: "openid,profile,email"
    OIDC_UID_FIELD: "preferred_username"
    OIDC_REDIRECT_URI: "https://social.mei-home.net/auth/auth/openid_connect/callback"
    OIDC_SECURITY_ASSUME_EMAIL_IS_VERIFIED: true
    OIDC_END_SESSION_ENDPOINT: "https://login.example.com/realms/example/protocol/openid-connect/logout"
    OIDC_ENABLED: true
    OMNIAUTH_ONLY: true
    RAILS_SERVE_STATIC_FILES: true
    S3_BATCH_DELETE_LIMIT: 1
    S3_READ_TIMEOUT: 60
    S3_BATCH_DELETE_RETRY: 10
    ALLOWED_PRIVATE_ADDRESSES: "300.300.300.1"

ingress:
  enabled: true
  annotations:
    external-dns.alpha.kubernetes.io/controller: "none"
  hosts:
    - host: social.mei-home.net
      paths:
        - path: "/"
  tls: null
  streaming:
    enabled: false

elasticsearch:
  enabled: false

postgresql:
  enabled: false
  postgresqlHostname: "mastodon-pg-cluster-rw"
  postgresqlPort: "5432"
  auth:
    database: "mastodon"
    username: "mastodon"
    existingSecret: "mastodon-pg-cluster-app"

redis:
  enabled: false
  hostname: "redis.example"
  port: "6379"
  auth:
    existingSecret: "masto-redis-mock"
  sidekiq:
    enabled: false
  cache:
    enabled: false

Conclusion

To be honest, somewhere during that Sunday I started thinking that starting the Mastodon migration on a Sunday morning might have been a mistake, but in the end it worked out well enough.

Now there are only a few services left to migrate over, chief amongst them my Keycloak instance. Let’s see whether I might even be able to clean out the entire cluster during this weekend. There’s definitely a light at the end of the migration tunnel. I guess this weekend will show whether it’s a freight train. 😅

Preparations#

The Mastodon setup#

Conclusion#

Preparations

The Mastodon setup

Conclusion