While working on the logging setup for my Grafana, Loki and CloudNativePG deployments, I found that there were some things I disliked about my original k8s logging setup, which I described here.

This is the start of a kind of post where I try to keep the reading time reasonably short. Whenever I prefix a post with “Bit Sized:”, you can expect a short one. I’m trying to wean myself off of the incredibly long, meandering posts I seem to keep putting out.

Hindsight: Well, at least I stayed under 10 minutes.

The trigger: CloudNativePG logs

I’m running CloudNativePG for my database needs in the new k8s cluster. Configuring Grafana for my metrics stack was the first time I deployed a production database with it. So obviously, I needed to get the logs properly ingested. To see what we’re dealing with, here are the first couple of logs lines from the Postgres Pod in a CloudNativePG deployment:

{"level":"info","ts":"2024-03-29T10:15:37Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}
{"level":"info","ts":"2024-03-29T10:15:37Z","logger":"postgres","msg":"2024-03-29 10:15:37.836 UTC [23] LOG:  redirecting log output to logging collector process","pipe":"stderr","logging_pod":"grafana-pg-cluster-1"}
{"level":"info","ts":"2024-03-29T10:15:37Z","logger":"postgres","msg":"record","logging_pod":"grafana-pg-cluster-1","record":{"log_time":"2024-03-29 10:15:37.836 UTC","process_id":"23","session_id":"660694c9.17","session_line_num":"3","session_start_time":"2024-03-29 10:15:37 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}

All three are JSON, which is nice already, meaning I won’t need any regexes.

The first line is the easiest to handle, produced by the management part of the CloudNativePG Pod. The second is already a bit more complicated, because it contains a full Postgres formatted logging message in it. From that line, I would at least like to get the log level (LOG in this case) as well as the timestamp.

The third line is then the most complicated, because it contains the Postgres log in the record key, now in JSON format as well.

But the above is not actually how those logs arrive in my FluentD instance. First of all, because I use cri-o, the first log line would look like this in the actual log file:

2024-03-29T11:15:37.627916025+01:00 stderr F {"level":"info","ts":"2024-03-29T10:15:37Z","msg":"Starting workers","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","worker count":1}

The prefix <time> <stream> <_p> <log> is added to each log line a container produces by cri-o, and I think this might be a standardized format mandated for CRI implementations? To ingest these logs and send them on to FluentD, I’m using Fluentbit, with its Kubernetes plugin. My mistake here was to leave the option Merge_Log enabled. If the content of <log> is a JSON string, the Kubernetes plugin takes that JSON object and adds its keys to the record directly. This means by the time one of the logs lines above arrives at my FluentD instance, it would look like this:

{[...]
 record="{\"log_time\"=>\"2024-03-29 22:55:27.814\" [...]
}

This is not JSON format, instead it’s nested keys format from FluentD. The issue with this is that FluentD does not seem to have any good tools to do something with an entire subkey. So I can’t do something specific with the record key here - I can only access specific subkeys, e.g. record['record']['log_time'], at least as long as I don’t want to go to Ruby. And I really, really don’t want to.

Time!

This lead me to another issue: While scrolling through some logs in my Grafana dashboard, I saw that pretty much all of them were prefixed with a time key. Which looks odd, because time should not appear in a log record’s keys. It turns out that that is the record key into which the CRI parser of Fluentbit parses the <time> field from the cri-o log line! And because I generally don’t do anything with that key, preferring to use the timestamp from the actual application log, I just left it in accidentally, polluting the log lines with unnecessary information.

But, and here comes the real issue: I couldn’t just change this behavior. The cri-o multiline parser is embedded into Fluentbit, and not configurable. So even though Fluentbit allows to configure whether parsers keep the time field of the parsed log or not, it did not allow me to do so here.

Next possibility: Just drop the time key in Fluentbit, before even sending the log on to FluentD. But this also wasn’t possible, because of the Merge_Log config of the Fluentbit Kubernetes plugin. This is due to some apps, e.g. Traefik in my setup, producing JSON log lines and having a field called time in those logs. So consequently, when a Traefik log line runs through the Kubernetes plugin, it will end up with its time field, not the original one from the cri-o parser, being put at the apex level of the log record. Removing it would mean deleting Traefik’s own timestamp, which I did not want.

Instead of investigating further, I decided to switch Merge_Log off. This way, Fluentbit never touches the actual log line produced by the app. It only parses the CRI parts and slaps on some Kubernetes Pod labels, then sends the entire enchilada on to FluentD for proper massaging.

This has the downside that now, I don’t get “parsed” JSON logs for free, but quite honestly, it isn’t that much work to add something like this for the apps which produce JSON formatted logs:

<filter services.monitoring.grafana>
    @type parser
    key_name log
    reserve_data true
    remove_key_name_field true
    <parse>
        @type json
        time_key t
        time_type string
        time_format %iso8601
        utc true
    </parse>
</filter>

This FluentD config only touches the log key of the incoming record, tries to parse it as JSON, adding all JSON object keys to the record as keys and parses the time from the t key, dropping that key afterwards.

I find this somehow more comforting, because I can be sure that I will always find the app’s original log output in the log key and can then do whatever parsing I want in FluentD, leaving Fluentbit mostly as a log forwarder/shipper.

I still had to drop the time key from all incoming logs, but now I could be sure that I was only removing the <time> part from the CRI log file entry, keeping the actual app log’s time key.

Parsing CloudNativePG logs

Finally back to the CloudNativePG logs. The first step of the parsing process is to parse the JSON log from the log key:

<filter services.*.postgres>
    @type parser
    key_name log
    reserve_data true
    remove_key_name_field true
    <parse>
        @type json
        time_key nil
    </parse>
</filter>

I’m ignoring the time from that log, for now.

Then I have to broadly split the records into two categories. The logs coming from the cnpg management plane, and the actual Postgres logs. That’s done as follows:

<match services.*.postgres>
  @type rewrite_tag_filter
  <rule>
    key record
    pattern /^.+$/
    tag pg-record.${tag}
    label @PGRECORD
  </rule>
  <rule>
    key record
    pattern /^.+$/
    tag pg-no-record.${tag}
    invert true
    label @PGNORECORD
  </rule>
</match>

This checks whether the record (after parsing of log) has a record key, and sends the records to different labels.

First comes the PGRECORD label, which handles log lines which look like this:

{"level":"info","ts":"2024-03-29T10:15:37Z","logger":"postgres","msg":"record","logging_pod":"grafana-pg-cluster-1","record":{"log_time":"2024-03-29 10:15:37.836 UTC","process_id":"23","session_id":"660694c9.17","session_line_num":"3","session_start_time":"2024-03-29 10:15:37 UTC","transaction_id":"0","error_severity":"LOG","sql_state_code":"00000","message":"listening on IPv4 address \"0.0.0.0\", port 5432","backend_type":"postmaster","query_id":"0"}}

They contain the JSON formatted Postgres log in the record key.

<label @PGRECORD>
  <filter **>
    @type record_transformer
    remove_keys record
    enable_ruby true
    <record>
      json_record ${record["record"].to_json}
    </record>
  </filter>
  <filter **>
    @type parser
    key_name json_record
    reserve_data true
    remove_key_name_field true
    <parse>
      @type json
      time_key nil
    </parse>
  </filter>
  <filter **>
    @type record_transformer
    remove_keys log_time,message,error_severity
    <record>
      ts ${record["log_time"]}
      level ${record["error_severity"]}
      msg ${record["message"]}
    </record>
  </filter>
  <match pg-record.**>
    @type relabel
    @label @PGNORECORD
  </match>
</label>

Okay, so one after the other. First, I need to transform the content of the record key back into JSON - because in the initial parsing of the log key, it would have been transformed into a FluentD nested key structure. Then I’m parsing that again. The effect that this has is that now, all the keys from log_record.record are keys of the record itself, and the record field has been removed.

Then I’m rewriting the names of a couple of keys. I do this so they’re similar to the log lines produced by the cnpg management app. After the Postgres log records pass through this pipeline, all the logs coming from the Postgres Pod, regardless of whether they come from Postgres itself or the management plane look approximately the same.

The unified logs are then further massaged by this pipeline:

<label @PGNORECORD>
  <filter pg-no-record.**>
    @type parser
    key_name msg
    reserve_data true
    remove_key_name_field false
    <parse>
      @type multi_format
      <pattern>
        format regexp
        expression /^(?<ts>[0-9\-]* [0-9\:\.]* [^\ ]+) [^\ ]* (?<pglvl>[^\ ]*) (?<msg>.*)$/
        time_key nil
      </pattern>
      <pattern>
        format regexp
        expression /^(?<msg>.*)$/
        time_key nil
      </pattern>
    </parse>
  </filter>
  <filter **>
    @type parser
    key_name ts
    reserve_data true
    remove_key_name_field true
    <parse>
      @type multi_format
      <pattern>
        format regexp
        expression /^(?<logtime>[0-9\-]* [0-9\:\.]* [^\ ]+)$/
        time_key logtime
        time_type string
        time_format %F %T.%N %Z
      </pattern>
      <pattern>
        format regexp
        expression /^(?<logtime>[0-9]{4}-[01][0-8]-[0-3][0-9]T[0-2][0-9]:[0-6][0-9]:[0-6][0-9].*)$/
        time_key logtime
        time_type string
        time_format %iso8601
        utc true
      </pattern>
    </parse>
  </filter>
  <match **>
    @type rewrite_tag_filter
    remove_tag_regexp /^pg-(no-)?record\./
    <rule>
      key msg
      pattern /^.+$/
      tag parsed.${tag}
      label @K8S
    </rule>
  </match>
</label>

Again, going step by step. The first filter config takes the msg key and parses it, with two potential regex parsers being tried. The first one recognizes Postgres log lines in string format, like this one:

2024-03-29 10:15:37.836 UTC [23] LOG:  redirecting log output to logging collector process

It properly parses the timestamp and level, which are the two most important parts for me. All other content of the msg key are left unparsed and are just written back to the msg key as-is.

Finally, I’m parsing the timestamp ts, which looks different now, depending on whether the log line came from Postgres or from cnpg. For Postgres, the timestamp looks like this: 2024-03-29 10:15:37.836 UTC, while for cnpg it is a properly formatted ISO8601 date/time.

And finally I rewrite the tag to start with parsed, which indicates in my config that the record should skip all the parsing pipelines.

Endless loops - Fluentbit edition

Before ending for today, one thing needs an honorary mention: I build another endless loop! 🎉

This time, just by enabling Fluentbit’s debug log verbosity. Because then, it produces a log entry for every message forwarded to FluentD - including its own:

component=output:forward:forward.0 msg="send options records=0 chunk=''

The one advantage about having my rack right next to my desk: I heard the fans crank the RPMs right away. 😅

Now you know why I was talking about Eldritch Horrors and the log pipeline being the most complicated part of my Homelab. I actually planned to just tidy up the monitoring stack migration a bit over the long Easter weekend and write the blog post, so I could get started with the next k8s migration step this week. Instead this is what I decided to spend my long weekend on. 🤦