Advice on where/how to write a new niche-ish blackbox exporter probe?

Stewart Webb

unread,

Dec 14, 2024, 2:51:59 AM12/14/24

to Prometheus Users

Hi folks,

At my workplace we have been working recently with Fluentbit and its Fluent Forward protocol, which is an efficient binary msgpack protocol for transmitting string or key-value logs with timestamps - see https://github.com/fluent/fluentd/wiki/Forward-Protocol-Specification-v1.5 .

I've been hoping to set up a blackbox-style status check for Fluent Forward endpoints. However after some research, it turns out the current blackbox_exporter repo can't cater to this need because it doesn't really have a probing module in it that will work for this. The tcp module almost gets me there, but the request/response probing support it has only works for newline-delimited protocols like SMTP. See https://github.com/nuclearpidgeon/fluent-forward-blackbox-testing for a writeup I've done on the details here.

Now, I can fork blackbox_exporter and add my own prober for the Fluent Forward protocol, but maintaining a fork would be a pain, and I also feel like the main blackbox_exporter project probably wouldn't really want to host and maintain an extra prober type that's so specific to one particular program/protocol, given the currently supported protocols are really generic ones like HTTP, ICMP, DNS, and gRPC.

My current approach is trying to fork blackbox_exporter down into a complete shell version that has none of the existing generic probers in it and then just adding back in my one new probe to that shell, with the plan of being able to publish this as a standalone project that can accomplish what we need while being something that I wouldn't have to worry much about pulling in upstream updates into.

In theory the TCP module in blackbox_exporter could alternatively be updated to support specifying binary-protocol request/response probing, but I feel like a generic feature like that might be a lot harder to build and get right, especially if any semantic parsing of the response is required.

If anyone here has some advice for my mild dilemma here, it'd be great to hear from you. (and if any maintainers of blackbox_exporter are around here it'd be really great to hear from you...)

Cheers,

Stewart Webb

Brian Candler

unread,

Dec 14, 2024, 12:15:13 PM12/14/24

to Prometheus Users

If you want to minimize your work, you can write a test as a one-shot standalone program in any language of your choice, and either:

1. Run it from cron, write the results to a file, and pick them up by node_exporter textfile collector; OR

2. Run it on demand from exporter_exporter using the "exec" method; OR

3. Run it as a nagios plugin under nrped, and query it from nrpe_exporter

Options 1 and 2 involve generating openmetrics output, which is very simple. Option 2 is good if you want the check to be triggered on every scrape. Option 1 is good if the check is expensive and you don't want to overload the target.

I wouldn't choose option 3 unless you're already deep into nagios plugins. There are issues with the ancient crypto used to talk to nrped.

Chris Siebenmann

unread,

Dec 14, 2024, 2:23:46 PM12/14/24

to Brian Candler, Prometheus Users, Chris Siebenmann

> If you want to minimize your work, you can write a test as a one-shot
> standalone program in any language of your choice, and either:
> 1. Run it from cron, write the results to a file, and pick them up by
> node_exporter textfile collector; OR
> 2. Run it on demand from exporter_exporter

> <https://github.com/QubitProducts/exporter_exporter> using the "exec"

> method; OR
> 3. Run it as a nagios plugin under nrped, and query it from nrpe_exporter

> <https://www.robustperception.io/nagios-nrpe-prometheus-exporter/>

Another 'run a program and provide its output as metrics' option is the
third party script exporter,
https://github.com/ricoberger/script_exporter

The basic usage of the script exporter is very similar to the blackbox
exporter, but of course you have to start a program every time. We've
been happily using it for years for a variety of checks that require
more sophistication (and fine grained metrics) than the Blackbox
exporter can handle.

(Another 'run it from cron' option is to have it push metrics into a
Pushgateway instance, but my view is that generally you want to use the
node_exporter textfile collector for that if it's possible. Pushgateway
usually has various drawbacks compared to the node_exporter approach.)

- cks

Matthias Rampke

unread,

Dec 16, 2024, 3:52:16 AM12/16/24

to Chris Siebenmann, Brian Candler, Prometheus Users

Given the main motion of fluentbit is to accept a message and send it elsewhere, what failure modes can and cannot be covered by the request-and-response style of the blackbox exporter? In other words, how far would this protocol support get you – or how much more could you do with an exporter that is specific to the system, and can e.g. receive the message back from a fluentbit output?

As you point out, there's a limitation to how generic an exporter can be and still be sensibly configurable, or at what point the configuration becomes just as complex as writing the code. Putting aside whether this protocol is widely used enough to justify adding support to the exporter, I wonder how much value you would get from that before you run into limitations from the fundamental model of the generic exporter.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/3646228.1734204215%40apps0.cs.toronto.edu.

Stewart Webb

unread,

Dec 25, 2024, 6:03:00 PM12/25/24

to Prometheus Users

The main thing I'm hoping to track with this exporter at the moment is availability of the endpoint for sending log data to - i.e. we want it tracked in prometheus history if it goes down. We have a cluster of Fluentbit instances sitting behind an Amazon AWS NLB which might help explain the motivation a bit further.

I might re-consider the "just run a script" option but it may depend on whether the probing source can be set up to do that (my team doesn't run/maintain that particular service). Standard Prometheus HTTP scraping definitely is readily available which is the main reason that's been my focus so far.

As per my writeup repo:

> [...] the [Forward] protocol specifies a "chunk" option that can be used to get the server to respond to a batch of messages sent up to it. This forms the basis of the testing performed in this repo to see if the ack can be caught in a way to get a more reliable availability check probe.

The main issue is that to catch this "ack", the prober needs to understand the binary msgpack response which isn't newline-delimited, and the blackbox_exporter probe is essentially hard-coded to work with newline-delimited protocols/responses:

> The [blackbox_exporter] tcp probe's response checking uses a Go bufio.Scanner (see https://github.com/prometheus/blackbox_exporter/blob/v0.25.0/prober/tcp.go#L135) seemingly with defaults, which means it will only ever be able to work with newline-separated chunks of bytes.

Fluentbit does come with a HTTP endpoint in it that can provide a status check (https://docs.fluentbit.io/manual/administration/monitoring), but this runs on a different port which we'd have to route through the NLB separately, and starts to defeat the point of validating the ingress traffic flow is working as expected because it's essentially testing something else at that point.

In terms of Fluentbit sending the message elsewhere, the details of this is a config and policy thing (which, yes, would be valuable to test and track too, but will be a lot more involved as it will probably involve checking one or more destination services as well).