Unusual traffic in prometheus nodes.

101 views
Skip to first unread message

Uvais Ibrahim

unread,
Jul 27, 2023, 8:14:25 AM7/27/23
to Prometheus Users
Hi,

Since last night, my Prometheus EC2 servers are getting high traffic unusually. When I was checking in Prometheus I can see this metric scrape_samples_scraped with with increased size but without any labels. What could be the reason?


Thanks,
Uvais Ibrahim



Brian Candler

unread,
Jul 27, 2023, 8:36:10 AM7/27/23
to Prometheus Users
scrape_samples_scraped always has the labels which prometheus itself adds (i.e. job and instance).

Extraordinary claims require extraordinary evidence. Are you saying that the PromQL query scrape_samples_scraped{job="",instance=""} returns a result?  If so, what's the number?  What do you mean by "with increased size" - increased as compared to what? And what version of prometheus are you running?

In any case, what you see with scrape_samples_scraped may be completely unrelated to the "high traffic" issue.  Is your prometheus server exposed to the Internet? Maybe someone is accessing it remotely.  Even if not, you can use packet capture to work out where the traffic is going to and from.  A tool like https://www.sniffnet.net/ may be helpful.

Uvais Ibrahim

unread,
Jul 27, 2023, 10:51:24 AM7/27/23
to Prometheus Users
Hi Brain,

This is the query that I have used.

sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type, beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region, failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch, kubernetes_io_hostname, kubernetes_io_os, node_kubernetes_io_instance_type, nodegroup, topology_kubernetes_io_region, topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace, pod_name, pod_template_hash, security_istio_io_tlsMode, service_istio_io_canonical_name, service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)

Which simply excluded every label but still I am getting a result like this

{}  7525871918


It shouldn't return any results right?

Prometheus version: 2.36.2

By increased traffic I meant that, the prometheus servers are getting high traffic from a specific point of time. Currently prometheus is getting 13 million packets earlier it was like 2 to 3 M packets on an average. And the prometheus endpoint is not public.

Stuart Clark

unread,
Jul 27, 2023, 11:23:14 AM7/27/23
to Uvais Ibrahim, Prometheus Users
On 27/07/2023 15:51, Uvais Ibrahim wrote:
> Hi Brain,
>
> This is the query that I have used.
>
> sum(scrape_samples_scraped)without(app,app_kubernetes_io_managed_by,clusterName,release,environment,instance,job,k8s_cluster,kubernetes_name,kubernetes_namespace,ou,app_kubernetes_io_component,app_kubernetes_io_name,app_kubernetes_io_version,kustomize_toolkit_fluxcd_io_name,kustomize_toolkit_fluxcd_io_namespace,application,name,role,app_kubernetes_io_instance,app_kubernetes_io_part_of,control_plane,beta_kubernetes_io_arch,beta_kubernetes_io_instance_type,
> beta_kubernetes_io_os, failure_domain_beta_kubernetes_io_region,
> failure_domain_beta_kubernetes_io_zone,kubernetes_io_arch,
> kubernetes_io_hostname, kubernetes_io_os,
> node_kubernetes_io_instance_type, nodegroup,
> topology_kubernetes_io_region,
> topology_kubernetes_io_zone,chart,heritage,revised,transit,component,namespace,
> pod_name, pod_template_hash, security_istio_io_tlsMode,
> service_istio_io_canonical_name,
> service_istio_io_canonical_revision,k8s_app,kubernetes_io_cluster_service,kubernetes_io_name,route_reflector)
>
> Which simply excluded every label but still I am getting a result like
> this
>
> {}  7525871918
>
I'm not sure what you are expecting, as that sounds about right. The
query is adding together all the different variants of the
scrape_samples_scraped metric (removing all the different labels), so if
that is indeed a list of every label the query is going to return a
value without any associated labels.

You want to be instead just graphing the raw scrape_samples_scraped
metric (no sum or without) and see how it varies over time. Is there a
particular job or target which has a huge increase in the graph, or new
series appearing? As to why that might happen it could be many different
reasons, but ideas could include:

* new version of software which increases number of exposed metrics (or
more granular labels)
* bug in software where a label is set to something with high
cardinality (e.g. there is a "path" label from a web app, which means a
potentially infinite cardinality, and you could have had a web scan
producing millions of combinations)
* lots of changes to the targets, such as new instances of software or
high churn of applications restarting

--
Stuart Clark

Brian Candler

unread,
Jul 27, 2023, 11:50:49 AM7/27/23
to Prometheus Users
As Stuart says, that looks correct, assuming your metrics don't have any labels other than the ones you've excluded. You'd save a lot of typing just by doing:

    sum(scrape_samples_scraped)

which is expected to return a single value, with no labels (as it's summed across all timeseries of this metric).

The value 7,525,871,918 does seem quite high - what was it before?  You can set an execution time for this query in the PromQL browser, or draw a graph this expression over time, to see historical values.

You could also look at
    count(scrape_samples_scraped)

or more simply
    count(up)

and see if that has jumped up: it would imply that lots more targets have been added (e.g. more pods are being monitored).

If not, then as well as Stuart's suggestion of graphing "scrape_samples_scraped" by itself to see if one particular target is generating way more metrics than usual, you could try different summary variants like

sum by (instance,job) (scrape_samples_scraped)
sum by (clusterName) (scrape_samples_scraped)
... etc

and see if there's a spike in any of these.  This may help you drill down to the offending item(s).

Ben Kochie

unread,
Jul 28, 2023, 4:53:00 AM7/28/23
to Brian Candler, Prometheus Users
That's 7 billion metrics, which would require approximately  30-50TiB of ram.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/811fba5c-1bd3-4677-b276-84116180a1acn%40googlegroups.com.

Brian Candler

unread,
Jul 28, 2023, 5:28:58 AM7/28/23
to Prometheus Users
Another query to try:
topk(10, scrape_samples_scraped)
Reply all
Reply to author
Forward
0 new messages