Hello!
I'm testing prometheus on both a "test" Hadoop cluster and one of our "small" HPC clusters ( so far using node exporter). The possibilities have me dreaming. I was wondering, for HPC jobs (parallel batch jobs submitted via a scheduler that may run for hours/days) what would be the best way to gather both cumulative and individual host metrics of the node(s) a job runs on? For example, Job A is running on node 1,2,3 and I want to see the metrics of nodes 1,2,3, for only the lifespan of Job A on those nodes. I'm continuing to read/learn about prometheus to see if the scenario above is possible but I was hoping someone could point me in the right direction. Thanks!