trouble making prom assume role using web token from EKS mount

185 views
Skip to first unread message

William Findley

unread,
Apr 10, 2020, 8:21:19 PM4/10/20
to Prometheus Users

I'm having trouble getting ec2 service discovery to work using an IAM role bound to an EKS service account.  Here's what I have.

I have a pod that has successfully had a web identity token projected into it.  I'm fairly confident that there's no problem with this.  I have customers on this EKS that I've rigged up with IAM roles and kubez service accounts, and they're happily using services.

/prometheus $ ls -la /var/run/secrets/eks.amazonaws.com/serviceaccount
total 0
drwxrwsrwt    3 root     2000           100 Apr 10 17:23 .
drwxr-xr-x    3 root     root            28 Apr 10 17:49 ..
drwxr-sr-x    2 root     2000            60 Apr 10 17:23 ..2020_04_10_17_23_59.145300320
lrwxrwxrwx    1 root     root            31 Apr 10 17:23 ..data -> ..2020_04_10_17_23_59.145300320
lrwxrwxrwx    1 root     root            12 Apr 10 17:23 token -> ..data/token


I'm the information about what role/token to use  is exposed on the following env vars:

      AWS_ROLE_ARN:                 arn:aws:iam::2XXXXXXXXXX0:role/prometheus-service-discovery-eks
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token

Here's my scrape config.  I'm trying to discover and scrape node exporter on a box that I've tagged with prometheus.io/discover and has a name biginning like I expect.
scrape_configs:
- ec2_sd_configs:
  - filters:
    - name: tag-key
      values:
      - prometheus.io/discover
    role_arn: arn:aws:iam::2XXXXXXXXXX0:role/prometheus-service-discovery-eks
  job_name: service-ec2
  relabel_configs:
  - action: keep
    regex: ^mycoolnameprefix-.*
    source_labels:
    - __meta_ec2_tag_Name
  - replacement: $1:9100
    source_labels:
    - __meta_ec2_private_ip
    target_label: __address__

My assumption from the docs and the use of the latest version of prometheus and the dependant AWS SDK was that it would use these ENV variables in the way that it needed to discover the role and go out and bind it.  However, these logs indicate otherwise:

level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:224 component="discovery manager scrape" msg="Starting provider" provider=*ec2.SDConfig/0 subs=[service-ec2]
level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:224 component="discovery manager notify" msg="Starting provider" provider=string/0 subs=[config-0]
level=info ts=2020-04-10T21:08:03.271Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:242 component="discovery manager notify" msg="discoverer channel closed" provider=string/0
level=error ts=2020-04-10T21:08:03.493Z caller=refresh.go:79 component="discovery manager scrape" discovery=ec2 msg="Unable to refresh target groups" err="could not describe instances: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 3317a2e2-5357-4535-9b53-085209fdfb5c"
level=error ts=2020-04-10T21:09:03.502Z caller=refresh.go:98 component="discovery manager scrape" discovery=ec2 msg="Unable to refresh target groups" err="could not describe instances: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 455fddb6-9b42-449b-b603-d7f453923a7b"

Any tips on where I might have gone wrong?  I made the best effort I could to follow the existing documentation, but I don't feel like it's telling me everything I need to know.


William Findley

unread,
Apr 13, 2020, 6:53:38 PM4/13/20
to Prometheus Users
On the off chance, I fired up a pod with a container with the AWS CLI on it under the service account I'm using, and it was able to do the ec2:describeinstances api call just fine.  I'm not sure how to track down what's happening here.  Maybe I've run into a bug?

Matthias Rampke

unread,
Apr 14, 2020, 7:54:48 AM4/14/20
to William Findley, Prometheus Users
This is a bit of a guess (I haven't dug into the code to confirm it) – what happens if you remove the role from the SD config and only pass it through the environment? I can imagine that the explicit configuration causes us to not look at the environment in the same way. My hope is that by not passing any authentication information in the Prometheus config we fall back to the default SDK behaviour.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/67f41258-7d11-44aa-92b2-43e60b58a616%40googlegroups.com.

William Findley

unread,
Apr 14, 2020, 8:05:19 AM4/14/20
to Matthias Rampke, Prometheus Users
I actually ran a aws cli container under the service account again and managed to replicate the error independently of Prometheus.  Yes, indications are that I can just leave out the role and it'll get set by the environment.

William Findley

unread,
Apr 14, 2020, 6:52:21 PM4/14/20
to Prometheus Users
I fixed my problem.  I was using the wrong service account and my angry eyes failed to notice.  I blame helm for making 2 service accounts that look almost exactly alike.  ;-)  However, you were *also* correct that I needed to declare less things to induce the default SDK behavior.  Where would it be appropriate to update the AWS service discovery docs with the advice that the SDK usually pick up the proper things?  It seems like that bit of guidance should go *someplace*.


On Friday, April 10, 2020 at 8:21:19 PM UTC-4, William Findley wrote:

Matthias Rampke

unread,
Apr 17, 2020, 12:09:47 PM4/17/20
to William Findley, Prometheus Users
I think it should (for now) go into the reference for the ec2_sd_config: https://github.com/prometheus/prometheus/blob/master/docs/configuration/configuration.md#ec2_sd_config

right now it doesn't explain anything about authentication, so a paragraph on that in general would be helpful I think.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages