I'm having trouble getting ec2 service discovery to work using an IAM role bound to an EKS service account. Here's what I have.
I have a pod that has successfully had a web identity token projected into it. I'm fairly confident that there's no problem with this. I have customers on this EKS that I've rigged up with IAM roles and kubez service accounts, and they're happily using services.
I'm the information about what role/token to use is exposed on the following env vars:
AWS_ROLE_ARN: arn:aws:iam::2XXXXXXXXXX0:role/prometheus-service-discovery-eks
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/
eks.amazonaws.com/serviceaccount/token
Here's my scrape config. I'm trying to discover and scrape node exporter on a box that I've tagged with
prometheus.io/discover and has a name biginning like I expect.
scrape_configs:
- ec2_sd_configs:
- filters:
- name: tag-key
values:
- prometheus.io/discover
role_arn: arn:aws:iam::2XXXXXXXXXX0:role/prometheus-service-discovery-eks
job_name: service-ec2
relabel_configs:
- action: keep
regex: ^mycoolnameprefix-.*
source_labels:
- __meta_ec2_tag_Name
- replacement: $1:9100
source_labels:
- __meta_ec2_private_ip
target_label: __address__
My assumption from the docs and the use of the latest version of prometheus and the dependant AWS SDK was that it would use these ENV variables in the way that it needed to discover the role and go out and bind it. However, these logs indicate otherwise:
level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:224 component="discovery manager scrape" msg="Starting provider" provider=*ec2.SDConfig/0 subs=[service-ec2]
level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:224 component="discovery manager notify" msg="Starting provider" provider=string/0 subs=[config-0]
level=info ts=2020-04-10T21:08:03.271Z caller=main.go:816 msg="Completed loading of configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
level=debug ts=2020-04-10T21:08:03.271Z caller=manager.go:242 component="discovery manager notify" msg="discoverer channel closed" provider=string/0
level=error ts=2020-04-10T21:08:03.493Z caller=refresh.go:79 component="discovery manager scrape" discovery=ec2 msg="Unable to refresh target groups" err="could not describe instances: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 3317a2e2-5357-4535-9b53-085209fdfb5c"
level=error ts=2020-04-10T21:09:03.502Z caller=refresh.go:98 component="discovery manager scrape" discovery=ec2 msg="Unable to refresh target groups" err="could not describe instances: WebIdentityErr: failed to retrieve credentials\ncaused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity\n\tstatus code: 403, request id: 455fddb6-9b42-449b-b603-d7f453923a7b"
Any tips on where I might have gone wrong? I made the best effort I could to follow the existing documentation, but I don't feel like it's telling me everything I need to know.