What happens if you scrape just the 625 targets? Or if you scrape just one or two targets, taken from the set of 625 problematic ones?
- if they still show as down, then it's a problem with those targets. Pick one, make a test direct scrape using curl from the prometheus server, and debug the issue. Could be something like a firewall between your prometheus server and the target.
- if the targets show as up, but down when go down when you are scraping 2000 hosts, then maybe you don't have enough capacity on your central prometheus server to get round everything. Increase your scrape interval, or spread the targets between multiple prometheus servers, or reduce the number of metrics being exported from each target.