Hi all,
Back in March 2013, I set up an S3 bucket with some content for the Raft user study, including John Ousterhout's Raft and Paxos lecture videos. I had linked to the raft.mp4 video from the Raft website, since some people might prefer that over the YouTube link.
AWS had been charging me about 30 cents a month to host this, which was negligible. Weirdly, traffic escalated substantially starting on December 24, 2023. My bill for January 2024 was $39, February was $39, March was $256, and April was $523. So, this has quickly turned into a significant and unpredictable cost for me. I've just shut off public access to this bucket to make it stop, and I'll see if AWS is generous enough to refund me.
The traffic pattern is bizarre, and I thought I'd share some details in case someone here can explain what's been happening. I don't have detailed access logs because those can potentially create more AWS costs, but I turned them on for a brief period in February and again today. I captured 10 requests in February and 90 requests today, so the samples are pretty small. Here are my observations:
- All of the traffic comes from China, according to geo IP lookups. There are 46 unique IP addresses.
- All of the requests are for raft.mp4 (while the Paxos video is slightly more popular than the Raft video on YouTube). A request here means a GET request for some or all of raft.mp4.
- The 10 requests in February had the user agent "Apache-HttpClient/5.1.4 (Java/1.8.0_392)". 5 of today's requests had that same user agent, while 4 used "Apache-HttpClient/5.1.4 (Java/1.8.0_402)", 15 used "Apache-HttpClient/4.5.14 (Java/11.0.5)", and 3 used "Apache-HttpClient/4.5.14 (Java/21.0.2)". Most of today's requests with these user agents downloaded all or large portions of the file.
- The remaining 63 requests today used "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", which might be Chrome 83 (released in 2020). Each of these requests downloaded 1-2% of the file.
- 121.56.158.23 is an example of one of a few IP addresses that showed up with both Chrome and Apache-HttpClient user agents today.
It's also worth noting that the YouTube traffic for the same Raft and Paxos videos does not show any significant increase in views during this time.
Honestly, I don't know what to make of this. If this is a crawler or a bad actor, I don't know what benefit they would get from repeatedly downloading the same video file. If these are real people in China trying to learn Raft (perhaps because YouTube is unavailable there), I'm sure we can find some alternative.
Can anyone here explain it?
Thanks,
Diego