Diego Ongaro

May 14, 2024, 9:27:01 PMMay 14
Hi all,

Back in March 2013, I set up an S3 bucket with some content for the Raft user study, including John Ousterhout's Raft and Paxos lecture videos. I had linked to the raft.mp4 video from the Raft website, since some people might prefer that over the YouTube link.

AWS had been charging me about 30 cents a month to host this, which was negligible. Weirdly, traffic escalated substantially starting on December 24, 2023. My bill for January 2024 was $39, February was $39, March was $256, and April was $523. So, this has quickly turned into a significant and unpredictable cost for me. I've just shut off public access to this bucket to make it stop, and I'll see if AWS is generous enough to refund me.

The traffic pattern is bizarre, and I thought I'd share some details in case someone here can explain what's been happening. I don't have detailed access logs because those can potentially create more AWS costs, but I turned them on for a brief period in February and again today. I captured 10 requests in February and 90 requests today, so the samples are pretty small. Here are my observations:

- All of the traffic comes from China, according to geo IP lookups. There are 46 unique IP addresses.

- All of the requests are for raft.mp4 (while the Paxos video is slightly more popular than the Raft video on YouTube). A request here means a GET request for some or all of raft.mp4.

- The 10 requests in February had the user agent "Apache-HttpClient/5.1.4 (Java/1.8.0_392)". 5 of today's requests had that same user agent, while 4 used "Apache-HttpClient/5.1.4 (Java/1.8.0_402)", 15 used "Apache-HttpClient/4.5.14 (Java/11.0.5)", and 3 used "Apache-HttpClient/4.5.14 (Java/21.0.2)". Most of today's requests with these user agents downloaded all or large portions of the file.

- The remaining 63 requests today used "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", which might be Chrome 83 (released in 2020). Each of these requests downloaded 1-2% of the file.

- is an example of one of a few IP addresses that showed up with both Chrome and Apache-HttpClient user agents today.

It's also worth noting that the YouTube traffic for the same Raft and Paxos videos does not show any significant increase in views during this time.

Honestly, I don't know what to make of this. If this is a crawler or a bad actor, I don't know what benefit they would get from repeatedly downloading the same video file. If these are real people in China trying to learn Raft (perhaps because YouTube is unavailable there), I'm sure we can find some alternative.

Can anyone here explain it?


Keine Neco

May 14, 2024, 10:43:31 PMMay 14
Hi Diego,

The ip you pointed out belongs to China Telecom Inner Mongolia, I have no idea whether there are some students or some crawlers.
Your website & that mp4 file are definitely helpful to many Chinese student like me, Do you need some help to serve those files again?
Maybe my friends and I can help on that. e.g. setup cloudflare, provide some other file hosting service in China, etc.


Diego Ongaro

May 15, 2024, 5:34:53 PMMay 15
Hi Dong,

Thanks for letting me know that access to the MP4 is useful to human learners in China. I just put a copy on my personal web server next to some existing Raft user study materials, and I updated the Raft website to link to that. I haven't included the URLs in this message to avoid artificial/crawler traffic, but you can look at the history of the Raft website to see what changed. If this ends up causing too much suspicious traffic again, I'll have to block that or turn it off. I expect my web site/server is accessible in China. Can you confirm, Dong?


Diego Ongaro

May 28, 2024, 8:33:19 PMMay 28
I'm just sending a brief follow up here. Dong confirmed (off-list) that the mp4 is accessible from China in its new location. The access logs for it look reasonable, at least for now. Thankfully, AWS refunded/credited the highest 3 months of usage, which they weren't obligated to do. I still have no idea what this traffic was, but otherwise I think this is resolved.

