Alibaba Cloud Traffic -- to block or not?

300 views
Skip to first unread message

Carolyn Sullivan

unread,
Apr 23, 2025, 1:29:16 PM4/23/25
to DSpace Technical Support
Hello all,

We've been getting a ton of traffic from IP addresses with Alibaba Cloud out of Hong Kong and Singapore that's been overwhelming our servers.  Is there any chance that legitimate users could be using Alibaba as an ISP, or would any accesses for Alibaba be coming from programs running on the cloud that we can unrepentantly block?

Thanks,
Carolyn.

Edmund Balnaves

unread,
Apr 23, 2025, 8:41:46 PM4/23/25
to DSpace Technical Support
We have seen this also.  It seems like AI-content harvesting.  Unfortunately these robots that do not honour the robots.txt and are poor crawlers, so they not only hit the server from multiple IP addresses.  They typically have an anonymous user-agent string.  They also get stuck in an endless crawl of browse and search discovery end points.   Our experience has been that these can be difficult to block as they IP ranges change are fairly dynamic.  We have introduced some apache-level filters to block searches that we can identify as bot-created rather than human (they query string is often irregular).  This helps quite a lot.  We are also looking at more detailed inspection to throw up a confirmation page where the activity is a suspected bot.

If they are harvesting a content page, we generally let them go through on the general principle that it is better they populate their AI with "good" content from a curated repository.

DSpace v8 is also quite a bit better at cached delivery and SEO.  However when they hit the search end points they can easily overwhelm the server.

Edmund 

John Evans

unread,
Apr 24, 2025, 5:23:56 PM4/24/25
to DSpace Technical Support
I block them unrepentantly.

jmis...@gmail.com

unread,
Apr 24, 2025, 5:39:26 PM4/24/25
to DSpace Technical Support
Similarly here, we have recently observed a kind of informal "contest" among poorly behaved robots, with no shortage of strong contenders.

As a provisional mitigation strategy,  we block for some time either by user agent or ip ranges, i.e. for nginx

map $http_user_agent $ignore_ua {
    default                 0;
#    ~*("SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|MojeekBot|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot") 1;
#    ~*("admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot") 1;
#    ~*Turnitin 1;
#    ~*Yandex 1;
}

server {
    location / {
        if ($ignore_ua) {
            access_log /var/log/nginx/access.bots.log;
            return 429;
        }

--
Jozef 

Rich Kulawiec

unread,
Apr 24, 2025, 8:06:30 PM4/24/25
to DSpace Technical Support
On Wed, Apr 23, 2025 at 10:29:16AM -0700, Carolyn Sullivan wrote:
> We've been getting a ton of traffic from IP addresses with Alibaba Cloud
> out of Hong Kong and Singapore that's been overwhelming our servers. [...]

1. Unless you have an operational need to do otherwise, I recommend
dropping all incoming TCP traffic from Alibaba's clouds to (a) web
services and (b) password-protected services. (a) includes ports 80,
443, and anything else you're running; (b) includes 22 (ssh), 993 (imaps),
995 (pops), and anything else that requires authentication.

2. I recommend this because I've been watching received traffic from
them for many years, and -- so far -- it's been constant attacks and abuse,
the latter including some very badly behaving web crawlers.

3. To clarify (1), I mean "in your perimeter firewalls". There's no
reason to let this traffic get anywhere near a server. And I use
the word "drop" because I mean "just drop the incoming TCP SYN, don't
even bother sending a NACK". It's not worth wasting even the small
amount of CPU/bandwidth it would take to send back those NACK responses
(which will be ignored anyway); just silently discard whatever arrives,
if your firewall allows that. (If you're using BSD's pf, then
"set block-policy drop" in concert with the "block" directive
accomplishes this.)

4. It's your call on whether to expand this policy beyond the ports
associated with web and password-protected services, and/or whether to
expand it to UDP. If you're observing port scans and/or DNS abuse/attacks
and/or attacks against other services, that might be wise.

5. Here's my working list of their network allocations. *This may be
incomplete or overinclusive or otherwise wrong* but I hope it'll
be useful as a starting point:

8.208.0.0/12 ASEPL-SG/AlibabaCloudSingaporePrivateLimited
47.52.0.0/16 AL-3/Alibaba
47.56.0.0/15 AL-3/ALICLOUD-HK
47.74.0.0/15 AL-3/Alibaba
47.76.0.0/14 AL-3/Alibaba
47.80.0.0/13 AL-3/Alibaba
47.88.0.0/14 AL-3/Alibaba
47.235.0.0/16 AL-3/Alibaba
47.236.0.0/14 AL-3/Alibaba
47.240.0.0/14 AL-3/Alibaba
47.244.0.0/15 AL-3/Alibaba
47.246.0.0/16 AL-3/Alibaba
47.250.0.0/15 AL-3/Alibaba
47.252.0.0/15 AL-3/Alibaba
47.254.0.0/16 AL-3/Alibaba
147.139.0.0/16 AL-3/Alibaba
163.181.0.0/16 AL-3/AlibabaCloudLLC
198.11.128.0/18 ALIBABA-US-CDN

6. There are various collaborative efforts underway to deal with very
badly behaved web crawlers, because unfortunately it's a rapidly
proliferating problem. If you're interested in this, please drop
me a note off-list.

---rsk

Carolyn Sullivan

unread,
Apr 25, 2025, 3:11:07 PM4/25/25
to DSpace Technical Support
Thank you all for this input, I really appreciate all this information <3  Institutionally, it seems to make the most sense to block all AI harvesters, even if they are accessing content: if for-profit AI companies would like to improve their models, they really shouldn't be doing so at the expense of publicly funded infrastructure.

Please continue to drop your information and thoughts on this in the thread, and have a lovely weekend!

Edmund Balnaves

unread,
Apr 27, 2025, 11:06:34 PM4/27/25
to DSpace Technical Support
Unfortunately, Alibaba is not the only, nor even the worst, culprit.  We see bots accessing with IP ranges across the globe.  You will end up with a very large list of blocked ranges.   Very few advertise a proper agent string.      It would be nice if one list of IP ranges did the trick but these ip ranges vary rapidly and are sourced from many different countries.    Given many appear to be throw-away ranges, it is not desirable to bock these permanently.

There is another dimension to this problem:  DSpace does not handle load terribly well.  While you might well see hits across several hundred IP addresses, this really shouldn't stress the server.    Sadly DSpace crumbles under what in other websites would be small-to-moderate load.    

Under the hood, both the angular and the API are excessively verbose, adding to the problem, especially as these bots do not reach the bot cache.

We have had most luck with bot detection and redirection to keep bots off the search and browse endpoints.  Item pages are served sufficiently quickly to scale out well.
Reply all
Reply to author
Forward
0 new messages