Views on DIA, Scraping, API Restrictions

109 views

Skip to first unread message

Tarunima

unread,

Jul 21, 2023, 2:48:31 AM7/21/23

to data...@googlegroups.com

Hi All,

There have been a number of recent changes globally and nationally, in part triggered by Gen AI chat bots, that could restrict scraping. Listing the salient ones here:

India coming up with the Digital India Act which would bear on data access
EU releasing the Data Services Act which has provisions for researcher access to data (https://algorithmwatch.org/en/dsa-data-access-explained/)
Twitter and Reddit revoking API access: https://www.fastcompany.com/90904038/reddit-restricts-third-party-apps

Does this group have any views/concerns about this?

Apologies if this has already been addressed in another thread.

Regards,

Tarunima

Nikhil VJ

unread,

Jul 27, 2023, 2:38:52 AM7/27/23

to data...@googlegroups.com

Hi Tarunima

I've not delved too deep into this stuff, but want to share some inputs about one problem that might be one root cause for these changes. Apologies in advance if this is under-informed:

Problem

There's a major cost factor in terms of bandwidth consumption for any kind of situation where you have one centralized distribution point that is being queried for data by thousands to millions of consumers.

This cost is recurring, plus there's a slab-like cost : to be ABLE to serve some data at huge numbers, you have to have bought/rented some pretty expensive infrastructure / services. That will result in fixed costs. And that is regardless of whether anyone actually consumes your data or not.

And that is just the bandwidth part. In cases where the data has to be fetched out of a database, even that needs to bear really high loads, in addition to doing its main job of intaking and maintaining the data. And things get more complex and expensive as we dive into multi-cluster databases etc.

When people in the research community talk about data access, I don't see anybody even mentioning this stuff. It's mostly "You have to open these endpoints with zero restrictions and I will scrape it at max speed and that's that!". They don't consider that there will be others like them. Imagine all the cars in a city deciding to go to a mall at the same time - that's what keeps happening here.

----------

Solutions

1. Torrents

Before the current state of high-speed internet access came about, there was one solution that was being used to get around this problem : Bit Torrent. Basically, anyone who is downloading some data, can also become a relayer of those chunks of data to other consumers. So the consumption is spread out.

This is for fixed static files that wouldn't need to update over time, not databases with updating data. Even today, if we go for downloading some linux distro, we'll see that they share .torrent / magnet links and recco that way to download, which relieves the load on their servers which are mostly volunteer-funded.

While this might not be suitable for dynamically updating data, but for data dumps that have a fixed version / release date, why not? We should see government / institutions releasing torrents the same way linux distros do it. And there's enough technical institutions and companies with hefty servers in the country that can seed these torrents, just like how they're seeding linux distros today.

For dynamically updating data, database-type queries, now there's web3 technologies coming up that might help for similar decentralised distribution - I don't know more details but it's worth digging into.

-----------

2. Consumer pays model

Another solution I've seen recently is a consumer-pays-for-egress model used in Amazon's Openstreetmap data release. Here's one link, I couldn't find the exact article explaining it: https://registry.opendata.aws/

Under this model, the dataset is available as open and is query-able, meaning you can fetch just the parts you want. But if you as a consumer want to fetch a high amount of it, then you have to pay the bandwidth costs incurred.

And if you consume less, then there might be a free slab you come under (not sure what is the case in AWS case)

Provided we retain a basic free tier, I think this takes care of a lot of problems. Now I wouldn't want India's governments to be a beholden AWS customer (because USA etc already are and there's national security considerations that we SHOULD take seriously, not scoff at), but rather like what has been done with UPI, there should be Indian-owned infrastructure and service which offers the same deal. Maybe a prepaid wallet that gets deducted from when we exceed the free tier.

---------

3. Rate limiting

Many times it's not the quantity, but the velocity of data scraping that inflicts high costs on the provider. If scrapers scraped data slowly overnight instead of trying to fetch everything in 10 minutes at peak business hours, we might make things work with the same existing setup used to serve data to sites.

Example : The main concern for say Indian Railways in serving train schedules data would be : the server should not get so clogged up by these scraping bots that people who are trying to book tickets get downtime. Enforcing rate limits can help a lot here. A basic figure: Allow an IP or a user to make max 4 requests per minute. I've recently implemented this using Kong Gateway, and was surprised by how easy it was.

In many cases I'm suspecting that the people in charge had no idea that rate-limiting is even possible, so they went to next option: captcha restrictions to disable automated data fetching entirely. Funnily, that's a far more expensive measure than rate-limiting! And we have an arms race now with scrapers cracking that captcha and then providers making it so difficult to read that eventually humans won't be able to read it anymore. Maybe we can put down our weapons, take a few steps back and communicate that there are options available that work for both sides?

--------

4. Load shifting

One load-shifting example I've seen in my netbanking : If I request for some long-term account statements, instead of trying to give the data immediately, it queues the task in backend and tells me to carry on and come back in a few mins to download. Some other sites mail me the link when they're done gathering the data requested. So, we could have something like this : it distributes the load on server from peak-time spikes to the "lazy" times later when there's not much high traffic. The institution can stay on a cheaper infrastructure, doesn't incur higher costs, and it still accomplishes the goal of providing data.

--------

A mental note:

I came across a quote the other day : A government that is expected to do everything for us, will take everything from us.

Too often I see this expectational attitude amongst folks like "everything must be provided for, free and openly accessible!", without taking into consideration what it all takes or the fact that they're not the only scrapers on the planet. Or the fact that there will always be an unlimited supply of idiots hogging up all the resources for no use other than to show off their latest Go code's concurrency stats.

I agree that we're paying taxes, but those taxes are already accounted for (very inefficiently, but yes), and demanding that they be used to fund all these new web infra for giving us more free stuff leads to the same convenient outcome : increase in our taxes. It's already happening, and it's not sustainable. I especially don't appreciate having to pay more taxes just for the sake of that idiot with the Go code :D.

And you can bet that if we demand that so and so government institution make everything freely available without caring about the details, then they will do it in the most expensive and wasteful way imaginable. The solutions I've written above seem a lot better to me than increasing my taxes plus creating yet another black hole in the govt budget.

I don't know exactly how we can make things work out, but if we ditched the expectational attitude and think more as a team player with the government being part of that team, we could go a long way.

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CACj7W2mAOhV4-Qi0Y6g4BqWUFUXCWtxQpyd1xewRpxwafXEb0g%40mail.gmail.com.

Tarunima

unread,

Aug 23, 2023, 1:13:05 AM8/23/23

to data...@googlegroups.com

Hi Nikhil.

Thanks for such a detailed email. I apologize for the delay in replying to it.

I think you raise an important point around bandwidth consumption for the entity hosting the data. And I appreciate the comprehensive list of solutions. Projects like Pushshift were addressing this concern (the data was typically shared through torrents). Of course not everyone used Pushshift to access Reddit data. Some people preferred to scrape on their own terms. Removing the free Reddit API isn't just to offset the material cost of scraping but rather from a belief that Reddit too should get money from profits made off LLMs (if they do become profitable). Not an unreasonable position except that it then flattens out the skill/resource disparity across projects and teams. It equates Open AI with a high school student doing a side project. I thought the Twitter API model was a good one- there was a limited free tier for hobbyists but researchers and corporations needed to pay more.

I was confused by your point about "A government that is expected to do everything for us, will take everything from us." I don't know if there is an argument that government should provide the infra to scrape, or host data that is scraped or will be scraped. That was definitely not what I was suggesting. In my understanding, government or judiciary is typically playing the role of legitimizing or delegitimizing the act of scraping, which remains an action between individuals and organizations. But happy to be corrected, if I have this wrong.

Regards,

Tarunima

To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAH7jeuMfUEDNwpqbvBJtt5YAzcpff9FW1stJGoDNmVJu%2BssmsA%40mail.gmail.com.