Error while scraping sites such as amazon

111 views
Skip to first unread message

Zia

unread,
Feb 19, 2021, 9:38:26 AM2/19/21
to beautifulsoup
While trying to scrape the site a different script shows up which includes a line-
To discuss automated access to Amazon data please contact api-servic...@amazon.com.

Could some please tell me how is it recognized that a program is running and furthermore a solution for this.

Jairaj Sahgal

unread,
Feb 19, 2021, 11:59:05 AM2/19/21
to beauti...@googlegroups.com
Add headers to request function like Mozilla Firefox or Google Chrome.


--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/beautifulsoup/d0b3db2d-c434-46b5-8dfc-5ddb47600ca3n%40googlegroups.com.

Zia

unread,
Feb 19, 2021, 12:08:58 PM2/19/21
to beautifulsoup
I am really sorry, i didn't quite get that. Could you please explain what do you mean by adding headers to request functions?

leonardr

unread,
Feb 19, 2021, 12:19:45 PM2/19/21
to beautifulsoup
In general, when you're scraping a site and the site offers a way to get automated access through an API, you should use the API rather than scraping.

To answer your question: when you use a script to make an HTTP request, the HTTP library inserts HTTP headers such as "User-Agent", which identify the HTTP library to the server. Servers can use this information to deny access to automated clients.

In some cases, changing the User-Agent string to a string associated with a web browser will cause the server to think it's dealing with a human user rather than a script. Here's some documentation that explains the issue in a Python context.

This won't work all the time, because servers can use other mechanisms to detect automated clients, such as checking for Javascript support. In those cases you may be able to use Selenium to script a real web browser.

Leonard

Jairaj Sahgal

unread,
Feb 19, 2021, 12:32:23 PM2/19/21
to beauti...@googlegroups.com
Sure, I'm sorry I wasn't clear earlier.
By headers i mean the information that your browser sent to a website about itself.
That's how a website learns about your browser.
The following code was used to get webpage data (html) from google.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'}
r = requests.get(url, headers=headers)

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To unsubscribe from this group and stop receiving emails from it, send an email to beautifulsou...@googlegroups.com.

Zia

unread,
Feb 19, 2021, 10:49:10 PM2/19/21
to beautifulsoup
That makes sense now. Thanks to both of you :)
Reply all
Reply to author
Forward
0 new messages