Chrome Smart Proxy

0 views

Skip to first unread message

Nicodemo Aidara

unread,

Aug 4, 2024, 2:44:16 PM8/4/24

to dubspuncleca

Iam thrilled to share some good news with all the Selenium users, who are looking for an easy-to-integrate anti-ban solution, and all the Zyte Smart Proxy Manager users, who use Selenium (a Web Automation-Headless browser library) for extracting data from javascript-heavy websites. We have just launched a new Zyte SmartProxy Selenium.

At Zyte, the developer experience matters the most, and we wanted to give you a smooth experience of scraping dynamic websites with seamless integration between Selenium and our smart rotating proxy service, Zyte Smart Proxy Manager.

With this library, you will be able to make the best of the headless browser capabilities of Selenium and manage bans by unlocking the powerful proxy management tool - Zyte Smart Proxy Manager in your web scraping projects.

Important note: block_ads and static_bypass are enabled by default. Some websites may not work with block_ads and static_bypass enabled. Try disabling them if you encounter any issues. To know more about these functionalities, read here.

Using libraries like Zyte SmartProxy Selenium can make it so much easier to work with and can make it so much easier to work with dynamic websites and manage bans and proxies all together in a single piece of code. Later this month, on the 22nd of June, I will be hosting a webinar to demonstrate the true power of this new integration and show you how to make the most out of it. So be sure to join me!

I have been away from using Selenium for many years. Because of this I now lack the context (and time, sorry) to go through the newer answers being provided and mark one as the solution to this problem. Does SO have a mechanism one could use to effectively delegate this function to someone who might be a current practitioner with expertise in this domain?

I have checked for most of the solutions on the web and for none of them authentication via chrome/firefox desired capabilities is working. Check this link: Finally the temporary solution is to whitelist your IP address with the proxy provider.

Instead of these workaround I switched to Firefox where i was able to fill Username and Password on Proxy authentication Pop-up. Like given below. Following code is for Ruby using Capybara. You should be able to do something like this on your platform

This is the best solution I found and is the ONLY one that worked - all other answers on this question are outdated. It basically generates an auth extension for Chrome on the fly. Simply use the function as defined in the script as follows:

If you have permanent IP address (e.g. by leasing a machine in the cloud) you can contact your proxy provider to access the proxy server with your IP instead of the user and password. In my experience that's easier.

after trying many solutions that didn't actually work properly, i finally managed to set the authenticated proxy using the suggested extension from previous answers.what you need to do is to enter this link:

With an estimated 40% of websites using Cloudflares Content Delievery Network (CDN), bypassing Cloudflare's anti-bot protection system has become a big requirement for developers looking to scrape some of the most popular websites on the internet.

Here instead of having to trick Cloudflare into thinking your requests are from a real user, you instead bypass Cloudflare completely by finding the IP address of the origin server that hosts the website and send your requests to that instead.

Sometimes accessing the website via the origin IP address by inserting it in your browsers address bar won't work, as the server may be expecting a HTTP HOST header. When this is the case, you can query the origin server with a tool like curl or Postman which allows you to set HOST headers or add a static mapping to your hosts file.

Provided that the website isn't using a 3rd part email provider, one trick is to send a email to a non-existing emaill address at your target website fake...@targetwebsite.com, and assuming the delievery fails you should recieve a notification from the email server which will contain the IP address.

The DNS history of every server is available on the internet so it is sometimes the case that the website is still being hosted on the same server as it was before they deployed it to the Cloudflare CDN. As a result, you can use a tool like CrimeFlare to find it.

Sometimes even if you find the actual IP address of the website server it is not possible to access it for example when the websites administrators correctly limits the server to only respond to Cloudflare IP ranges, redirects any requests to the Cloudflare CDN, or if Origin CA certificates are used.

If find what looks like an origin server, it may in fact be a development or staging server for the real website. Although you can never be 100% sure that the server you found is the origin server, if you can browse around, the data looks the same as the Cloudflare protected site, can register an account on the "origin version" and login to the real website with it then it should be okay to treat this website as the real website.

Some websites (like LinkedIn), tell Google to not cache their web pages or Google's crawl frequency is too low meaning some pages mightn't be cached already. So this method doesn't work with every website.

When run, FlareSolverr starts a proxy server which forwards your requests to the Cloudflare protected website using puppeteer and the stealth plugin, and waits until the Cloudflare challenge is solved (or timesout) before returning the response and cookies to your scraper.

The advantage of this approach over using a fortified headless browser for every request is that you only need to use FlareSolverr to retrieve valid Cloudflare cookies and then can continue scraping with much less resource intensive HTTP clients (like Python Requests, HTTPX, Node Axios, etc.).

When run, FlareSolverr starts a server that uses Python Selenium with undetected-chromedriver to solve Cloudflares Javascript and browser fingerprinting challenges by impersonating a real web browser.

As headless browsers can consume a lot of memory and each request to FlareSolverr launches a new browser window, FlareSolverr can crash your server if you send to many requests to it and your machine doesn't have enough RAM. Therefore you need to throttle the number of requests you send and/or deploy it on a larger server.

Sometimes CloudFlare not only gives mathematical computations and Javascript browser tests to be solved, but sometimes require the user to solve a CAPTCHA. Although FlareSolverr does support CAPTCHA solving via third party CAPTCHA solvers, currently, none of the automated CAPTCHA solving solutions work as Cloudflare uses hCAPTCHA.

Vanilla headless browsers leak their identify in their JS fingerprints which anti-bot systems can easily detect. However, developers have released a number of fortified headless browsers that patch the biggest leaks:

For example, a commonly known leak present in headless browsers like Puppeteer, Playwright and Selenium is the value of the navigator.webdriver. In normal browsers, this is set to false, however, in unfortified headless browsers it is set to true.

There are over 200 known headless browser leaks which these stealth plugins attempt to patch. However, it is believed to be much higher as browsers are constantly changing and it is in browser developers & anti-bot companies interest to not reveal all the leaks they know of.

Headless browser stealth plugins patch a large majority of these browser leaks, and can often bypass a lot of anti-bot services like Cloudflare, PerimeterX, Incapsula, DataDome depending on what security level they have been implement on the website with.

Another way to make your headless browsers more undetectable is to pair them with high-quality residential or mobile proxies. These proxies typically have higher IP address reputation scores than datacenter proxies and anti-bot services are more relucant to block them making them more reliable.

As residential & mobile proxies are typically charged per GB of bandwidth used and a page rendered with a headless browser can consume 2MB on average (versus 250kb without headless browser). Meaning it can get very expensive as you scale.

To enable the use of authenticated proxies, in the below example we will load the undetected_chromedriver from seleniumwire instead of directly from the undetected-chromedriver package and pass the proxy settings into the seleniumwire_options attribute of the Chromedriver.

The downsides with using open source Cloudflare Solvers and Pre-Fortified Headless Browsers, is that anti-bot companies like Cloudflare can see how they bypass their anti-bot protections systems and easily patch the issues that they exploit.

These are typically more reliable as it is harder for Cloudflare to develop patches for them, and they are developed by proxy companies who are financially motivated to stay 1 step ahead of Cloudflare and fix their bypasses the very minute they stop working.

However, one of the best options is to use the ScrapeOps Proxy Aggregator as it integrates over 20 proxy providers into the same proxy API, and finds the best/cheapest proxy provider for your target domains.

You can activate ScrapeOps' Cloudflare Bypass by simply adding bypass=cloudflare_level_1 to your API request, and the ScrapeOps proxy will use the best & cheapest Cloudflare bypass available for your target domain.

The final and most complex way to bypass the Cloudflare anti-bot protection is to actually reverse engineer Cloudflare's anti-bot protection system and develop a bypass that passes all Cloudflares anti-bot checks without the need to use a full fortified headless browser instance.

Advantages: The advantage of this approach, is that if you are scraping at large scales and you don't want to run hundreds (if not thousands) of costly full headless browser instances. You can instead develop the most resource efficient Cloudflare bypass possible. One that is solely designed to pass the Cloudflare JS, TLS and IP fingerprint tests.