Getting Binary data

126 views
Skip to first unread message

David Hansen

unread,
Jan 1, 2021, 12:25:31 PM1/1/21
to Abot Web Crawler
Trying to crawl: https://www.amazon.com/s?k=shredder
getting binary info.  How do I get the html content?

Also, trying to use the advanced options for the 30 day demo, but getting an error that I don't have a license.  How do I demo without the license?

Thanks
David

sjdirect

unread,
Jan 1, 2021, 4:07:18 PM1/1/21
to Abot Web Crawler
Hi, 

Do you have a simple code snippet or unit test that demonstrates your issue? Also, can you verify if you put in another website url with the same code you get the same issue? I suspect Amazon is blocking your requests since it's against their Usage Policy to crawl their pages. You are likely being blocked by a WAF. As far as the advanced features you likely just need to make sure your license file ends up in the correct location and is available during execution. What you are describing is almost certainly a side effect of this.

David Hansen

unread,
Jan 1, 2021, 4:28:59 PM1/1/21
to Abot Web Crawler
Hi, I pulled the Abot2 (2.0.67) and AbotX 2.1.8 from nuget.  I copied your examples.  and just included the url for amazon.com above and was getting the binary content.  I copied your code exactly with only adding www.amazon.com as the url.
For the trial licence, where do I find it.  I searched everywhere, and can't locate it.

Thanks
David

David Hansen

unread,
Jan 1, 2021, 5:01:57 PM1/1/21
to Abot Web Crawler
One other update.  When using WebClient.DownloadString on same url, then it retrieves the page content as text (not binary).  Does it have to do anything with page encoding?  
Thanks
David

sjdi...@gmail.com

unread,
Jan 2, 2021, 11:18:48 AM1/2/21
to David Hansen, Abot Web Crawler
Hi David,

What example code are you using specifically (give me a link to the page or send a snippet)? I can't make assumptions here as there could be 1000 variations. If you use the same code but change the url to a few other sites/urls does it have the same issue? Abot/AbotX doesn't use WebClient internally so that test isn't helpful in pointing out the problem. The trial license is sent through email after signing up here.

--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/abot-web-crawler/1d49a94a-2edf-4f42-b615-a768056d902bn%40googlegroups.com.
Message has been deleted

David Hansen

unread,
Jan 3, 2021, 1:28:51 PM1/3/21
to sjdi...@gmail.com, Abot Web Crawler
Hi SJ,
I resolved part of the mystery.  It appears that Visual Studio 2019 is looking for the license files in the project root directory and not the bin.
I got javascript rending working.
However, testing on Amazon, I am getting 
ReferenceError: Can't find variable: ReactDOM
Does AbotX support crawling React webpages?  
Again, I appreciate all your help, and I am planning on purchasing a license, once this works.
Thanks
David

On Sat, Jan 2, 2021 at 11:41 AM David Hansen <dhansen...@gmail.com> wrote:
Hi SJ,
Sorry for all the back and forth.  I was going to attach files and did not want to include them in public.
I downloaded the licence file and put it in the bin directory.  but still getting the same licensing errors. (log file attached)
Also attached my code.  It was copied from abotx/README.md at master · sjdirect/abotx · GitHub Quick Start.
When I tried it on other sites (i.e. Walmart) it pulled the text version.  With Amazon it shows as binary.
Also, in my code I included the test of the .NET WebClient that I also tested Amazon with and it came as text not binary.

Thanks again for your help
David
abot-log-20210103_001.txt

sjdi...@gmail.com

unread,
Jan 4, 2021, 6:08:39 PM1/4/21
to David Hansen, Abot Web Crawler
Hi David,

Glad you were able to work through most of the issues. I'm not having any issues with the url for amazon shredder. When you say "binary data" are you looking somewhere other than args.CrawledPage.ContentText shown below? Maybe you haven't set a JavascriptRenderingWaitTimeInMilliseconds? Here is a unit test that demonstrates it, your phantomjs exe might be slightly different.

        [TestMethod]
        public async Task JavascriptRendering_ReactJs_Amazon()
        {
            await VerifyJavascriptHasRendered(new Uri("https://www.amazon.com/s?k=shredder"), "Amazon Basics 8-Sheet Capacity", 2500);
        }


        private async Task VerifyJavascriptHasRendered(Uri uri, string searchWord, int waitTime)
        {
            var config = new CrawlConfigurationX
            {
                IsJavascriptRenderingEnabled = true,
                JavascriptRendererPath = "..\\..\\..\\..\\packages\\PhantomJS.2.1.1\\tools\\phantomjs",
                IsSendingCookiesEnabled = true,
                MaxConcurrentThreads = 1,
                MaxPagesToCrawl = 1,
                JavascriptRenderingWaitTimeInMilliseconds = waitTime,
                CrawlTimeoutSeconds = 20
            };

            var crawler = new CrawlerX(config);
            bool javascriptHasRendered = false;
            crawler.PageCrawlCompleted += (sender, args) =>
            {
                javascriptHasRendered = args.CrawledPage.Content.Text.Contains(searchWord);
            };

            await crawler.CrawlAsync(uri);

            Assert.IsTrue(javascriptHasRendered);
        }

TestResult.png

Suresh Dayma

unread,
Jul 14, 2022, 5:57:46 AM7/14/22
to Abot Web Crawler
In the configuration, set isHttpRequestAutomaticDecompressionEnabled="true", this will fix the issue.

Reply all
Reply to author
Forward
0 new messages