PhantomJs crashes

80 views
Skip to first unread message

vital...@gmail.com

unread,
Oct 28, 2018, 10:10:35 AM10/28/18
to Abot Web Crawler
Hi,

My current config is (not sure how to fold it):

<abot>
<crawlBehavior maxConcurrentThreads="6" maxPagesToCrawl="100000" maxPagesToCrawlPerDomain="0" maxPageSizeInBytes="0"
userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" crawlTimeoutSeconds="0"
downloadableContentTypes="text/html, text/plain, text/xml" isUriRecrawlingEnabled="false" isExternalPageCrawlingEnabled="false"
isExternalPageLinksCrawlingEnabled="false" httpServicePointConnectionLimit="200" httpRequestTimeoutInSeconds="15"
httpRequestMaxAutoRedirects="7" isHttpRequestAutoRedirectsEnabled="true" isHttpRequestAutomaticDecompressionEnabled="false"
isSendingCookiesEnabled="false" isSslCertificateValidationEnabled="false" isRespectUrlNamedAnchorOrHashbangEnabled="false"
minAvailableMemoryRequiredInMb="0" maxMemoryUsageInMb="0" maxMemoryUsageCacheTimeInSeconds="0" maxCrawlDepth="1000" maxLinksPerPage="1000"
isForcedLinkParsingEnabled="false" maxRetryCount="0" minRetryDelayInMilliseconds="0" />
<authorization isAlwaysLogin="false" loginUser="" loginPassword="" />
<politeness isRespectRobotsDotTextEnabled="true" isRespectMetaRobotsNoFollowEnabled="false" isRespectHttpXRobotsTagHeaderNoFollowEnabled="false"
isRespectAnchorRelNoFollowEnabled="false" isIgnoreRobotsDotTextIfRootDisallowedEnabled="false" robotsDotTextUserAgentString="abot"
maxRobotsDotTextCrawlDelayInSeconds="5" minCrawlDelayPerDomainMilliSeconds="2000" />
<extensionValues>
<add key="key1" value="value1" />
<add key="key2" value="value2" />
</extensionValues>
</abot>

<abotX
maxConcurrentSiteCrawls="1"
sitesToCrawlBatchSizePerRequest="25"
minSiteToCrawlRequestDelayInSecs="15"
isJavascriptRenderingEnabled="true"
javascriptRenderingWaitTimeInMilliseconds="30000"
>
<autoThrottling
isEnabled="false"
thresholdMed="5"
thresholdHigh="10"
thresholdTimeInMilliseconds="5000"
minAdjustmentWaitTimeInSecs="30"
/>
<autoTuning
isEnabled="false"
cpuThresholdMed="65"
cpuThresholdHigh="85"
minAdjustmentWaitTimeInSecs="30"
/>
<accelerator
concurrentSiteCrawlsIncrement="2"
concurrentRequestIncrement="2"
delayDecrementInMilliseconds="2000"
minDelayInMilliseconds="0"
concurrentRequestMax="10"
concurrentSiteCrawlsMax="3"
/>
<decelerator
concurrentSiteCrawlsDecrement="2"
concurrentRequestDecrement="2"
delayIncrementInMilliseconds="2000"
maxDelayInMilliseconds="15000"
concurrentRequestMin="1"
concurrentSiteCrawlsMin="1"
/>
</abotX>


The exception i'm getting is:

[2018-10-28 10:44:28,513] [7] [WARN ] - Error occurred while rendering javascript for page [%Crawling_page%], [NReco.PhantomJS.PhantomJSException: PhantomJS exit code -1073741819: Fatal Windows exception, code 0xc0000005.
PhantomJS has crashed. Please read the bug reporting guide at
<http://phantomjs.org/bug-reporting.html> and file a bug report.
at NReco.PhantomJS.PhantomJS.CheckExitCode(Int32 exitCode, List`1 errLines)
at NReco.PhantomJS.PhantomJS.Run(String jsFile, String[] jsArgs, Stream inputStream, Stream outputStream)
at NReco.PhantomJS.PhantomJS.RunScript(String javascriptCode, String[] jsArgs, Stream inputStream, Stream outputStream)
at AbotX.Core.PhantomJsRenderer.RunPhantomJsWithStaticWaitTime(Stream inputStream, MemoryStream outputStream)
at AbotX.Core.PhantomJsRenderer.Render(Uri pageUri, PageContent pageContent, CookieCollection cookieCollection)] - [AbotLogger]

sjdi...@gmail.com

unread,
Oct 28, 2018, 11:43:55 AM10/28/18
to vital...@gmail.com, Abot Web Crawler
Hi, 

Can you give more context? Does this fail intermittently or on every link? Can you give us a reproducible set of steps? The only thing that sticks out in your confirmation is the javascriptRenderingWaitTimeInMilliseconds is set a little higher than I'm used to seeing.

Steven

--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

vital...@gmail.com

unread,
Oct 28, 2018, 12:20:12 PM10/28/18
to Abot Web Crawler
Hi,

Its not falling every time, but seems to be pretty much random. Originally it was too often, thats why i increased javascriptRenderingWaitTimeInMilliseconds - now crashes are quite rare, but still happens from time to time.

Actually, you can reproduce the issue if you run a crawler for https://www.target.com/c/clearance/-/N-5q0ga?lnk=dNav_clearance. That website is heavily relied on JS, so HTML page loads pretty fast, but JS might take 12-15 seconds: http://prntscr.com/lbg8fh
So you'd be able to get the issue either on the first page, or on some of the sequent ones (should take not more than 5-7 pages to reproduce).

Reply all
Reply to author
Forward
0 new messages