My current config is (not sure how to fold it):
<abot>
<crawlBehavior maxConcurrentThreads="6" maxPagesToCrawl="100000" maxPagesToCrawlPerDomain="0" maxPageSizeInBytes="0"
userAgentString="Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko" crawlTimeoutSeconds="0"
downloadableContentTypes="text/html, text/plain, text/xml" isUriRecrawlingEnabled="false" isExternalPageCrawlingEnabled="false"
isExternalPageLinksCrawlingEnabled="false" httpServicePointConnectionLimit="200" httpRequestTimeoutInSeconds="15"
httpRequestMaxAutoRedirects="7" isHttpRequestAutoRedirectsEnabled="true" isHttpRequestAutomaticDecompressionEnabled="false"
isSendingCookiesEnabled="false" isSslCertificateValidationEnabled="false" isRespectUrlNamedAnchorOrHashbangEnabled="false"
minAvailableMemoryRequiredInMb="0" maxMemoryUsageInMb="0" maxMemoryUsageCacheTimeInSeconds="0" maxCrawlDepth="1000" maxLinksPerPage="1000"
isForcedLinkParsingEnabled="false" maxRetryCount="0" minRetryDelayInMilliseconds="0" />
<authorization isAlwaysLogin="false" loginUser="" loginPassword="" />
<politeness isRespectRobotsDotTextEnabled="true" isRespectMetaRobotsNoFollowEnabled="false" isRespectHttpXRobotsTagHeaderNoFollowEnabled="false"
isRespectAnchorRelNoFollowEnabled="false" isIgnoreRobotsDotTextIfRootDisallowedEnabled="false" robotsDotTextUserAgentString="abot"
maxRobotsDotTextCrawlDelayInSeconds="5" minCrawlDelayPerDomainMilliSeconds="2000" />
<extensionValues>
<add key="key1" value="value1" />
<add key="key2" value="value2" />
</extensionValues>
</abot>
<abotX
maxConcurrentSiteCrawls="1"
sitesToCrawlBatchSizePerRequest="25"
minSiteToCrawlRequestDelayInSecs="15"
isJavascriptRenderingEnabled="true"
javascriptRenderingWaitTimeInMilliseconds="30000"
>
<autoThrottling
isEnabled="false"
thresholdMed="5"
thresholdHigh="10"
thresholdTimeInMilliseconds="5000"
minAdjustmentWaitTimeInSecs="30"
/>
<autoTuning
isEnabled="false"
cpuThresholdMed="65"
cpuThresholdHigh="85"
minAdjustmentWaitTimeInSecs="30"
/>
<accelerator
concurrentSiteCrawlsIncrement="2"
concurrentRequestIncrement="2"
delayDecrementInMilliseconds="2000"
minDelayInMilliseconds="0"
concurrentRequestMax="10"
concurrentSiteCrawlsMax="3"
/>
<decelerator
concurrentSiteCrawlsDecrement="2"
concurrentRequestDecrement="2"
delayIncrementInMilliseconds="2000"
maxDelayInMilliseconds="15000"
concurrentRequestMin="1"
concurrentSiteCrawlsMin="1"
/>
</abotX>
The exception i'm getting is:
[2018-10-28 10:44:28,513] [7] [WARN ] - Error occurred while rendering javascript for page [%Crawling_page%], [NReco.PhantomJS.PhantomJSException: PhantomJS exit code -1073741819: Fatal Windows exception, code 0xc0000005.
PhantomJS has crashed. Please read the bug reporting guide at
<http://phantomjs.org/bug-reporting.html> and file a bug report.
at NReco.PhantomJS.PhantomJS.CheckExitCode(Int32 exitCode, List`1 errLines)
at NReco.PhantomJS.PhantomJS.Run(String jsFile, String[] jsArgs, Stream inputStream, Stream outputStream)
at NReco.PhantomJS.PhantomJS.RunScript(String javascriptCode, String[] jsArgs, Stream inputStream, Stream outputStream)
at AbotX.Core.PhantomJsRenderer.RunPhantomJsWithStaticWaitTime(Stream inputStream, MemoryStream outputStream)
at AbotX.Core.PhantomJsRenderer.Render(Uri pageUri, PageContent pageContent, CookieCollection cookieCollection)] - [AbotLogger]
--
You received this message because you are subscribed to the Google Groups "Abot Web Crawler" group.
To unsubscribe from this group and stop receiving emails from it, send an email to abot-web-crawl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Its not falling every time, but seems to be pretty much random. Originally it was too often, thats why i increased javascriptRenderingWaitTimeInMilliseconds - now crashes are quite rare, but still happens from time to time.
Actually, you can reproduce the issue if you run a crawler for https://www.target.com/c/clearance/-/N-5q0ga?lnk=dNav_clearance. That website is heavily relied on JS, so HTML page loads pretty fast, but JS might take 12-15 seconds: http://prntscr.com/lbg8fh
So you'd be able to get the issue either on the first page, or on some of the sequent ones (should take not more than 5-7 pages to reproduce).