Webscraper crashes when we try to export csv after scraping for hours

1,083 views
Skip to first unread message

Michael Stefani

unread,
Mar 20, 2015, 7:31:16 AM3/20/15
to web-s...@googlegroups.com
Hi,
wie try to scrape different billboards for an university project. The scraping itself seems to work like a charm, but unfortunately we are not able to view or to export the data.
When the scraper runs for more than let's say 1.5 hours, web scraper ends itself, without further comment.
Is there any way to get to the csv file?
Best regards Michael

Mārtiņš Balodis

unread,
Mar 26, 2015, 3:24:56 AM3/26/15
to Michael Stefani, web-scraper
Hi,
If the browser crashes the data is still stored. You can export everything that was scraped till the browser crashed. Just remember that starting a new scraping job will erase all the previous data.

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Stefani

unread,
Mar 26, 2015, 5:25:45 AM3/26/15
to web-s...@googlegroups.com, stefan...@gmail.com
Thank you for your reply. There might be an misunderstanding: It's not Chrome that crashes, it's the plugin itself. So therefore I can't actually get to the data: When I click "Export data as CSV. ", the line "Waiting for the download button to appear." shows, but before the "Download now" Link pops up the plugin crashes. This takes usually 2-3 minutes. I can restart the plugin from the menu, but it happens over and over again. 
I took a look at the memory consumption. I happens around 850 MB to 1GB, although there are still more than 3GB available. Is there a limit built in?

Here is one of our sitemaps, but it happens with others too.

{"startUrl":"http://mein.dbna.de/webforum/","selectors":[{"parentSelectors":["_root"],"type":"SelectorLink","multiple":true,"id":"boards","selector":"h2 a","delay":""},{"parentSelectors":["boards","board_pagination"],"type":"SelectorLink","multiple":true,"id":"board_pagination","selector":"div.col-inner > a:nth-of-type(n+3)","delay":""},{"parentSelectors":["boards","board_pagination"],"type":"SelectorLink","multiple":true,"id":"thread","selector":"a.topictitle","delay":""},{"parentSelectors":["thread","thread_pagination"],"type":"SelectorLink","multiple":true,"id":"thread_pagination","selector":"div.col-inner > a","delay":""},{"parentSelectors":["thread","thread_pagination"],"type":"SelectorElement","multiple":true,"id":"Post","selector":"table.table > tbody > tr:nth-of-type(n+3) ","delay":""},{"parentSelectors":["Post"],"type":"SelectorText","multiple":false,"id":"nickname","selector":"a:nth-of-type(2)","regex":"","delay":""},{"parentSelectors":["Post"],"type":"SelectorText","multiple":false,"id":"date","selector":"tr span.postdetails","regex":"","delay":""},{"parentSelectors":["Post"],"type":"SelectorText","multiple":false,"id":"content","selector":"span.postbody > span:nth-of-type(1)","regex":"","delay":""}],"_id":"dbna"}

Best regards Michael 

Mārtiņš Balodis

unread,
Mar 27, 2015, 11:11:44 AM3/27/15
to Michael Stefani, web-scraper
Hi,
There is no limit in the extension. The CSV export function is using a lot of memory because it has to be created the CSV file in memory and only then it can be downloaded. This is a limitation of JavaScript language and chrome browser.

Can you send me a screenshot of any errors you are receiving? The errors might happen in extensions background page or in the web scraper developer tools tab. Here is how you can get the errors:

1. go to "manage extensions" panel
2. check "Developer mode"
3. find web scraper
4. open background page
5. open console tab
6. export the csv and check for errors in the console

1. Open web scraper tab in developer tools
2. click and hold to undock developer tools to a window (image) Inline image 1
3. Open another developer tools window by pressing CTRL+SHIFT+I in the web scraper tab. (For mac it's Cmd+Opt+I)
4. export the csv from the first developer tools window. Any errors that will happen in the web scraper tab will be printed in the other developer tools window.

Amit Rai

unread,
May 5, 2016, 8:32:12 AM5/5/16
to Web Scraper, stefan...@gmail.com
Hi All, 
Has anyone managed to figure this one out. I scraped a lot of data (took almost 24 hours). The plug-in said scraping is finished. 
However I can't manage to download the CSV file. I get the message "Waiting for the download button to appear. >" and nothing happens (sometimes I get an error that the plug-in has crashed)
Is there any way to access the data? I'd hate to run the scraper again. 

Amit 

Mārtiņš Balodis

unread,
May 10, 2016, 12:47:12 PM5/10/16
to Amit Rai, Web Scraper, Michael Stefani
Hi,
You can split the sitemap in smaller parts and re scrape it. Or you can try to export the data to an external couchdb database. Here is a post that has information about that - https://groups.google.com/forum/#!searchin/web-scraper/pouchdb/web-scraper/XmB0MCHd78I/ASYf04dFf9AJ
Reply all
Reply to author
Forward
0 new messages