Feature Request: Conditional Logic

184 views
Skip to first unread message

1stClass

unread,
Mar 10, 2016, 1:15:41 PM3/10/16
to Web Scraper
Hi Mārtiņš Balodis,

I wanted to personally extend my thanks for providing this product to the community. I have been looking for a product which could scrape the pages of an ajax enabled site with authentication controls for many years. I am sure that there are some available, but none of the many products which I tried were successful. I also love the fact that your scraper is an open source project. I just found the project 2 days ago, and have found it to be fairly easy to understand and use once you get started, but I can certainly see how users who are unfamiliar with the HTML DOM, and programming in general would have trouble getting started using Web Scraper. Your videos are helpful, but they are created from the perspective of an expert user, not a new user who is totally unfamiliar with the software. 

The reason that I am writing now is to request a new feature to allow conditional scraping. Perhaps a new element which can compare the results of a text or element attribute to a constant, and execute one selector if true, and another if false. It would enable more efficient scraping of data from sites.

Also, when scrapping data using selectors with multiple elements, currently the scraper generates a new record for each element. I would like to request that an option be added to make all of that data part of the same record. This works fine for your example sites where the first set of links are pages listing multiple products for which you want separate records. However, if you are scraping data from a list of links in a table, ex. user names, which contain linked pages to detailed information containing a list of information, ex. user preferences, it would be better to have the user preferences all within one record for each user name, as opposed to a separate record for each preference.

It would be nice to load records from a file as the input for the site map or the start pages of a scrape. That would enable users to scrape one site, download the records and process them, and use that data to upload a new sitemap to begin a new scrape based upon the results from the first scrape.

I downloaded the code for the project today. I am not sure when I will have an opportunity to look at it, but I would love to be able to help you with this project.

Thanks

Mārtiņš Balodis

unread,
Mar 14, 2016, 5:16:27 AM3/14/16
to 1stClass, Web Scraper
Hi,
Could you give an example of the conditional scraping? Why cannot you create both selectors and use the data that you needed after exporting it?

I have thought a lot about the record grouping feature. Right now there is the Grouped selector which can extract multiple values and store them into a single record in JSON format. The only limitation of this selector is that it cannot group multiple selectors into a single record. For example you might want to extract a table of items attributes and store it in a single record.

How did you meant that records scraped by one sitemap could be used in another one? Did you mean uploading a list of start urls for a sitemap?

--
You received this message because you are subscribed to the Google Groups "Web Scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to web-scraper...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

1stClass

unread,
Mar 15, 2016, 5:56:34 PM3/15/16
to Web Scraper, 1stclas...@gmail.com
The two most basic elements of any programming language are loops, and conditional statements. You have enabled the ability to create loops by allowing link selectors to be their own parent and execute recursively, but there is no way to break out of that loop. If you are scraping a site with thousands of pages, it is more likely that your browser will crash before the scrape ends. It would be nice to be able compare the text selector for a page number and if it is greater than or equal to a literal then execute one link selector or nothing at all, else execute another link selector.

Sometimes, it would be much more efficient if the link selector could accept the text for an actual link to a page, instead of having to navigate to a page by clicking on buttons. There have been multiple occasions where I needed to start scraping on one page to get some general data before going to another page with rows of data. The general data is added to each row which is redundant, but acceptable. I guess that what I would really like is a feature to combine multiple scrapes into one, so that I can scrape one row of general data from one start page, then go to another start page, and scrape multiple rows from multiple pages. But being able to jump around from page to page instead of navigating with links would be an improvement.

As an example, I am scraping a site by creating 20 duplicate link selectors which all select the next page, so that after 20 pages the scrape stops without me having to manually stop the scrape. If there were a way to check for Page 20 and either select the next page or not. It would only be necessary to have that one conditional link selector executing itself recursively.

I am using a work around for the grouped selector which works for me, because it is a limited list. Instead of selecting multiple for an element selector, I just created multiple selectors to choose each item that I needed in the row.

Yes, I did mean uploading a list of start urls. I was scraping a game site which has many different servers, but they are all identical, and the same sitemap could be used for each server except for a unique start url.

I would think that it would be nice to be able to load the selectors from a file anyway, then all of your sitemaps could be stored in a convenient folder on your computer. Perhaps you could execute an entire folder of sitemaps, loading the sitemap, performing the scrape, then downloading the records to a results file.

If you could then schedule those scrapes to run daily or weekly, I could see the potential for using your app to scrape many types of data that is updated on a regular basis, but that would begin to interfere with the services which your professional version provides.

Mārtiņš Balodis

unread,
Mar 24, 2016, 6:08:39 AM3/24/16
to 1stClass, Web Scraper
I see where you are going with conditionals. This might be resolved if the CSS selector would allow selecting elements with a regex. For example pages 1 to10 could be selected with a regex like this - ^([0-9]|10)$. I added a feature request for this https://github.com/martinsbalodis/web-scraper-chrome-extension/issues/147 

The start url upload feature will make easier spliting a job in multiple sitemaps. But right now you can manually edit the exported json and import a sitemap with a lots of urls. 

Reply all
Reply to author
Forward
0 new messages