Regarding json template stored for annotated page

24 views
Skip to first unread message

Shivam Malhotra

unread,
Dec 17, 2015, 1:16:46 AM12/17/15
to portia-scraper
Hi everyone,

I had a small query regarding the source code which I could not understand.
I have noticed(please correct me if I am wrong) that for each annotated page, a large json file is generated in our projects directory which further somehow acts as a template for matching more elements on similar pages. 
Can someone please mention how this file is parsed? 
Basically where does the code lie in portia directory which reads this json file for matching? 
Also it would be very helpful if you could provide some idea about how this comparison actually occurs.

Thanks
Shivam

David Bengoa Rocandio

unread,
Dec 17, 2015, 5:42:59 AM12/17/15
to Shivam Malhotra, portia-scraper
Hi Shivam,

You are correct, the annotated HTML is parsing and matching against the scraped HTML is made by Scrapely ( https://github.com/scrapy/scrapely/ ). In the Architecture section of the README there is information about how it works and the theoretical background behind it.

Regards,
David

--
You received this message because you are subscribed to the Google Groups "portia-scraper" group.
To unsubscribe from this group and stop receiving emails from it, send an email to portia-scrape...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages