Including heading and subheadings and labelling data fields

28 views
Skip to first unread message

Lisa Evans

unread,
Feb 4, 2015, 8:41:08 AM2/4/15
to opencorporat...@googlegroups.com
Hello,

I've taken on mission 609 to scrape this Icelandic financial entity data http://en.fme.is/supervision/supervised-entities/.
The data about the financial entities is grouped under headings and sometimes subheading. I've written a scraper where the json output has the groupings and subgroups. Is this what you want? Or do you just want the entities output and don't really need the groupings?

Here's the code at present: http://objectgroup.org/scraper.py

Also there are no headings for the columns of data on web page I'm scraping, so I've given the entity data fields titles that describe them, I hope, fairly well. Is this the right things to do, or do you have your own naming scheme? 

Many thanks
Lisa

Peter Evans

unread,
Feb 4, 2015, 12:22:28 PM2/4/15
to opencorporat...@googlegroups.com

Hi Lisa,


Thank you for putting time into scraping the data for Mission 609.


You’re definitely correct in your assessment of the data source. Regarding the groupings / sub-groupings the way that we usually do it is to have a field in each row which indicates the entity “type”. I would then use the bottom level groupings (E.g “Commercial Banks”) for the “type” field.


Regarding headers it’s important to distinguish between primary data, data as close to the original page as possible, and transformed data, which is standardised and can be used more readily. In primary data it’s best to capture data (and headers) which are as close to the source as possible; if no headers are present then anything that makes sense can be used, as you have done. Transformer data is more strict with headers, you can find the headers for the simple licence schema here: http://turbot.opencorporates.com/docs/supported_data_types - as an example the “type” header for types of entity would be “jurisdiction_classification” in simple licence output.


There is an example of adding a licence transformer here: http://turbot.opencorporates.com/docs/examples#structured-bots - this allows us to output transformed data as well as primary data.


Hope that answers your questions, do feel free to be in touch.


Best,

Peter


p.s. Chris says Hi!
Reply all
Reply to author
Forward
0 new messages