parse HTML query

62 views

Skip to first unread message

Jevon, Graham

unread,

May 10, 2021, 12:22:58 PM5/10/21

to openr...@googlegroups.com

Does anyone know how I can extract the text highlighted in yellow?

My various expression attempts have failed. The above expression is as close I’ve got. Even getting this far, I suspect that using a key rather than the index of 65 would be a more reliable way to get this far (as I plan to regularly repeat this action).

Thanks

Graham

******************************************************************************************************************

Experience the British Library online at www.bl.uk

The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html

Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook

The Library's St Pancras site is WiFi - enabled

*****************************************************************************************************************

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postm...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*****************************************************************************************************************

Think before you print

Owen Stephens

unread,

May 10, 2021, 12:45:51 PM5/10/21

to OpenRefine

Hi Graham,

You can select based on an attribute using a construct like

select("tag[attr='attribute value']")

So the approach I'd take would be to use something like

filter(value.parseHtml().select("div[class=field]"),v,v.select("div[class=field__label]")[0].ownText()=="Grant holder(s):")[0].select("div[class=field__item]")[0].ownText()

Assuming there is only ever one div with class = "field__label" with the content "Grant holder(s):" and that only ever contains a single div for the field__item. If those assumptions are not true you might need to tweak some of the places where I use [0]

Hope that helps

Owen

Jevon, Graham

unread,

May 11, 2021, 5:52:20 AM5/11/21

to openr...@googlegroups.com

Thanks Owen

That works, and that will be really helpful for me to adapt and apply to other situations.

Thanks

Graham

From: openr...@googlegroups.com <openr...@googlegroups.com> On Behalf Of Owen Stephens
Sent: 10 May 2021 17:46
To: OpenRefine <openr...@googlegroups.com>
Subject: [OpenRefine] Re: parse HTML query

Hi Graham,

You can select based on an attribute using a construct like

select("tag[attr='attribute value']")

So the approach I'd take would be to use something like

filter(value.parseHtml().select("div[class=field]"),v,v.select("div[class=field__label]")[0].ownText()=="Grant holder(s):")[0].select("div[class=field__item]")[0].ownText()

Hope that helps

Owen

On Monday, May 10, 2021 at 5:22:58 PM UTC+1 GJ wrote:

Hi

Does anyone know how I can extract the text highlighted in yellow?

My various expression attempts have failed. The above expression is as close I’ve got. Even getting this far, I suspect that using a key rather than the index of 65 would be a more reliable way to get this far (as I plan to regularly repeat this action).

Thanks

Graham

******************************************************************************************************************

Experience the British Library online at www.bl.uk

The British Library’s latest Annual Report and Accounts : www.bl.uk/aboutus/annrep/index.html

Help the British Library conserve the world's knowledge. Adopt a Book. www.bl.uk/adoptabook

The Library's St Pancras site is WiFi - enabled

*****************************************************************************************************************

The information contained in this e-mail is confidential and may be legally privileged. It is intended for the addressee(s) only. If you are not the intended recipient, please delete this e-mail and notify the postm...@bl.uk : The contents of this e-mail must not be disclosed or copied without the sender's consent.

The statements and opinions expressed in this message are those of the author and do not necessarily reflect those of the British Library. The British Library does not take any responsibility for the views of the author.

*****************************************************************************************************************

Think before you print

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/881b0299-5734-47a2-a6f7-39cb6b700328n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages