Extracting strings from HTML tag

Miquel Centelles

unread,

Dec 26, 2021, 7:34:57 AM12/26/21

to OpenRefine

Hi,

I have a series of tags conatining the following content pattern:

Titulo Original: Entrevista a Ricardo Alarcón de Quesada

Dirección: <a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a>

País(es): Cuba

Idioma Original: Español

Formato: VHS

Categoría: Documental

Tipo: Color

Duración: 40 min.

Año de producción: 2004

I need to split the content of each of the nine subtags in a different column, based on the following conditions.

a) The string between and ("Dirección: " in the first instance) must be deleted. So, the only string to transfer to the corresponding columns is that between and . That is: "Entrevista a Ricardo Alarcón de Quesada" for the first instance.

b) The second instance

Dirección: <a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a>

should be splitted into two different columns; one for the href content (cineasta.aspx?cod=99) and another one for the string (Daniel Díaz Torres).

Any help?

Miquel Centelles

unread,

Dec 28, 2021, 11:46:33 AM12/28/21

to OpenRefine

Hi,

I've working on it and I got a preliminary solution:

1) Firstly, I've deleted all the HTML tags and attribute except for and links (i.e. <a href="cineasta.aspx?cod=99">). So, the text has become:

Titulo Original: Entrevista a Ricardo Alarcón de Quesada Dirección: <a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a> País(es): Cuba Idioma Original: Español Formato: VHS Categoría: Documental Tipo: Color Duración: 40 min. Año de producción: 2004

2) Now, I have elements as milestones to fix the ending of each substring. For example, for the first metadata, I give the formula:

value.find(/Titulo original: (.+ Di)/).join(',')

And I get:

Titulo Original: Entrevista a Ricardo Alarcón de Quesada Di

And then I erase "Titulo Original:" and " Di" in a new column.

value.find(/Dirección: (.+ P)/).join(',') for the Dirección metadata, an so on.

As all the milestones are equal ( ), I have to add the first letter(s) following it for each metadata. I wonder if there is a way to say "until the closest occurrence" in order to avoy giving additional characters for the substring. In addition, this will help in cases where subsequent metadata are not included in the text.

Help on this will be apperciated.

Miquel Centelles

Owen Stephens

unread,

Jan 4, 2022, 6:39:30 AM1/4/22

to OpenRefine

One option you might want to look at is the "Transpose" function. The way I'd initially try approaching this problem is:

Parse the HTML (using a combination of methods including 'parseHtml' and 'match' or 'find') until you have the key/value pairs you need

e.g. GREL:

filter(value.parseHtml().select("span")[0].innerHtml().split(" "),v,isNonBlank(v)).join("|")

then

forEach(value.split("|"),v,v.trim().match(/(.*)<\/span>(.*)/).join("~")).join("|")

This gets something like:

Then split this into multiple cells using "Split multi-valued cells" based on the pipe so you get rows like

Titulo Original: ~Entrevista a Ricardo Alarcón de Quesada

Dirección: ~<a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a>

País(es): ~Cuba

Idioma Original: ~Español

Formato: ~VHS

Categoría: ~Documental

Tipo: ~Color

Duración: ~40 min.

Año de producción: ~2004

Then split across columns using the "Split into several columns" option based on the ~ character

Then finally use the "Transpose" option selecting "Columnize by key/value columns..." which should result in you getting one column per table (i.e. Titulo Original: , Dirección: , País(es): , Idioma Original: , Formato: , Categoría: , Tipo: , Duración: , Año de producción: )
  

with the appropriate value in each column.

Hope this is of some help - please ask if I can clarify any of this

Owen

Miquel Centelles

unread,

Jan 10, 2022, 5:30:19 AM1/10/22

to OpenRefine

Thank you very much, Owen.

I've been following the steps. The point is the final results seem to degrade from the first one to the rest of them. I mean, the first one includes the values of all the keys. Second one loses values for "Idioma Original" and "Formato". Third one loses values for "Idioma Original", "Formato", "Categoría", and "Tipo". And so on. Althoug on the original HTML code those values are present. I've gone over the steps again and again, but this drawback persisted.

Miquel Centelles

Owen Stephens

unread,

Jan 10, 2022, 5:55:21 AM1/10/22

to openr...@googlegroups.com

Hi Miquel

Are you able to share the original HTML - or a larger extract - so that I can try to recreate the behaviour you are seeing?

Thanks

Owen

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/71e6b09a-fdf4-4674-a640-27cfca18845cn%40googlegroups.com.

Miquel Centelles

unread,

Jan 10, 2022, 6:25:49 AM1/10/22

to OpenRefine

Thanks, Owen.

Here you have an extract of it.

Titulo Original: Ciclón Dirección: <a href="cineasta.aspx?cod=238">Santiago Álvarez</a> País(es): Cuba Idioma Original: Español Formato: 35 mm Categoría: Documental Tipo: B/N Duración: 22 min. Año de producción: 1963

You can get the full HTML code in http://cinelatinoamericano.org/ficha.aspx?cod=1051.

Miquel Centelles

Owen Stephens

unread,

Jan 12, 2022, 5:44:33 AM1/12/22

to OpenRefine

Thanks Miquel. I'm not sure, but if you have additional columns in your project you will need to make sure these are 'filled down' before you do the 'transpose' step. I've uploaded a screen video at https://drive.google.com/file/d/1PL-IfPnfLfX2QV8ByQq0yKZehDuVKT-7/view?usp=sharing of me doing the whole process starting with three URLs from that website, retrieving the HTML, and then getting the data into the columns. You can see here that because I've got the extra column containing the original URL I have to do a 'fill down' step before I do the final transpose step - without this step the transpose does not work correctly. I hope this is helpful and let me know if I can help any further

Owen

Reply all

Reply to author

Forward