Extracting strings from HTML tag

55 views
Skip to first unread message

Miquel Centelles

unread,
Dec 26, 2021, 7:34:57 AM12/26/21
to OpenRefine
Hi,

I have a series of <span> tags conatining the following content pattern:

<span id="_ctl0_ContentPlaceHolder1_lblDatos">
<span class="intenso">Titulo Original: </span>Entrevista a Ricardo Alarcón de Quesada<br>
<span class="intenso">Dirección: </span><a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a><br>
<span class="intenso">País(es): </span>Cuba<br>
<span class="intenso">Idioma Original: </span>Español<br>
<span class="intenso">Formato: </span>VHS<br>
<span class="intenso">Categoría: </span>Documental<br>
<span class="intenso">Tipo: </span>Color<br>
<span class="intenso">Duración: </span>40 min.<br>
<span class="intenso">Año de producción: </span>2004<br>
</span> 

I need to split the content of each of the nine subtags <span class="intenso"> in a different column, based on the following conditions.

a) The string between <span class="intenso"> and </span> ("Dirección: " in the first instance) must be deleted. So, the only string to transfer to the corresponding columns is that between </span> and <br>. That is: "Entrevista a Ricardo Alarcón de Quesada" for the first instance.

b) The second instance

<span class="intenso">Dirección: </span><a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a><br> 
 
should be splitted into two different columns; one for the href content (cineasta.aspx?cod=99) and another one for the string (Daniel Díaz Torres).

Any help?

Miquel Centelles




Miquel Centelles

unread,
Dec 28, 2021, 11:46:33 AM12/28/21
to OpenRefine
Hi,

I've working on it and I got a preliminary solution:

1) Firstly, I've deleted all the HTML tags and attribute except for <br> and links (i.e. <a href="cineasta.aspx?cod=99">). So, the text has become:

Titulo Original: Entrevista a Ricardo Alarcón de Quesada<br>Dirección: <a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a><br>País(es): Cuba<br>Idioma Original: Español<br>Formato: VHS<br>Categoría: Documental<br>Tipo: Color<br>Duración: 40 min.<br>Año de producción: 2004<br>  

2) Now, I have <br> elements as milestones to fix the ending of each substring. For example, for the first metadata, I give the formula:

value.find(/Titulo original: (.+<br>Di)/).join(',') 

And I get:

Titulo Original: Entrevista a Ricardo Alarcón de Quesada<br>Di

And then I erase "Titulo Original:" and "<br>Di" in a new column.

value.find(/Dirección: (.+<br>P)/).join(',') for the Dirección metadata, an so on.   

As all the milestones are equal (<br>), I have to add the first letter(s) following it for each metadata. I wonder if there is a way to say "until the closest <br> occurrence" in order to avoy giving additional characters for the substring. In addition, this will help in cases where subsequent metadata are not included in the text.

Help on this will be apperciated.

Miquel Centelles

Owen Stephens

unread,
Jan 4, 2022, 6:39:30 AM1/4/22
to OpenRefine
One option you might want to look at is the "Transpose" function. The way I'd initially try approaching this problem is:

Parse the HTML (using a combination of methods including 'parseHtml' and 'match' or 'find') until you have the key/value pairs you need
e.g. GREL:

filter(value.parseHtml().select("span")[0].innerHtml().split("<br>"),v,isNonBlank(v)).join("|")
then
forEach(value.split("|"),v,v.trim().match(/<span class="intenso">(.*)<\/span>(.*)/).join("~")).join("|")

This gets something like:
Titulo Original: ~Entrevista a Ricardo Alarcón de Quesada|Dirección: ~<a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a>|País(es): ~Cuba|Idioma Original: ~Español|Formato: ~VHS|Categoría: ~Documental|Tipo: ~Color|Duración: ~40 min.|Año de producción: ~2004

Then split this into multiple cells using "Split multi-valued cells" based on the pipe so you get rows like
Titulo Original: ~Entrevista a Ricardo Alarcón de Quesada
Dirección: ~<a href="cineasta.aspx?cod=99">Daniel Díaz Torres</a>
País(es): ~Cuba
Idioma Original: ~Español
Formato: ~VHS
Categoría: ~Documental
Tipo: ~Color
Duración: ~40 min.
Año de producción: ~2004

Then split across columns using the "Split into several columns" option based on the ~ character
Then finally use the "Transpose" option selecting "Columnize by key/value columns..." which should result in you getting one column per table (i.e. Titulo Original: , Dirección: , País(es): , Idioma Original: , Formato: , Categoría: , Tipo: , Duración: , Año de producción: )
  
with the appropriate value in each column.

Hope this is of some help - please ask if I can clarify any of this

Owen

Miquel Centelles

unread,
Jan 10, 2022, 5:30:19 AM1/10/22
to OpenRefine
Thank you very much, Owen.

I've been following the steps. The point is the final results seem to degrade from the first one to the rest of them. I mean, the first one includes the values of all the keys. Second one loses values for "Idioma Original" and "Formato". Third one loses values for "Idioma Original", "Formato", "Categoría", and "Tipo". And so on. Althoug on the original HTML code those values are present. I've gone over the steps again and again, but this drawback persisted.

Miquel Centelles

Owen Stephens

unread,
Jan 10, 2022, 5:55:21 AM1/10/22
to openr...@googlegroups.com
Hi Miquel

Are you able to share the original HTML - or a larger extract - so that I can try to recreate the behaviour you are seeing?

Thanks

Owen

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/71e6b09a-fdf4-4674-a640-27cfca18845cn%40googlegroups.com.

Miquel Centelles

unread,
Jan 10, 2022, 6:25:49 AM1/10/22
to OpenRefine
Thanks, Owen.

Here you have an extract of it.

<span id="_ctl0_ContentPlaceHolder1_lblDatos"><span class="intenso">Titulo Original: </span>Ciclón<br><span class="intenso">Dirección: </span><a href="cineasta.aspx?cod=238">Santiago Álvarez</a><br><span class="intenso">País(es): </span>Cuba<br><span class="intenso">Idioma Original: </span>Español<br><span class="intenso">Formato: </span>35 mm<br><span class="intenso">Categoría: </span>Documental<br><span class="intenso">Tipo: </span>B/N<br><span class="intenso">Duración: </span>22 min.<br><span class="intenso">Año de producción: </span>1963<br></span>

You can get the full HTML code in http://cinelatinoamericano.org/ficha.aspx?cod=1051.

Miquel Centelles

Owen Stephens

unread,
Jan 12, 2022, 5:44:33 AM1/12/22
to OpenRefine
Thanks Miquel. I'm not sure, but if you have additional columns in your project you will need to make sure these are 'filled down' before you do the 'transpose' step. I've uploaded a screen video at https://drive.google.com/file/d/1PL-IfPnfLfX2QV8ByQq0yKZehDuVKT-7/view?usp=sharing  of me doing the whole process starting with three URLs from that website, retrieving the HTML, and then getting the data into the columns. You can see here that because I've got the extra column containing the original URL I have to do a 'fill down' step before I do the final transpose step - without this step the transpose does not work correctly. I hope this is helpful and let me know if I can help any further



Owen



Reply all
Reply to author
Forward
0 new messages