How to clean previous formatting tags

31 views

Skip to first unread message

Jeremy G

unread,

Sep 6, 2022, 10:30:13 AM9/6/22

to OpenRefine

Hello,

First message here and total newbie with OpenRefine!

I am working on a pre-existing database made up partly of text fields. They were previously filled in by users who had access to a form, and this form proposed to format the text via WYSIWYG input tools for bold, italic, etc. In the database, all the text entered in these fields is therefore contained in formatting tags, making it hard to read. For the following text entered by the user:

> Hi this is a sample text.

The database recorded this:

> <Section xml:space="preserve" HasTrailingParagraphBreakOnPaste="False" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"><Paragraph FontSize="11" FontFamily="Portable User Interface" Foreground="#FF000000" FontWeight="Normal" FontStyle="Normal" FontStretch="Normal" CharacterSpacing="0" Typography.AnnotationAlternates="0" Typography.EastAsianExpertForms="False" Typography.EastAsianLanguage="Normal" Typography.EastAsianWidths="Normal" Typography.StandardLigatures="True" Typography.ContextualLigatures="True" Typography.DiscretionaryLigatures="False" Typography.HistoricalLigatures="False" Typography.StandardSwashes="0" Typography.ContextualSwashes="0" Typography.ContextualAlternates="True" Typography.StylisticAlternates="0" Typography.StylisticSet1="False" Typography.StylisticSet2="False" Typography.StylisticSet3="False" Typography.StylisticSet4="False" Typography.StylisticSet5="False" Typography.StylisticSet6="False" Typography.StylisticSet7="False" Typography.StylisticSet8="False" Typography.StylisticSet9="False" Typography.StylisticSet10="False" Typography.StylisticSet11="False" Typography.StylisticSet12="False" Typography.StylisticSet13="False" Typography.StylisticSet14="False" Typography.StylisticSet15="False" Typography.StylisticSet16="False" Typography.StylisticSet17="False" Typography.StylisticSet18="False" Typography.StylisticSet19="False" Typography.StylisticSet20="False" Typography.Capitals="Normal" Typography.CapitalSpacing="False" Typography.Kerning="True" Typography.CaseSensitiveForms="False" Typography.HistoricalForms="False" Typography.Fraction="Normal" Typography.NumeralStyle="Normal" Typography.NumeralAlignment="Normal" Typography.SlashedZero="False" Typography.MathematicalGreek="False" Typography.Variants="Normal" TextOptions.TextHintingMode="Fixed" TextOptions.TextFormattingMode="Ideal" TextOptions.TextRenderingMode="Auto" TextAlignment="Left" LineHeight="0" LineStackingStrategy="MaxHeight"><Run>entablement surmonté d’un fronton courbe à deux faces de dimension réduite</Run></Paragraph><Paragraph FontSize="11" FontFamily="Portable User Interface" Foreground="#FF000000" FontWeight="Normal" FontStyle="Normal" FontStretch="Normal" CharacterSpacing="0" Typography.AnnotationAlternates="0" Typography.EastAsianExpertForms="False" Typography.EastAsianLanguage="Normal" Typography.EastAsianWidths="Normal" Typography.StandardLigatures="True" Typography.ContextualLigatures="True" Typography.DiscretionaryLigatures="False" Typography.HistoricalLigatures="False" Typography.StandardSwashes="0" Typography.ContextualSwashes="0" Typography.ContextualAlternates="True" Typography.StylisticAlternates="0" Typography.StylisticSet1="False" Typography.StylisticSet2="False" Typography.StylisticSet3="False" Typography.StylisticSet4="False" Typography.StylisticSet5="False" Typography.StylisticSet6="False" Typography.StylisticSet7="False" Typography.StylisticSet8="False" Typography.StylisticSet9="False" Typography.StylisticSet10="False" Typography.StylisticSet11="False" Typography.StylisticSet12="False" Typography.StylisticSet13="False" Typography.StylisticSet14="False" Typography.StylisticSet15="False" Typography.StylisticSet16="False" Typography.StylisticSet17="False" Typography.StylisticSet18="False" Typography.StylisticSet19="False" Typography.StylisticSet20="False" Typography.Capitals="Normal" Typography.CapitalSpacing="False" Typography.Kerning="True" Typography.CaseSensitiveForms="False" Typography.HistoricalForms="False" Typography.Fraction="Normal" Typography.NumeralStyle="Normal" Typography.NumeralAlignment="Normal" Typography.SlashedZero="False" Typography.MathematicalGreek="False" Typography.Variants="Normal" TextOptions.TextHintingMode="Fixed" TextOptions.TextFormattingMode="Ideal" TextOptions.TextRenderingMode="Auto" TextAlignment="Left" LineHeight="0" LineStackingStrategy="MaxHeight"><Run>Hi this is a sample text.</Run></Paragraph></Section>

I'm looking for a way to clean these fields. It looks like OpenRefine could be the good tool, but not sure how to proceed. The cleaning rule would be something like "In the cells of this field, delete the chevrons and every characters inside them".

Does it sound like something OpenRefine could do? Do you have any hint on how to do this?

Thanks!

Best,

Jeremy

Owen Stephens

unread,

Sep 6, 2022, 2:05:05 PM9/6/22

to OpenRefine

Welcome Jeremy,

I think that OpenRefine can help here. I'm going to describe a couple of different approaches to give you an idea - but the right approach might depend a bit on how you current have the data, how much data there is, and what you need as your eventual output.

The first option might look something like this:

1. If you have a file full of text like this, you should be able to import the file using the XML importer

2. Select the "Section" or "Paragraph" element (it may depend on the exact structure - basically you want to select the element that represents a single 'record'. You'll get a tabular output

3. Click "Create project" to actually import the data into OpenRefine

4. Scroll to the last column in the project - this contains the text (labelled "Section - Paragraph - Run" in the screenshot below)

5. From here you could export the text by using one of the exporters (Custom tabular exporter allows you to select which columns export) or if you need to combine together text which was from a single <Section> but different <Paragraph> you can do that as well (I'll leave the explanation now - just ask if you need to do this)

The second option is where you have all of the text in a single cell in a project like this:

From here you can extract particular information using the in built functions in OpenRefine using "GREL" (General Refine Expression Language) which I usually describe a little bit like writing an Excel formula. Anyway, with the right expression in a single step you can get from the mass of markup above to something like this :

If this is the approach needed I can provide some more information on how to achieve this

I hope this is a starter for what's possible with the tool - there are other tools that could also do the job, but OpenRefine certainly could be an option here. Please feel free to ask questions - I'm very happy to provide more detail as necessary.

Owen

Jeremy G

unread,

Sep 7, 2022, 9:16:14 AM9/7/22

to OpenRefine

Hi Owen,

Thanks for your help.

My situation looks like the second scenario. The data is contained inside cells on a csv file exported from the database. I plan to reimport it once the cleaning done.

There are 4 possible "possible content" in the same field/column, and I'd like to find the GREL instruction that allow me to process the whole field by covering the 4 cases. Here is what it could look like in natural language, with (for 3 & 4) the tested GREL instruction that works:

1. if there is no string in the cell, do not transform the cell content

2. if there is no XML tags and just raw string, do not transform the cell content

3. If some XML is present in the cell, return the string of the text within the "Section" element

parseXml(value).select("Section")[0].xmlText()

4. If some XML is present in the cell, return the string from the "Text" attribute of the "Run" element

parseXml(value).select("Run")[0].xmlAttr("Text")

The previous example I gave corresponds to 3. An example of 4 would be this:

Not sure in what order 3 & 4 should be launched, but I haven't encountered a cell with both cases at the same time yet.

Is it possible to create such a chain of instruction, and if so, what should I look for? Also Any help would be much appreciated for 1 & 2!

Many thanks,

Best regards,

Jeremy

Owen Stephens

unread,

Sep 7, 2022, 12:35:38 PM9/7/22

to OpenRefine

Hi Jeremy,

There are multiple ways of approaching this. I suspect you could write a single GREL expressions to try to capture all the cases but it would get quite long and instead I'd probably split this out to multiple steps and in this case work with an additional column to get my final text in, and then once that's ready replace/remove the original column with the new one.

So as a suggestion:

First deal with case 2 by:

1. From your starting column us "Add column based on this column" with an expression that will only work if the original cell contains XML. I'm not sure what the best expression to check this is and it might depend on your XML, but possibly a general expression that would work is:

if(value==value.parseXml().select("*").join("|"),value,null)

but if you know that a particular tag is always there you could do something like:

if(not(value.contains("</Section>")),value,null)

You should now have a new column that is populated with any raw strings

2. In the new column apply a facet by blank (under Facet -> Customized Facets) and select 'true' for to filter to only the remaining blank cells

3. In the new column use Edit Cells -> Transform and write GREL to extract the information from any <Section> elements but amending what you have below to pull the data from the original column like:

cells["XML Column name"].value.parseXml().select("Section")[0].xmlText()

4. This should lead to more of the cells in your new column being populated - refresh the facet to make sure these are now filtered by the 'Facet by blank' facet - so you only have blank cells again in the column

5. In the new column use Edit Cells -> Transform and write GREL to extract the information from any <Run> text attributes but amending what you have below to pull the data from the original column like:

cells["XML Column name"].value.parseXml().select("Run")[0].xmlAttr("Text")

6. Once these cells are populated and you have the facet by blank = true applied, you should be left only with the blank cells again ... which should mean that you are now left with only your scenario number 1 "if there is no string in the cell, do not transform the cell content" - but you can check this by examining the original column values that are left and make sure this is true and if necessary identify any more work that needs doing - which you can pursue in an appropriate way following the same pattern

Once the new column is fully populated as you want you can remove the Facet by blank and then either copy the data back to the original column, or just remove the original column and rename the new one (or however you want to manage it)

If there are situations where there is both text in side the Section element and also text in the Run element 'Text' attribute, the above method would only copy it from the Section element (because once you've done step 3 then there would be a populated cell in your column and so it would be filtered from future work). So it is probably worth checking up front if this ever happens and if so deciding how to deal with it. You can do this by writing a Custom Text Facet on the original XML column with the GREL like:

with(value.parseXml(),x,and(x.select("Section")[0].xmlText().length()>0,x.select("Run")[0].xmlAttr("Text").length()>0))

which checks for text appearing in both places - so if there is a 'true' result in that facet then you have situations where both are populated and need to decide how you handle this situation (get both texts? or prioritise one over the other?)

Sorry - I realise there's alot here - I hope it makes sense and please ask if not