removing everything within square brackets (including these brackets)

Agnieszka Podpora

unread,

Sep 1, 2022, 5:40:54 AM9/1/22

to OpenRefine

Hi All,

I am new to OpenRefine and to the language. In my data I have a column with cells that look like this:

Lewandowski, Ignacy (1936- ) [Tłumaczenie, Wstęp, Komentarz]

OR

Adamska, Ewa [Tł. .]|Dunin-Borkowski, Jan [Tłumaczenie, Opracowanie.]

I would like to get rid of the square brackets and everything within. It it weren't for the second example i would split the column before "[" and delete the second one afterwards, but I also have instances with the brackets "inside" the text and it's not that easy. Could anyone please help me with this? Assume I have almost no knowledge about GREL and the expressions (except for really basic ones) ;-).

Any help would be really appreciated. Thanks!

Agnieszka

Owen Stephens

unread,

Sep 1, 2022, 5:55:05 AM9/1/22

to OpenRefine

I can see potentially a couple of options but it may depend on the exact structure of the data.

In your second example:

Adamska, Ewa [Tł. .]|Dunin-Borkowski, Jan [Tłumaczenie, Opracowanie.]

You have a vertical pipe character separating these two names. If this is consistent you could use the option Edit Cells -> Split multi-valued cells... At this point you will be asked for the 'separator' and you can enter the pipe character

Then you would have each name in a separate cell:

Adamska, Ewa [Tł. .]

Dunin-Borkowski, Jan [Tłumaczenie, Opracowanie.]

And then you could use the approach you describe of splitting into two columns, and removing a column. Then, you can reverse the "split multi-valued cells" step by doing Edit Cells -> Join multi-valued cells...

Alternatively, you could use the Replace function with a regular expression. "Replace" can be done via a GREL transformation or via the menu option Edit Cells -> Replace

The regular expression you need is:

\[.+?\]

Which is essentially saying 'look for some square brackets with one or more characters inside'

To do this as a GREL transform you can do:

value.replace(/\[.+?\]/,"")

Or if you use the Replace function from the menu then into the Find option just enter

\[.+?\]

and check the 'regular expression' checkbox, and leave "Replace with" blank, and click OK

Hope this helps - feel free to ask some more information if it's not clear

Owen

Agnieszka Podpora

unread,

Sep 1, 2022, 6:38:07 AM9/1/22

to OpenRefine

Thanks Owen,

your expression \[.+?\]worked like a charm, thank you!

I have been trying along these lines, I know that the period stands for any character, the plus for any number of those - could you please write what does the question mark indicate?

I also have a (similiar?) problem in the column detailing number of pages (I have a library data set). The values I have in cells are like:

347 stron

OR

44, [2] strony

OR

[36] stron

OR

35, [1] s.

I would like to simplify this column to only indicate the principal number of pages, without brackets, text or puctuation after to eventually transform them into numbers and be able to sort by number of pages. I was thinking to first remove the brackets and commas (because of the instances I marked yellow) and then somehow split the column before the first space (??? or the end of the first string of characters ?) Could you please help me with this as well? Which expression woul you use?

Thanks a lot for your help,

Agnieszka

Owen Stephens

unread,

Sep 1, 2022, 7:58:43 AM9/1/22

to OpenRefine

On Thursday, September 1, 2022 at 12:38:07 PM UTC+2 agapo...@gmail.com wrote:

Thanks Owen,
your expression \[.+?\]worked like a charm, thank you!
I have been trying along these lines, I know that the period stands for any character, the plus for any number of those - could you please write what does the question mark indicate?

As you say the

.+

part of the expression means any characters (the plus means one or more). Because it's 'any character' than can include the closed square bracket which means that in the example you give

Adamska, Ewa [Tł. .]|Dunin-Borkowski, Jan [Tłumaczenie, Opracowanie.]

The regular expression

\[.+\]

could match either

Tł. .

or

Tł. .]|Dunin-Borkowski, Jan [Tłumaczenie, Opracowanie.

Because both of these fulfil the condition "an open square bracket, followed by one or more characters, followed by a closed square bracket". It happens that Regular expressions default to what's called "greedy" which is to use the longest possible match in this situation. The ? after the + changes this to the opposite - "non-greedy". So it will use the shortest possible match - which is what we want in this case

I'll answer your other question in a separate message

Best wishes

Owen

Owen Stephens

unread,

Sep 1, 2022, 8:54:17 AM9/1/22

to OpenRefine

On Thursday, September 1, 2022 at 12:38:07 PM UTC+2 agapo...@gmail.com wrote:

I also have a (similiar?) problem in the column detailing number of pages (I have a library data set). The values I have in cells are like:

347 stron
OR
44, [2] strony
OR
[36] stron
OR
35, [1] s.

I would like to simplify this column to only indicate the principal number of pages, without brackets, text or puctuation after to eventually transform them into numbers and be able to sort by number of pages. I was thinking to first remove the brackets and commas (because of the instances I marked yellow) and then somehow split the column before the first space (??? or the end of the first string of characters ?) Could you please help me with this as well? Which expression woul you use?

In my experience there tends to be quite a lot of variation in this type of bibliographic data and so it's likely that you can't resolve it with a single expression. The sort of approach I'd take is to do something like:

Use "Add column based on this column" (in my experience it's good to keep a copy of the original data)

You could start with the expression maybe

value.match(/(\d+).*/)[0]

This is looking for any values that start with a series of digits and will put it in the new column - you may need to check this is a good starting point (as for example if you had a value like "3,347 stron" it would only find the "3" and not the "347" - so it maybe you need to start from a different expression

Once you've hit the most common pattern to create the new column, you can then use 'facet by blank' on this new column to find all the rows where this hasn't populated any information in the new column and start to look at whether there are any other patterns that you need to take into consideration (lets say the page numbers aren't the first thing in the field for some reason). Depending on the number of rows you could either fix by hand (if there are few) or work out new patterns to capture other common situations - when dealing with the additional patterns you'll want to work from your new column but pull the information from your original column using expressions like

cells["Column name"].value.match(/regular expression/)

I hope this gives you a way of starting on the problem and feel free to ask or post examples and questions if I can help further

Owen

Reply all

Reply to author

Forward