Parthasarathi Mukhopadhyay

unread,

Aug 4, 2022, 7:37:47 AM8/4/22

to openr...@googlegroups.com

Dear all

I am in the process of preparing a training dataset required for testing an AI/ML based automated subject indexing. I have, after much struggle, prepared a project in OpenRefine with 15.41 million rows in the format - <MeSH subject descriptor URI> Subject Descriptor (two columns only) on the basis of the Pubmed dataset (see below).

After the process of merging to generate a single input file as a training dataset, I realized that many rows are actually similar in the merged dataset, and this is quite natural as one document may have many descriptors, and many documents may have the same descriptors. It means that there is a possibility of reducing the number of rows as I need unique rows (URI+Descriptor) only to train the programme.

A text facet analysis in OR show the magnitude of repetition in my dataset -

Humans 1005353
Male 500965
Female 469373
Animals 462998
Adult 298138
Middle Aged 237034
Aged 160809
Rats 148305
Adolescent 109725
Child 92493

I tried 'cluster & edit' option > key collision > and almost all keying fuction listed therein, but not getting a suitable result for my requirement to convert/reduce the dataset with one unique row for each MeSH descriptor.

What is the way out?

Best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

Owen Stephens

unread,

Aug 4, 2022, 9:02:34 AM8/4/22

to OpenRefine

I'm not sure what the performance might be like on such a large dataset but the way I would normally approach this in OpenRefine is:

Sort by the column with the duplication (in this case the URI) and make sort permanent

Use Edit cells -> Blank down in this column

Facet by blank on this column

Depending on requirement either remove the blank rows, or export the non-blank rows

Owen

Vladimir Alexiev

unread,

Aug 4, 2022, 9:07:36 AM8/4/22

to openr...@googlegroups.com

Hi Parthasarathi!

If you can use a command prompt and have some basic unix utilities installed, this will do it for you in less than a second:

sort <file> | uniq > <uniq-file>

If you want to see the number of repetitions:

sort <file> | uniq -c <uniq-file-with-count>

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/fd9b64c3-8753-4791-a500-312661c14915n%40googlegroups.com.

Owen Stephens

unread,

Aug 4, 2022, 9:19:09 AM8/4/22

to OpenRefine

On Thursday, August 4, 2022 at 2:07:36 PM UTC+1 vladimir...@ontotext.com wrote:

Hi Parthasarathi!

If you can use a command prompt and have some basic unix utilities installed, this will do it for you in less than a second:

sort <file> | uniq > <uniq-file>

I think this relies on the entire row in the file being unique - if that's the case then I completely agree this is the way to go. I was assuming that there might be variations elsewhere in the row

Parthasarathi Mukhopadhyay

unread,

Aug 4, 2022, 10:45:55 AM8/4/22

to openr...@googlegroups.com

Dear Owen and Vladimir

Thanks a lot for your guidance.

The "blank down" route is giving me an out of heap memory issue every time I try it though I'm using an i7/16 GB RAM/1TB SSD laptop (as apprehended by Owen in his reply).

The solution of the unix command tool is really doing the trick within a few minutes. Amazingly the 15.41 million rows is now 19,312 unique rows.

I would like to know something more on this approach -

My final work will have 20 sets of tsv files (large files of 1.5 GB to 2.5 GB) with two columns (URI - Descriptors and with header), and then how to handle these 20 sets of tsv files with headers.

In this experiment I first exported two sets without column headers in tsv format, then use 'cat' command to merge them, and finally issued the sort <file> | uniq > <uniq-file>.

Is there any way to merge all 20 large tsv files with headers, and then use the uniq command?

Second -

Is there any way to display/store output of the command sort <file> | uniq -c sorted by number of facet counts (reverse sort will be very useful for my case)?

man uniq command is giving any clue on this.

Best regards

--

You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/2906b76a-22f9-473a-8967-5980c670eb7cn%40googlegroups.com.

magdmartin

unread,

Aug 4, 2022, 3:44:17 PM8/4/22

to OpenRefine

Parthasarathi,

You can try OpenRefine 4-alpha release. It uses spark in the back end to process such large dataset. https://github.com/OpenRefine/OpenRefine/releases/tag/4.0-alpha1
I've been using it for the last couple of weeks for large datasets and it is pretty stable (still need to open a couple of issues regarding minor UI bugs)

Parthasarathi Mukhopadhyay

unread,

Aug 6, 2022, 1:12:31 PM8/6/22

to openr...@googlegroups.com

Thanks Martin for pointing it out. Is there any installation instructions to know how to link OpenRefine 4 with Apache Spark?

Regards

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/fbc5fce8-5343-4ac9-ba13-a01bbb0193f9n%40googlegroups.com.

Antonin Delpeuch (lists)

unread,

Aug 6, 2022, 1:34:54 PM8/6/22

to openr...@googlegroups.com

Hi both,

Great to see the interest in the new architecture! It motivates me even
more to finally stop working on Wikimedia integration and work on a
stable release for it. Martin, I am keen to see your issues :)

To select Spark, you can refer to the following documentation:
https://github.com/OpenRefine/OpenRefine/blob/4.0/docs/docs/technical-reference/workflow-execution/overview.md

Note that in many use cases, Spark will be slower than the default
runner, so I would recommend to only use Spark in advanced cases where
you really want to run a workflow in a Spark cluster. This should become
easier once we have more tooling around running OpenRefine workflows in
"headless" mode (for instance via command line).

Best,
Antonin

On 06/08/2022 19:12, Parthasarathi Mukhopadhyay wrote:
> Thanks Martin for pointing it out. Is there any installation
> instructions to know how to link OpenRefine 4 with Apache Spark?
>
> Regards
>
> On Fri, Aug 5, 2022 at 1:14 AM magdmartin <martin.m...@gmail.com
> <mailto:martin.m...@gmail.com>> wrote:
>
> Parthasarathi,
>
> You can try OpenRefine 4-alpha release. It uses spark in the back
> end to process such large dataset.
> https://github.com/OpenRefine/OpenRefine/releases/tag/4.0-alpha1
> <https://github.com/OpenRefine/OpenRefine/releases/tag/4.0-alpha1>
> I've been using it for the last couple of weeks for large datasets
> and it is pretty stable (still need to open a couple of issues
> regarding minor UI bugs)
>
>
> On Thursday, August 4, 2022 at 10:45:55 AM UTC-4

> <https://groups.google.com/d/msgid/openrefine/2906b76a-22f9-473a-8967-5980c670eb7cn%40googlegroups.com?utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to openrefine+...@googlegroups.com

> <mailto:openrefine+...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/openrefine/fbc5fce8-5343-4ac9-ba13-a01bbb0193f9n%40googlegroups.com
> <https://groups.google.com/d/msgid/openrefine/fbc5fce8-5343-4ac9-ba13-a01bbb0193f9n%40googlegroups.com?utm_medium=email&utm_source=footer>.

>
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com

> <mailto:openrefine+...@googlegroups.com>.

> To view this discussion on the web visit

> https://groups.google.com/d/msgid/openrefine/CAGM_5ua9g%2BPPtOtM5WD%3D5iazS%2BgYDS8r_%2B8uGDC5bLOmVJML3w%40mail.gmail.com
> <https://groups.google.com/d/msgid/openrefine/CAGM_5ua9g%2BPPtOtM5WD%3D5iazS%2BgYDS8r_%2B8uGDC5bLOmVJML3w%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Reply all

Reply to author

Forward