Custom document corpus - a methodology

55 views

Skip to first unread message

Parthasarathi Mukhopadhyay

unread,

Jun 29, 2022, 2:26:40 PM6/29/22

to Annif Users

Dear all

We are following the methodology as stated below to create custom document format but need some experts' opinions to check the validity of the approach -

1. We have a small domain-specific vocabulary of say 3000 preflabels with URI for each term along with scope notes, altlabels etc;

2. We have loaded this vocabulary into Annif without any issue (after fine tuning the voc dataset into TTL by using Skosify);

3. Now comes the training dataset - that we are planning in this way -

3.1 > Import the said vocabulary in a data wrangling software (in our case OpenRefine);

3.2> Use preflabel column to fetch through REST/API the title,abstract DOI from different open access bibliographic databases (Crossref, SemanticScholar, Lens and so on and sources are journal articles, book chapters, conf proceedings etc) against the preflabels (one at a time to create columns with JSON datasets from the bibliographic databases in use;

3.3> Extract title and abstract from the fetched JSON data;

3.4> Finally prepare CSV/TSV files (one for each bibliographic data source) as per the required format for Annif;

3.5> The end results looks something like this - for the preflabel "Gendercide" ( Tile ¤ Abstract format)

<https://cmslov.org/v3/cmlov0000580>

A strange case of gendercide: fascination, psychotic features, archaic elements, and phantasmatic metamorphosis ¤ This is a strange case of murder: a mafia pentito, an informer, reveals after ten years that the death of a man which had been considered natural, had in fact been a murder. The victim, a loan shark associated with crime people, was the companion of his lover at the time. The strange aspect is the way in which the mobster, with a couple of hired killers and helped by the woman, organised the murder: a shot of pesticide while the man was lying in bed with the woman. This modality was suggested by a veterinarian friend of the mobster who claimed that the pesticide would not be detected. A fragile and hidden common thread is hypothesised in this work. A common thread whose core is the desire for the narcissistic realisation of a woman who, in order to achieve it, puts eros at the service of thanatos. A red thread that connects all the different events, the real and the phantasmatic ones: from seeking a role as a woman of the criminal underworld to the magic fascination of an archaic Sicily, evoking primitive mechanisms of functioning typical of psychosis.

One known issue in this approach is that we are getting only one subject URI against Title/Abstract in comparison with the Hunan indexed Gold standard where possibly the indexer will use more than one descriptor(s)/URI for each title/abstract pair. But the good part is that we are touching every preflabel of the vocabulary in use, and it can be done quickly and programatically.

What do you think?

Best regards

Parthasarathi Mukhopadhyay

Professor, Department of Library and Information Science,

University of Kalyani, Kalyani - 741 235 (WB), India

juho.i...@helsinki.fi

unread,

Jul 6, 2022, 4:34:27 AM7/6/22

to Annif Users

Hi Parthasarathi!

About the issue of one subject URI for one Title/Abstract document: My gut feeling is that this is not a problem for training a project, if you are able to get a sufficient number of training documents. For a new text Annif will anyway give multiple subject suggestions, just like you mentioned a human indexer would give.

However, I think training on documents with single subject means that you need to have more of them to get a project of the same quality as a project trained on documents with multiple subjects. But actually maybe you will have quite many documents, as you mentioned that you are collecting documents from multiple databases. Anyway, it is beneficial to have multiple documents for each subject, if possible.

Another concern could arise if you would use such single-subject documents for evaluating a project. Annif gives 10 subject suggestions for a text by default, and then for such documents 9 of the suggestions would be considered "wrong" (false positives) always, so most of the metrics given by the Annif's eval command would not be applicable. Maybe you have a separate set of documents you can use for evaluating your final project, with multiple subjects per document. Having some test set is highly desirable, so you know how well a project is performing, and if a change on the project makes it better or worse.

In the end, I wonder if you could somehow merge back the "instances" of the same document that are separated by different subjects in the collecting process. I assume the documents in the online databases usually have multiple subjects. Maybe using the DOI, or just the whole title+abstract content. This would at least reduce the disk size of the corpus, if nothing else.

One point to consider is whether using the preflabels for fetching articles could lead to inconsistencies (e.g. preflabel "rock" could result to articles about the music genre and some stones). For this reason using URIs for querying the articles would be optimal, but maybe not possible.

Also I noticed that the columns in the example document for the preflabel "Gendercide" are wrong way around: in the short-text document corpus format the document text is in the first column, and the subject URIs are in the second after the tab.

I think you have seen the Jupyter notebook of the Annif tutorial for creating a custom corpus, but for general knowledge I link it here: https://github.com/NatLibFi/Annif-tutorial/blob/master/data-sets/arxiv/create-arxiv-corpus.ipynb

-Juho

Parthasarathi Mukhopadhyay

unread,

Jul 6, 2022, 12:43:29 PM7/6/22

to Annif Users

Thanks Juho for a very comprehensive answer touching upon almost all major facets of the process. I'll try to answer here inline -

About the issue of one subject URI for one Title/Abstract document: My gut feeling is that this is not a problem for training a project, if you are able to get a sufficient number of training documents. For a new text Annif will anyway give multiple subject suggestions, just like you mentioned a human indexer would give.

Excellent! Thanks for this valuable input related to the process we are thinking to follow.

However, I think training on documents with single subject means that you need to have more of them to get a project of the same quality as a project trained on documents with multiple subjects. But actually maybe you will have quite many documents, as you mentioned that you are collecting documents from multiple databases. Anyway, it is beneficial to have multiple documents for each subject, if possible.

We are actually collecting 5 documents (on the basis of inbuilt relevancy ranking scores of the datasets in use) from selected datasets, and thereby collecting 5 document X 5 datasets = 25 documents against a given preflabel query.

Another concern could arise if you would use such single-subject documents for evaluating a project. Annif gives 10 subject suggestions for a text by default, and then for such documents 9 of the suggestions would be considered "wrong" (false positives) always, so most of the metrics given by the Annif's eval command would not be applicable. Maybe you have a separate set of documents you can use for evaluating your final project, with multiple subjects per document. Having some test set is highly desirable, so you know how well a project is performing, and if a change on the project makes it better or worse.

We have not thought about it yet. Thanks a ton for this valuable suggestion related to evaluation of the project as a whole.

In the end, I wonder if you could somehow merge back the "instances" of the same document that are separated by different subjects in the collecting process. I assume the documents in the online databases usually have multiple subjects. Maybe using the DOI, or just the whole title+abstract content. This would at least reduce the disk size of the corpus, if nothing else.

If I understand this suggestion properly, OpenRefine allows us to generate custom tabular format (csv or tsv) in the structure like [doc1-dataset1] [<subjct-URI-1>] (say ds1-set1.csv); [doc2-dataset1] [<subjct-URI-1>] (say ds1-set2.csv) and so on. Then we are merging all of these CSV files into a single one before sending them to Annif.

One point to consider is whether using the preflabels for fetching articles could lead to inconsistencies (e.g. preflabel "rock" could result to articles about the music genre and some stones). For this reason using URIs for querying the articles would be optimal, but maybe not possible.

Yes, quite true. Here we are depending on the relevancy score of the dataset. But a manual check is always a better option. I'll raise this point in the next meeting to discuss further.

Also I noticed that the columns in the example document for the preflabel "Gendercide" are wrong way around: in the short-text document corpus format the document text is in the first column, and the subject URIs are in the second after the tab.

Yes, my mistake. Extremely sorry for this confusion.

I think you have seen the Jupyter notebook of the Annif tutorial for creating a custom corpus, but for general knowledge I link it here: https://github.com/NatLibFi/Annif-tutorial/blob/master/data-sets/arxiv/create-arxiv-corpus.ipynb

We have explored this resource already and are still learning how to apply for our case.

Heartfelt thanks and best regards

Parthasarathi

Reply all

Reply to author

Forward

0 new messages