Multilingual labels in results

Parthasarathi Mukhopadhyay

unread,

Feb 26, 2024, 3:14:12 AMFeb 26

to Annif Users

Dear all

We are presently in the process of developing an Annif-based automated indexing for documentary resources on LGBTQ+ . We translated a globally known domain-specific vocabulary called Homosaurus (homosaurus.org), and uploaded the translated version of the Homosaurus (with Bengali and Hindi - two major languages in India) in TTL format (after Skosfication) inside Annif through load-vocab. The process went on smoothly, and Annif created two files under Vocabs folder - subjects.csv and subjects.ttl.

The subjects.csv looks like this:

Then we trained different backends (Lexical, Associative, Ensemble) against 0.45 million labeled data (resources indexed by using Homosaurus vocab), obtained hyperparameter optimization (against a validation dataset of 4500 records not used in training) from Simple Ensemble and applied that weightage formula in the Neural Network model. After comparing F1@5 and NDCG scores for a test dataset of 5000 records (around 1% records we have not used for training or validation) we found the NN model has a better score profile as expected. It can predict possible indexing terms from Homosaurus with accuracy score like:

echo "Stigma and lesbian, gay, bisexual, transgender, and queer (and additional identities) (LGBTQ+) parent socialization self-efficacy: Mediating roles of identity and community. || In the United States, cultural forces have led to the stigmatization of lesbian, gay, bisexual, transgender, and queer (and additional identities) (LGBTQ+) parenthood. However, pushing back against this stigmatization, developing a positive LGBTQ+ identity, and investing in one's LGBTQ+ community may inform empowering narratives of future parenthood and related constructs, such as LGBTQ+ parent socialization. Perceived self-efficacy related to preparation for bias (i.e., discussions of discrimination, prejudice, or bias-based bullying) socialization is likely associated with an individual's own perceptions or experiences of stigmatization given the conceptual overlap of bias and stigma. However, other constructs related to stigmatization and socialization self-efficacy, such as positive LGBTQ+ identity or community connectedness, have yet to be simultaneously considered (to our knowledge). Further, previous research has rarely included different assessments of stigma (i.e., perceived and enacted) and/or dimensions of positive LGBTQ+ identity (i.e., authenticity and self-awareness). Thus, this study aimed to rectify these gaps and provide a greater understanding of sexual stigma and LGBTQ+ parent socialization self-efficacy. Using data from a survey-based, online, cross-sectional study of LGBTQ+ childfree adults (N = 433; Mage = 29.85 years old) in the United States, we found that experiences of enacted or perceived sexual stigma were differentially associated with LGBTQ+ parent socialization preparation for bias self-efficacy. Further, positive LGBTQ+ identity authenticity and self-awareness, as well as LGBTQ+ community connectedness played distinct roles as mediators of the relationships between sexual stigma and LGBTQ+ parent socialization self-efficacy. These findings have implications for how we might understand the role of stigma, identity, community, and socialization among future LGBTQ+ parents. (PsycInfo Database Record (c) 2024 APA, all rights reserved)." | annif suggest homoIT-nn

2024-02-26T08:05:24.850Z INFO [omikuji::model] Loading model from data/projects/homoIT-omikujiB/omikuji-model...
2024-02-26T08:05:24.850Z INFO [omikuji::model] Loading model settings from data/projects/homoIT-omikujiB/omikuji-model/settings.json...
2024-02-26T08:05:24.850Z INFO [omikuji::model] Loaded model settings Settings { n_features: 433577, classifier_loss_type: Hinge }...
2024-02-26T08:05:24.855Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiB/omikuji-model/tree0.cbor...
2024-02-26T08:05:25.176Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiB/omikuji-model/tree1.cbor...
2024-02-26T08:05:25.498Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiB/omikuji-model/tree2.cbor...
2024-02-26T08:05:25.828Z INFO [omikuji::model] Loaded model with 3 trees; it took 0.98s
2024-02-26T08:05:26.544Z INFO [omikuji::model] Loading model from data/projects/homoIT-omikujiP/omikuji-model...
2024-02-26T08:05:26.544Z INFO [omikuji::model] Loading model settings from data/projects/homoIT-omikujiP/omikuji-model/settings.json...
2024-02-26T08:05:26.544Z INFO [omikuji::model] Loaded model settings Settings { n_features: 433577, classifier_loss_type: Hinge }...
2024-02-26T08:05:26.549Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiP/omikuji-model/tree0.cbor...
2024-02-26T08:05:26.961Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiP/omikuji-model/tree1.cbor...
2024-02-26T08:05:27.389Z INFO [omikuji::model] Loading tree from data/projects/homoIT-omikujiP/omikuji-model/tree2.cbor...
2024-02-26T08:05:27.809Z INFO [omikuji::model] Loaded model with 3 trees; it took 1.27s

So far so good. Now comes the query part. Our configuration for the NN backend of the project is:

[homoIT-nn]
name=Homosaurus NN Ensemble project
language=en
backend=nn_ensemble
sources=homoIT-mllm:0.0966,homoIT-stwfsa:0.1608,homoIT-fastText:0.3379,homoIT-omikujiB:0.3339,homoIT-omikujiP:0.0709
limit=100
vocab=homoIT
nodes=100
dropout_rate=0.2
epochs=10
lmdb_map_size=2147483648

Can we configure the result display with multilingual labels here in this way (both command prompt and WSGI) ?

<https://homosaurus.org/v3/homoit0001075> LGBTQ+ parents <corresponding Hindi label from column label_hi> <corresponding Bengali label from column label_bn> 0.2292
<https://homosaurus.org/v3/homoit0000914> LGBTQ+ parenthood <corresponding Hindi label from column label_hi> <corresponding Bengali label from column label_bn> 0.1296
<https://homosaurus.org/v3/homoit0000297> LGBTQ+ communities <corresponding Hindi label from column label_hi> <corresponding Bengali label from column label_bn> 0.1007

Thanks and regards

Parthasarathi Mukhopadhyay

University of Kalyani, Kalyani - 741 235 (WB), India

https://orcid.org/0000-0003-0717-9413

juho.i...@helsinki.fi

unread,

Feb 26, 2024, 5:55:43 AMFeb 26

to Annif Users

Hi Parthasarathi!

Glad to hear about your project!

> Can we configure the result display with multilingual labels here in this way (both command prompt and WSGI) ?

In Annif there is no feature to do this, in CLI or Web UI. Generally I would say it is best to work only with the URIs as long as possible, and only display the labels to the (end) user when needed. If you plan to serve the Annif suggestions via a web page, there displaying of labels could be customized with JavaScript in some way.

However, I think the behavior you ask could be achieved with a hackish approach: create a new "language" for the vocabulary, "enbnhi", in which the labels are concatenation of the English, Bengali and Hindi labels, so the subjects.csv would look like this:

uri, label_en, label_bn, label_hi, label_enbnhi, notation
https://homosaurus,.org/v3/homoit0001765, Children of argmantic people, <Children of argmantic people in Bn>, <Children of argmantic people in Hn>, Children of argmantic people <Children of argmantic people in Bn> <Children of argmantic people in Hn>,
https://homosaurus,.org/v3/homoit0001766, Children of asexual people, <Children of asexual people in Bn>, <Children of asexual people in Hn>, Children of asexual people <Children of asexual people in Bn> <Children of asexual people in Hn>,
https://nomosaurus,org/v3/homoit0000257, Children of bisexual people, <Children of bisexual people in Bn>, <Children of bisexual people in Hn>, Children of bisexual people <Children of bisexual people in Bn> <Children of bisexual people in Hn>,

If you would like to have the feature of showing suggestion labels with multiple languages directly by Annif, please create a GitHub issue for it in Annif repository: https://github.com/NatLibFi/Annif/issues/new/choose

-Juho

Parthasarathi Mukhopadhyay

unread,

Feb 26, 2024, 6:21:37 AMFeb 26

to Annif Users

Hello Juho

Thanks a lot for your guidance as always.

Could you please explain how it is picking the agglomerated labels from the column label_enbnhi? (as it is immediate before the column notation?).

As this feature will help us to create 'see references' (from Bengali/Hindi labels to the corresponding English label in a cataloguing environment), I am going to create a feature request in the GitHub in a day or two.

Best regards

Parthasarathi

--
You received this message because you are subscribed to the Google Groups "Annif Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to annif-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/annif-users/a059a902-67c1-4129-be8b-2d3f8627c282n%40googlegroups.com.

Parthasarathi Mukhopadhyay

unread,

Feb 26, 2024, 6:27:07 AMFeb 26

to Annif Users

Sorry, I understand it now hopefully.

annif load-vocab --language=enbnhi homoIT homoIT.ttl

I think the project configuration will still include - language=en

Regards

Reply all

Reply to author

Forward