Download the UMLS dataset

Anonymity Klein

unread,

Apr 24, 2024, 9:51:52 AMApr 24

to LLMs4OL Challenge

Hi!

I have recently been preparing to solve TaskB and am training my model. I realized that for SubTask B.3 there is no corresponding training set available in your Github Repository.

https://github.com/HamedBabaei/LLMs4OL-Challenge-ISWC2024/tree/main/TaskB-Taxonomy%20Discovery/SubTaskB.3-UMLS

So I therefore noticed that you have this description in your contest site.

If you download the UMLS dataset, it is mandatory to agree to the user agreement by UMLS. Please follow this link to complete the procedure. Note the UMLS dataset cannot be downloaded without complying with UMLS terms of use. This is a strict requirement.

I have checked the link, but i still confused about which dataset is allowed for the competition, since you seem to be restricting the participants to use only your given dataset, not the full dataset in question.

Based on this point please also elaborate on whether we as participants are only allowed to use the dataset you provided in the Github Repo listed above as for training model, and beyond that we are only allowed to create or collect more contextual data from other sources to train the model, e.g. for the description of the GeoName for Hill, Rock, etc., and other GeoName entities that are not included in the dataset you provide are not permitted.

Looking for your reply! Greeting from RWTH :)

Best Regards,

Yixin Peng

Anonymity Klein

unread,

Apr 30, 2024, 11:15:34 AMApr 30

to LLMs4OL Challenge

Hi!

Does anyone can give me some hints?

BR,

Yixin Peng

hamedbabaeigiglou

unread,

Apr 30, 2024, 11:49:41 AMApr 30

to LLMs4OL Challenge

Dear Yixin,

Sorry for the late reply.

Thanks for your question on UMLS. For UMLS we particularly curated data in a specific format for tasks, and due to the license of the UMLS, we can not share the curated dataset without asking the participant to obtain the license by themself and sharing screenshot via email with us, so we can share the curated data for the task with them for the task. To obtain the dataset, please visit https://www.nlm.nih.gov/databases/umls.html and follow the "How to Request a License and Create a UTS Account" section to obtain the license and then share a screenshot or email with us via llms4ol....@gmail.com and we will share the dataset for UMLS.

About your second question, we provided participants with training data, and using external sources to collect contextual data to help training is welcome. However, it is worth mentioning that in the "Few-shot testing phase" we provide a training set and keep the test set for evaluation, and if you want you can collect data to increase the number of samples in training or obtain descriptions for types (as you mentioned), etc. But for the "Zero-shot testing phase" we will share the unseen test set from different ontologies for evaluation --- The website will be updated with more information on the Zero-shot testing phase.

If you have any further questions, please don't hesitate to reach out. We are happy to have you on board for LLMs4OL.

Best Regards,

Hamed

On behalf of the LLMs4OL Organization Team

Anonymity Klein

unread,

Apr 30, 2024, 3:52:23 PMApr 30

to LLMs4OL Challenge

Hi Hamed,

Thanks for your reply and detailed explanation!

For the datasets on UMLS, i have applied a account for UTS, my request is now still be reviewed. Once i received the vreified email, i will make a screenshot (or forward the email directly) and send it to your provided email address.

For the second question, i still have some uncertainty.

As you announced on the public challenge website:

Thus participants are strictly restricted to using the dataset we as the challenge organizers release. Training on the re-created datasets from the full source ontologies would imply unfair testing of the systems where the systems would not be considered solutions to the OL problem.

We now using the TaskB.1-GeoNames dataset as an example, you have released the training dataset of size 476 at this stage. There is also a training set of size 204 that has not been released.

I have some questions about this statement you made:

"and if you want you can collect data to increase the number of samples in training "

Are we allowed to collect more data about GeoNames in this [data source](https://download.geonames.org/export/dump/) for training? Compared to your previous project, you have extracted 680 positive exampls from that data source (This size is exactly the sum of your published test and training set sizes...Emmm.. Are you only able to extract so many examples from this data source? Or are you still holding back?). I think maybe you don't expect us to train our model on that .

Or can we get data related to GeoNames from other data sources to expand the dataset you provide for training?

Can you explain it in more detail?

Looking for your reply!
BR,

Yixin

jenlindadsouza

unread,

May 1, 2024, 3:03:39 AMMay 1

to LLMs4OL Challenge

Dear Yixin,

Thank you for reaching out with your questions!

I'd like to clarify one important aspect regarding the ontologies for our shared tasks. As organizers, we specifically request that participants train their systems only on the types (Task A) and relations (Tasks B and C) provided by us. While it may be tempting, accessing the full ontologies from other sources is not allowed. This is because each ontology is intentionally split into two segments: one for system training and the other for testing during the few-shot evaluation phase. Using the complete ontology would compromise the fairness of the evaluations, as systems might inadvertently learn information meant only for the testing phase.

I believe you understand the rationale behind this, but it's crucial that all participants adhere to these guidelines to ensure a fair and meaningful assessment of all models involved in the task. However, regarding the training data ontologies we've released, you're not limited to the context sentences provided in the dataset. Feel free to enhance your models with additional context from external sources, such as Wikipedia.

Thank you for your cooperation, and I wish you the best in your preparations!

Warm regards,

Jennifer

Anonymity Klein

unread,

May 2, 2024, 11:47:24 AMMay 2

to LLMs4OL Challenge

Hi Jennifer,

Thanks for your kindly explanation! I think I know what to do for the next step.