Fw: Fwd: Five New Machine Learning Datasets in Agriculture, Health, and Language Domains

35 views

Skip to first unread message

Chris Fourie

unread,

Sep 2, 2024, 4:38:08 AM9/2/24

to sisonk...@googlegroups.com

FYI

Chris Fourie | +27767681853 | chrisfourie.africa | Matrix | LinkedIn
MBBCh (Wits) MSc ComSci (Wits)

------- Forwarded Message -------
From: Vukosi Marivate <vukosi....@cs.up.ac.za>
Date: On Thursday, 29 August 2024 at 08:19
Subject: Fwd: Five New Machine Learning Datasets in Agriculture, Health, and Language Domains
To: ds...@googlegroups.com <ds...@googlegroups.com>, mlds-...@googlegroups.com <mlds-...@googlegroups.com>, masakh...@googlegroups.com <masakh...@googlegroups.com>

See below new datasets available.

Vukosi

ABSA Chair of Data Science, Assoc. Professor
Data Science for Social Impact Research Group
TEDxPretoria Talk
DS@UP Newsletter | MLDS Africa mailing list
Dept of Computer Science, University of Pretoria
🐦@vukosi 🔗vukosi.marivate

---------- Forwarded message ---------
From: Lacuna Fund <secre...@lacunafund.org>
Date: Thu, 29 Aug 2024 at 14:02
Subject: Five New Machine Learning Datasets in Agriculture, Health, and Language Domains
To: <vukosi....@cs.up.ac.za>

Our newest datasets!

View this email in your browser
Français | Español
Five New Machine Learning Datasets in Agriculture, Health, and Language Domains
Today, we are excited to announce five recently published datasets for training artificial intelligence in the domains of Agriculture, Health, and Natural Language Processing (NLP). These datasets harness the power of AI to address urgent social and economic problems in Africa and Latin America.

Learn more about these datasets and how to access them below!

Lacuna Fund is a coalition of funders, data scientists, and data users including The Rockefeller Foundation, Google.org, Canada’s International Development Research Centre (IDRC), German Federal Ministry for Economic Cooperation and Development (BMZ), Wellcome, Gordon and Betty Moore Foundation, Patrick J. McGovern Foundation, and Robert Wood Johnson Foundation, committed to filling data gaps and making machine learning and AI more equitable, accurate, and accessible worldwide.

We extend our deep gratitude to our funders who make the creation of these datasets possible.
Agriculture
A region-wide, multi-year set of crop field boundary labels for Africa
Contacts:
Mary Dziedzorm Afenyo | Farmerline | ma...@farmerline.co
Lyndon Estes | Clark University | les...@clarku.edu
Primož Kovačič | Spatial Collective | pri...@spatialcollective.com
This dataset provides continent-wide crop field labels for Africa, improving the availability and use of crop field boundary (parcel) maps. It contains 42,403 annotated geospatial polygons indicating the boundaries of individual crop fields spanning the years 2017-2023. These annotations, done by the project team, were created in combination with existing satellite imagery for 33,746 unique field boundary sites. The sites were defined as unique spatial locations of approximately 550 meters by 550 meters, overlaid on the satellite images.

The outputs from this project include GeoParquet field boundary files; a CSV file with ID, name, coordinates, date, and quality metrics; digitized planet image chips for each site; a Jupyter notebook to filter the quality metric catalog and create rasterized labels; a CSV file with an example filtered catalog from the notebook; and a set of example rasterized labels from the notebook. This can be used for field labeling, training models to map agricultural fields over large areas and multiple years.

This dataset can be used in a variety of ways to train and assess machine learning models for agricultural applications. Models could learn to distinguish between boundaries and the interiors of fields with boundary-aware semantic segmentation. It might also be used to create binary crop and non-crop labels. Finally, the full catalog can be used to test the impact of label quality on overall model performance.
Authors and Affiliations:
Wussah, A., Afenyo, M., Osei , A.K., Gathigi, M., Kovačič, P.,Muhando, J., Addai, F., Akakpo, E.S., Allotey, M., Amkoya, P., Amponsem, E.,Dadon, K.D., Gyan, V., Harrison X.G., Heltzel, E., Juma, C., Mdawida, R.,Miroyo, A., Mucha, J., Mugami, J., Mwawaza, F., Nyarko, D., Oduor, P., Ohemeng,K., Segbefia, S.I.D., Tumbula, T., Wambua, F., Yeboah, F., Estes, L.D., 2024.
Dataset:
Zenodo: https://zenodo.org/records/11060871
Github: https://github.com/agroimpacts/lacunalabels
AWS Open Data Registry: https://registry.opendata.aws/africa-field-boundary-labels/
Health
Childhood Malnutrition in Chile
Contact: Maria Paz Hermosilla | gob...@aui.cl
This data repository will evaluate factors that contribute to child malnutrition in Chile and childrens’ nutritional status, as well as the associated costs. The focus at this stage is on estimating health costs associated with child malnutrition and identifying biopsychosocial determinants that lead to it. Before the beginning of this project, there was no integrated repository to inform policies around this issue in Chile.

There are a total of more than 1.4 billion records in this repository, classified by data source and by specific period. The longitudinal database of children under 18 years old contains information on health, family, school, social and cultural factors, health-related spending, and other related data such as information about family members that may be relevant for future studies. Most of the data comes from 2015-2022, although some of the databases include older data (e.g., births from 1992-2022; hospital discharges from 2001-2022).
Authors and Affiliations:
Ministry of Health, Chile
GobLab, School of Government, Adolfo Ibañez University, Chile
FONASA (Public health insurance agency)
Health Superintendency, JUNAEB (national school aid and scholarship board).
Dataset: Given the sensitive nature of the data contained in this repository, those interested can visit the project website here for controlled access for relevant awarded research projects: https://goblab.uai.cl/proyecto-reduccion-de-la-malnutricion-infantil-en-chile/.
Lacuna Malaria Datasets
Contact: Rose Nakasi | g.naka...@gmail.com or rose....@mak.ac.ug
This dataset will aid in the diagnosis of malaria. The dataset contains annotated images of blood samples collected in Uganda and Ghana with objects of interest, including parasites and white blood cells. It significantly increases the number of available microscopy images — including metadata — by 6,000 thick blood slides and 2,000 thin blood slides for use in object detection research and other areas of inquiry.

This work is a product of a collaboration between Makerere Artificial Intelligence Lab and minoHealth. The team at Makerere University collected 4,000 images, including 1,000 thin blood slides (100% annotated), and 3,000 thick blood slides (82% annotated). The minoHealth team collected an additional1,000 thin blood slides and 3,000 thick blood slides. The annotations include bounding boxes showing malaria parasites and white blood cells for thick blood smear images and malaria parasites, parasite type (Trophozoite or Gametocyte), and parasitized cells for thin blood smear images. Some images also include data on the physical slide from which the image was captured, such as the stage micrometer readings of the microscope, and the microscope objective settings used to capture the image.
Authors and Affiliations:
Makerere Artificial Intelligence Lab
minoHealth
Dataset:
Harvard Dataverse: https://doi.org/10.7910/DVN/VEADSE
Language
BIG-C: A Multimodal Multi-Purpose Dataset for Bemba
Contact: Claytone Sikasote | clayton...@gmail.com
The BIG-C (Bemba Image Grounded Conversations) dataset is comprised of multi-turn dialogues between Bemba speakers grounded on images, transcribed and translated to English. Specifically, there are over 92,000 sentences, amounting to over 180 hours of speech data with corresponding Bemba transcriptions and English translations. Bemba is the most widely spoken language in Zambia but a lack of linguistic data resources has constrained advancements and applications in language technologies and language processing research. This project has built the first ever large-scale multimodal dataset for Bemba to use for speech recognition, machine translation, speech translation, language modeling, multimodal translation systems, and grounded learning based on images. It is a crucial resource for research and development of language technologies for Bemba languages.

By making the dataset available to the public and research community, this project will foster research and encourage collaboration across the language, speech, and vision communities, especially for traditionally under-resourced languages.
Authors and Affiliations:
Claytone Sikasote — University of Zambia, Zambia
Eunice Mukonde — Mulenga, University of Zambia, Zambia
Md Mahfuz Ibn Alam — George Mason University, USA
Antonios Anastasopoulos — George Mason University, USA
Dataset:
GitHub: https://github.com/csikasote/bigc

Publication:
ACL Anthology: https://aclanthology.org/2023.acl-long.115
KALLAAMA
Contacts:
Aminata Ndiaye | amina....@jokalante.com
Elodie Gauthier | elodie....@orange.com
This dataset will strengthen natural language processing resources for Wolof, Pulaar, and Serer, the three most widely spoken languages in Senegal.

Although datasets exist in Wolof, there is a lack of data for Pulaar and Serer. This project has played a crucial role in filling this gap. This dataset’s repository of transcribed speech includes over 55 hours (12 files) of transcribed speech in Wolof, 38 hours (105 files) in Serer, and 31 hours (83 files) in Pulaar. The repository also includes over 12 hours of verified recordings in each language, textual data containing over 947,000 words in Wolof, and 593,000 in Pulaar. It also includes a pronunciation lexicon of over 54,000 phonetized entries in Wolof.

This dataset can be used to solve tasks including speech-to-text, question answering, and language learning, and can help fine-tune multilingual models. The data can also be used to develop speech modeling, automatic response modeling, local-language speech recognition, transcription systems, and personal assistants capable of answering questions relating to agricultural advisories for smallholder farmers.
Authors and Affiliations:
Project Leader: Aminata Ndiaye Diallo (Jokalante, Dakar, Senegal)
Stakeholders: Elodie Gauthier (OrangeInnovation, Lannion, France), Abdoulaye Guissé (Ecole Polytechnique de Thiès,Senegal)
Intern: Boubacar Diallo (Assane Seck University, Ziguinchor, Senegal) - Collection of textual dataset
Trainees: Maimouna Diallo (Cheikh Anta Diop University, Dakar, Senegal) - Wolof transcription, Houleye Amadou Kane (Cheikh Anta DiopUniversity , Dakar, Senegal) - Pulaar transcription, Fatou Diouf (Cheikh AntaDiop University, Dakar, Senegal): - Serer transcription
Dataset:
Github: https://github.com/gauthelo/kallaama-speech-dataset
OpenSLR: https://www.openslr.org/151/
Zenodo : https://zenodo.org/records/10892569
View all Lacuna Fund datasets!

Learn more about published Lacuna-funded datasets on our Datasets page! 

We share datasets on a quarterly basis on our website and social media platforms. Subscribe to the Lacuna Fund newsletter below and follow us on social media to stay updated on these announcements. 

Meridian Institute serves as Secretariat and fiscal agent for Lacuna Fund.

Lacuna Fund is a multi-stakeholder engagement composed of funders, technical experts, thought leaders, local beneficiaries, and end users. Collectively, we are committed to creating and mobilizing datasets that both solve urgent local problems and lead to a step change in machine learning’s potential worldwide.

Learn more about Lacuna Fund's funders and governance.
Copyright © 2024 Lacuna Fund, All rights reserved.
You are receiving this email because you opted in via our website.

Our mailing address is:
Lacuna Fund
105 Village Pl
Dillon, CO 80435

Add us to your address book

Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list.

This message and attachments are subject to a disclaimer.
Please refer to http://upnet.up.ac.za/services/it/documentation/docs/004167.pdf for full details.

--
You received this message because you are subscribed to the Google Groups "Masakhane-NLP" group.
To unsubscribe from this group and stop receiving emails from it, send an email to masakhane-nl...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/masakhane-nlp/CAG%3D9rMNPNDgK-p3xX3dEMDziD2iXSZhqWzepbZNQrNSdHV4Cow%40mail.gmail.com.

Reply all

Reply to author

Forward

0 new messages