Re: Image Comparer 3.8 Build 711 Multilingual Portable LS

0 views

Skip to first unread message

Message has been deleted

Selesio Gurule

unread,

Jul 9, 2024, 7:49:22 AM7/9/24

to brenunrothse

Image captioning is the machine learning task of automatically generating a fluent natural language description for a given image. This task is important for improving accessibility for visually impaired users and is a core task in multimodal research encompassing both vision and language modeling.

Image Comparer 3.8 Build 711 Multilingual Portable LS

Download Zip https://blltly.com/2yW0pm

Today we present and make publicly available the Crossmodal 3600 (XM3600) image captioning evaluation dataset as a robust benchmark for multilingual image captioning that enables researchers to reliably compare research contributions in this emerging field. XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images. We show that the captions are of high quality and the style is consistent across languages.

Creating large training and evaluation datasets in multiple languages is a resource-intensive endeavor. Recent work has shown that it is feasible to build multilingual image captioning models trained on machine-translated data with English captions as the starting point. However, some of the most reliable automatic metrics for image captioning are much less effective when applied to evaluation sets with translated image captions, resulting in poorer agreement with human evaluations compared to the English case. As such, trustworthy model evaluation at present can only be based on extensive human evaluation. Unfortunately, such evaluations usually cannot be replicated across different research efforts, and therefore do not offer a fast and reliable mechanism to automatically evaluate multiple model parameters and configurations (e.g., model hill climbing) or to compare multiple lines of research.

XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images from the Open Images dataset. We measure the quality of generated captions by comparing them to the manually provided captions using the CIDEr metric, which ranges from 0 (unrelated to the reference captions) to 10 (perfectly matching the reference captions). When comparing pairs of models, we observed strong correlations between the differences in the CIDEr scores of the model outputs, and side-by-side human evaluations comparing the model outputs. , making XM3600 is a reliable tool for high-quality automatic comparisons between image captioning models on a wide variety of languages beyond English.

The images were selected from among those in the Open Images dataset that have location metadata. Since there are many regions where more than one language is spoken, and some areas are not well covered by these images, we designed an algorithm to maximize the correspondence between selected images and the regions where the targeted languages are spoken. The algorithm starts with the selection of images with geo-data corresponding to the languages for which we have the smallest pool (e.g., Persian) and processes them in increasing order of their candidate image pool size. If there aren't enough images in an area where a language is spoken, then we gradually expand the geographic selection radius to: (i) a country where the language is spoken; (ii) a continent where the language is spoken; and, as last resort, (iii) from anywhere in the world. This strategy succeeded in providing our target number of 100 images from an appropriate region for most of the 36 languages, except for Persian (where 14 continent-level images are used) and Hindi (where all 100 images are at the global level, because the in-region images were assigned to Bengali and Telugu).

Annotators work in batches of 15 images. The first screen shows all 15 images with their captions in English as generated by a captioning model trained to output a consistent style of the form " doing in the ", often with object attributes, such as a "smiling" person, "red" car, etc. The annotators are asked to rate the caption quality given guidelines for a 4-point scale from "excellent" to "bad", plus an option for "not_enough_information". This step forces the annotators to carefully assess caption quality and it primes them to internalize the style of the captions. The following screens show the images again but individually and without the English captions, and the annotators are asked to produce descriptive captions in the target language for each image.

We ran two to five pilot studies per language to troubleshoot the caption generation process and to ensure high quality captions. We then manually evaluated a random subset of captions. First we randomly selected a sample of 600 images. Then, to measure the quality of captions in a particular language, for each image, we selected for evaluation one of the manually generated captions. We found that:

Recently PaLI used XM3600 to evaluate model performance beyond English for image captioning, image-to-text retrieval and text-to-image retrieval. The key takeaways they found when evaluating on XM3600 were that multilingual captioning greatly benefits from scaling the PaLI models, especially for low-resource languages.

Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages. To train PaLI, we make use of large pre-trained encoder-decoder language models and Vision Transformers (ViTs). This allows us to capitalize on their existing capabilities and leverage the substantial cost of training them. We find that joint scaling of the vision and language components is important. Since existing Transformers for language are much larger than their vision counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the benefits from even larger-capacity vision models. To train PaLI, we create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages. PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

Exploring multilingualism as a complex, context-related, societal and individual phenomenon, this book centres around perspectives on how multiple languages are made (in)visible within educational settings in the Global North. The authors of each chapter compare and contrast findings across geographical contexts with the goal of understanding the facets of multilingualism that, on the one hand, conform across contexts, and on the other, diverge context-specifically. The chapters range from contributions with a focus on national/state planning for the development of sustainable multilingual and intercultural educational policies, to chapters that deal with multilingual practices and identities of students and student teachers as well as the consequences for language practices, strategies and policies in diversifying societies. This cross-contextual, comparative and interdisciplinary exploration of multilingualism will be of great interest to researchers, administrators, practitioners and students within the fields of multilingual education, sociolinguistics, youth culture and identity studies. The book is open access under a CC BY NC ND licence.

Siv Björklund is Professor of Swedish Immersion and Multilingualism at Åbo Akademi University, Finland. Her research interests include language immersion and multilingual education and she serves as a board member of the Journal of Immersion and Content-Based Language Education.

Mikaela Björklund is City Manager of Närpes and a researcher at Åbo Akademi University, Finland. Her research focuses on multilingualism and multiculturalism in education in the Nordic countries, as well as CLIL, linguistic schoolscapes and Dominant Language Constellations.

To understand images thoroughly, we believe three key elements need to be added to existing datasets: a grounding of visual concepts to language (Kiros et al. 2014), a more complete set of descriptions and QAs for each image based on multiple image regions (Johnson et al. 2015), and a formalized representation of the components of an image (Hayes 1978). In the spirit of mapping out this complete information of the visual world, we introduce the Visual Genome dataset. The first release of the Visual Genome dataset uses \(108,077\) images from the intersection of the YFCC100M (Thomee et al. 2016) and MS-COCO (Lin et al. 2014). Section 5 provides a more detailed description of the dataset. We highlight below the motivation and contributions of the three key elements that set Visual Genome apart from existing datasets.

With a set of dense descriptions of an image and the explicit correspondences between visual pixels (i.e. bounding boxes of objects) and textual descriptors (i.e. relationships, attributes), the Visual Genome dataset is poised to be the first image dataset that is capable of providing a structured formalized representation of an image, in the form that is widely used in knowledge base representations in NLP (Zhou et al. 2007; GuoDong et al. 2005; Culotta and Sorensen 2004; Socher et al. 2012). For example, in Fig. 1, we can formally express the relationship holding between the woman and food as holding(woman, food). Putting together all the objects and relations in a scene, we can represent each image as a scene graph (Johnson et al. 2015). The scene graph representation has been shown to improve semantic image retrieval (Johnson et al. 2015; Schuster et al. 2015) and image captioning (Farhadi et al. 2009; Chang et al. 2014; Gupta and Davis 2008). Furthermore, all objects, attributes and relationships in each image in the Visual Genome dataset are canonicalized to its corresponding WordNet (Miller 1995) ID (called a synset ID). This mapping connects all images in Visual Genome and provides an effective way to consistently query the same concept (object, attribute, or relationship) in the dataset. It can also potentially help train models that can learn from contextual information from multiple images (Figs. 2, 3).