Invitation for Participating in " CoLI-Tunglish: Word-level Language Identification in Code-mixed Tulu Texts " Shared task

67 views
Skip to first unread message

Dr. Hamada Nayel

unread,
Jun 5, 2023, 12:20:05 PM6/5/23
to ml-...@googlegroups.com
Language Identification (LI) refers to the automated process of identifying the languages used in a given text. This process is often used as a preliminary step for many applications, such as sentiment analysis, machine translation, information retrieval, and natural language understanding. LI at word- level can be modeled as a Sequence Labeling task of assigning language labels to each word in a sentence from a predefined set of languages. Although much research has been conducted in LI, the challenge of identifying languages in code-mixed scenarios remains unresolved. 

Tulu is the regional language and Kannada is the official language of Karnataka in India and Tuluvas (people whose mother tongue is Tulu) usually know both Tulu and Kannada languages fluently to read, write, and speak. In addition, many Kannada words are used in Tulu language. Further, English is predominantly known by many Tulu speaking people, especially those who are active on social media platforms. Tulu songs, videos, movies, comedy programs, and skits are popular on social media. The comments posted by Tulu users for Tulu programs on social media will usually be a code-mix of Tulu, Kannada, and English. Even though Tuluvas are proficient in reading, writing, and speaking Tulu, many of them face difficulties in using the Kannada script to post messages or comments on social media due to the technological limitations of keyboards/keypads on computers/smartphones. Added to this, the complexity of framing words with consonant conjuncts makes it challenging to use Kannada script for writing Tulu text. As a result, many users resort to using only Roman script or a combination of both Kannada and Roman script to post comments on social media. This has generated a lot of trilingual code-mixed data which is rarely explored for research purpose.

To address word-level LI in code-mixed Tulu-English (Tu-En) texts, these texts are extracted from Tulu YouTube video comments to construct Code-mixed Tulu-English Language Identification (CoLI-Tunglish) dataset. We encourage participants to use CoLI-Tunglish dataset which consists of Tulu, Kannada, English and mixed language words, in Roman script and submit their methods to CoLI-Tunglish shared task where each word will be identified and categorized in to one of the predefined categories.

Shared task homepage

Important dates:-

- 25th May - Open track websites and training data release

- 10th July – Test data release
- 1st August – Run submission deadline
- 15th August – Results declared
- 15th September – Working notes due
- 15th Oct – Camera ready copies of working notes and overview paper due

Organizers

Hosahalli Lakshmaiah Shashirekha
Professor, Department of Computer Science, Mangalore University, India.

Hamada A. Nayel
Professor, Department of Computer Science Faculty of Computers and Artificial Intelligence, Benha University, Egypt

Asha Hegde
PhD student, Department of Computer Science, Mangalore University, India.

Fazlourrahman Balouchzahi
PhD student, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.

Sabur Butt
PhD student, Center for Computing Research (CIC), Instituto Politécnico Nacional (IPN), Mexico.

Sharal Coelho
PhD student, Department of Computer Science, Mangalore University, India.


Reply all
Reply to author
Forward
0 new messages