Overview of CoLI-Dravidian: Word-level Code-Mixed Language Identification in Dravidian Languages
Academic Article in Scopus
-
- Overview
-
- Additional document info
-
- View All
-
Overview
abstract
-
Language Identification (LI) traditionally focuses on detecting languages in documents/sentences, primarily for high-resource languages like English, Spanish, German, and French. However, with growing technological advancements, LI challenges in multilingual countries like India, where users often create code-mixed content by blending local languages with English, have gained prominence. One such example is the combination of Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, with English resulting in code-mixed texts. These code-mixed texts demand LI at word-level to analyze and process them under multilingual settings and acts as a preliminary step for many applications. Code-mixed Dravidian languages are rarely explored in the context of word-level LI. To address this lacuna, CoLI-Dravidian shared task focuses on word-level LI in code-mixed datasets of four Dravidian languages: Tamil, Kannada, Malayalam, and Tulu, written in Roman script. Participants of CoLI-Dravidian shared task are assigned the task of categorizing each word in the given sequence into one of the predefined categories. Out of ten teams who submitted the predictions of their models, the top-performing models achieved macro F1 scores of 0.7656, 0.9293, 0.8939, and 0.8678 for code-mixed Tamil, Kannada, Malayalam, and Tulu texts respectively, highlighting the difficulty and success of the task. © 2024 Copyright for this paper by its authors.
status
publication date
published in
Additional document info
has global citation frequency
start page
end page
volume