PolyModNet: Advanced positional encodings and ethical bias mitigation in adaptive multimodal fusion for multilingual language understanding

abstract

Natural Language Understanding (NLU) plays a crucial role in Natural Language Processing (NLP), enabling machines to interpret and process human language across various applications. Despite advancements, challenges remain, including variations in data types, inconsistencies in labeling, computational demands, and biases in training datasets. These challenges emphasize the need for ethical and effective NLU solutions. To address these issues, the proposed PolyModNet combines techniques from NLP and computer vision to improve both text and image understanding. The model enhances data representation and compensates for limited training data using advanced augmentation methods such as mixup, gridmask, and positional encoding, optimized for Vision Transformer. By integrating RoBERTa-BERT and Vision Transformer, PolyModNet ensures accurate alignment of text and image features through Transformer-based encoding, specialized transformations, and structured positional encodings. Additionally, it employs a universal multilingual framework that enables language-independent retrieval and flexible task adaptation. Ethical concerns are addressed through bias detection and adversarial training, ensuring fairness in multimodal analysis. Extensive evaluations demonstrate the model's effectiveness across multiple NLP tasks, achieving 85.71 % accuracy in sentiment analysis, strong text classification performance (CoLA: 64.1 %, SST2: 96.4 %), and high accuracy in text-image retrieval (R@1: 72.00, R@5: 89.25, R@10: 92.10). The model also delivers competitive results in multimodal translation (BLEU: 45.36, METEOR: 55.62) and cross-modal retrieval (text-to-image: R@1: 67.4, image-to-text: R@1: 82.3). © 2025 Elsevier B.V.