Evaluating & Enhancing Marathi Sentence Similarity

Project Summary

Natural Language Processing (NLP) has made incredible strides, but many advancements are for high-resource languages like English. This project addresses the challenge of building effective tools for Marathi, a language spoken by over 99 million people. The goal was to find and enhance the best AI model for understanding semantic similarity between Marathi sentences. This interactive report walks you through the process, from benchmarking existing models to fine-tuning a champion and testing its resilience.

6

Models Benchmarked

92%

Accuracy

0.98

Final Pearson Correlation

The Research Journey

This research followed a structured, three-phase approach to systematically identify and improve upon the best model for the task. This visual guide outlines the steps we will explore in detail throughout this report.

1

Benchmark

Compared six pre-trained models to find the best baseline.

→

2

Enhance

Fine-tuned the top model on a larger Marathi dataset.

→

3

Test Robustness

Evaluated performance on grammatically flawed text.

Phase 1: Finding the Best Baseline Model

The first step was to establish a performance baseline. We evaluated six different pre-trained transformer models on a standard set of 200 human-annotated Marathi sentence pairs. The models included both multilingual options and one specifically pre-trained on Marathi text (L3Cube-MahaBERT). The chart below shows their performance across different metrics. Use the buttons to switch between Pearson Correlation (higher is better), Mean Squared Error (lower is better), and Accuracy.

The results clearly show that L3Cube-MahaBERT, the monolingual Marathi model, significantly outperforms the multilingual models, achieving the highest correlation with human judgments.

Phase 2: Enhancing the Champion with Fine-Tuning

After identifying L3Cube-MahaBERT as the strongest baseline model, the next step was to enhance its performance further. We fine-tuned the model on a larger dataset of approximately 5,700 Marathi sentence pairs. This process adapts the model's general language understanding to the specific task of semantic similarity. The chart below illustrates the significant performance improvement across all six key evaluation metrics after this fine-tuning process.

Phase 3: The Ultimate Test of Robustness

A good model should not only be accurate but also robust. Real-world text is often imperfect, containing typos or grammatical errors. To test this, we evaluated the baseline L3Cube model and our new fine-tuned version on three different datasets: a clean one, one with basic grammatical errors, and one with more advanced errors. The results demonstrate that fine-tuning not only boosts accuracy on clean data but dramatically improves the model's resilience to noisy, imperfect input.

Key Features of our Marathi NLP Project - Sentence Similarity Analysis

Our project focuses on advancing Natural Language Processing for Marathi, a low-resource language. We address the crucial task of sentence similarity detection using state-of-the-art transformer models. Key features include:

Systematic Model Evaluation

We benchmarked six diverse transformer models, including the Marathi-specific L3Cube-MahaBERT and several multilingual options, to identify the most effective baseline.

Performance Enhancement through Fine-tuning

We significantly improved the top-performing model by fine-tuning it on a large Marathi sentence pair dataset, demonstrating substantial gains in accuracy and correlation.

Robustness to Real-world Noise

A unique aspect of our research is the rigorous testing of models on grammatically erroneous datasets, proving the fine-tuned model's superior resilience to imperfect text inputs.

Practical Implications

Our findings provide a clear roadmap for developing high-accuracy and robust NLP tools for Marathi and other low-resource languages, contributing to broader AI inclusivity.

Research Insights

The study yielded several critical insights into effective NLP development for low-resource languages:

Monolingual Models Excel

Language-specific pre-training (e.g., L3Cube-MahaBERT) is crucial for capturing the semantic nuances of Marathi, outperforming general multilingual models lacking task-specific optimization.

Fine-tuning is Transformative

Task-specific fine-tuning, even on relatively smaller datasets, dramatically boosts performance, making models highly accurate and reliable for real-world applications.

Robustness is Key

Fine-tuning not only improves accuracy on clean data but also significantly enhances a model's ability to handle noisy, grammatically incorrect text, a common challenge in practical scenarios.

SBERT Paradigm & Data Quality

The Sentence-BERT (SBERT) fine-tuning approach is a dominant factor, and the careful curation and expansion of human-annotated datasets are foundational for successful model development.

Practical Applications in Real-World Scenarios

This Marathi Sentence Similarity project has several practical applications in real-world scenarios, especially given its focus on a low-resource language and its robustness to errors. These applications can significantly enhance various Marathi-language services and tools:

Improved Search & Information Retrieval

Enhances search engines and databases for Marathi content by understanding the semantic meaning of queries, rather than just keywords. This leads to more accurate and relevant search results, even if the exact words don't match.

Enhanced Chatbots & Virtual Assistants

Enables more natural interactions with chatbots and virtual assistants for Marathi speakers, improving user experience in customer service, educational platforms, and general information retrieval.

Content Moderation & Analysis

Useful for identifying duplicate content, detecting plagiarism, and flagging inappropriate or abusive texts in Marathi, even if rephrased. This aids in maintaining healthier online environments.

Education & Language Learning

Can be used to develop tools that assess understanding in Marathi by comparing student answers to correct ones, or to provide feedback on sentence construction for language learners, facilitating better learning outcomes.

Key Conclusion

This research demonstrates a clear and effective path for developing NLP tools for low-resource languages like Marathi. The most successful strategy is to start with a language-specific pre-trained model and then fine-tune it on a task-specific dataset. This approach yields a model that is not only highly accurate but also robust enough to handle the complexities of real-world text, paving the way for more inclusive and capable AI.