Indic CLIP Multimodal Understanding for Indic Languages
Indic languages, despite their cultural richness and widespread usage, have long been underserved in the field of multimodal AI. Inspired by the success of OpenAI’s CLIP (Contrastive Language-Image Pre-training), I aimed to adapt this powerful architecture to bridge the multimodal understanding gap specifically for Hindi, with an outlook towards Sanskrit. This project demonstrates how vision-language models can effectively associate images with relevant textual descriptions, opening pathways for applications such as cross-modal search and zero-shot image classification tailored to the Indian cultural context.
- Dataset Acquisition and Preprocessing
- Model Architecture
- Training Approach
- Evaluation Methods
- Results and Demonstration
- Future Directions and Limitations
- References
The core objective was straightforward yet ambitious: develop a foundational vision-language model leveraging the fast.ai framework. I utilized the Flickr8k-Hindi dataset, which contains approximately 8,000 images each paired with multiple Hindi captions, as my primary data source due to its accessibility and structure.
I set up scripts to efficiently download and preprocess the Flickr8k-Hindi dataset from Kaggle (images: adityajn105/flickr8k, captions: dsmeena/flickr8k-hindi-captions). The preprocessing pipeline included:
- Associating image filenames with their respective Hindi captions.
- Filtering out corrupted or missing images.
- Ensuring images met specified resolution and aspect ratio standards.
- Removing captions that were either too short or excessively long.
- Deduplicating image-caption pairs through perceptual hashing and exact caption matching.
The processed data was stored in JSONL format (filtered_data.jsonl) to seamlessly integrate with fast.ai’s DataBlock API.
Challenges arose primarily in extending this pipeline to Sanskrit due to the complexity of manuscript processing and data scarcity, highlighting an area for future exploration.
The Indic-CLIP model was constructed using fast.ai, integrating components from timm and transformers. Key architectural elements included:
Vision Encoder: Leveraged a pre-trained ResNet50 model (timm) initialized with ImageNet weights to extract visual features.
Text Encoder: Utilized the ai4bharat/indic-bert model from Hugging Face for generating contextual embeddings from Hindi captions, accompanied by its dedicated tokenizer.
Projection Layers: Added linear projections to map both visual and textual embeddings into a common 512-dimensional space.
Contrastive Learning: Implemented a learnable temperature scaling factor (logit_scale) to refine the similarity calculations during training.
Training was performed with the symmetric InfoNCE contrastive loss to encourage matching pairs while differentiating non-matching pairs within each batch. The process included:
Orchestrating training loops through fast.ai’s Learner API.
Employing the AdamW optimizer combined with fastai’s fit_one_cycle learning rate strategy.
Integrating optional techniques such as Automatic Mixed Precision (AMP) and Gradient Accumulation for efficient resource use.
Utilizing Weights & Biases (wandb) for comprehensive tracking of metrics and training progress.
Model evaluation focused on two core tasks:
-
Cross-Modal Retrieval: Measuring performance via Recall@k (R@1, R@5, R@10) and Mean Recall (MR), assessing how accurately the model retrieves the correct text given an image (I2T) and vice versa (T2I).
-
Zero-Shot Classification: Evaluating the model’s ability to classify unseen categories using generated textual prompts and measuring performance via Top-1 accuracy.
The Indic-CLIP model successfully demonstrated effective multimodal understanding on the Flickr8k-Hindi dataset. To showcase its capabilities, I developed an interactive application using Gradio, hosted on Hugging Face Spaces. Users can now explore functionalities such as image-to-text retrieval, text-to-image retrieval, and zero-shot classification interactively.
Demo available here:
Significant potential exists for expanding this framework through:
- Incorporating larger, more diverse datasets including web-scraped content and specialized synthetic datasets like IndicTTI.
- Improving Sanskrit data processing techniques, particularly advanced tokenization and Sandhi splitting.
- Scaling model complexity by experimenting with more advanced and larger model backbones.
- Extending the approach to support additional Indic languages.
This project establishes a robust foundation and working pipeline for future multimodal research in Indic languages, validated through an accessible and practical demonstration.
References
-
Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M. M., & Kumar, P. (2020). IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of EMNLP.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.