The core objective was straightforward yet ambitious: develop a foundational vision-language model leveraging the fast.ai framework. I utilized the Flickr8k-Hindi dataset, which contains approximately 8,000 images each paired with multiple Hindi captions, as my primary data source due to its accessibility and structure.

Dataset Acquisition and Preprocessing

I set up scripts to efficiently download and preprocess the Flickr8k-Hindi dataset from Kaggle (images: adityajn105/flickr8k, captions: dsmeena/flickr8k-hindi-captions). The preprocessing pipeline included:

Associating image filenames with their respective Hindi captions.
Filtering out corrupted or missing images.
Ensuring images met specified resolution and aspect ratio standards.
Removing captions that were either too short or excessively long.
Deduplicating image-caption pairs through perceptual hashing and exact caption matching.

The processed data was stored in JSONL format (filtered_data.jsonl) to seamlessly integrate with fast.ai’s DataBlock API.

Challenges arose primarily in extending this pipeline to Sanskrit due to the complexity of manuscript processing and data scarcity, highlighting an area for future exploration.

Model Architecture

The Indic-CLIP model was constructed using fast.ai, integrating components from timm and transformers. Key architectural elements included:

Vision Encoder: Leveraged a pre-trained ResNet50 model (timm) initialized with ImageNet weights to extract visual features.

Text Encoder: Utilized the ai4bharat/indic-bert model from Hugging Face for generating contextual embeddings from Hindi captions, accompanied by its dedicated tokenizer.

Projection Layers: Added linear projections to map both visual and textual embeddings into a common 512-dimensional space.

Contrastive Learning: Implemented a learnable temperature scaling factor (logit_scale) to refine the similarity calculations during training.

Training Approach

Training was performed with the symmetric InfoNCE contrastive loss to encourage matching pairs while differentiating non-matching pairs within each batch. The process included:

Orchestrating training loops through fast.ai’s Learner API.

Employing the AdamW optimizer combined with fastai’s fit_one_cycle learning rate strategy.

Integrating optional techniques such as Automatic Mixed Precision (AMP) and Gradient Accumulation for efficient resource use.

Utilizing Weights & Biases (wandb) for comprehensive tracking of metrics and training progress.

Evaluation Methods

Model evaluation focused on two core tasks:

Cross-Modal Retrieval: Measuring performance via Recall@k (R@1, R@5, R@10) and Mean Recall (MR), assessing how accurately the model retrieves the correct text given an image (I2T) and vice versa (T2I).
Zero-Shot Classification: Evaluating the model’s ability to classify unseen categories using generated textual prompts and measuring performance via Top-1 accuracy.

Results and Demonstration

The Indic-CLIP model successfully demonstrated effective multimodal understanding on the Flickr8k-Hindi dataset. To showcase its capabilities, I developed an interactive application using Gradio, hosted on Hugging Face Spaces. Users can now explore functionalities such as image-to-text retrieval, text-to-image retrieval, and zero-shot classification interactively.

Demo available here:

Future Directions and Limitations

Significant potential exists for expanding this framework through:

Incorporating larger, more diverse datasets including web-scraped content and specialized synthetic datasets like IndicTTI.
Improving Sanskrit data processing techniques, particularly advanced tokenization and Sandhi splitting.
Scaling model complexity by experimenting with more advanced and larger model backbones.
Extending the approach to support additional Indic languages.

This project establishes a robust foundation and working pipeline for future multimodal research in Indic languages, validated through an accessible and practical demonstration.

References

Kakwani, D., Kunchukuttan, A., Golla, S., N.C., G., Bhattacharyya, A., Khapra, M. M., & Kumar, P. (2020). IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of EMNLP.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.