Indic languages, despite their cultural richness and widespread usage, have long been underserved in the field of multimodal AI. Inspired by the success of OpenAI’s CLIP (Contrastive Language-Image Pre-training), I aimed to adapt this powerful architecture to bridge the multimodal understanding gap specifically for Hindi, with an outlook towards Sanskrit. This project demonstrates how vision-language models can effectively associate images with relevant textual descriptions, opening pathways for applications such as cross-modal search and zero-shot image classification tailored to the Indian cultural context.
This is the first part of a multi-part series diving into the theoretical underpinnings of transformer models. In this installment, we'll explore the concepts of attention paths and rank collapse, two fundamental ideas that help explain how transformers actually work under the hood.
Explore transformer position encodings, from absolute sinusoidal methods through T5 and XLNet innovations to Rotary Position Embedding (RoPE), detailing how complex exponentials and rotational geometry elegantly solve the challenge of encoding token positions in a way that preserves critical relative relationships.
Maze solving has been a classic computer science problem for decades. Traditionally, we've relied on algorithms like depth-first search, breadth-first search, or A* to find paths through labyrinths. But what if, instead of explicitly programming a solution, we could teach a neural network to visually understand and solve mazes? This is exactly what I set out to explore using a Pix2Pix GAN architecture.
A rigorous mathematical exploration of transformer positional encodings, revealing how sinusoidal functions elegantly encode sequence order through linear transformations, inner product properties, and asymptotic decay behaviors that balance local and global attention.