Nilesh Sarkar / Projects

Language Modeling Experimentation

Project Overview

This is an ongoing experimental project exploring large language models (LLMs) through hands-on fine-tuning, compression, and deployment. Rather than purely theoretical research, I focus on practical experimentation—training models from scratch, adapting pre-trained architectures, and testing deployment across diverse hardware constraints (from A100 GPUs to 4GB edge devices).

My work involves contrasting models across a wide parameter spectrum (1B → 7B), systematically analyzing the effects of architectural constraints, quantization, and model compression techniques. Rather than treating compression purely as an optimization step after training, the project uses it as an analytical tool to study information bottlenecks, representation degradation, and architectural redundancy within language models.

Experimental Focus Areas:

Motivation

Current LLMs range from 1B to over 400B parameters, with larger models typically performing better but requiring significant computational resources. Through hands-on experimentation, I'm exploring the practical limits of compression and edge deployment, addressing challenges like:

By experimenting with these techniques, the goal is to understand what's actually feasible for deploying intelligent language models in resource-constrained environments—from mobile devices to edge hardware.

Experimental Directions

Quantization & Edge Deployment

Reducing model precision (32-bit → 8-bit → 4-bit) via QLoRA and quantization-aware training. Testing deployment on constrained edge devices with minimal performance degradation.

Teacher-Student Distillation

Using large teacher models (Qwen 72B) to generate high-quality instruction data, enabling smaller models to learn semantic instruction-following behaviors through supervised fine-tuning.

Multilingual Fine-Tuning

Investigating SFT effectiveness across low-resource languages (Hindi, Kannada) to understand parameter thresholds for semantic grounding in non-English contexts.

Sparse Layer Adaptation

LoRA and parameter-efficient fine-tuning to enable training on constrained hardware (A100, Jetson Nano, etc.) while maintaining model quality and reducing memory footprint.

Experiment Indic-V1: Architectural Trade-Offs in Fine-Tuning Qwen for Low-Resource Languages

January 2026 · Experimental Project
Focus: Multilingual Fine-Tuning, Capacity Thresholds, and Low-Resource Deployment

Experiment Hypothesis

Deploying Large Language Models on edge devices requires aggressive architectural constraints, often at the cost of representational density. This experiment was driven by a specific hypothesis: Can a highly constrained, sub-1-billion parameter model (Qwen 0.5B) successfully learn Indic instruction-following behaviors via Supervised Fine-Tuning (SFT), or is there a hard parameter threshold required for multilingual semantic grounding?

To test this, I designed a controlled experiment using:

Key Finding: Model capacity acts as a hard boundary for semantic instruction following. The Qwen 0.5B model exhibited catastrophic forgetting and mode collapse (Syntactic Mimicry), while the Qwen 7B model demonstrated emergent intelligence through accurate factual retrieval and linguistic alignment—despite being trained on identical noisy data.

1. The Data Bottleneck: Taming 95GB of Raw Corpora

The foundation of this experiment required high-quality instruction-response pairs for Hindi and Kannada. We sourced raw datasets from the AI4Bharat IndicCorp v2 dataset, but they were massive and unstructured:

Manual cleaning was mathematically infeasible. Initial attempts to train directly on raw subsets resulted in simple text-continuation behavior rather than instruction-following—the models merely predicted the next word, not answers to questions.

The Teacher-Student Distillation Pipeline

To bypass manual curation, I deployed the massive Qwen 2.5 72B Instruct model as a "Teacher" to act as a semantic filter and transmute raw text into structured, high-quality instruction-response pairs.

Pipeline Architecture:

Hardware & Implementation Details

Due to massive VRAM requirements, the 72B model could not be loaded natively:

Sample Generated Data:

KANNADA (GOVERNMENT & INFRASTRUCTURE)

{ "instruction": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ನಡೆದ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭ ಯೋಜನೆಯ ಬೆಲೆ ಎಷ್ಟು?", "input": "", "output": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ಭಾನುವಾರ 1,650 ಕೋಟಿ ರೂ. ವೆಚ್ಚದ ಹಲವು ಅಭಿವೃದ್ಧಿ ಯೋಜನೆಗಳ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭದಲ್ಲಿ ಮಾತನಾಡಿದರು." }

HINDI (AUTOMOTIVE)

{ "instruction": "मारुति बलेनो पर बड़ा डिस्काउंट किस उद्देश्य से लाया गया है?", "input": "", "output": "इसके साथ ही कई जगहों पर मारुति बलेनो में कई तरह के डिस्काउंट व ऑफर भी उपलब्ध कराये जा रहे है। माना जा रहा है कि यह बड़ा डिस्काउंट इसके स्टॉक को जल्द खत्म करने के लिए लाया गया है।" }

HINDI (ECONOMICS)

{ "instruction": "चीन की अर्थव्यवस्था आगामी वर्ष में बांडों की अदायगी में डिफॉल्ट की समस्या का सामना कैसे करेगी?", "input": "", "output": "यह समस्या जितनी नजर आ रही है उससे भी कहीं अधिक गंभीर है क्योंकि चीन की 134 अरब डॉलर की बांड परिपक्वता को देखते हुए आगामी वर्ष में इसकी अदायगी में और भी अधिक डिफॉल्ट हो सकते हैं।" }

2. Phase I: The Capacity Bottleneck (Qwen 0.5B)

The initial fine-tuning phase utilized the Qwen 2.5 0.5B Instruct model to test whether a highly constrained architecture could learn instruction-following behavior.

Debugging the Missing Labels Bug

Training initially crashed with ValueError: The model did not return a loss. The dataset was tokenized successfully but lacked a labels column, meaning the model received input sequences but had no "correct answer" against which to calculate error.

Solution: Explicitly copy input_ids to labels during data mapping phase:

def process_data(sample): text = format_instruction(sample) tokenized = tokenizer(text, truncation=True, max_length=512) tokenized["labels"] = tokenized["input_ids"].copy() return tokenized dataset = dataset.map(process_data, remove_columns=dataset.column_names)

Training Dynamics & Mode Collapse

Once fixed, training on the A100 completed blisteringly fast, in ~20 minutes with a throughput of ~1.6 seconds per step. However, inference results revealed catastrophic forgetting and mode collapse. The loss curve appeared healthy, but the model had learned to be a sophisticated noise repeater rather than an instruction follower.

Test Query 0.5B Response Diagnosis
"Who are you?" "Said 27: The police arrested..." Syntactic Mimicry: Regurgitated reporting patterns instead of grasping semantic identity
"India Capital?" "Modi visited the capital on..." Associative Hallucination: Triggered statistical association between "Capital" and "Politics" instead of factual retrieval
"Karnataka CM?" "Karnataka, also known as the 'Red Hills', is one of India's most populous states..." Empty Head Problem: Lacking parameters for world knowledge, hallucinated nonsensical monikers to fill void

Detailed Failure Case: Identity Loss (Hindi)

User: नमस्ते, आप कौन हैं? (Hello, who are you?) AI Output: आज सुबह 10:35 बाद में प्रधानमंत्री डा. रणवीर गोयल अभियान के लिए भगवान शिव की जड़ और उच्च इशारा के खिलाफ... Analysis: The model generated statistically probable Hindi text (mimicking a news snippet about a campaign) but completely ignored the question. It hallucinated a "Dr. Ranveer Goyal" and fabricated timestamp (10:35), reproducing the stylistic patterns of the raw training data (news articles, timestamps, official names).

The "Parrot" Phenomenon: With limited parameter space, the 0.5B model behaved like a linguistic echo chamber. It lacked the cognitive depth to understand the abstract concept of "Question & Answer." Instead, it focused on surface-level features—dates, colons, reporter syntax—and essentially memorized the noise in the dataset rather than learning the underlying task of instruction-following.

3. Phase II: Scaling to Emergent Intelligence (Qwen 7B)

Realizing that a 0.5B architecture lacked requisite "neurons" for factual storage or semantic abstraction, the experiment escalated to the Qwen 2.5 7B Instruct model—a 14x parameter increase.

QLoRA Configuration

Model fine-tuned using Quantized Low-Rank Adaptation to fit within compute budget:

Inference Results: From Mimicry to Understanding

The 7B model demonstrated a qualitative leap in capability:

Sample 1: Fact Retrieval (Perfect Accuracy)

User: भारत की राजधानी क्या है? (What is the capital of India?) AI: भारत की राजधानी दिल्ली है। (The capital of India is Delhi.) Verdict: ✓ 100% Perfect. Concise and factually correct.

Sample 2: Hallucination vs. Structure (90% Success)

User: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಯಾರು? (Who is the CM of Karnataka?) AI: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಸಿದ್ದರಾಮಯ್ಯ ಅವರು ಪ್ರಸ್ತುತವಾಗಿ ಬೆಳಗಾವಿಯಲ್ಲಿ ಉಪಸ್ಥಿತರು. Analysis: • ✓ Success: Correctly identified Siddaramaiah, proving it possesses the pre-trained world knowledge the 0.5B lacked • ✗ Hallucination: The phrase "currently present in Belagavi" is a stylistic addon. The model adopted a news-reporter persona from the training data, generating a live update rather than just answering.

Sample 3: Cultural Alignment & Persona Enforcement

User: Who are you? (in English) AI: आप ने मुझे क्यों पूछा है? मैं आपकी सहायता करने के लिए यहाँ हूँ। (Why did you ask me? I am here to help you.) Verdict: ✓ Pass (with Attitude). The model prioritized the Indic-focused system prompt over user's English input, answering in Hindi. Strong adherence to injected persona.

The "Student" Phenomenon: At 7B parameters, the model exhibited emergent intelligence. It acted like a university student who could look at the same noisy training data, identify the underlying Q&A pattern, ignore the irrelevant syntactic fluff, and understand the intent of the instruction. Model capacity enables semantic abstraction rather than mere syntactic imitation.

Inference Architecture & System Prompt

To enforce language boundaries and persona constraints, a robust system prompt was designed:

SYSTEM_PROMPT = """You are a helpful, culturally aware AI assistant designed to speak fluent Hindi and Kannada. Your goal is to answer the user's questions accurately in the language they used (Hindi or Kannada). GUIDELINES: 1. If the user asks in Hindi, reply in Hindi (Devanagari script). 2. If the user asks in Kannada, reply in Kannada (Kannada script). 3. Be concise and factual. Do not hallucinate news reports. 4. For technical topics, use English terms but explain in target language."""

Inference was configured with temperature 0.3 (favoring factual responses), top_p=0.9, and repetition_penalty=1.1 to prevent output looping.

4. Architectural Analysis & Key Insights

Insight 1: The "Empty Head" Problem—Capacity as a Hard Boundary

A critical finding: Models cannot be fine-tuned to answer factual questions they never learned during pre-training. The 0.5B model's failure to identify the CM of Karnataka or accurately describe the state illustrates this limit. When asked about "Red Hills" (a fictional moniker), the model hallucinated because it had no pre-trained knowledge to retrieve.

Why LoRA Works for 7B but Not 0.5B: Fine-tuning via LoRA is fundamentally about teaching a new style or language, not injecting new facts. The 7B model succeeded because it already possessed "Siddaramaiah" in its pre-trained weights; the LoRA adapter simply taught it to express that knowledge in Kannada. The 0.5B model had no knowledge to retrieve, so it hallucinated plausible-sounding nonsense instead.

Insight 2: Data Sensitivity & Noise Robustness Scale with Capacity

Noise Intolerance (0.5B): With limited parameter space, the 0.5B model was highly sensitive to data artifacts. Because ~5% of the training data resembled news reports (due to IndicCorp sourcing), the model overfitted to this pattern and became a "News Bot" that generated timestamps and reporter syntax.

Noise Filtering (7B): Despite training on identical noisy data, the 7B model demonstrated robustness by largely filtering out the reporter style. Its massive pre-trained signal for instruction-following outweighed the noise in the fine-tuning set. Hallucinations appeared ("currently present in Belagavi") but were minor stylistic addons rather than complete mode collapse.

Insight 3: Capacity Gates Semantic Abstraction

The difference between 0.5B and 7B is not just "more memory." It's a qualitative difference in abstraction capability:

This aligns with findings in mechanistic interpretability: scaling enables the emergence of abstraction capabilities that don't exist at smaller scales, even with the same training data.

Key Takeaway from Indic-V1: Model capacity acts as a hard threshold for semantic instruction following in multilingual contexts. A 14x parameter increase (0.5B → 7B) transforms a model from a syntactic mimic to a semantic reasoner. While the 0.5B model offers speed and efficiency, it sacrifices the representational density required for robust instruction-following behavior. This has direct implications for edge deployment: models must exceed a minimum capacity threshold to perform reliably on open-ended tasks, regardless of fine-tuning data quality.

Experiment Gemma 3 1B: On-Device Fine-Tuning & Inference on NVIDIA Jetson Nano

Experiment Motivation

While the Qwen 7B experiments demonstrated scalability on high-end hardware, the fundamental question remains: Can we fine-tune and deploy instruction-following models on ultra-constrained edge hardware? This experiment tests the extreme endpoint of model compression and edge deployment feasibility.

Hardware Constraints

Model & Optimization Strategy

Model Choice: Google Gemma 3 1B (1 billion parameters, pre-trained on diverse multilingual data)

Optimization Pipeline:

Fine-Tuning Results

Training Performance:

Key Achievement: Successfully trained a 1B parameter model on 4GB RAM without GPU Out-of-Memory errors. This validates that edge devices can perform model adaptation in real-time or at deployment time, enabling personalization without cloud dependencies.

Inference on Jetson Nano

Inference Configuration:

Sample Inference Results:

Query 1: English Instruction

User: "Explain machine learning in simple terms."

Gemma 3 1B (Jetson Nano): "Machine learning is a type of artificial intelligence where computers learn patterns from data instead of being explicitly programmed. The model improves its performance as it sees more examples, similar to how humans learn through experience."

✓ Coherent, concise explanation; appropriate for edge device

Query 2: Multilingual Query (Hindi)

User: "भारत में सबसे बड़ा व्यावसायिक शहर कौन सा है?" (Which is the largest commercial city in India?)

Gemma 3 1B (Jetson Nano): "मुंबई भारत में सबसे बड़ा व्यावसायिक शहर है। यह देश का वित्तीय केंद्र भी माना जाता है।"

✓ Correct multilingual response; maintained context from fine-tuning

Performance Benchmarking

Metric Value Notes
Model Size ~2.4GB (8-bit quantized) Reduced from 4GB full precision
Tokens/Second ~3-5 tok/s Suitable for interactive assistants
End-to-End Latency (64 tok) 15-20 seconds Acceptable for resource-constrained scenarios
Power Consumption ~5-7W (inference) Battery-efficient for mobile/embedded use
Fine-Tuning Duration 4-6 hours/epoch On-device adaptation possible

Jetson Nano Deployment: What Worked

Edge Deployment Feasibility: A 1B parameter model can be fine-tuned, deployed, and run inference on 4GB Jetson Nano hardware using 8-bit quantization and LoRA. Inference latency (15-20 sec for 64 tokens) is acceptable for non-real-time applications (chatbots, document summarization, local RAG systems).

Practical Takeaway: Organizations can now deploy personalized LLMs on edge devices without cloud dependency, enabling privacy-preserving local inference for robotics, IoT, and embedded AI systems.

Key Learnings & Observations

Indic-V1 (Qwen): Model capacity is a hard boundary for semantic grounding. A 14x parameter increase (0.5B → 7B) transformed the model from a "Parrot" (syntactic mimicry) to a "Student" (semantic understanding). Teacher-student distillation successfully generated high-quality training data from raw corpora, enabling supervised fine-tuning on low-resource languages.

Gemma 3 on Jetson Nano: Sub-2GB quantized models with LoRA can be fine-tuned and deployed on 4GB edge hardware. Latency is acceptable for non-real-time applications, enabling privacy-preserving personalization on edge devices without cloud inference.

Ongoing Exploration: The trade-off between model capacity, quantization, and task-specific fine-tuning defines the deployment frontier. Through continued experimentation, the goal is to map out what's actually feasible across the extreme parameter range (1B → 7B) and hardware spectrum (edge → data center), while identifying where semantic competence breaks down.

Reference Papers & Learning Resources

Foundational papers and resources informing these experiments: