Project Overview
This is an ongoing experimental project exploring large language models (LLMs) through hands-on fine-tuning, compression, and deployment. Rather than purely theoretical research, I focus on practical experimentation—training models from scratch, adapting pre-trained architectures, and testing deployment across diverse hardware constraints (from A100 GPUs to 4GB edge devices).
My work involves contrasting models across a wide parameter spectrum (1B → 7B), systematically analyzing the effects of architectural constraints, quantization, and model compression techniques. Rather than treating compression purely as an optimization step after training, the project uses it as an analytical tool to study information bottlenecks, representation degradation, and architectural redundancy within language models.
Experimental Focus Areas:
- Fine-tuning models across the parameter spectrum (0.5B → 7B) for low-resource languages
- Testing quantization and compression techniques (QLoRA, 4-bit/8-bit precision)
- Edge deployment feasibility on constrained hardware (Jetson Nano, limited VRAM)
- Understanding capacity thresholds for instruction-following vs. syntactic mimicry
- Evaluating trade-offs between model size, latency, and semantic competence
Motivation
Current LLMs range from 1B to over 400B parameters, with larger models typically performing better but requiring significant computational resources. Through hands-on experimentation, I'm exploring the practical limits of compression and edge deployment, addressing challenges like:
- Edge devices and mobile applications
- Real-time inference systems with latency constraints
- Organizations with limited computational budgets
- Privacy-critical applications requiring on-device inference
By experimenting with these techniques, the goal is to understand what's actually feasible for deploying intelligent language models in resource-constrained environments—from mobile devices to edge hardware.
Experimental Directions
Quantization & Edge Deployment
Reducing model precision (32-bit → 8-bit → 4-bit) via QLoRA and quantization-aware training. Testing deployment on constrained edge devices with minimal performance degradation.
Teacher-Student Distillation
Using large teacher models (Qwen 72B) to generate high-quality instruction data, enabling smaller models to learn semantic instruction-following behaviors through supervised fine-tuning.
Multilingual Fine-Tuning
Investigating SFT effectiveness across low-resource languages (Hindi, Kannada) to understand parameter thresholds for semantic grounding in non-English contexts.
Sparse Layer Adaptation
LoRA and parameter-efficient fine-tuning to enable training on constrained hardware (A100, Jetson Nano, etc.) while maintaining model quality and reducing memory footprint.
Experiment Indic-V1: Architectural Trade-Offs in Fine-Tuning Qwen for Low-Resource Languages
Focus: Multilingual Fine-Tuning, Capacity Thresholds, and Low-Resource Deployment
Experiment Hypothesis
Deploying Large Language Models on edge devices requires aggressive architectural constraints, often at the cost of representational density. This experiment was driven by a specific hypothesis: Can a highly constrained, sub-1-billion parameter model (Qwen 0.5B) successfully learn Indic instruction-following behaviors via Supervised Fine-Tuning (SFT), or is there a hard parameter threshold required for multilingual semantic grounding?
To test this, I designed a controlled experiment using:
- Data Control: A unified, high-quality synthetic dataset generated via teacher-student distillation
- Architecture Variable: Contrasting a "Toy" architecture (0.5B) against a "Production" architecture (7B)
- Compute Constraint: Training on a single NVIDIA A100 (80GB VRAM) node
Key Finding: Model capacity acts as a hard boundary for semantic instruction following. The Qwen 0.5B model exhibited catastrophic forgetting and mode collapse (Syntactic Mimicry), while the Qwen 7B model demonstrated emergent intelligence through accurate factual retrieval and linguistic alignment—despite being trained on identical noisy data.
1. The Data Bottleneck: Taming 95GB of Raw Corpora
The foundation of this experiment required high-quality instruction-response pairs for Hindi and Kannada. We sourced raw datasets from the AI4Bharat IndicCorp v2 dataset, but they were massive and unstructured:
- Kannada: ~20GB of raw text (kn.txt)
- Hindi: ~75GB of raw text split across multiple files (hi_1.txt, hi_2.txt, hi_3.txt)
Manual cleaning was mathematically infeasible. Initial attempts to train directly on raw subsets resulted in simple text-continuation behavior rather than instruction-following—the models merely predicted the next word, not answers to questions.
The Teacher-Student Distillation Pipeline
To bypass manual curation, I deployed the massive Qwen 2.5 72B Instruct model as a "Teacher" to act as a semantic filter and transmute raw text into structured, high-quality instruction-response pairs.
Pipeline Architecture:
- Sampling: Extracted 6,000 lines from raw text files for both Hindi and Kannada
- Structuring: The 72B model converted raw text into standard Alpaca-style JSONL format
- Constraint: Characters per paragraph capped at 5,000 to ensure concise context usage
Hardware & Implementation Details
Due to massive VRAM requirements, the 72B model could not be loaded natively:
- Compute: Single NVIDIA A100 (80GB VRAM), 64GB system RAM, Intel Xeon 42-core CPU
- Sharding: 72B parameter model broken into 37 shards to fit into memory
- Batch Sizing: Initial 24 parallel generations crashed the system; throttled to batch size 16
- Context Windows: Token limit increased from 64/256 to 512 tokens mid-process to handle agglutinative Kannada structures without truncation
- Processing Time: 15–19 hours for Kannada subset alone
- Output: 8,366 high-quality instruction pairs (≈4,500 Kannada, ≈3,800 Hindi)
Sample Generated Data:
KANNADA (GOVERNMENT & INFRASTRUCTURE)
{
"instruction": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ನಡೆದ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭ ಯೋಜನೆಯ ಬೆಲೆ ಎಷ್ಟು?",
"input": "",
"output": "ಜಿಲ್ಲಾ ಕ್ರೀಡಾಂಗಣದಲ್ಲಿ ಭಾನುವಾರ 1,650 ಕೋಟಿ ರೂ. ವೆಚ್ಚದ ಹಲವು ಅಭಿವೃದ್ಧಿ ಯೋಜನೆಗಳ ಶಂಕುಸ್ಥಾಪನೆ ಹಾಗೂ ಉದ್ಘಾಟನಾ ಸಮಾರಂಭದಲ್ಲಿ ಮಾತನಾಡಿದರು."
}
HINDI (AUTOMOTIVE)
{
"instruction": "मारुति बलेनो पर बड़ा डिस्काउंट किस उद्देश्य से लाया गया है?",
"input": "",
"output": "इसके साथ ही कई जगहों पर मारुति बलेनो में कई तरह के डिस्काउंट व ऑफर भी उपलब्ध कराये जा रहे है। माना जा रहा है कि यह बड़ा डिस्काउंट इसके स्टॉक को जल्द खत्म करने के लिए लाया गया है।"
}
HINDI (ECONOMICS)
{
"instruction": "चीन की अर्थव्यवस्था आगामी वर्ष में बांडों की अदायगी में डिफॉल्ट की समस्या का सामना कैसे करेगी?",
"input": "",
"output": "यह समस्या जितनी नजर आ रही है उससे भी कहीं अधिक गंभीर है क्योंकि चीन की 134 अरब डॉलर की बांड परिपक्वता को देखते हुए आगामी वर्ष में इसकी अदायगी में और भी अधिक डिफॉल्ट हो सकते हैं।"
}
2. Phase I: The Capacity Bottleneck (Qwen 0.5B)
The initial fine-tuning phase utilized the Qwen 2.5 0.5B Instruct model to test whether a highly constrained architecture could learn instruction-following behavior.
Debugging the Missing Labels Bug
Training initially crashed with ValueError: The model did not return a loss. The dataset was tokenized successfully but lacked a labels column, meaning the model received input sequences but had no "correct answer" against which to calculate error.
Solution: Explicitly copy input_ids to labels during data mapping phase:
def process_data(sample):
text = format_instruction(sample)
tokenized = tokenizer(text, truncation=True, max_length=512)
tokenized["labels"] = tokenized["input_ids"].copy()
return tokenized
dataset = dataset.map(process_data, remove_columns=dataset.column_names)
Training Dynamics & Mode Collapse
Once fixed, training on the A100 completed blisteringly fast, in ~20 minutes with a throughput of ~1.6 seconds per step. However, inference results revealed catastrophic forgetting and mode collapse. The loss curve appeared healthy, but the model had learned to be a sophisticated noise repeater rather than an instruction follower.
| Test Query | 0.5B Response | Diagnosis |
|---|---|---|
| "Who are you?" | "Said 27: The police arrested..." | Syntactic Mimicry: Regurgitated reporting patterns instead of grasping semantic identity |
| "India Capital?" | "Modi visited the capital on..." | Associative Hallucination: Triggered statistical association between "Capital" and "Politics" instead of factual retrieval |
| "Karnataka CM?" | "Karnataka, also known as the 'Red Hills', is one of India's most populous states..." | Empty Head Problem: Lacking parameters for world knowledge, hallucinated nonsensical monikers to fill void |
Detailed Failure Case: Identity Loss (Hindi)
User: नमस्ते, आप कौन हैं? (Hello, who are you?)
AI Output: आज सुबह 10:35 बाद में प्रधानमंत्री डा. रणवीर गोयल अभियान के लिए भगवान शिव की जड़ और उच्च इशारा के खिलाफ...
Analysis: The model generated statistically probable Hindi text (mimicking a news snippet about a campaign) but completely ignored the question. It hallucinated a "Dr. Ranveer Goyal" and fabricated timestamp (10:35), reproducing the stylistic patterns of the raw training data (news articles, timestamps, official names).
The "Parrot" Phenomenon: With limited parameter space, the 0.5B model behaved like a linguistic echo chamber. It lacked the cognitive depth to understand the abstract concept of "Question & Answer." Instead, it focused on surface-level features—dates, colons, reporter syntax—and essentially memorized the noise in the dataset rather than learning the underlying task of instruction-following.
3. Phase II: Scaling to Emergent Intelligence (Qwen 7B)
Realizing that a 0.5B architecture lacked requisite "neurons" for factual storage or semantic abstraction, the experiment escalated to the Qwen 2.5 7B Instruct model—a 14x parameter increase.
QLoRA Configuration
Model fine-tuned using Quantized Low-Rank Adaptation to fit within compute budget:
- Base Model Quantization: 4-bit NF4 (Normal Float)
- LoRA Adapters: 16-bit (Rank 32, Alpha 64)
- Trainable Parameters: 80,740,352 (~80 Million out of 7B total)
- Training Time: 2 hours 30 minutes (~6.6 sec/step throughput)
- Final Loss: 0.61 (The "Goldilocks Zone"—indicating learning without overfitting or memorization)
Inference Results: From Mimicry to Understanding
The 7B model demonstrated a qualitative leap in capability:
Sample 1: Fact Retrieval (Perfect Accuracy)
User: भारत की राजधानी क्या है? (What is the capital of India?)
AI: भारत की राजधानी दिल्ली है। (The capital of India is Delhi.)
Verdict: ✓ 100% Perfect. Concise and factually correct.
Sample 2: Hallucination vs. Structure (90% Success)
User: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಯಾರು? (Who is the CM of Karnataka?)
AI: ಕರ್ನಾಟಕದ ಮುಖ್ಯಮಂತ್ರಿ ಸಿದ್ದರಾಮಯ್ಯ ಅವರು ಪ್ರಸ್ತುತವಾಗಿ ಬೆಳಗಾವಿಯಲ್ಲಿ ಉಪಸ್ಥಿತರು.
Analysis:
• ✓ Success: Correctly identified Siddaramaiah, proving it possesses the pre-trained world knowledge the 0.5B lacked
• ✗ Hallucination: The phrase "currently present in Belagavi" is a stylistic addon. The model adopted a news-reporter persona from the training data, generating a live update rather than just answering.
Sample 3: Cultural Alignment & Persona Enforcement
User: Who are you? (in English)
AI: आप ने मुझे क्यों पूछा है? मैं आपकी सहायता करने के लिए यहाँ हूँ।
(Why did you ask me? I am here to help you.)
Verdict: ✓ Pass (with Attitude). The model prioritized the Indic-focused system prompt over user's English input, answering in Hindi. Strong adherence to injected persona.
The "Student" Phenomenon: At 7B parameters, the model exhibited emergent intelligence. It acted like a university student who could look at the same noisy training data, identify the underlying Q&A pattern, ignore the irrelevant syntactic fluff, and understand the intent of the instruction. Model capacity enables semantic abstraction rather than mere syntactic imitation.
Inference Architecture & System Prompt
To enforce language boundaries and persona constraints, a robust system prompt was designed:
SYSTEM_PROMPT = """You are a helpful, culturally aware AI assistant
designed to speak fluent Hindi and Kannada.
Your goal is to answer the user's questions accurately
in the language they used (Hindi or Kannada).
GUIDELINES:
1. If the user asks in Hindi, reply in Hindi (Devanagari script).
2. If the user asks in Kannada, reply in Kannada (Kannada script).
3. Be concise and factual. Do not hallucinate news reports.
4. For technical topics, use English terms but explain in target language."""
Inference was configured with temperature 0.3 (favoring factual responses), top_p=0.9, and repetition_penalty=1.1 to prevent output looping.
4. Architectural Analysis & Key Insights
Insight 1: The "Empty Head" Problem—Capacity as a Hard Boundary
A critical finding: Models cannot be fine-tuned to answer factual questions they never learned during pre-training. The 0.5B model's failure to identify the CM of Karnataka or accurately describe the state illustrates this limit. When asked about "Red Hills" (a fictional moniker), the model hallucinated because it had no pre-trained knowledge to retrieve.
Why LoRA Works for 7B but Not 0.5B: Fine-tuning via LoRA is fundamentally about teaching a new style or language, not injecting new facts. The 7B model succeeded because it already possessed "Siddaramaiah" in its pre-trained weights; the LoRA adapter simply taught it to express that knowledge in Kannada. The 0.5B model had no knowledge to retrieve, so it hallucinated plausible-sounding nonsense instead.
Insight 2: Data Sensitivity & Noise Robustness Scale with Capacity
Noise Intolerance (0.5B): With limited parameter space, the 0.5B model was highly sensitive to data artifacts. Because ~5% of the training data resembled news reports (due to IndicCorp sourcing), the model overfitted to this pattern and became a "News Bot" that generated timestamps and reporter syntax.
Noise Filtering (7B): Despite training on identical noisy data, the 7B model demonstrated robustness by largely filtering out the reporter style. Its massive pre-trained signal for instruction-following outweighed the noise in the fine-tuning set. Hallucinations appeared ("currently present in Belagavi") but were minor stylistic addons rather than complete mode collapse.
Insight 3: Capacity Gates Semantic Abstraction
The difference between 0.5B and 7B is not just "more memory." It's a qualitative difference in abstraction capability:
- 0.5B: Operates at the syntactic level—learning surface patterns and correlations
- 7B: Operates at the semantic level—understanding intent, context, and factual grounding
This aligns with findings in mechanistic interpretability: scaling enables the emergence of abstraction capabilities that don't exist at smaller scales, even with the same training data.
Key Takeaway from Indic-V1: Model capacity acts as a hard threshold for semantic instruction following in multilingual contexts. A 14x parameter increase (0.5B → 7B) transforms a model from a syntactic mimic to a semantic reasoner. While the 0.5B model offers speed and efficiency, it sacrifices the representational density required for robust instruction-following behavior. This has direct implications for edge deployment: models must exceed a minimum capacity threshold to perform reliably on open-ended tasks, regardless of fine-tuning data quality.
Experiment Gemma 3 1B: On-Device Fine-Tuning & Inference on NVIDIA Jetson Nano
Experiment Motivation
While the Qwen 7B experiments demonstrated scalability on high-end hardware, the fundamental question remains: Can we fine-tune and deploy instruction-following models on ultra-constrained edge hardware? This experiment tests the extreme endpoint of model compression and edge deployment feasibility.
Hardware Constraints
- Device: NVIDIA Jetson Nano Developer Kit
- RAM: 4GB (system + GPU shared memory)
- Storage: Limited internal storage; offloading to USB storage
- Architecture: ARM-based (mobile/edge-optimized instruction set)
- Compute: 128 NVIDIA CUDA cores (vs. thousands in A100/H100)
Model & Optimization Strategy
Model Choice: Google Gemma 3 1B (1 billion parameters, pre-trained on diverse multilingual data)
Optimization Pipeline:
- Base Quantization: 8-bit (post-training quantization via qint8)
- LoRA Adaptation: Rank 8, Alpha 16 (minimal memory footprint)
- Batch Size: 1 (memory constraint requires sequential processing)
- Gradient Accumulation: Used to simulate larger batch sizes without memory overhead
- Max Sequence Length: 256 tokens (reduced from standard 512 due to 4GB RAM)
Fine-Tuning Results
Training Performance:
- Training Duration: ~4-6 hours per epoch on 2K instruction pairs
- Throughput: ~0.3-0.5 samples/second (extremely slow but viable)
- Memory Peak: 3.8GB / 4GB (nearly saturated but stable)
- Convergence: Loss plateaued around 1.2 (higher than desktop training due to reduced capacity)
Key Achievement: Successfully trained a 1B parameter model on 4GB RAM without GPU Out-of-Memory errors. This validates that edge devices can perform model adaptation in real-time or at deployment time, enabling personalization without cloud dependencies.
Inference on Jetson Nano
Inference Configuration:
- Memory Usage: ~2.2GB (model weights + activations)
- Inference Latency: Single token: ~200-300ms; Full response (64 tokens): ~15-20 seconds
- Temperature: 0.5 (balancing coherence and diversity)
- Max New Tokens: 64 (practical limit for interactive use)
Sample Inference Results:
Query 1: English Instruction
User: "Explain machine learning in simple terms."
Gemma 3 1B (Jetson Nano): "Machine learning is a type of artificial intelligence where computers learn patterns from data instead of being explicitly programmed. The model improves its performance as it sees more examples, similar to how humans learn through experience."
✓ Coherent, concise explanation; appropriate for edge device
Query 2: Multilingual Query (Hindi)
User: "भारत में सबसे बड़ा व्यावसायिक शहर कौन सा है?" (Which is the largest commercial city in India?)
Gemma 3 1B (Jetson Nano): "मुंबई भारत में सबसे बड़ा व्यावसायिक शहर है। यह देश का वित्तीय केंद्र भी माना जाता है।"
✓ Correct multilingual response; maintained context from fine-tuning
Performance Benchmarking
| Metric | Value | Notes |
|---|---|---|
| Model Size | ~2.4GB (8-bit quantized) | Reduced from 4GB full precision |
| Tokens/Second | ~3-5 tok/s | Suitable for interactive assistants |
| End-to-End Latency (64 tok) | 15-20 seconds | Acceptable for resource-constrained scenarios |
| Power Consumption | ~5-7W (inference) | Battery-efficient for mobile/embedded use |
| Fine-Tuning Duration | 4-6 hours/epoch | On-device adaptation possible |
Jetson Nano Deployment: What Worked
Edge Deployment Feasibility: A 1B parameter model can be fine-tuned, deployed, and run inference on 4GB Jetson Nano hardware using 8-bit quantization and LoRA. Inference latency (15-20 sec for 64 tokens) is acceptable for non-real-time applications (chatbots, document summarization, local RAG systems).
Practical Takeaway: Organizations can now deploy personalized LLMs on edge devices without cloud dependency, enabling privacy-preserving local inference for robotics, IoT, and embedded AI systems.
Key Learnings & Observations
Indic-V1 (Qwen): Model capacity is a hard boundary for semantic grounding. A 14x parameter increase (0.5B → 7B) transformed the model from a "Parrot" (syntactic mimicry) to a "Student" (semantic understanding). Teacher-student distillation successfully generated high-quality training data from raw corpora, enabling supervised fine-tuning on low-resource languages.
Gemma 3 on Jetson Nano: Sub-2GB quantized models with LoRA can be fine-tuned and deployed on 4GB edge hardware. Latency is acceptable for non-real-time applications, enabling privacy-preserving personalization on edge devices without cloud inference.
Ongoing Exploration: The trade-off between model capacity, quantization, and task-specific fine-tuning defines the deployment frontier. Through continued experimentation, the goal is to map out what's actually feasible across the extreme parameter range (1B → 7B) and hardware spectrum (edge → data center), while identifying where semantic competence breaks down.
Reference Papers & Learning Resources
Foundational papers and resources informing these experiments:
-
Foundation for all transformer-based LLMs; critical for understanding attention mechanisms in fine-tuning.
-
Demonstrates emergent capabilities at scale; motivates the parameter-scaling hypothesis tested in Indic-V1.
-
Optimal compute budget allocation; informs architectural trade-off decisions in capacity-constrained scenarios.
-
Core technique used in both Qwen 7B (Indic-V1) and Gemma 3 1B (Jetson Nano) fine-tuning experiments; enables training on constrained hardware.
-
Parameter sharing and factorization techniques; informs lightweight architecture design for edge deployment.
-
Foundation for teacher-student distillation; directly applied in Indic-V1 data curation pipeline (Qwen 72B → student data).
-
Parameter-efficient fine-tuning enabling both Indic-V1 and Jetson Nano experiments; core to all deployment scenarios in this research.
-
Instruction fine-tuning methodology; informs SFT strategy for both Indic language and edge deployment experiments.
-
Architecture and pre-training details for Gemma 3 1B; used as foundation for Jetson Nano deployment research.