The plan focuses on evaluating and enhancing the performance of the MahaVani Large Language Model (LLM) through critical metrics. For instance, assessing the "Number of Parameters" allows us to determine the model's scalability and performance trade-offs, which is akin to choosing between models like 3B or 7B parameters for specific tasks. This metric ensures that the model meets the standard benchmarks while providing room for optimized resource management and scalability.
Another essential metric, "Dataset Composition," examines the representation of diverse data sources such as web data and Indian regional languages. With typical datasets consisting of varying content percentages, balancing and periodically updating these datasets ensures high-quality output and better evaluation across multiple scenarios. Similarly, "Perplexity on Validation Datasets" helps measure the model's predictability, ensuring that refinement processes are in place for robust and accurate results.
Inference speed is vital for practical deployment, emphasizing tokens processed on different devices. Fast processing is crucial, especially on GPUs and mobile devices, adhering to set benchmarks. Finally, "Edge-device Compatibility" tests the model's ability to deliver rapid and quality responses on devices with limited resources, ensuring a seamless user experience even in low-resource settings.
Top 5 metrics for AI Model Performance Evaluation
1. Number of Parameters
Differentiates model size options such as 1 billion (B), 3B, 7B, 14B parameters
What good looks like for this metric: 3B parameters is standard
How to improve this metric:- Evaluate the scalability and resource constraints of the model
- Optimise parameter tuning
- Conduct comparative analysis for various model sizes
- Assess trade-offs between size and performance
- Leverage model size for specific tasks
2. Dataset Composition
Percentage representation of data sources: web data, books, code, dialogue corpora, Indian regional languages, and multilingual content
What good looks like for this metric: Typical dataset: 60% web data, 15% books, 5% code, 10% dialogue, 5% Indian languages, 5% multilingual
How to improve this metric:- Increase regional and language-specific content
- Ensure balanced dataset for diverse evaluation
- Perform periodic updates to dataset
- Utilise high-quality, curated sources
- Diversify datasets with varying domains
3. Perplexity on Validation Datasets
Measures the predictability of the model on validation datasets
What good looks like for this metric: Perplexity range: 10-20
How to improve this metric:- Enhance tokenization methods
- Refine sequence-to-sequence layers
- Adopt better pre-training techniques
- Implement data augmentation
- Leverage transfer learning from similar tasks
4. Inference Speed
Tokens processed per second on CPU, GPU, and mobile devices
What good looks like for this metric: GPU: 10k tokens/sec, CPU: 1k tokens/sec, Mobile: 500 tokens/sec
How to improve this metric:- Optimise algorithm efficiency
- Reduce model complexity
- Implement hardware-specific enhancements
- Utilise parallel processing
- Explore alternative deployment strategies
5. Edge-device Compatibility
Evaluates the model's ability to function on edge devices with latency and response quality
What good looks like for this metric: Latency: <200 ms for response generation
How to improve this metric:- Optimise for low-resource environments
- Develop compact model architectures
- Incorporate adaptive and scalable quality features
- Implement quantisation and compression techniques
- Perform real-world deployment tests
How to track AI Model Performance Evaluation metrics
It's one thing to have a plan, it's another to stick to it. We hope that the examples above will help you get started with your own strategy, but we also know that it's easy to get lost in the day-to-day effort.
That's why we built Tability: to help you track your progress, keep your team aligned, and make sure you're always moving in the right direction.

Give it a try and see how it can help you bring accountability to your metrics.