Azerbaijani LLM on SageMaker AI Delivers Gains
By Alexander Cole
Six weeks, a 23 percent throughput bump, and 58 percent GPU memory saved redefine Azerbaijani LLM training.
Azercell Telecom LLC, Azerbaijan’s leading telecommunications provider, teamed up with AWS to push the boundaries of language models for a morphologically rich, low resource language. The goal was clear: adapt a foundation model for Azerbaijani to telecom use cases and a customer facing chatbot, without a blueprint for efficient training in a language with limited data. The collaboration with the AWS Generative AI Innovation Center culminated in a production ready framework on Amazon SageMaker AI that delivers tangible engineering gains and a path for similar languages.
The project leans on open source tooling, including PyTorch, Hugging Face Transformers, and Liger Kernels, to form a three stage pipeline that produces artifacts feeding the next stage. The team reports that the kernel level optimizations on an ml.p5.48xlarge instance yielded a meaningful win in training efficiency and resource usage. In practical terms, that translates to a 23 percent higher training throughput and a 58 percent lower peak GPU memory footprint, enabling faster iterations and larger experiments within the same hardware envelope. Even more striking is the tokenizer breakthrough: a custom monolingual tokenizer delivered a 2x improvement in tokens per word, effectively doubling the amount of Azerbaijani text that can fit within the model’s context window.
Stage 1 centers on tokenizer development, a critical frontier for morphologically rich Azerbaijani. The team evaluated three approaches, namely baseline English optimized tokenizers, vocabulary extension, and custom monolingual tokenizers, measuring encoding efficiency with standardized metrics. The results, the team reports, favored the custom monolingual tokenizer, which unlocked more efficient encoding for Azerbaijani and set the stage for the subsequent training stages. The framework’s three sequential stages are designed so that each produces artifacts that feed into the next, creating a practical and repeatable process for low resource languages that otherwise lack a blueprint for LLM training.
This is not just a story about speed. It is a story about feasibility. By combining domain specific tokenizer design with kernel level optimizations and a production oriented framework, Azercell and AWS show that a morphologically complex language can scale in a real world setting. The improvements do not merely shave milliseconds off a benchmark; they expand the practical text that the model can learn from and reason about within a given context. The company’s use case, focused on telecom workflows and a customer facing chatbot, stands to benefit from faster iteration, more responsive dialogue capabilities, and better language handling in everyday customer interactions.
From an engineering perspective, the work illustrates several important takeaways for practitioners working with languages that are underrepresented in AI. First, tokenizer design can dominate gains in low resource languages where data is scarce and morphology is rich. Second, architecture and kernel level tuning can unlock meaningful throughput and memory savings even when training on single high end instances. Third, a production ready pipeline that produces tangible artifacts at each stage can help teams replicate success across similar languages and domains rather than reinventing the wheel.
For teams contemplating similar efforts, the key constraints to watch include data availability, the need for language specific tokenization, and the ability to balance memory usage against throughput on a given hardware profile. The Azercell AWS effort demonstrates that targeted engineering choices can yield outsized benefits, turning a challenging morphologically rich language into a trainable, production ready LLM within a six week window.
- Training Azerbaijani language models on Amazon SageMaker AIAWS Machine Learning / Primary / Published MAY 28, 2026 / Accessed MAY 29, 2026
- Evaluating Deep Agents using LangSmith on AWSAWS Machine Learning / Primary / Published MAY 28, 2026 / Accessed MAY 29, 2026
Newsletter
The Robotics Briefing
A daily front-page digest delivered around noon Central Time, with the strongest headlines linked straight into the full stories.
No spam. Unsubscribe anytime. Read our privacy policy for details.