Omar Kamali
Independent AI Researcher & Builder
I build AI for the languages the industry ignores.
I grew up in Morocco, speaking Darija in a world where every piece of technology spoke back in someone else's language. That experience is the origin of everything I build.
In 2023, I built Sawalni, the first conversational AI for Moroccan Darija, supporting both Arabic and Latin script. Corpus, pipeline, model: all solved from scratch, with no prior art to lean on. Thousands of users. Presented at international conferences on Moroccan Arabic linguistics. Featured on Moroccan national television.
I run Omneity Labs, as a private R&D lab focused on low-resource language AI. Current research: training base language models for underrepresented language families. The data pipelines, tokenization methods, and evaluation infrastructure that don't yet exist for these languages. I've published peer-reviewed work on multilingual phonetics, and I maintain open-source datasets and models used by researchers working on similar problems.
I also lead GenAI engineering at Blue Yonder, where I build and deploy LLM systems at enterprise scale. That work informs how I think about training infrastructure and production deployment, not just research prototypes.
My first line of code was BASIC on an Amstrad CPC at age six. I built my first website at nine (it's still online). My first serious research question: why can't I talk to a computer in the language I actually think in? I'm still working on the answer.
I'm interested in collaborating with researchers, communities, and organizations working on language equity, multilingual AI, and open-source NLP. If that's you, I'd like to hear from you.
Work
Sawalni
The first conversational AI for Moroccan Darija & Amazigh. Arabic, Latin, and Tifinagh script. Built from scratch.
sawalni.comWikiLLM
In developmentOpen base models for low-resource language families, trained on Wikipedia data.
Coming 2026wikilangs.org
NLP models derived from 340+ Wikipedia language editions to bootstrap LLM development.
HuggingFaceOpen source
Tools, datasets, and infrastructure for multilingual NLP: tokenizers, embeddings, training frameworks, and data pipelines.
github.com/omarkamaliRecent writing
Why I stopped trusting the official Wikipedia dataset, and what I did about it
It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.
A Wordle for the Worldle
I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.
Picomon 0.2.0: From AMD Crash Fix to GPU Monitoring That Doesn’t Suck
Earlier this month, I whipped up a Python script with an LLM that parsed amd-smi output. It was ugly. It worked. I called it picomon.
Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research
Announcing Wikipedia Monthly, an always fresh dataset to support research for low-resource languages
Getting Perfectly Structured Data from LLMs
If you've ever struggled to get consistent JSON output from large language models, I have a simple and clever solution for you.
Research
Sawtone: A universal framework for phonetic similarity and alignment across languages and scripts
Lingua Posnaniensis, Vol. 67, Issue 1, pp. 165-200, 2025
Moroccan Darija and Generative AI
7th International Congress for Moroccan Arabic · University of Navarra, Spain, 2024
TIM'24 Presentation
TIM'24 Conference · University Hassan II, Morocco, 2024