Omar Kamali

Independent AI Researcher & Builder

I build AI for the languages the industry ignores.

I grew up in Morocco, speaking Darija in a world where every piece of technology spoke back in someone else's language. That experience is the origin of everything I build.

In 2023, I built Sawalni, the first conversational AI for Moroccan Darija, supporting both Arabic and Latin script. Corpus, pipeline, model: all solved from scratch, with no prior art to lean on. Thousands of users. Presented at international conferences on Moroccan Arabic linguistics. Featured on Moroccan national television.

I run Omneity Labs, as a private R&D lab focused on low-resource language AI. Current research: training base language models for underrepresented language families. The data pipelines, tokenization methods, and evaluation infrastructure that don't yet exist for these languages. I've published peer-reviewed work on multilingual phonetics, and I maintain open-source datasets and models used by researchers working on similar problems.

I also lead GenAI engineering at Blue Yonder, where I build and deploy LLM systems at enterprise scale. That work informs how I think about training infrastructure and production deployment, not just research prototypes.

My first line of code was BASIC on an Amstrad CPC at age six. I built my first website at nine (it's still online). My first serious research question: why can't I talk to a computer in the language I actually think in? I'm still working on the answer.

I'm interested in collaborating with researchers, communities, and organizations working on language equity, multilingual AI, and open-source NLP. If that's you, I'd like to hear from you.

Email X HuggingFace LinkedIn

Work

Sawalni

The first conversational AI for Moroccan Darija & Amazigh. Arabic, Latin, and Tifinagh script. Built from scratch.

sawalni.com

WikiLLM

In development

Open base models for low-resource language families, trained on Wikipedia data.

Coming 2026

wikilangs.org

NLP models derived from 340+ Wikipedia language editions to bootstrap LLM development.

HuggingFace

Open source

Tools, datasets, and infrastructure for multilingual NLP: tokenizers, embeddings, training frameworks, and data pipelines.

github.com/omarkamali

Recent writing

Beyond Tokenization: The Four Taxes and the Path Forward

The compounding tax stack low-resource languages carry, why vision encoders might hold the key, and the open research questions.

Mar 2026

The Hidden Tax Your LLM Pays for Bad Tokenization

How bad tokenization forces language models to waste capacity on reconstruction instead of reasoning.

Mar 2026

Tokenization is Killing Our Multilingual LLM Dream

Why tokenization is the hidden bottleneck blocking truly multilingual AI — lessons from building Sawalni and Wikilangs.

Mar 2026

Why I stopped trusting the official Wikipedia dataset, and what I did about it

It all started with a DM from a friend, member and contributor to the Moroccan Wikipedia community. "Are you using the current version of Wikipedia? The official dataset is severely outdated. We added so many cool articles nowhere on huggingface" He was right. I was running a 2023 snapshot in 2025.

Mar 2026

A Wordle for the Worldle

I built a word game for more than 300 languages, each drawing on its own Wikipedia as the source. Here's the thing nobody tells you: building a simple word game for most of these languages meant building things that didn't exist.

Mar 2026

All posts

Research

Sawtone: A universal framework for phonetic similarity and alignment across languages and scripts

Lingua Posnaniensis, Vol. 67, Issue 1, pp. 165-200, 2025

Paper

Moroccan Darija and Generative AI

7th International Congress for Moroccan Arabic · University of Navarra, Spain, 2024

TIM'24 Presentation

TIM'24 Conference · University Hassan II, Morocco, 2024

All research

Interested in collaboration?

Get in touch

Or find me on X · HuggingFace · LinkedIn