Why is GPT-3 expensive and slow for Non-English languages?

A quick look at GPT-3’s tokenization

Ramsri Goutham
3 min readJan 17, 2023
Image by Author!

How can a word like స్త్రీ (meaning woman in Telugu, an Indian language) come upto 18 tokens in OpenAI’s GPT-3 whereas “woman” in English is 1 token?

This also means it is 10–20x expensive as well as 10–20x slower to support a GPT-3 based app in Telugu because GPT-3 needs to generate 18 tokens for one word in Telugu in this case!

Most people don’t understand OpenAI GPT-3’s tokenization and how expensive/inefficient it is to build an app in non-English, especially for Indian languages.

Token difference between the word “woman” in English and Telugu

First of all, what is స్త్రీ?

It is not a character, not a word but a sequence of “Graphemes”.

A grapheme is the smallest functional unit of a writing system! It is composed of code points.

Want to understand easily?
á (grapheme) = a + ´

Acute a (á) is a sum of 2 code points!

Similarly,

స్త్రీ = sum of 6 codepoints.

--

--

No responses yet