Why is GPT-3 expensive and slow for Non-English languages?
A quick look at GPT-3’s tokenization
How can a word like స్త్రీ (meaning woman in Telugu, an Indian language) come upto 18 tokens in OpenAI’s GPT-3 whereas “woman” in English is 1 token?
This also means it is 10–20x expensive as well as 10–20x slower to support a GPT-3 based app in Telugu because GPT-3 needs to generate 18 tokens for one word in Telugu in this case!
Most people don’t understand OpenAI GPT-3’s tokenization and how expensive/inefficient it is to build an app in non-English, especially for Indian languages.
First of all, what is స్త్రీ?
It is not a character, not a word but a sequence of “Graphemes”.
A grapheme is the smallest functional unit of a writing system! It is composed of code points.
Want to understand easily?
á (grapheme) = a + ´
Acute a (á) is a sum of 2 code points!
Similarly,
స్త్రీ = sum of 6 codepoints.