Languages Of Prayer
It was noted with me, that there was a concern that i was focused entirely and solely on the english language; and that, any such view lacks kindness. whilst this was not nor never the intent; it did become clear to me, that i was not clear enough - about the intent to ensure the opposite was the case...
there is already various resources that illustrate the multi-lingual intent; but, seemingly it has not been clear enough; As such, need to work on it as a matter of great importance....
this is the research notes doc.
Polyglot: Large Language Models of Well-balanced Competence in Multi-languages
LINK: https://github.com/EleutherAI/polyglot
Polyglot-Ko
When we started our research, we have already had 1.2TB of Korean data collected by TUNiB. Before we collected a large amount of multilingual data, we decided to try Korean modeling with the dataset we already had. This Korean model can be used for performance comparison with the multilingual models, and this model itself would help many Korean companies and researchers.
Polyglot-East-Asian
We chose the East Asian language as our first multilingual dataset. This model includes Korean, Chinese, Japanese, Indonesian, Malay, Vietnamese, Thai, and English. We will train the model by collecting at least hundreds of billions tokens of data from each language and balancing them. Some people may wonder why English is included on this list, but because English is now a global language, we believe it could synergize with any other language in the world.
Polyglot-Romance
We also plan on testing several multilingual hypotheses on both typologically related and unrelated languages of the Romance family. The final model for the Romance languages will include billions of tokens for Spanish, French, Italian, Portuguese, and Romanian. English is also included for the same reasons stated above.