catherinearnett/montok
General NLPAFR, ALS, AMHapache-2.0
Catherinearnett/montok is a General NLP-focused dataset in AFR, ALS, AMH distributed in Parquet format. It is distributed under the apache-2.0 license, and has been downloaded 10.4K times.
About catherinearnett/montok
MonTok: A Suite of Monolingual Tokenizers
This is a set of monolingual tokenizers for 98 languages. For each language, there are Unigram, BPE, and SuperBPE tokenizers, ranging in vocabulary size from around 6k to over 200k.
Training Det...
Details
- Task
- General NLP
- Language
- AFR, ALS, AMH
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- catherinearnett
- Year
- 2025
- License
- apache-2.0
- Downloads
- 10398
- Likes
- 4