Skip to content

catherinearnett/montok

General NLPAFR, ALS, AMHapache-2.0

Catherinearnett/montok is a General NLP-focused dataset in AFR, ALS, AMH distributed in Parquet format. It is distributed under the apache-2.0 license, and has been downloaded 10.4K times.

About catherinearnett/montok

MonTok: A Suite of Monolingual Tokenizers This is a set of monolingual tokenizers for 98 languages. For each language, there are Unigram, BPE, and SuperBPE tokenizers, ranging in vocabulary size from around 6k to over 200k. Training Det...

Details

Task
General NLP
Language
AFR, ALS, AMH
Format
Parquet
Rows / instances
N/A
Creator
catherinearnett
Year
2025
License
apache-2.0
Downloads
10398
Likes
4
Download Homepage

Related General NLP datasets

FAQ