occiglot/tokenizer-wiki-bench
General NLPAF, AR, BG
Occiglot/tokenizer-wiki-bench is a General NLP-focused dataset in AF, AR, BG distributed in Parquet format.
About occiglot/tokenizer-wiki-bench
Multilingual Tokenizer Benchmark
This dataset includes pre-processed wikipedia data for tokenizer evaluation in 45 languages. We provide more information on the evaluation task in general this blogpost.
Usage
The dataset allows us to...
Details
- Task
- General NLP
- Language
- AF, AR, BG
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- occiglot
- Year
- 2024