Skip to content

occiglot/tokenizer-wiki-bench

General NLPAF, AR, BG

Occiglot/tokenizer-wiki-bench is a General NLP-focused dataset in AF, AR, BG distributed in Parquet format.

About occiglot/tokenizer-wiki-bench

Multilingual Tokenizer Benchmark This dataset includes pre-processed wikipedia data for tokenizer evaluation in 45 languages. We provide more information on the evaluation task in general this blogpost. Usage The dataset allows us to...

Details

Task
General NLP
Language
AF, AR, BG
Format
Parquet
Rows / instances
N/A
Creator
occiglot
Year
2024
Download

Related General NLP datasets

FAQ