Japanese Datasets
We catalog 11 Japanese datasets for NLP and machine learning, including 1 benchmarks. Browse the list below or narrow down by task.
This page covers Japanese, a high-resource East Asian language with dedicated NLP tooling. Our directory includes 11 datasets in Japanese.
Updated June 2026
- CC100-JapaneseText CorporaJapanese
- llm-jp/jhleGeneral NLPJA
- nyuuzyou/sunoAudio Classification, Text To AudioEN, JA, MULTILINGUAL
- izumi-lab/llm-japanese-datasetGeneral NLPJA
- deepghs/danbooru2024Image Classification, Zero Shot Image Classification, Text To ImageEN, JA
- joujiboi/japanese-anime-speechAutomatic Speech RecognitionJA
- DeliberatorArchiver/asmr-archive-data-02General NLPJABenchmark
- JosephusCheung/GuanacoDatasetText Generation, Question AnsweringZH, EN, JA
- kunishou/databricks-dolly-15k-jaGeneral NLPJA
- KBlueLeaf/danbooru2023-metadata-databaseImage Classification, Text To Image, Image To Text, Image To Image, Text Retrieval, Text Generation, Text ClassificationEN, JA
- NilanE/ParallelFiction-Ja_En-100kTranslationJA, EN
What tasks do Japanese datasets cover?
General NLP (4)Image Classification (2)Text To Image (2)Text Generation (2)Text Corpora (1)Audio Classification (1)Text To Audio (1)Zero Shot Image Classification (1)Automatic Speech Recognition (1)Question Answering (1)Image To Text (1)Image To Image (1)Text Retrieval (1)Text Classification (1)Translation (1)