HAERAE-HUB/KOREAN-WEBTEXT
General NLPKO
Created by HAERAE-HUB at 2024, the HAERAE-HUB/KOREAN-WEBTEXT is a General NLP dataset in KO containing 1,284,879 records in Parquet format. With 476 downloads and 47 likes, it is actively used by the community and is a 1M<n<10M-scale dataset.
About HAERAE-HUB/KOREAN-WEBTEXT
KOREAN-WEBTEXT
KOREAN-WEBTEXT is a high-quality Korean language corpus consisting of 2.2 billion tokens. The data has been collected from the following sources:
cc100
oscar-corpus/OSCAR-2201
oscar-corpus/OSCAR-2109
oscar-corpus/OSCAR-2301
onto...
Details
- Task
- General NLP
- Language
- KO
- Format
- Parquet
- Rows / instances
- 1284879
- Size
- 1M<n<10M
- Creator
- HAERAE-HUB
- Year
- 2024
- Downloads
- 476
- Likes
- 47