Skip to content

shjwudp/chinese-c4

General NLPZH

The shjwudp/chinese-c4 dataset is a ZH General NLP resource from shjwudp at 2022.

About shjwudp/chinese-c4

Introduction Chinese-C4 is a clean Chinese internet dataset based on Common Crawl. The dataset is 46.29GB and has undergone multiple cleaning strategies, including Chinese filtering, heuristic cleaning based on punctuation, line-based hashing f...

Details

Task
General NLP
Language
ZH
Format
Parquet
Rows / instances
N/A
Creator
shjwudp
Year
2022
Download

Related General NLP datasets

FAQ