shjwudp/chinese-c4
General NLPZH
The shjwudp/chinese-c4 dataset is a ZH General NLP resource from shjwudp at 2022.
About shjwudp/chinese-c4
Introduction
Chinese-C4 is a clean Chinese internet dataset based on Common Crawl. The dataset is 46.29GB and has undergone multiple cleaning strategies, including Chinese filtering, heuristic cleaning based on punctuation, line-based hashing f...
Details
- Task
- General NLP
- Language
- ZH
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- shjwudp
- Year
- 2022