codeparrot/github-code-clean
General NLPEnglishapache-2.0
The codeparrot/github-code-clean dataset is a English General NLP resource from codeparrot at 2022. With 45.8K downloads and 142 likes, it is actively used by the community. It is released under the apache-2.0 license and is a 10M<n<100M-scale dataset.
About codeparrot/github-code-clean
The GitHub Code clean dataset in a more filtered version of codeparrot/github-code dataset, it consists of 115M code files from GitHub in 32 programming languages with 60 extensions totaling in almost 1TB of text data.
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 10M<n<100M
- Creator
- codeparrot
- Year
- 2022
- License
- apache-2.0
- Downloads
- 45842
- Likes
- 142