iNeil77/the-stack-dedup-filtered
General NLPEnglish
Created by iNeil77 at 2024, the iNeil77/the-stack-dedup-filtered is a General NLP dataset in English containing 154,329,051 records in Parquet format. With 10.8K downloads and 0 likes, it is actively used by the community and is a 100M<n<1B-scale dataset.
About iNeil77/the-stack-dedup-filtered
This is a filtered version of the near-deduped bigcode/the-stack-dedup dataset. We further apply the following filters:
For files forked more than 25 times, we retain them if the average line length is less than 140, the maximum line length is le...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- 154329051
- Size
- 100M<n<1B
- Creator
- iNeil77
- Year
- 2024
- Downloads
- 10832
- Likes
- 0