Skip to content

iNeil77/the-stack-dedup-filtered

General NLPEnglish

Created by iNeil77 at 2024, the iNeil77/the-stack-dedup-filtered is a General NLP dataset in English containing 154,329,051 records in Parquet format. With 10.8K downloads and 0 likes, it is actively used by the community and is a 100M<n<1B-scale dataset.

About iNeil77/the-stack-dedup-filtered

This is a filtered version of the near-deduped bigcode/the-stack-dedup dataset. We further apply the following filters: For files forked more than 25 times, we retain them if the average line length is less than 140, the maximum line length is le...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
154329051
Size
100M<n<1B
Creator
iNeil77
Year
2024
Downloads
10832
Likes
0
Download Homepage

Related General NLP datasets

FAQ