Skip to content

ibm-granite/GneissWeb

General NLPENapache-2.0

Created by ibm-granite at 2025, the ibm-granite/GneissWeb is a General NLP dataset in EN in Parquet format. With 746 downloads and 47 likes, it is actively used by the community. It is released under the apache-2.0 license.

About ibm-granite/GneissWeb

What is it? Recipe for producing a state-of-the-art LLM pre-training dataset having 10+ Trillion tokens, derived from FineWeb V1.1.0 Evaluation results showing more than 2% avg improvement (with multiple random seeds) over FineWeb V1.1.0 token...

Details

Task
General NLP
Language
EN
Format
Parquet
Rows / instances
N/A
Creator
ibm-granite
Year
2025
License
apache-2.0
Downloads
746
Likes
47
Download Homepage

Related General NLP datasets

FAQ