ibm-granite/GneissWeb
General NLPENapache-2.0
Created by ibm-granite at 2025, the ibm-granite/GneissWeb is a General NLP dataset in EN in Parquet format. With 746 downloads and 47 likes, it is actively used by the community. It is released under the apache-2.0 license.
About ibm-granite/GneissWeb
What is it?
Recipe for producing a state-of-the-art LLM pre-training dataset having 10+ Trillion tokens, derived from FineWeb V1.1.0
Evaluation results showing more than 2% avg improvement (with multiple random seeds) over FineWeb V1.1.0 token...
Details
- Task
- General NLP
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- ibm-granite
- Year
- 2025
- License
- apache-2.0
- Downloads
- 746
- Likes
- 47