hoskinson-center/proof-pile
Text GenerationENapache-2.0
Hoskinson-center/proof-pile is a text generation dataset in EN from hoskinson-center in Parquet format. It is distributed under the apache-2.0 license and falls in the 100K<n<1M size category, and has been downloaded 1.6K times.
About hoskinson-center/proof-pile
# Dataset Description
The `proof-pile` is a 13GB pre-training dataset of mathematical text that comprises 8.3 billion tokens (using the `gpt-neox` tokenizer). Models trained on this dataset are coming soon :) The dataset is composed of diverse sources of both informal and formal mathematics, namely
- ArXiv.math (10GB)
- Open-source math textbooks (50MB)
- Formal mathematics libraries (500MB)
- Lean mathlib and other Lean repositories
- Isabelle AFP
- Coq mathematical components and other Coq repositories
- HOL Light
- set.mm
- Mizar Mathematical Library
- Math Overflow and Math Stack Exchange (2.5GB)
- Wiki-style sources (50MB)
- ProofWiki
- Wikipedia math articles
- MATH dataset (6MB)
The construction of the dataset is reproducible using the code and instructions in the [proof-pile Github
repo](https://github.com/zhangir-azerbayev/proof-pile).
# Supported Tasks…
Details
- Task
- Text Generation
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Size
- 100K<n<1M
- Creator
- hoskinson-center
- Year
- 2022
- License
- apache-2.0
- Downloads
- 1591
- Likes
- 68