Skip to content

hoskinson-center/proof-pile

Text GenerationENapache-2.0

Hoskinson-center/proof-pile is a text generation dataset in EN from hoskinson-center in Parquet format. It is distributed under the apache-2.0 license and falls in the 100K<n<1M size category, and has been downloaded 1.6K times.

About hoskinson-center/proof-pile

# Dataset Description The `proof-pile` is a 13GB pre-training dataset of mathematical text that comprises 8.3 billion tokens (using the `gpt-neox` tokenizer). Models trained on this dataset are coming soon :) The dataset is composed of diverse sources of both informal and formal mathematics, namely - ArXiv.math (10GB) - Open-source math textbooks (50MB) - Formal mathematics libraries (500MB) - Lean mathlib and other Lean repositories - Isabelle AFP - Coq mathematical components and other Coq repositories - HOL Light - set.mm - Mizar Mathematical Library - Math Overflow and Math Stack Exchange (2.5GB) - Wiki-style sources (50MB) - ProofWiki - Wikipedia math articles - MATH dataset (6MB) The construction of the dataset is reproducible using the code and instructions in the [proof-pile Github repo](https://github.com/zhangir-azerbayev/proof-pile). # Supported Tasks…

Details

Task
Text Generation
Language
EN
Format
Parquet
Rows / instances
N/A
Size
100K<n<1M
Creator
hoskinson-center
Year
2022
License
apache-2.0
Downloads
1591
Likes
68
Download Homepage

Related Text Generation datasets

FAQ