jhu-clsp/ettin-pretraining-data
Text GenerationFill MaskText ClassificationENmit
Jhu-clsp/ettin-pretraining-data is a text generation-focused dataset in EN distributed in Parquet format. It is distributed under the mit license, and has been downloaded 165.9K times.
About jhu-clsp/ettin-pretraining-data
Ettin Pre-training Data
Phase 1 of 3: Diverse pre-training data mixture (1.7T tokens) used to train the Ettin model suite.
This dataset contains the pre-training phase data used to train all Ettin encoder and decoder models. The data is p...
Details
- Task
- Text Generation, Fill Mask, Text Classification
- Language
- EN
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- jhu-clsp
- Year
- 2024
- License
- mit
- Downloads
- 165903
- Likes
- 9