Skip to content

lightonai/embeddings-pre-training

General NLPEnglish

Created by lightonai at 2025, the lightonai/embeddings-pre-training is a General NLP dataset in English in Parquet format.

About lightonai/embeddings-pre-training

Overview This large-scale dataset is designed for pre-training state-of-the-art text embedding models. Its goal is to reproduce and build upon the data recipe described in the mGTE technical report (Zhang et al., 2024), which details the data s...

Details

Task
General NLP
Language
English
Format
Parquet
Rows / instances
N/A
Creator
lightonai
Year
2025
Download

Related General NLP datasets

FAQ