lightonai/embeddings-pre-training
General NLPEnglish
Created by lightonai at 2025, the lightonai/embeddings-pre-training is a General NLP dataset in English in Parquet format.
About lightonai/embeddings-pre-training
Overview
This large-scale dataset is designed for pre-training state-of-the-art text embedding models. Its goal is to reproduce and build upon the data recipe described in the mGTE technical report (Zhang et al., 2024), which details the data s...
Details
- Task
- General NLP
- Language
- English
- Format
- Parquet
- Rows / instances
- N/A
- Creator
- lightonai
- Year
- 2025