Clarification on General-Purpose Training Data

#17
by zanga78 - opened

Hello,

Thank you for the impressive model as well as the really detailed paper, impressive work!

I had a couple of questions regarding the General-Purpose Training stage described in Section 4.1.3 of the paper. You mention that the model is trained on a “diverse collection of text pairs, drawn from over 300 datasets in over 30 languages.”

  1. Would you be open to sharing (fully or partially) the list of datasets used in this stage?
    Even a high-level breakdown (e.g., major public sources, proportions, or dataset categories) would be extremely helpful for understanding and reproducing the training pipeline.
  2. Is there any overlap between the datasets used in this stage and the datasets used for the Asymmetric Retrieval Adapter ?
  3. If sharing exact datasets is not possible, could you give any details on how the 300+ datasets were selected or filtered and if
    synthetic data was included at this stage (vs. only in long-context training), or could you detail the balance between domains (web, academic, QA, multilingual corpora, etc.)

Thanks again for your work and for any details you’re able to share!

Jina AI org

Hi,

I can not share the full list but answer your questions: Some of the larger datasets are s2orc, CC-News, and datasets derived from Wikipedia. Roughly 1/3 of the data is English. It also contains retrieval datasets like training data from HotpotQA which are also used during the retrieval stage. However, for the retrieval adapter training we use hard-negatives while we don't use them in the first stage. Since our retrieval datasets are usually much smaller they also represent a smaller portion of the training data of the first stage, however, smaller datasets might get larger sampling rates than larger datasets. For filtering, we used consistency filtering, filtering out duplicate pairs and pairs where left and right have a high synthetic overlap. We also verify that hard negatives are not false negative using Cross-Encoders and LLMs. We also use some synthetic data, however, only the queries are synthetic and not the documents. Regarding the domains, we don't have clearly categorized the training data in domains, however, it is a lot of Web data, there is also a significant portion of academic data (e.g. s2orc), data in the form of question answer pairs and data from the e-commerce domain. For the retrieval adapter, we have a larger portion of other domains like legal and finance-related data which are not that well represented in the pairs used in the first stage.

michael-guenther changed discussion status to closed

Sign up or log in to comment