Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Paper
• 2101.06983 • Published
• 2
internally on leaderboard known as jade-ft-14-bert
This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base on the jade_embeddings_train_25.04.04 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("lwoollett/jade-ft-14-bert-static")
# Run inference
sentences = [
'Which subclasses are associated with the JadeXMLCharacterData class?',
'## JadeXMLCharacterData Class\n\nThe **JadeXMLCharacterData** class is the abstract superclass of character-based nodes in an XML document tree; that is, the text, **CDATA**, and comment nodes.\n\nFor details about the property defined in the **JadeXMLCharacterData** class, see "[JadeXMLCharacterData Property](jadexmlcharacterdata_property.htm)", in the following section.\n\n[JadeXMLNode](../jadexmlnode_class/jadexmlnode_class.htm)\n\n[JadeXMLCDATA](../jadexmlcdata_class/jadexmlcdata_class.htm), [JadeXMLComment](../jadexmlcomment_class/jadexmlcomment_class.htm), [JadeXMLText](../jadexmltext_class/jadexmltext_class.htm)',
"### Minimizing the Working Set\n\nIn loops where there are multiple filters, apply the cheapest filters first and then the filters that reduce the working set the most. For example, consider the following code fragment, which finds sales of appliances in a specified city.\n\n```\nwhile iter.next(tran) do\r\n if tran.type = Type_Sale\r\n and tran.myBranch.myLocation.city = targetCity\r\n and tran.myProduct.isAppliance then\r\n <do something with tran>\r\n endif;\r\nendwhile;\n```\nIn this example, **tran.type** should be checked first, because it is the cheapest. The **tran** object must be fetched to evaluate all of the other conditions, so we may as well check the **type** attribute first. If we did the **isAppliance** check first, we would have to fetch all of the product objects for the transactions that were not sales. Regardless of how many transactions are sales and how many products are appliances, it will save time to check **tran.type** first.\n\nNow, assume that:\n\n- 80 percent of transactions are sales\n\n- 15 percent, on average, are likely to be in the target city\n\n- 90 percent of the products are appliances\n\nIt pays to check the city first, even though it means fetching the branch and location objects for the non‑appliance products. There are very few non‑appliance products, so the number of extra fetches is small. By contrast, checking for non‑appliance products for all other cities would result in a large number of extra fetches.\n\nIt doesn't matter if the filters are conditions of an [if](../../devref/ch1languageref/if_instruction.htm#if) instruction, multiple [if](../../devref/ch1languageref/if_instruction.htm#if) instructions, or multiple conditions in the [where](../../devref/ch1languageref/where_clause_optimization.htm#whereoptimization) clause of a [while](../../devref/ch1languageref/while_instruction.htm#while) statement; the end result is the same.\n\nThis code fragment example is simple and concise, to convey the concept. In the real world, each successive filter may be in another method, another class, or even another schema. It may take a bit of investigation to find all of the filters involved in a single loop.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
What is the format for defining a Byte constant in JADE? |
##### Constant Definition Tips |
How does the replaceFrom__ method handle case sensitivity? |
#### replaceFrom__ |
replacement: String; |
|
startIndex: Integer; |
|
bIgnoreCase: Boolean): String; ``` The replaceFrom__ method of the String primitive type replaces only the first occurrence of the substring specified in the target parameter with the substring specified in the replacement parameter, starting from the specified startIndex parameter. Case‑sensitivity is ignored if you set the value of the bIgnoreCase parameter to true. Set this parameter to false if you want the substring replacement to be case‑sensitive. This method raises exception 1413 (Index used in string operation is out of bounds) if the value specified in the startIndex parameter is less than 1 or it is greater than the length of the original string. In addition, it returns the original receiver String if the value specified in the target parameter has a length of zero (**... |
|
What does the global constant Ex_Continue do? |
## Exceptions Category |
CachedMultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"mini_batch_size": 32
}
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
What is the keyword list constant value for JADE_SYSTEMVARS? |
### changeKeywords |
keywordList: Integer; |
|
keywords: String); ``` The changeKeywords method of the JadeTextEdit class modifies one or more of the current keyword lists. The keyword lists are used by the current language lexical analyzer to classify the tokens found in the text. For the Jade language, this includes keywords, class names, constant names, and so on. The value of the action parameter can be one of the JadeTextEdit class constants listed in the following table. |
Class Constant |
What should you click to abandon the deletion of a report in JADE? |
#### Delete Report Command |
What types of objects can be set for the userGroupObject in JadeMultiWorkerTcpTransport? |
#### userGroupObject |
CachedMultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"mini_batch_size": 32
}
eval_strategy: stepsper_device_train_batch_size: 18per_device_eval_batch_size: 18num_train_epochs: 4warmup_ratio: 0.1bf16: Truebatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 18per_device_eval_batch_size: 18per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 4max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Truefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size: 0fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | Validation Loss |
|---|---|---|---|
| 0.1761 | 100 | 0.0851 | 0.0243 |
| 0.3521 | 200 | 0.0262 | 0.0211 |
| 0.5282 | 300 | 0.0275 | 0.0217 |
| 0.7042 | 400 | 0.0216 | 0.0256 |
| 0.8803 | 500 | 0.0283 | 0.0241 |
| 1.0563 | 600 | 0.0226 | 0.0195 |
| 1.2324 | 700 | 0.0113 | 0.0170 |
| 1.4085 | 800 | 0.0114 | 0.0204 |
| 1.5845 | 900 | 0.0165 | 0.0182 |
| 1.7606 | 1000 | 0.0129 | 0.0219 |
| 1.9366 | 1100 | 0.0126 | 0.0181 |
| 2.1127 | 1200 | 0.0069 | 0.0207 |
| 2.2887 | 1300 | 0.0045 | 0.0212 |
| 2.4648 | 1400 | 0.0046 | 0.0187 |
| 2.6408 | 1500 | 0.0056 | 0.0206 |
| 2.8169 | 1600 | 0.0084 | 0.0196 |
| 2.9930 | 1700 | 0.005 | 0.0214 |
| 3.1690 | 1800 | 0.0056 | 0.0202 |
| 3.3451 | 1900 | 0.0088 | 0.0190 |
| 3.5211 | 2000 | 0.0026 | 0.0202 |
| 3.6972 | 2100 | 0.0064 | 0.0205 |
| 3.8732 | 2200 | 0.006 | 0.0202 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021},
eprint={2101.06983},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Base model
answerdotai/ModernBERT-base