NanoBERT-V2
This model was pre-trained from scratch using the same model architecture as google-bert/bert-base-uncased.
A custom tokenizer was used, created using wordpiece from a corpus consisting of 200,000 domain-specific papers, and a cut of BookCorpus.
Intended uses & limitations
Intended for training on downstream tasks using Nanoscience datasets. Can be used directly to create dense vector representations for information retrieval.
Training and evaluation data
Trained using 2 nodes on Polaris: https://docs.alcf.anl.gov/polaris/hardware-overview/machine-overview/
Example usage
MLM
from transformers import pipeline
mlm_pipeline = pipeline("fill-mask", model="Flamenco43/NanoBERT-V2")
text = "The most common nanoparticle is [MASK]."
predictions = mlm_pipeline(text)
for prediction in predictions: print(f"Token: {prediction['token_str']}, Score: {prediction['score']:.4f}")
Embedding generation (using the HF api)
api_token = "your_api_token"
model_name = "Flamenco43/NanoBERT-V2"
url = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{model_name}"
headers = {"Authorization": f"Bearer {api_token}"}
data = { "inputs": "This is a sample text for embedding generation." }
response = requests.post(url, headers=headers, json=data)
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 32
- eval_batch_size: 32
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- total_train_batch_size: 256
- total_eval_batch_size: 256
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- training_steps: 1000000
Framework versions
- Transformers 4.41.2
- Pytorch 2.3.0
- Datasets 2.20.0
- Tokenizers 0.19.1
- Downloads last month
- 4
Model tree for alan-yahya/NanoBERT-V2
Base model
google-bert/bert-base-uncasedCollection including alan-yahya/NanoBERT-V2
Evaluation results
- Accuracy on Flamenco43/custom_tokenized_nano_papers_cut-with_book_corpusself-reported0.771