| --- |
| library_name: transformers |
| tags: |
| - citation |
| - text-classification |
| - science |
| license: apache-2.0 |
| language: |
| - af |
| - am |
| - ar |
| - as |
| - az |
| - be |
| - bg |
| - bn |
| - br |
| - bs |
| - ca |
| - cs |
| - cy |
| - da |
| - de |
| - el |
| - en |
| - eo |
| - es |
| - et |
| - eu |
| - fa |
| - fi |
| - fr |
| - fy |
| - ga |
| - gd |
| - gl |
| - gu |
| - ha |
| - he |
| - hi |
| - hr |
| - hu |
| - hy |
| - id |
| - is |
| - it |
| - ja |
| - jv |
| - ka |
| - kk |
| - km |
| - kn |
| - ko |
| - ku |
| - ky |
| - la |
| - lo |
| - lt |
| - lv |
| - mg |
| - mk |
| - ml |
| - mn |
| - mr |
| - ms |
| - my |
| - ne |
| - nl |
| - 'no' |
| - om |
| - or |
| - pa |
| - pl |
| - ps |
| - pt |
| - ro |
| - ru |
| - sa |
| - sd |
| - si |
| - sk |
| - sl |
| - so |
| - sq |
| - sr |
| - su |
| - sv |
| - sw |
| - ta |
| - te |
| - th |
| - tl |
| - tr |
| - ug |
| - uk |
| - ur |
| - uz |
| - vi |
| - xh |
| - yi |
| - zh |
| base_model: |
| - distilbert/distilbert-base-multilingual-cased |
| --- |
| |
| # Citation Pre-Screening |
|
|
| <!-- Provide a quick summary of what the model is/does. --> |
|
|
| ## Overview |
|
|
| <details> |
| <summary>Click to expand</summary> |
| |
| - **Model type:** Language Model |
| - **Architecture:** DistilBERT |
| - **Language:** Multilingual |
| - **License:** Apache 2.0 |
| - **Task:** Binary Classification (Citation Pre-Screening) |
| - **Dataset:** SIRIS-Lab/citation-parser-TYPE |
| - **Additional Resources:** |
| - [GitHub](https://github.com/sirisacademic/citation-parser) |
| </details> |
|
|
| ## Model description |
|
|
| The **Citation Pre-Screening** model is part of the [`Citation Parser`](https://github.com/sirisacademic/citation-parser) package and is fine-tuned for classifying citation texts as valid or invalid. This model, based on **DistilBERT**, is specifically designed for automated citation processing workflows, making it an essential component of the **Citation Parser** tool for citation metadata extraction and validation. |
|
|
| The model was trained on a dataset containing citation texts, with the labels `True` (valid citation) and `False` (invalid citation). The dataset contains 3599 training samples and 400 test samples, with each example consisting of citation-related text and a corresponding label. |
|
|
| The fine-tuning process was done with the **DistilBERT-base-multilingual-cased** architecture, making the model capable of handling multilingual text, but it was evaluated on English citation data. |
|
|
| ## Intended Usage |
|
|
| This model is intended to classify raw citation text as either a valid or invalid citation based on the provided input. It is ideal for automating the pre-screening process in citation databases or manuscript workflows. |
|
|
| ## How to use |
|
|
| ```python |
| from transformers import pipeline |
| |
| # Load the model |
| citation_classifier = pipeline("text-classification", model="sirisacademic/citation-pre-screening") |
| |
| # Example citation text |
| citation_text = "MURAKAMI, H等: 'Unique thermal behavior of acrylic PSAs bearing long alkyl side groups and crosslinked by aluminum chelate', 《EUROPEAN POLYMER JOURNAL》" |
| |
| # Classify the citation |
| result = citation_classifier(citation_text) |
| print(result) |
| ``` |
|
|
| ## Training |
|
|
| The model was trained using the **Citation Pre-Screening Dataset** consisting of: |
|
|
| - **Training data**: 3599 samples |
| - **Test data**: 400 samples |
|
|
| The following hyperparameters were used for training: |
|
|
| - **Model Path**: `distilbert/distilbert-base-multilingual-cased` |
| - **Batch Size**: 32 |
| - **Number of Epochs**: 4 |
| - **Learning Rate**: 2e-5 |
| - **Max Sequence Length**: 512 |
|
|
| ## Evaluation Metrics |
|
|
| The model's performance was evaluated on the test set, and the following results were obtained: |
|
|
| | Metric | Value | |
| |----------------------|--------| |
| | **Accuracy** | 0.95 | |
| | **Macro avg F1** | 0.94 | |
| | **Weighted avg F1** | 0.95 | |
|
|
| ## Additional information |
|
|
| ### Authors |
|
|
| - SIRIS Lab, Research Division of SIRIS Academic. |
|
|
| ### License |
|
|
| This work is distributed under a [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
| ### Contact |
| For further information, send an email to either [nicolau.duransilva@sirisacademic.com](mailto:nicolau.duransilva@sirisacademic.com) or [info@sirisacademic.com](mailto:info@sirisacademic.com). |
|
|
|
|