Vietnamese Legal Base Model - Qwen1.7B (Pretrained)
This model is a Vietnamese legal-domain base model pretrained from Qwen-1.7B, adapted specifically for legal text understanding and legal question answering tasks.
Overview
- Base model: Qwen-1.7B
- Domain: Vietnamese legal language
- Training objective: Continual pretraining on legal-domain texts
Training Data
The model was continually pretrained on a curated corpus of Vietnamese legal texts, including:
- Official legal documents (laws, codes, decrees, etc.)
- Legal news articles and commentary
Dataset Statistics
The training corpus includes a total of approximately 96,000 legal documents categorized as follows:
- ~96,000 legal documents: Official sources such as laws, decrees, circulars,...
Training Configuration
The model was trained overfitting using full-parameter fine-tuning (no quantization or LoRA). Below is the training setup used for continual pretraining:
Model & Tokenization
- Base model:
Qwen/Qwen3-1.7B - Training setup: 4 nodes × 8 × H100 GPUs (32 GPUs in total)
- Maximum sequence length:
8092 - Epochs:
25 - Batch size:
2 - Gradient accumulation steps:
4 - Mixed precision:
bf16 - LR warmup ratio:
0.05 - LR scheduler:
linear - Seed:
1308 - Time Training:
15 hours
Metrics
- CrossEntropyLoss:
0.01
All texts were collected from publicly available and legally permitted sources, then preprocessed to ensure quality and consistency for domain adaptation.
Example Usage
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document")
model = AutoModelForCausalLM.from_pretrained("huyhoangvbck/Qwen3-1.7B-Pretrain-Legal-Document")
Maintainers
This model is developed and maintained by huyhoangvbck For inquiries, please contact: huyhoangak4@gmail.com
- Downloads last month
- 6