---
license: cc-by-nc-4.0
---
# SentiV

**A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding**

This repository releases datasets, code, and pretrained checkpoints for **SentiV**, a benchmark for Vietnamese emotion understanding across **text**, **speech**, and **multimodal** settings, as described in our paper.

📄 **Paper**: *SentiV: A Benchmark for Low-Resource Vietnamese Speech–Text Emotion Understanding* 

---

## 1. Overview

SentiV focuses on realistic low-resource evaluation for Vietnamese emotion recognition under:

* Label imbalance
* Limited supervision (1–100% label budgets)
* Cross-dataset and cross-modal generalization
* Explicit label-space alignment between text and speech

We release:

* Text emotion dataset (data + code + checkpoints)
* Speech emotion annotations (labels + code + checkpoints)
* Reproducible training and evaluation scripts

---

## 2. Repository Structure

```
sentiv/
├── text-training/
│   ├── model/                # Text model checkpoints
│   ├── train_PhoBERT.py      # Training script (PhoBERT)
│   ├── train.xlsx            # Labeled text data
│   └── readme.MD
│
├── voice-training/
│   ├── hubert-large-ls960/   # Speech model checkpoints
│   ├── label/                # Emotion labels and split manifests (speech)
│   ├── label-text/           # Text samples paired with the speech data, annotated with the emotion labels
│   ├── train_hubert.py       # HuBERT fine-tuning script
│   └── readme.MD
│
└── README.md                 # This file

```

---

## 3. Tasks and Label Space

### Task A: Text Emotion Classification

* Labels (7):
  `Anger, Disgust, Enjoyment, Fear, Neutral, Sadness, Surprise`
* Dataset: social media text (comments, posts)
* Evaluation: Macro-F1, Accuracy

### Task B: Speech Emotion Classification

* Labels (6):
  `Anger (includes Disgust), Enjoyment, Fear, Neutral, Sadness, Surprise`
* Disgust is merged into Anger due to extreme scarcity in speech data

### Task C: Multimodal Speech–Text Classification

* Same 6-label space as speech
* Late fusion over text and speech logits

---

## 4. Text Modality (text-training)

### Data

* Source: public Vietnamese social media posts
* Size: 265,011 labeled samples
* Average length: ~20 words
* Labels: 7 emotions
* Anonymized and released strictly for research use

### Model

* Backbone: **PhoBERT (vinai/phobert-base)**
* Loss: Focal Loss with class reweighting
* Max sequence length: 256
* Metric: Macro-F1

### Training

```bash
python train_PhoBERT.py
```

The script supports:

* Class imbalance handling
* Oversampling
* Low-resource label budgets
* Fixed train/dev/test splits

---

## 5. Speech Modality (voice-training)

### Data

* Source audio: VietSpeech dataset (batches 0–10)
* We release:

  * Emotion labels
  * Split manifests
  * Training code
* Raw audio must be obtained from the original VietSpeech source under its license

### Label Mapping

* Disgust is merged into Anger for training stability
* Final label space: 6 emotions

### Model

* Backbone: **HuBERT Large (ls960)**
* Input: 16 kHz audio, max 8 seconds
* Loss: Weighted Cross-Entropy
* Sampler: WeightedRandomSampler
* Metric: Macro-F1

### Training

```bash
python train_hubert.py
```

---

## 6. Multimodal Fusion

We adopt **late fusion at logit level** for reproducibility.

### Fusion Strategy

* Average fusion
* Concatenation + MLP
* **Uncertainty-aware late fusion** (main method)

Confidence is estimated from entropy or max probability, and fusion weights are adjusted dynamically to down-weight unreliable modalities.

---

## 7. Low-Resource Evaluation Protocol

* Label budgets: 1%, 5%, 10%, 25%, 50%, 100%
* Fixed test set
* Only training data is subsampled
* 3–5 random seeds per setting
* Report mean ± std

This protocol is designed to reflect realistic variance under limited supervision.

---

## 8. Ethics and Licensing

### Text Data

* Collected from publicly available social media
* All user-identifying information removed
* Research-only use
* Takedown requests supported

### Speech Data

* Based on VietSpeech
* Speakers provided research consent
* We release labels and derived artifacts only

Users must comply with original dataset licenses.

---

## 9. Access Policy

This repository is released via Hugging Face with **access control enabled**.

* Users must request access
* Access is granted manually for research purposes
* Redistribution without permission is not allowed

---

## 10. Citation

If you use SentiV, please cite our paper:

```bibtex
@misc{pham_duc_dat_2026,
	author       = { Pham Duc Dat and Ngoc Tram Huynh Thi and Vo Ngoc Minh Anh and Nhan Le Thanh Pham and Le Anh Tien and Tan Duy Le and Kha Tu Huynh },
	title        = { sentiv (Revision 6b39b15) },
	year         = 2026,
	url          = { https://huggingface.co/ducdatit2002/sentiv },
	doi          = { 10.57967/hf/7805 },
	publisher    = { Hugging Face }
}
```