YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Azeri Handwriting Detection Dataset

Overview

This dataset contains 12 handwritten Azerbaijani document samples with transcriptions, representing diverse real-world document types. The dataset serves as a pilot collection for developing and validating the Azeri Handwriting Recognition (HTR) system.

Dataset Statistics:

Total Documents: 12
Total Lines: 80 (avg: 6.7 lines/document)
Total Characters: 2,384 (avg: 198.7 chars/document)
Language: Azerbaijani (Latin script)
Format: HEIC images + TXT transcriptions
Created: December 13, 2025

Directory Structure

data/
├── images/          # 12 HEIC image files
│   ├── 01.HEIC  →  az_formal_letter_01.txt
│   ├── 02.HEIC  →  az_handwritten_note_02.txt
│   ├── 03.HEIC  →  az_numeric_mixed_03.txt
│   ├── 04.HEIC  →  az_medical_form_04.txt
│   ├── 05.HEIC  →  az_utility_application_05.txt
│   ├── 06.HEIC  →  az_bank_statement_06.txt
│   ├── 07.HEIC  →  az_education_text_07.txt
│   ├── 08.HEIC  →  az_address_list_08.txt
│   ├── 09.HEIC  →  az_technical_report_09.txt
│   ├── 10.HEIC  →  az_contract_clause_10.txt
│   ├── 11.HEIC  →  az_daily_diary_11.txt
│   └── 12.HEIC  →  az_tabular_text_12.txt
├── labels/          # 12 transcription files
│   └── az_*_##.txt
└── README.md        # This file

Naming Convention: The number at the end of each label filename (e.g., _01, _12) corresponds exactly to the image number (e.g., 01.HEIC, 12.HEIC).

Document Types

The dataset contains 12 diverse document types representing real-world Azerbaijani documents:

ID	Document Type	Lines	Chars	Description
01	Formal Letter	9	196	Official business letter to director about 2025 reports
02	Handwritten Note	8	171	Personal reminder about bank appointment
03	Numeric Mixed	6	147	Contract with numbers, dates, amounts (VAT calculation)
04	Medical Form	8	215	Patient form with diagnosis and prescriptions
05	Utility Application	8	251	Electricity complaint with meter reading
06	Bank Statement	6	165	Account transactions with debits/credits
07	Education Text	6	240	Constitutional text about education rights
08	Address List	7	184	Addresses from Baku, Ganja, Sumqayit cities
09	Technical Report	4	216	System performance analysis (CPU, disk)
10	Contract Clause	8	257	Legal contract clause text
11	Daily Diary	6	184	Personal diary entry about work project
12	Tabular Text	4	158	Employee table with names, ages, departments

Complexity Levels

Simple (4-6 lines):

Tabular text (12), Technical report (09), Numeric mixed (03)
Short, structured content

Medium (6-8 lines):

Bank statement (06), Education (07), Address list (08)
Moderate length with mixed content

Complex (8-9 lines):

Medical form (04), Handwritten note (02), Contract (10)
Longer documents with varied formatting

Image Characteristics

Format: HEIC (High Efficiency Image Container)

Codec: HEVC (H.265) - Apple's modern image format
File Sizes: 820KB - 1.0MB per image (average: ~890KB)
Total Size: ~10.5MB

Important: HEIC format requires conversion to PNG/JPG for PyTorch processing:

from PIL import Image
import pillow_heif

pillow_heif.register_heif_opener()
img = Image.open('01.HEIC').convert('L')  # Convert to grayscale
img.save('01.png')

Label File Format

Format:

Line_number→Transcribed_text

Example (az_tabular_text_12.txt):

     1→Adı        | Yaşı | Şöbə
     2→-------------------------
     3→Rauf       | 32   | Maliyyə
     4→Aysel      | 28   | İnsan resursları
     5→Kamal      | 41   | İT dəstəyi

Characteristics:

Line numbers with leading spaces
Arrow delimiter (→) separates line number from text
Preserves original spacing and formatting
Includes punctuation and special characters exactly as written

Azerbaijani Language Statistics

Character Distribution (Top 15)

Space:  262 occurrences (11% - word boundaries)
a:      154 (6.5%)
i:      137 (5.7%)
ə:      125 (5.2%) ← Azerbaijani-specific schwa
r:       98 (4.1%)
l:       91 (3.8%)
n:       90 (3.8%)
ı:       66 (2.8%) ← Azerbaijani dotless i
s:       66 (2.8%)
m:       57 (2.4%)
d:       51 (2.1%)
t:       49 (2.1%)
e:       42 (1.8%)
u:       39 (1.6%)
-:       35 (1.5%) ← Hyphenation

Azerbaijani-Specific Characters (Critical)

Lowercase:

ə: 125  (schwa - most common special character)
ı:  66  (dotless i)
ş:  27  (s with cedilla)
ü:  27  (u with diaeresis)
ğ:  13  (g with breve)
ö:   7  (o with diaeresis)
ç:   7  (c with cedilla)

Uppercase:

Ə:   4
İ:   3  (i with dot - Turkish/Azeri uppercase)
Ş:   2
Ü:   1
Ö:   1

Total Azerbaijani-specific characters: 263 (11% of all characters)

Key Insight: Azerbaijani diacritics (ə, ı, ş, ü, ğ, ö, ç) are essential and must be preserved by the HTR model.

Content Analysis

Character Breakdown

Letters (a-z, A-Z):  ~1,550 (65%)
Azerbaijani chars:      263 (11%)
Spaces:                 262 (11%)
Numbers:                ~150 (6%)
Punctuation:            ~160 (7%)

Numeric & Special Content

Numbers Present:

Dates: 14.06.2024, 01.02.2025, 03.11.1987
Amounts: 12 750.45 AZN, 15 045.53 AZN, 304.50 AZN
Percentages: 18%, 67%
Phone numbers: 050-3456789
Contract numbers: № 457/23
Account numbers: AZ21NABZ

Punctuation & Symbols:

Hyphens (35) - word breaks, line wrapping
Periods (34) - decimals, abbreviations
Commas (19) - number separators
Colons (15) - field labels
Pipe symbols (|) - table formatting

Domain-Specific Vocabulary

The dataset contains rich domain terminology across multiple sectors:

Financial:

müqavilə (contract), məbləğ (amount), ƏDV (VAT)
hesabat (report), hesab nömrəsi (account number)
Maaş (salary), Komunal (utilities)

Medical:

pasiyent (patient), diaqnoz (diagnosis)
baş ağrısı (headache), arterial hipertenziya (hypertension)
dərman (medicine)

Legal/Formal:

Hörmətli (Dear/Honorable), direktor (director)
Konstitusiya (Constitution), qanunvericilik (legislation)
qarşılıqlı razılaşma (mutual agreement)

Technical:

sistem performans (system performance)
CPU yüklənməsi (CPU load), disk oxunuş (disk read)

Personal/General:

xahiş edirəm (I request), bildiririk (we inform)
ünvan (address), rayon (district)

Text Features & Challenges

Line Breaking & Hyphenation

Multiple documents show mid-word line breaks with hyphens:

hazır-lanması (prepared, split across lines)
yaran-mış (arising)
araşdırıl-masını (investigation)
hiper-tenziya (hypertension)

Implication: HTR model needs line-level recognition, and post-processing must reconstruct hyphenated words.

Tabular Formatting

Document 12 contains table structure: ``` Adı | Yaşı | Şöbə

Rauf | 32 | Maliyyə Aysel | 28 | İnsan resursları


**Implication:** Stage 1 (layout detection) is critical for preserving table structure.

### Mixed Case Usage

- **Proper nouns:** `Bakı`, `Gəncə`, `Neftçilər prospekti`
- **Abbreviations:** `ƏDV`, `AZN`, `ATM`, `CPU`, `IT`
- **Sentence case:** Standard for regular text

---

## Data Quality Assessment

### Strengths

✅ **Diverse document types** - Covers real-world use cases across multiple domains

✅ **Rich Azerbaijani vocabulary** - Proper diacritics preserved throughout

✅ **Mixed content** - Text, numbers, tables, addresses

✅ **Domain variety** - Medical, legal, financial, technical, personal

✅ **Proper formatting** - Line-level transcriptions with structure preservation

✅ **Clean transcriptions** - Accurate character-level annotations

### Limitations

⚠️ **Small dataset size** - Only 12 samples (insufficient for production training)

⚠️ **No writer diversity info** - Unknown if single/multiple writers

⚠️ **HEIC format** - Requires preprocessing for PyTorch

⚠️ **No bounding boxes** - Labels are page-level, not line-level

⚠️ **No validation split** - Need to define train/val/test splits

⚠️ **No image metadata** - Resolution, DPI, quality information missing

⚠️ **Insufficient for LM training** - Only 2.4K chars vs. 100K-1M recommended

---

## Preprocessing Requirements

Before training, the following preprocessing steps are required:

### 1. Convert HEIC to PNG/JPG

```python
import pillow_heif
from PIL import Image
import os

pillow_heif.register_heif_opener()

for heic_file in os.listdir('images/'):
    if heic_file.endswith('.HEIC'):
        img_path = os.path.join('images/', heic_file)
        img = Image.open(img_path).convert('L')  # Grayscale
        png_path = img_path.replace('.HEIC', '.png')
        img.save(png_path)

2. Line Segmentation

Extract bounding boxes for each line from page images:

Use layout detection (YOLOv8) or manual annotation
Create line-level image crops
Map each line crop to its transcription

3. Create Vocabulary File

import json

# Extract all unique characters from labels
chars = set()
for label_file in label_files:
    with open(label_file, 'r', encoding='utf-8') as f:
        text = f.read()
        # Remove line numbers and arrow delimiter
        text = '→'.join(text.split('→')[1:]) if '→' in text else text
        chars.update(text)

# Create vocabulary mapping
vocab = {char: idx for idx, char in enumerate(sorted(chars))}
vocab['[BLANK]'] = len(vocab)  # CTC blank token

with open('vocab.json', 'w', encoding='utf-8') as f:
    json.dump(vocab, f, ensure_ascii=False, indent=2)

4. Define Data Splits

Recommended split (document-wise to prevent data leakage):

train.txt: 01,02,03,04,05,06,07,08,09  (75% - 9 documents)
val.txt:   10,11                        (17% - 2 documents)
test.txt:  12                           (8% - 1 document)

Recommended Character Set

Based on dataset analysis, the vocabulary should include:

Latin Letters:

Lowercase: a-z
Uppercase: A-Z

Azerbaijani Characters:

Lowercase: ə, ç, ğ, ı, ö, ş, ü
Uppercase: Ə, Ç, Ğ, İ, Ö, Ş, Ü

Digits: 0-9

Punctuation & Symbols:

. , : ; - – — ( ) [ ] / |
" ' « » ? ! № % + =

Special:

Space character
CTC blank token

Estimated Vocabulary Size: ~100 characters

Usage Guidelines

For Model Training

Data Augmentation is Critical - With only 12 samples, heavy augmentation is mandatory:
- Rotation: ±3°
- Scaling: 0.9-1.1
- Elastic distortion
- Blur, noise, random erasing
- Synthetic overlays
Start with HTR-Lite - Use the lightweight model variant for proof-of-concept
Character-Level Tokenization - Recommended for Azerbaijani language
Preserve Diacritics - Critical for maintaining language integrity

For Dataset Expansion

Immediate Actions:

Collect More Data - Current 2.4K chars is far below recommended 100K-1M
- Photograph additional handwritten documents
- Use synthetic data generation
- Apply pseudo-labeling on unlabeled scans
Line-Level Annotation - Convert page-level to line-level:
- Extract individual line bounding boxes
- Crop and save as separate line images
- Create line-level transcriptions
Metadata Collection - Document:
- Writer information (for stratified splits)
- Image resolution and DPI
- Document quality scores
- Date of collection
Quality Control - Verify:
- Transcription accuracy
- Diacritic correctness
- Proper character encoding (UTF-8)

Integration with Architecture

Alignment with Planned System (plan.md)

Planned Feature	Current Data Status
Character-level vocab	✅ Azerbaijani chars present
Document-wise split	⚠️ Not defined yet
Line-level images	❌ Only page-level currently
Bounding boxes	❌ Not annotated
Mixed content	✅ Numbers, text, tables
Domain diversity	✅ 12 document types
100K+ tokens for LM	❌ Only 2.4K chars

Next Steps for Implementation

Preprocessing Pipeline:
- Convert HEIC → PNG
- Segment pages into lines
- Extract line bounding boxes
Dataset Preparation:
- Create train/val/test splits
- Generate vocabulary.json
- Build data loader with augmentation
Baseline Model:
- Train HTR-Lite on augmented data
- Evaluate on validation set
- Analyze error patterns
Data Expansion:
- Collect 100+ more documents
- Implement active learning loop
- Build Azerbaijani language model corpus

Example Label Samples

Document 01 - Formal Letter

Hörmətli cənab direktor,
Bu məktub vasitəsilə
bildiririk ki,
2025-ci il
üzrə hesabatların
hazırlanması başa çatmaq
üzrədir.

Document 04 - Medical Form

Pasiyentin adı, soyadı:
Əliyev Rəşad Kamran oğlu
Doğum tarixi: 03.11.1987
Şikayətlər: baş ağrısı, halsızlıq,
yuxusuzluq
Diaqnoz: arterial hipertenziya

Document 12 - Tabular Text

Adı        | Yaşı | Şöbə
-------------------------
Rauf       | 32   | Maliyyə
Aysel      | 28   | İnsan resursları
Kamal      | 41   | İT dəstəyi

References

Project Plan: See ../plan.md for full architecture specification
Character Encoding: UTF-8
Language: Azerbaijani (Latin script, ISO 639-1: az)
Image Format: HEIC (requires conversion to PNG/JPG)

License & Usage

This dataset is collected for developing the Azeri Handwriting Detection system. Please ensure proper handling of any personal information that may appear in the documents.

Last Updated: December 13, 2025 Dataset Version: 1.0 (Pilot) Total Samples: 12 documents, 80 lines, 2,384 characters

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support