YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Azeri Handwriting Detection Dataset

Overview

This dataset contains 12 handwritten Azerbaijani document samples with transcriptions, representing diverse real-world document types. The dataset serves as a pilot collection for developing and validating the Azeri Handwriting Recognition (HTR) system.

Dataset Statistics:

  • Total Documents: 12
  • Total Lines: 80 (avg: 6.7 lines/document)
  • Total Characters: 2,384 (avg: 198.7 chars/document)
  • Language: Azerbaijani (Latin script)
  • Format: HEIC images + TXT transcriptions
  • Created: December 13, 2025

Directory Structure

data/
├── images/          # 12 HEIC image files
│   ├── 01.HEIC  →  az_formal_letter_01.txt
│   ├── 02.HEIC  →  az_handwritten_note_02.txt
│   ├── 03.HEIC  →  az_numeric_mixed_03.txt
│   ├── 04.HEIC  →  az_medical_form_04.txt
│   ├── 05.HEIC  →  az_utility_application_05.txt
│   ├── 06.HEIC  →  az_bank_statement_06.txt
│   ├── 07.HEIC  →  az_education_text_07.txt
│   ├── 08.HEIC  →  az_address_list_08.txt
│   ├── 09.HEIC  →  az_technical_report_09.txt
│   ├── 10.HEIC  →  az_contract_clause_10.txt
│   ├── 11.HEIC  →  az_daily_diary_11.txt
│   └── 12.HEIC  →  az_tabular_text_12.txt
├── labels/          # 12 transcription files
│   └── az_*_##.txt
└── README.md        # This file

Naming Convention: The number at the end of each label filename (e.g., _01, _12) corresponds exactly to the image number (e.g., 01.HEIC, 12.HEIC).


Document Types

The dataset contains 12 diverse document types representing real-world Azerbaijani documents:

ID Document Type Lines Chars Description
01 Formal Letter 9 196 Official business letter to director about 2025 reports
02 Handwritten Note 8 171 Personal reminder about bank appointment
03 Numeric Mixed 6 147 Contract with numbers, dates, amounts (VAT calculation)
04 Medical Form 8 215 Patient form with diagnosis and prescriptions
05 Utility Application 8 251 Electricity complaint with meter reading
06 Bank Statement 6 165 Account transactions with debits/credits
07 Education Text 6 240 Constitutional text about education rights
08 Address List 7 184 Addresses from Baku, Ganja, Sumqayit cities
09 Technical Report 4 216 System performance analysis (CPU, disk)
10 Contract Clause 8 257 Legal contract clause text
11 Daily Diary 6 184 Personal diary entry about work project
12 Tabular Text 4 158 Employee table with names, ages, departments

Complexity Levels

Simple (4-6 lines):

  • Tabular text (12), Technical report (09), Numeric mixed (03)
  • Short, structured content

Medium (6-8 lines):

  • Bank statement (06), Education (07), Address list (08)
  • Moderate length with mixed content

Complex (8-9 lines):

  • Medical form (04), Handwritten note (02), Contract (10)
  • Longer documents with varied formatting

Image Characteristics

Format: HEIC (High Efficiency Image Container)

  • Codec: HEVC (H.265) - Apple's modern image format
  • File Sizes: 820KB - 1.0MB per image (average: ~890KB)
  • Total Size: ~10.5MB

Important: HEIC format requires conversion to PNG/JPG for PyTorch processing:

from PIL import Image
import pillow_heif

pillow_heif.register_heif_opener()
img = Image.open('01.HEIC').convert('L')  # Convert to grayscale
img.save('01.png')

Label File Format

Format:

Line_number→Transcribed_text

Example (az_tabular_text_12.txt):

     1→Adı        | Yaşı | Şöbə
     2→-------------------------
     3→Rauf       | 32   | Maliyyə
     4→Aysel      | 28   | İnsan resursları
     5→Kamal      | 41   | İT dəstəyi

Characteristics:

  • Line numbers with leading spaces
  • Arrow delimiter () separates line number from text
  • Preserves original spacing and formatting
  • Includes punctuation and special characters exactly as written

Azerbaijani Language Statistics

Character Distribution (Top 15)

Space:  262 occurrences (11% - word boundaries)
a:      154 (6.5%)
i:      137 (5.7%)
ə:      125 (5.2%) ← Azerbaijani-specific schwa
r:       98 (4.1%)
l:       91 (3.8%)
n:       90 (3.8%)
ı:       66 (2.8%) ← Azerbaijani dotless i
s:       66 (2.8%)
m:       57 (2.4%)
d:       51 (2.1%)
t:       49 (2.1%)
e:       42 (1.8%)
u:       39 (1.6%)
-:       35 (1.5%) ← Hyphenation

Azerbaijani-Specific Characters (Critical)

Lowercase:

ə: 125  (schwa - most common special character)
ı:  66  (dotless i)
ş:  27  (s with cedilla)
ü:  27  (u with diaeresis)
ğ:  13  (g with breve)
ö:   7  (o with diaeresis)
ç:   7  (c with cedilla)

Uppercase:

Ə:   4
İ:   3  (i with dot - Turkish/Azeri uppercase)
Ş:   2
Ü:   1
Ö:   1

Total Azerbaijani-specific characters: 263 (11% of all characters)

Key Insight: Azerbaijani diacritics (ə, ı, ş, ü, ğ, ö, ç) are essential and must be preserved by the HTR model.


Content Analysis

Character Breakdown

Letters (a-z, A-Z):  ~1,550 (65%)
Azerbaijani chars:      263 (11%)
Spaces:                 262 (11%)
Numbers:                ~150 (6%)
Punctuation:            ~160 (7%)

Numeric & Special Content

Numbers Present:

  • Dates: 14.06.2024, 01.02.2025, 03.11.1987
  • Amounts: 12 750.45 AZN, 15 045.53 AZN, 304.50 AZN
  • Percentages: 18%, 67%
  • Phone numbers: 050-3456789
  • Contract numbers: № 457/23
  • Account numbers: AZ21NABZ

Punctuation & Symbols:

  • Hyphens (35) - word breaks, line wrapping
  • Periods (34) - decimals, abbreviations
  • Commas (19) - number separators
  • Colons (15) - field labels
  • Pipe symbols (|) - table formatting

Domain-Specific Vocabulary

The dataset contains rich domain terminology across multiple sectors:

Financial:

  • müqavilə (contract), məbləğ (amount), ƏDV (VAT)
  • hesabat (report), hesab nömrəsi (account number)
  • Maaş (salary), Komunal (utilities)

Medical:

  • pasiyent (patient), diaqnoz (diagnosis)
  • baş ağrısı (headache), arterial hipertenziya (hypertension)
  • dərman (medicine)

Legal/Formal:

  • Hörmətli (Dear/Honorable), direktor (director)
  • Konstitusiya (Constitution), qanunvericilik (legislation)
  • qarşılıqlı razılaşma (mutual agreement)

Technical:

  • sistem performans (system performance)
  • CPU yüklənməsi (CPU load), disk oxunuş (disk read)

Personal/General:

  • xahiş edirəm (I request), bildiririk (we inform)
  • ünvan (address), rayon (district)

Text Features & Challenges

Line Breaking & Hyphenation

Multiple documents show mid-word line breaks with hyphens:

  • hazır-lanması (prepared, split across lines)
  • yaran-mış (arising)
  • araşdırıl-masını (investigation)
  • hiper-tenziya (hypertension)

Implication: HTR model needs line-level recognition, and post-processing must reconstruct hyphenated words.

Tabular Formatting

Document 12 contains table structure: ``` Adı | Yaşı | Şöbə

Rauf | 32 | Maliyyə Aysel | 28 | İnsan resursları


**Implication:** Stage 1 (layout detection) is critical for preserving table structure.

### Mixed Case Usage

- **Proper nouns:** `Bakı`, `Gəncə`, `Neftçilər prospekti`
- **Abbreviations:** `ƏDV`, `AZN`, `ATM`, `CPU`, `IT`
- **Sentence case:** Standard for regular text

---

## Data Quality Assessment

### Strengths

✅ **Diverse document types** - Covers real-world use cases across multiple domains

✅ **Rich Azerbaijani vocabulary** - Proper diacritics preserved throughout

✅ **Mixed content** - Text, numbers, tables, addresses

✅ **Domain variety** - Medical, legal, financial, technical, personal

✅ **Proper formatting** - Line-level transcriptions with structure preservation

✅ **Clean transcriptions** - Accurate character-level annotations

### Limitations

⚠️ **Small dataset size** - Only 12 samples (insufficient for production training)

⚠️ **No writer diversity info** - Unknown if single/multiple writers

⚠️ **HEIC format** - Requires preprocessing for PyTorch

⚠️ **No bounding boxes** - Labels are page-level, not line-level

⚠️ **No validation split** - Need to define train/val/test splits

⚠️ **No image metadata** - Resolution, DPI, quality information missing

⚠️ **Insufficient for LM training** - Only 2.4K chars vs. 100K-1M recommended

---

## Preprocessing Requirements

Before training, the following preprocessing steps are required:

### 1. Convert HEIC to PNG/JPG

```python
import pillow_heif
from PIL import Image
import os

pillow_heif.register_heif_opener()

for heic_file in os.listdir('images/'):
    if heic_file.endswith('.HEIC'):
        img_path = os.path.join('images/', heic_file)
        img = Image.open(img_path).convert('L')  # Grayscale
        png_path = img_path.replace('.HEIC', '.png')
        img.save(png_path)

2. Line Segmentation

Extract bounding boxes for each line from page images:

  • Use layout detection (YOLOv8) or manual annotation
  • Create line-level image crops
  • Map each line crop to its transcription

3. Create Vocabulary File

import json

# Extract all unique characters from labels
chars = set()
for label_file in label_files:
    with open(label_file, 'r', encoding='utf-8') as f:
        text = f.read()
        # Remove line numbers and arrow delimiter
        text = '→'.join(text.split('→')[1:]) if '→' in text else text
        chars.update(text)

# Create vocabulary mapping
vocab = {char: idx for idx, char in enumerate(sorted(chars))}
vocab['[BLANK]'] = len(vocab)  # CTC blank token

with open('vocab.json', 'w', encoding='utf-8') as f:
    json.dump(vocab, f, ensure_ascii=False, indent=2)

4. Define Data Splits

Recommended split (document-wise to prevent data leakage):

train.txt: 01,02,03,04,05,06,07,08,09  (75% - 9 documents)
val.txt:   10,11                        (17% - 2 documents)
test.txt:  12                           (8% - 1 document)

Recommended Character Set

Based on dataset analysis, the vocabulary should include:

Latin Letters:

  • Lowercase: a-z
  • Uppercase: A-Z

Azerbaijani Characters:

  • Lowercase: ə, ç, ğ, ı, ö, ş, ü
  • Uppercase: Ə, Ç, Ğ, İ, Ö, Ş, Ü

Digits: 0-9

Punctuation & Symbols:

. , : ; - – — ( ) [ ] / |
" ' « » ? ! № % + =

Special:

  • Space character
  • CTC blank token

Estimated Vocabulary Size: ~100 characters


Usage Guidelines

For Model Training

  1. Data Augmentation is Critical - With only 12 samples, heavy augmentation is mandatory:

    • Rotation: ±3°
    • Scaling: 0.9-1.1
    • Elastic distortion
    • Blur, noise, random erasing
    • Synthetic overlays
  2. Start with HTR-Lite - Use the lightweight model variant for proof-of-concept

  3. Character-Level Tokenization - Recommended for Azerbaijani language

  4. Preserve Diacritics - Critical for maintaining language integrity

For Dataset Expansion

Immediate Actions:

  1. Collect More Data - Current 2.4K chars is far below recommended 100K-1M

    • Photograph additional handwritten documents
    • Use synthetic data generation
    • Apply pseudo-labeling on unlabeled scans
  2. Line-Level Annotation - Convert page-level to line-level:

    • Extract individual line bounding boxes
    • Crop and save as separate line images
    • Create line-level transcriptions
  3. Metadata Collection - Document:

    • Writer information (for stratified splits)
    • Image resolution and DPI
    • Document quality scores
    • Date of collection
  4. Quality Control - Verify:

    • Transcription accuracy
    • Diacritic correctness
    • Proper character encoding (UTF-8)

Integration with Architecture

Alignment with Planned System (plan.md)

Planned Feature Current Data Status
Character-level vocab ✅ Azerbaijani chars present
Document-wise split ⚠️ Not defined yet
Line-level images ❌ Only page-level currently
Bounding boxes ❌ Not annotated
Mixed content ✅ Numbers, text, tables
Domain diversity ✅ 12 document types
100K+ tokens for LM ❌ Only 2.4K chars

Next Steps for Implementation

  1. Preprocessing Pipeline:

    • Convert HEIC → PNG
    • Segment pages into lines
    • Extract line bounding boxes
  2. Dataset Preparation:

    • Create train/val/test splits
    • Generate vocabulary.json
    • Build data loader with augmentation
  3. Baseline Model:

    • Train HTR-Lite on augmented data
    • Evaluate on validation set
    • Analyze error patterns
  4. Data Expansion:

    • Collect 100+ more documents
    • Implement active learning loop
    • Build Azerbaijani language model corpus

Example Label Samples

Document 01 - Formal Letter

Hörmətli cənab direktor,
Bu məktub vasitəsilə
bildiririk ki,
2025-ci il
üzrə hesabatların
hazırlanması başa çatmaq
üzrədir.

Document 04 - Medical Form

Pasiyentin adı, soyadı:
Əliyev Rəşad Kamran oğlu
Doğum tarixi: 03.11.1987
Şikayətlər: baş ağrısı, halsızlıq,
yuxusuzluq
Diaqnoz: arterial hipertenziya

Document 12 - Tabular Text

Adı        | Yaşı | Şöbə
-------------------------
Rauf       | 32   | Maliyyə
Aysel      | 28   | İnsan resursları
Kamal      | 41   | İT dəstəyi

References

  • Project Plan: See ../plan.md for full architecture specification
  • Character Encoding: UTF-8
  • Language: Azerbaijani (Latin script, ISO 639-1: az)
  • Image Format: HEIC (requires conversion to PNG/JPG)

License & Usage

This dataset is collected for developing the Azeri Handwriting Detection system. Please ensure proper handling of any personal information that may appear in the documents.


Last Updated: December 13, 2025 Dataset Version: 1.0 (Pilot) Total Samples: 12 documents, 80 lines, 2,384 characters

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support