---
language:
- en
- fr
license: agpl-3.0
library_name: transformers
base_model: Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
tags:
- construction
- visual-analysis
- safety-inspection
- vlm
- qwen2_5_vl
- qwen2-vl
- lora
- horama
- btp
- structured-output
- json
- image-to-json
- peft
- safetensors
model-index:
- name: Horama_BTP
  results: []
---

<div align="center">

# HORAMA-BTP

### Vision-Language Model for Construction Site Analysis

**Image &rarr; Structured JSON** | Built on Qwen2.5-VL | Fine-tuned with LoRA

[![Model](https://img.shields.io/badge/Model-3B_params-blue)]()
[![License](https://img.shields.io/badge/License-AGPL--3.0-green)]()
[![Format](https://img.shields.io/badge/Output-Structured_JSON-orange)]()
[![Framework](https://img.shields.io/badge/Framework-Transformers-yellow)]()

---

*Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.*

</div>

## Overview

**Horama-BTP** is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning **15 analysis dimensions** -- from construction progress estimation and safety compliance to quality defects and environmental impact.

The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image.

### Key Capabilities

| Dimension | What the model extracts |
|---|---|
| **Progress** | Construction stage (earthworks &rarr; commissioning), estimated % completion, detected milestones |
| **Safety** | PPE compliance per worker, hazard identification (9 types), control measures present/missing |
| **Quality** | Structural defects (cracks, misalignment, corrosion...), non-conformities |
| **Observations** | Objects, materials, equipment, personnel, vehicles, structural parts with attributes |
| **Logistics** | Materials on site, equipment status (idle/operating), access constraints |
| **Environment** | Dust, noise, waste, spills; waste management assessment |
| **Evidence** | Traceable evidence entries with unique IDs linking every finding to visual proof |

## Architecture

```
Input Image ───┐
               ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON
System Prompt ─┘         (backbone)              (r=32, alpha=64)
```

| Component | Details |
|---|---|
| **Backbone** | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -- 3B parameter multimodal transformer |
| **Adaptation** | LoRA (Low-Rank Adaptation) applied to all attention and MLP projections |
| **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` |
| **LoRA Rank** | r=32, alpha=64 (2x scaling), dropout=0.1 |
| **Precision** | BF16 (GPU) / FP32 (CPU/MPS) |
| **Output** | Deterministic JSON (temperature=0, greedy decoding) |

### Design Principles

- **Schema-first**: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations
- **Evidence-linked**: All observations reference `evidence_id` entries -- no claim without visual justification
- **Confidence-scored**: Every detection carries a `[0, 1]` confidence score for downstream filtering
- **Honest by design**: When something is uncertain or not visible, the model uses `"unknown"`, `null`, or empty arrays -- never hallucinated details

## Quick Start

```python
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

# Load model and processor
model_id = "Horama/Horama_BTP"

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

# Load image
image = Image.open("construction_site.jpg").convert("RGB")

# System prompt -- instructs the model to output Horama-BTP v1 JSON
system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after.
CRITICAL RULES:
1. ONLY describe what you can CLEARLY SEE in the image
2. If you cannot see something -> use empty array [] or "unknown"
3. Output must follow the Horama-BTP v1 JSON schema exactly"""

user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output."

# Prepare messages
messages = [
    {"role": "system", "content": system_prompt},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": user_prompt},
        ],
    },
]

# Generate
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

result = processor.decode(output[0], skip_special_tokens=True)

# Extract JSON from response
import json
json_start = result.rfind("{")
json_end = result.rfind("}") + 1
analysis = json.loads(result[json_start:json_end])

print(json.dumps(analysis, indent=2))
```

## Output Schema

The model outputs a single JSON object with **15 required top-level fields**:

```
{
  "job_type":        "construction" | "renovation" | "infrastructure" | "unknown",
  "asset_type":      "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown",
  "scene_context":   { location_hint, weather_light, viewpoint },
  "summary":         { one_liner, confidence },
  "progress":        { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected },
  "work_activities":  [{ activity, status, confidence, evidence_ids }],
  "observations":    [{ type, label, attributes, confidence, evidence_ids }],
  "safety":          { overall_risk_level, ppe[], hazards[], control_measures[] },
  "quality":         { issues[], non_conformities[] },
  "logistics":       { materials_on_site[], equipment_on_site[], access_constraints[] },
  "environment":     { impacts[], waste_management },
  "evidence":        [{ evidence_id, source, bbox_xyxy, description }],
  "unknown":         [{ question, why_unknown, needed_input }],
  "domain_fields":   { custom_kpis, lot_breakdown, client_specific },
  "metadata":        { model, version, generated_at }
}
```

### Controlled Vocabularies

The schema enforces controlled enumerations across all categorical fields:

| Field | Allowed values |
|---|---|
| `overall_stage` | `planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown` |
| `ppe_item` | `helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other` |
| `hazard_type` | `fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other` |
| `issue_type` | `crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other` |
| `observation_type` | `object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other` |

## Example Output

Given a drone photograph of a wood-framed house under construction:

```json
{
  "job_type": "construction",
  "asset_type": "house",
  "scene_context": {
    "location_hint": "outdoor",
    "weather_light": "day",
    "viewpoint": "drone"
  },
  "summary": {
    "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.",
    "confidence": 0.88
  },
  "progress": {
    "overall_stage": "structure",
    "stage_confidence": 0.85,
    "progress_percent_estimate": 35,
    "progress_confidence": 0.35,
    "milestones_detected": []
  },
  "safety": {
    "overall_risk_level": "medium",
    "ppe": [
      { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] },
      { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }
    ],
    "hazards": [
      { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] }
    ],
    "control_measures": [
      { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] }
    ]
  },
  "evidence": [
    { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" },
    { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" },
    { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" }
  ]
}
```

*(Truncated for readability -- full output includes all 15 top-level fields)*

## Training Details

| Parameter | Value |
|---|---|
| **Method** | LoRA (Parameter-Efficient Fine-Tuning) |
| **Epochs** | 15 |
| **Effective batch size** | 4 (batch=1, accumulation=4) |
| **Learning rate** | 1e-4 with cosine schedule |
| **Warmup** | 10% of training steps |
| **Weight decay** | 0.01 |
| **Gradient checkpointing** | Enabled |
| **Trainable parameters** | ~1.5% of total model parameters |
| **Framework** | Transformers + PEFT |
| **Hardware** | NVIDIA GPU with BF16 |

## Intended Uses

**Primary use cases:**
- Automated construction progress reporting from site photographs
- Safety compliance auditing (PPE detection, hazard identification)
- Quality control -- detecting visible defects and non-conformities
- Logistics monitoring -- tracking materials and equipment on site
- Environmental impact documentation

**Input requirements:**
- Single construction site image (JPEG, PNG, WebP, BMP)
- Supports ground-level, drone, and fixed-camera viewpoints
- Works best with daylight, well-lit images

## Limitations

- **Single-image analysis**: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking
- **Visible elements only**: Cannot infer hidden structural issues, underground utilities, or elements behind walls
- **No sensory data**: Cannot detect noise levels, dust concentration, or odors from static images
- **Resolution-dependent**: Small or distant objects (e.g., PPE details at long range) may have lower confidence
- **Schema-bound**: Output strictly follows the Horama-BTP v1 schema -- custom fields require the `domain_fields` extension point

## Hardware Requirements

| Setup | VRAM / RAM | Precision | Notes |
|---|---|---|---|
| **NVIDIA GPU** | ~8 GB VRAM | BF16 | Recommended for production |
| **Apple Silicon** | ~8 GB RAM | FP32 | Supported via MPS backend |
| **CPU** | ~12 GB RAM | FP32 | Functional but slower |

## License

**AGPL-3.0** -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license.

For **commercial or closed-source** usage, please contact [Horama](https://horama.ai) for a commercial license.

## Citation

```bibtex
@misc{horama-btp-2025,
  title   = {Horama-BTP: Vision-Language Model for Construction Site Analysis},
  author  = {Horama},
  year    = {2025},
  url     = {https://huggingface.co/Horama/Horama_BTP},
  note    = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis}
}
```

---

<div align="center">

**Built by [Horama](https://horama.ai)** | Construction intelligence, powered by vision AI

</div>