--- language: - en - fr license: agpl-3.0 library_name: transformers base_model: Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: image-text-to-text tags: - construction - visual-analysis - safety-inspection - vlm - qwen2_5_vl - qwen2-vl - lora - horama - btp - structured-output - json - image-to-json - peft - safetensors model-index: - name: Horama_BTP results: [] ---
# HORAMA-BTP ### Vision-Language Model for Construction Site Analysis **Image → Structured JSON** | Built on Qwen2.5-VL | Fine-tuned with LoRA [![Model](https://img.shields.io/badge/Model-3B_params-blue)]() [![License](https://img.shields.io/badge/License-AGPL--3.0-green)]() [![Format](https://img.shields.io/badge/Output-Structured_JSON-orange)]() [![Framework](https://img.shields.io/badge/Framework-Transformers-yellow)]() --- *Horama-BTP transforms construction site photographs into comprehensive, machine-readable JSON reports covering progress tracking, safety compliance, quality assessment, and logistics.*
## Overview **Horama-BTP** is a domain-specialized Vision-Language Model (VLM) that converts construction site images into structured JSON analyses. Given a single photograph, the model produces a detailed report spanning **15 analysis dimensions** -- from construction progress estimation and safety compliance to quality defects and environmental impact. The model enforces a strict, validated JSON schema with controlled vocabularies, confidence scores, and an evidence-linking system that traces every observation back to visual evidence in the image. ### Key Capabilities | Dimension | What the model extracts | |---|---| | **Progress** | Construction stage (earthworks → commissioning), estimated % completion, detected milestones | | **Safety** | PPE compliance per worker, hazard identification (9 types), control measures present/missing | | **Quality** | Structural defects (cracks, misalignment, corrosion...), non-conformities | | **Observations** | Objects, materials, equipment, personnel, vehicles, structural parts with attributes | | **Logistics** | Materials on site, equipment status (idle/operating), access constraints | | **Environment** | Dust, noise, waste, spills; waste management assessment | | **Evidence** | Traceable evidence entries with unique IDs linking every finding to visual proof | ## Architecture ``` Input Image ───┐ ├──► Qwen2.5-VL-3B-Instruct ──► LoRA-adapted layers ──► Structured JSON System Prompt ─┘ (backbone) (r=32, alpha=64) ``` | Component | Details | |---|---| | **Backbone** | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) -- 3B parameter multimodal transformer | | **Adaptation** | LoRA (Low-Rank Adaptation) applied to all attention and MLP projections | | **Target Modules** | `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj` | | **LoRA Rank** | r=32, alpha=64 (2x scaling), dropout=0.1 | | **Precision** | BF16 (GPU) / FP32 (CPU/MPS) | | **Output** | Deterministic JSON (temperature=0, greedy decoding) | ### Design Principles - **Schema-first**: Every output is validated against a formal JSON Schema (draft 2020-12) with 15 required top-level fields and controlled enumerations - **Evidence-linked**: All observations reference `evidence_id` entries -- no claim without visual justification - **Confidence-scored**: Every detection carries a `[0, 1]` confidence score for downstream filtering - **Honest by design**: When something is uncertain or not visible, the model uses `"unknown"`, `null`, or empty arrays -- never hallucinated details ## Quick Start ```python import torch from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image # Load model and processor model_id = "Horama/Horama_BTP" model = AutoModelForVision2Seq.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) # Load image image = Image.open("construction_site.jpg").convert("RGB") # System prompt -- instructs the model to output Horama-BTP v1 JSON system_prompt = """You are Horama-BTP v1. Analyze construction site images. Output ONLY valid JSON. No text before/after. CRITICAL RULES: 1. ONLY describe what you can CLEARLY SEE in the image 2. If you cannot see something -> use empty array [] or "unknown" 3. Output must follow the Horama-BTP v1 JSON schema exactly""" user_prompt = "Analyze this construction site image and return the Horama-BTP v1 JSON output." # Prepare messages messages = [ {"role": "system", "content": system_prompt}, { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": user_prompt}, ], }, ] # Generate text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=4096, do_sample=False) result = processor.decode(output[0], skip_special_tokens=True) # Extract JSON from response import json json_start = result.rfind("{") json_end = result.rfind("}") + 1 analysis = json.loads(result[json_start:json_end]) print(json.dumps(analysis, indent=2)) ``` ## Output Schema The model outputs a single JSON object with **15 required top-level fields**: ``` { "job_type": "construction" | "renovation" | "infrastructure" | "unknown", "asset_type": "house" | "building" | "road" | "bridge" | "tunnel" | "site" | "unknown", "scene_context": { location_hint, weather_light, viewpoint }, "summary": { one_liner, confidence }, "progress": { overall_stage, stage_confidence, progress_percent_estimate, milestones_detected }, "work_activities": [{ activity, status, confidence, evidence_ids }], "observations": [{ type, label, attributes, confidence, evidence_ids }], "safety": { overall_risk_level, ppe[], hazards[], control_measures[] }, "quality": { issues[], non_conformities[] }, "logistics": { materials_on_site[], equipment_on_site[], access_constraints[] }, "environment": { impacts[], waste_management }, "evidence": [{ evidence_id, source, bbox_xyxy, description }], "unknown": [{ question, why_unknown, needed_input }], "domain_fields": { custom_kpis, lot_breakdown, client_specific }, "metadata": { model, version, generated_at } } ``` ### Controlled Vocabularies The schema enforces controlled enumerations across all categorical fields: | Field | Allowed values | |---|---| | `overall_stage` | `planning`, `earthworks`, `foundations`, `structure`, `envelope`, `mep`, `finishing`, `commissioning`, `unknown` | | `ppe_item` | `helmet`, `vest`, `gloves`, `goggles`, `harness`, `boots`, `mask`, `other` | | `hazard_type` | `fall_risk`, `open_trench`, `moving_vehicle`, `electrical`, `fire`, `unstable_load`, `poor_housekeeping`, `restricted_area`, `other` | | `issue_type` | `crack`, `misalignment`, `water_infiltration`, `corrosion`, `spalling`, `poor_finish`, `missing_component`, `rework`, `other` | | `observation_type` | `object`, `material`, `equipment`, `signage`, `defect`, `hazard`, `personnel`, `vehicle`, `structure_part`, `other` | ## Example Output Given a drone photograph of a wood-framed house under construction: ```json { "job_type": "construction", "asset_type": "house", "scene_context": { "location_hint": "outdoor", "weather_light": "day", "viewpoint": "drone" }, "summary": { "one_liner": "Aerial view of a wood-framed house under construction; floor deck and wall framing visible, two workers on site.", "confidence": 0.88 }, "progress": { "overall_stage": "structure", "stage_confidence": 0.85, "progress_percent_estimate": 35, "progress_confidence": 0.35, "milestones_detected": [] }, "safety": { "overall_risk_level": "medium", "ppe": [ { "role": "worker", "ppe_item": "helmet", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] }, { "role": "worker", "ppe_item": "vest", "status": "compliant", "confidence": 0.8, "evidence_ids": ["ev_003"] } ], "hazards": [ { "hazard_type": "fall_risk", "severity": "medium", "confidence": 0.6, "evidence_ids": ["ev_005"] } ], "control_measures": [ { "measure": "barrier", "status": "present", "confidence": 0.6, "evidence_ids": ["ev_004"] } ] }, "evidence": [ { "evidence_id": "ev_001", "source": "image", "description": "Wood-framed structure with interior wall framing visible" }, { "evidence_id": "ev_003", "source": "image", "description": "Two workers wearing high-visibility vests and hard hats" }, { "evidence_id": "ev_005", "source": "image", "description": "Open edges and elevated framing suggesting fall risk" } ] } ``` *(Truncated for readability -- full output includes all 15 top-level fields)* ## Training Details | Parameter | Value | |---|---| | **Method** | LoRA (Parameter-Efficient Fine-Tuning) | | **Epochs** | 15 | | **Effective batch size** | 4 (batch=1, accumulation=4) | | **Learning rate** | 1e-4 with cosine schedule | | **Warmup** | 10% of training steps | | **Weight decay** | 0.01 | | **Gradient checkpointing** | Enabled | | **Trainable parameters** | ~1.5% of total model parameters | | **Framework** | Transformers + PEFT | | **Hardware** | NVIDIA GPU with BF16 | ## Intended Uses **Primary use cases:** - Automated construction progress reporting from site photographs - Safety compliance auditing (PPE detection, hazard identification) - Quality control -- detecting visible defects and non-conformities - Logistics monitoring -- tracking materials and equipment on site - Environmental impact documentation **Input requirements:** - Single construction site image (JPEG, PNG, WebP, BMP) - Supports ground-level, drone, and fixed-camera viewpoints - Works best with daylight, well-lit images ## Limitations - **Single-image analysis**: The model analyzes one image at a time; it cannot compare images across time for temporal progress tracking - **Visible elements only**: Cannot infer hidden structural issues, underground utilities, or elements behind walls - **No sensory data**: Cannot detect noise levels, dust concentration, or odors from static images - **Resolution-dependent**: Small or distant objects (e.g., PPE details at long range) may have lower confidence - **Schema-bound**: Output strictly follows the Horama-BTP v1 schema -- custom fields require the `domain_fields` extension point ## Hardware Requirements | Setup | VRAM / RAM | Precision | Notes | |---|---|---|---| | **NVIDIA GPU** | ~8 GB VRAM | BF16 | Recommended for production | | **Apple Silicon** | ~8 GB RAM | FP32 | Supported via MPS backend | | **CPU** | ~12 GB RAM | FP32 | Functional but slower | ## License **AGPL-3.0** -- This model can be freely used, modified, and redistributed as long as derivative work remains open-source under the same license. For **commercial or closed-source** usage, please contact [Horama](https://horama.ai) for a commercial license. ## Citation ```bibtex @misc{horama-btp-2025, title = {Horama-BTP: Vision-Language Model for Construction Site Analysis}, author = {Horama}, year = {2025}, url = {https://huggingface.co/Horama/Horama_BTP}, note = {Fine-tuned from Qwen2.5-VL-3B-Instruct with LoRA for structured JSON construction analysis} } ``` ---
**Built by [Horama](https://horama.ai)** | Construction intelligence, powered by vision AI