SynthVision: Building a 110K Synthetic Medical VQA Dataset with Cross-Model Validation
TLDR: We annotated 119K medical images with two frontier VLMs (Qwen 3.5, Kimi K2.5), cross-validated at 93% agreement, and produced 110K training records, all for under $500. Fine-tuning 3 small models (2-3B params) improved all benchmarks: best model reaches +15.0% average exact match. Everything is open-sourced: datasets, adapters, and code.
Table of Contents
- Data Curation
- Batch Annotation via Doubleword
- Cross-Model Validation
- Training Data Preparation
- Fine-Tuning
- Results & Analysis
- Lessons Learned
- When Does Synthetic Data Help?
- Artifacts & Reproducibility
- Future Work
- Conclusion
Medical Visual Question Answering (VQA) sits at the intersection of computer vision and clinical reasoning: given a medical image, answer natural-language questions about it. The applications are immediate: radiology triage, pathology education, clinical decision support. But the field has a data problem.
Existing medical VQA datasets are small (VQA-RAD has 314 training samples), narrow in modality coverage, and often restrictively licensed. Frontier vision-language models with hundreds of billions of parameters can produce multi-turn clinical analyses of medical images, but deploying them costs $10–50 per 1,000 images. Small 2–3B parameter models are deployable but lack the medical knowledge to produce useful answers.
SynthVision bridges this gap through knowledge distillation at dataset scale. The idea: use two frontier VLMs as "teachers" to annotate medical images with multi-turn clinical conversations, then cross-validate their outputs against each other to filter hallucinations. The result is a 110K-record synthetic dataset that transfers frontier-model reasoning into small, deployable models.
This project was built by OpenMed in collaboration with Hugging Face and Doubleword. OpenMed designed the pipeline, wrote the code, ran the experiments, and open-sourced everything. But a project at this scale doesn't happen in isolation:
- The Hugging Face team supported us throughout: the Hub hosted every dataset and model artifact, Hugging Face Jobs provided A100 compute for training and evaluation, and the open-source stack (transformers, PEFT, accelerate, datasets) handled everything from LoRA configuration to multi-GPU orchestration.
- The Doubleword team worked with us on batch annotation at scale. Their API made it economically viable to run two full annotation passes plus two cross-validation passes over 119K images with frontier reasoning models (Qwen 3.5 at 397B/17B active, Kimi K2.5 at 1T/32B active).
The pipeline in numbers: 119,137 seed images from 4 open datasets → annotated by Qwen 3.5 (397B/17B active) and Kimi K2.5 (1T/32B active) → cross-validated at 93% pass rate → 110,741 validated records → fine-tuned 3 model families (13 training runs) → best result: +15.0% average improvement on Qwen3.5-2B, with all 3 families improving across all benchmarks.
┌──────────┐ ┌─────────┐ ┌─────────────────┐ ┌────────────┐ ┌───────────────┐
│ 4 Source │──▶│ 119K │──▶│ Annotate │──▶│ Cross- │──▶│ Fine-Tune │
│ Datasets │ │ Seeds │ │ Qwen 3.5: 59K │ │ Validate │ │ 3 Families │
│ │ │ │ │ Kimi K2.5: 59K │ │ 93% → 110K │ │ Best: +15% EM │
└──────────┘ └─────────┘ └─────────────────┘ └────────────┘ └───────────────┘
Data Curation
Selecting seed datasets
We evaluated 9 medical image datasets on three criteria: open licensing, availability on the Hugging Face Hub, and modality diversity. Four made the cut:
| Dataset | Tier | Records | Modality | HF ID | Selection Reason |
|---|---|---|---|---|---|
| ROCO | B (caption) | 65,393 | Radiology | eltorio/ROCO-radiology |
Large, diverse radiology with captions |
| MultiCaRe | B (caption) | 50,000 | Mixed | OpenMed/multicare-images |
Multi-specialty clinical cases |
| PathVQA | C (Q&A) | 3,430 | Pathology | flaviagiammarino/path-vqa |
Existing Q&A pairs for augmentation |
| VQA-RAD | C (Q&A) | 314 | Radiology | flaviagiammarino/vqa-rad |
Gold-standard radiology VQA |
| Total | 119,137 |
We prioritized datasets that were freely downloadable from the Hub without manual steps or gating, maximizing reproducibility for anyone running the pipeline.
Image deduplication
Before annotation, we deduplicated images using SHA-256 hashes on raw image bytes. This removed cross-dataset duplicates (some ROCO images appear in PMC-OA) without losing unique medical content.
The 3-tier annotation system
Not all seed datasets have the same metadata. A radiology image with a caption requires a different prompt than one with existing Q&A pairs. We designed three annotation tiers:
Tier A (label-based): For datasets like NIH ChestX-ray14 where metadata is a list of diagnostic labels (e.g., "Cardiomegaly, Effusion"). The prompt asks the model to generate a 4-turn conversation covering findings, clinical significance, urgency, and follow-up. Not used in this release; all Tier A datasets were disabled.
Tier B (caption-based): For ROCO and MultiCaRe where metadata is a free-text caption or report. The prompt provides the caption and asks for a 5-turn conversation: findings, mechanism, differential diagnosis, urgency assessment, and recommended follow-up.
Tier C (Q&A-based): For PathVQA and VQA-RAD where metadata includes existing question-answer pairs. The prompt preserves the original Q&A pairs and asks for 4 additional harder follow-up turns covering mechanism, differential, prognosis, and management.
All tiers produce the same output schema:
{
"report": "4-6 sentence clinical narrative",
"conversations": [{"from": "human", "value": "..."}, {"from": "assistant", "value": "..."}],
"reasoning_chain": "1. Assess cardiac borders... 2. Evaluate lung fields...",
"structured_findings": {"cardiomegaly": true, "pleural_effusion": "moderate"},
"difficulty": "intermediate"
}
All three tier prompts are in src/openmed/annotation/prompts.py.
Here are examples of real annotations from the dataset, spanning radiology and pathology:
Batch Annotation via Doubleword
Architecture
Annotating 119K images with frontier VLMs would be prohibitively expensive at real-time API rates. Doubleword solved this: their OpenAI-compatible batch inference API at api.doubleword.ai queues requests, processes them in bulk, and returns results asynchronously within a 24-hour window at reduced pricing. This made it viable to run two full annotation passes plus two cross-validation passes over 119K images with frontier reasoning models.
We iterated on two approaches during development, with guidance from the Doubleword team:
- Manual JSONL chunking (scripts 10–12): build JSONL request files, upload them, poll for completion, download results. Fine-grained control but manual orchestration across 200 MB file-size limits.
- Autobatcher library (script 13): Doubleword's Python client that handles batching, file uploads, polling, and result streaming transparently. We just call
client.chat.completions.create()per record andasyncio.gather()the coroutines, and the library takes care of the rest.
Both approaches required solving the same constraint: each API request body is limited to 5MB, and base64 encoding inflates image sizes by ~33%. We set a 3.9MB threshold on raw image bytes and progressively resize larger images (80% per iteration, JPEG quality 85, up to 10 attempts, 512x512 fallback).
Two-model annotation
To get diverse annotations and enable cross-validation, we split the 119K seeds deterministically:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
half = len(df) // 2
first_half = df.iloc[:half] # → Qwen 3.5
second_half = df.iloc[half:] # → Kimi K2.5
- First half (59,476 records) → annotated by
Qwen/Qwen3.5-397B-A17B(Qwen 3.5, a 397B MoE model with 17B active parameters) - Second half (59,539 records) → annotated by
moonshotai/Kimi-K2.5(Kimi K2.5, a 1T MoE model with 32B active parameters)
Total input to the API was approximately 16.7GB of JSONL across 476 batches, auto-chunked at Doubleword's 200MB per-file limit. All requests used max_tokens=16384 and temperature=0.2.
Cost
The full pipeline (both annotation rounds and both cross-validation rounds) processed 586M tokens (109M input + 477M output) for $452.58 via Doubleword's batch API.
Detailed cost breakdown
| Model | Provider | Input $/1M | Output $/1M | Total Cost |
|---|---|---|---|---|
| Qwen3.5-397B | Doubleword | $0.15 | $1.20 | $182.26 |
| Qwen3.5-397B | Alibaba Cloud | $0.39 | $2.34 | $358.22 |
| Kimi-K2.5 | Doubleword | $0.50 | $0.70 | $270.32 |
| Kimi-K2.5 | Moonshot AI | $0.60 | $3.00 | $1,035.17 |
| Combined | Doubleword | $452.58 | ||
| Combined | Direct providers | $1,393.39 |
For comparison on other frontier models: Gemini 3 Flash ($1,486), GPT-5 ($4,909), Claude Sonnet 4.6 ($7,487).
| Model | Batches | Input Tokens | Output Tokens | Total Tokens |
|---|---|---|---|---|
| Qwen3.5-397B | 175 | 29.0M | 148.3M | 177.2M |
| Kimi-K2.5 | 301 | 80.0M | 329.1M | 409.0M |
| Total | 476 | 108.9M | 477.3M | 586.2M |
Note the ~4.4x output-to-input ratio, characteristic of reasoning models that generate extensive chain-of-thought before the final JSON. This is why output pricing dominates for this workload.
Parsing model outputs (the hard part)
Both Qwen 3.5 and Kimi K2.5 are reasoning models that "think" before responding. This creates parsing challenges:
- Qwen 3.5 puts its chain-of-thought in a
reasoning_contentfield, and the actual JSON incontent, butcontentis sometimesNone. - Kimi K2.5 outputs reasoning as plain text before a
```json ```fenced block, all in thecontentfield.
Beyond reasoning extraction, the models produce JSON in at least 6 different conversation formats ({from, value}, {question, answer}, {human, assistant}, {role, content}, {Q, A}, {user, assistant}). A normalisation layer unifies them into a consistent {from, value} schema.
The JSON itself also requires progressive repair: trailing commas, missing commas between values, and unbalanced braces are common when models generate 5-10KB of structured output. The parser tries raw json.loads first, then applies regex fixups (trailing comma removal, missing comma insertion) before giving up. A critical guard: _extract_json() must verify isinstance(result, dict) because some models occasionally return a JSON array instead of an object.
See the full parsing pipeline in the source code.
Cross-Model Validation
The key insight
Qwen 3.5 and Kimi K2.5 hallucinate in different ways: different training data, different architectures, different alignment strategies. When both models agree that findings are consistent with an image, the probability of a shared hallucination drops substantially.
Validation protocol
Each model validates the other's annotations. The validator sees the original image plus the structured findings from the annotator, and answers whether the findings are consistent with what is visible:
VALIDATION_PROMPT = (
"Look at this {modality} image. "
"A model reported the following findings: \"{findings}\".\n"
"Do these findings appear consistent with what is visible in the image?\n"
"Answer with a JSON: {{\"consistent\": true/false, "
"\"confidence\": 0.0-1.0, \"reason\": \"brief explanation\"}}\n"
"Return ONLY valid JSON."
)
A record passes validation if consistent == true AND confidence >= 0.7.
Results
| Annotator | Validator | Input | Passed | Rate |
|---|---|---|---|---|
| Qwen 3.5 | Kimi K2.5 | 59,476 | 55,359 | 93.1% |
| Kimi K2.5 | Qwen 3.5 | 59,539 | 55,382 | 93.0% |
| Combined | 119,015 | 110,741 | 93.0% |
Both directions landed at 93%. The ~7% rejection rate primarily catches cases where the annotator hallucinated findings not visible in the image, exactly the failure mode cross-validation is designed to detect.
Examples of validated samples across different imaging modalities:
Training Data Preparation
The annotations from both rounds need to be merged into a single training-ready format. The preparation script (scripts/15_prepare_training_data.py) performs three steps:
- Merge all Parquet files from both
data/validated_qwen_by_kimi/anddata/validated_kimi_by_qwen/. - Deduplicate: if the same image ID appears in both sets, keep the record with the higher
quality_score. - Convert to ShareGPT JSONL, with the first human turn prefixed with
<image>:
{
"id": "roco_train_00042",
"image": "images/ROCO_00042.jpg",
"conversations": [
{"from": "human", "value": "<image>\nWhat are the key findings in this chest X-ray?"},
{"from": "gpt", "value": "The chest X-ray demonstrates bilateral perihilar..."},
{"from": "human", "value": "What is the likely mechanism?"},
{"from": "gpt", "value": "The bilateral distribution suggests..."}
],
"metadata": {
"source": "roco",
"modality": "Radiology",
"difficulty": "intermediate",
"quality_score": 0.92,
"annotation_model": "Qwen/Qwen3.5-397B-A17B"
}
}
The final training set contains 118,995 deduplicated records across 24 JSONL shards (~320MB total).
Fine-Tuning
Infrastructure
We ran all training on 4x NVIDIA A100 80GB GPUs (48 vCPU, 568 GB RAM) via Hugging Face Jobs with bf16 mixed precision and cosine scheduling. Stack: transformers + PEFT + accelerate for multi-GPU LoRA training, datasets for benchmark loading, Hub for artifact hosting.
One implementation detail worth noting: the data collator masks everything except assistant responses. System prompts, user messages, and special tokens get -100 labels so only assistant tokens contribute to the loss, preventing the model from learning to parrot user questions. The masking logic scans for <|im_start|>assistant boundaries in the token IDs (see src/openmed/training/finetune_qwen.py).
Qwen2.5-VL-3B: 5 experiments
Qwen2.5-VL-3B-Instruct was our primary target, a vision-language model with strong general VQA capability that we hypothesized could benefit from medical domain adaptation. We ran 5 experiments varying LoRA rank, target modules, learning rate, and training steps.
| Model | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base | 0.5033 | 0.3038 | 0.5438 | 0.4503 |
| Best (v6) | 0.5211 | 0.3468 | 0.6032 | 0.4904 |
| Delta | +3.5% | +14.2% | +10.9% | +8.9% |
Best config: r=32, alpha=32, q/v/k/o projections, LR=7e-5, 700 steps. All three benchmarks improve, with PathVQA and SLAKE seeing the largest gains.
Full experiment log
| Exp | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base | 0.5033 | 0.3038 | 0.5438 | 0.4503 |
| v2 | 0.5078 | 0.3140 | 0.4439 | 0.4219 |
| v3@500 | 0.5011 | 0.3169 | 0.4788 | 0.4323 |
| v3@1860 | 0.5166 | 0.3210 | 0.4260 | 0.4212 |
| v4@500 | 0.5011 | 0.2983 | 0.5118 | 0.4370 |
| v4@1860 | 0.4812 | 0.3015 | 0.4873 | 0.4233 |
| v5 | 0.5055 | 0.3051 | 0.4976 | 0.4361 |
| v6 | 0.5211 | 0.3468 | 0.6032 | 0.4904 |
Early experiments (v2–v5) showed that aggressive LoRA configurations improve VQA-RAD and PathVQA but degrade SLAKE — our synthetic data teaches explanatory answers while SLAKE expects short categorical responses ("yes", "no", "liver"). The final experiment (v6) resolved this trade-off.
Ministral-3B: 4 experiments
Ministral-3B-Instruct is a vision-language model with a built-in Pixtral vision encoder. We ran 4 experiments with different LoRA configurations.
| Model | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base | 0.4701 | 0.3240 | 0.4948 | 0.4296 |
| Best (D) | 0.4789 | 0.3669 | 0.5664 | 0.4707 |
| Delta | +1.9% | +13.2% | +14.5% | +9.6% |
Best config: r=64, alpha=128, all 7 modules, LR=5e-5, 500 steps. The largest gains are on PathVQA and SLAKE.
Note: early experiments (A–C) were evaluated with standard PyTorch inference (model.generate()), which produces lower scores for Ministral due to suboptimal generation parameters. All final results use vLLM with tensor parallelism and the proper chat template.
Full experiment log
| Exp | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base (vLLM) | 0.4701 | 0.3240 | 0.4948 | 0.4296 |
| A | 0.2838 | 0.2582 | 0.2997 | 0.2806 |
| B | 0.2905 | 0.2712 | 0.3091 | 0.2903 |
| C | 0.2905 | 0.2747 | 0.3063 | 0.2905 |
| D | 0.4789 | 0.3669 | 0.5664 | 0.4707 |
Experiments A–C were evaluated with standard PyTorch inference (lower scores due to suboptimal generation parameters). Experiment D was evaluated with vLLM, matching the base evaluation setup.
Qwen3.5-2B: 4 experiments
Qwen3.5-2B is the highest-scoring base model we tested, scoring 0.5477 EM on VQA-RAD and 0.3822 on PathVQA out of the box.
| Model | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base | 0.5477 | 0.3822 | 0.5617 | 0.4972 |
| Best (D) | 0.5521 | 0.4748 | 0.6880 | 0.5716 |
| Delta | +0.8% | +24.2% | +22.5% | +15.0% |
Best config: r=64, alpha=64, all 7 modules, LR=1e-4, 350 steps. The largest gains are on PathVQA (+24.2%) and SLAKE (+22.5%), with VQA-RAD holding steady near its already-high baseline.
Full experiment log
| Exp | VQA-RAD | PathVQA | SLAKE | Avg EM |
|---|---|---|---|---|
| Base | 0.5477 | 0.3822 | 0.5617 | 0.4972 |
| A | 0.5388 | 0.3585 | 0.5193 | 0.4722 |
| B | 0.5455 | 0.3562 | 0.5316 | 0.4778 |
| C | 0.5521 | 0.3618 | 0.5203 | 0.4781 |
| D | 0.5521 | 0.4748 | 0.6880 | 0.5716 |
Experiments A–C regressed overall — fine-tuning with synthetic data pushed the model toward verbose answers that hurt exact match on short-answer benchmarks. Experiment D reversed this with gains across all three benchmarks.
Results & Analysis
Cross-model comparison
| Model | Params | VQA-RAD | PathVQA | SLAKE | Avg EM | vs Base |
|---|---|---|---|---|---|---|
| Qwen3.5-2B (D) | 2B | 0.5521 | 0.4748 | 0.6880 | 0.5716 | +15.0% |
| Qwen2.5-VL-3B (v6) | 3B | 0.5211 | 0.3468 | 0.6032 | 0.4904 | +8.9% |
| Ministral-3B (D) | 3B | 0.4789 | 0.3669 | 0.5664 | 0.4707 | +9.6% |
Base vs Fine-Tuned: Average Exact Match
0.60 ┤ ◆◆ 0.5716
│
0.55 ┤
│ ◇◇ 0.4972
0.50 ┤ ▲▲ 0.4904
│ ●● 0.4707
0.45 ┤ ○○ 0.4296 △△ 0.4503
│
0.40 ┤
│
0.35 ┤
└──────────────────────────────────────────────
Ministral-3B Qwen2.5-VL-3B Qwen3.5-2B
○○/●● Ministral △△/▲▲ Qwen2.5-VL ◇◇/◆◆ Qwen3.5
open = base filled = fine-tuned
All three model families improve across all three benchmarks:
- Qwen3.5-2B (D) reaches 0.5716 avg EM, with the largest gains on PathVQA (+24.2%) and SLAKE (+22.5%).
- Qwen2.5-VL-3B (v6) improves across all benchmarks, with the most notable gain on SLAKE (+10.9%), resolving the regression seen in earlier experiments.
- Ministral-3B (D) improves +9.6% overall when evaluated with vLLM, with gains concentrated on PathVQA (+13.2%) and SLAKE (+14.5%).
The SLAKE paradox, resolved
Early experiments (v2–v5 for Qwen2.5-VL, A–C for Qwen3.5 and Ministral) showed consistent SLAKE regression: our synthetic data trains models to give explanatory multi-sentence responses, while SLAKE expects short categorical answers ("yes", "no", "axial", "liver"). A model that answers "Yes, there is evidence of cardiomegaly with a cardiothoracic ratio exceeding 0.5" is medically correct but gets 0 exact match on a question expecting "Yes".
The final experiments (v6, D) resolved this: all three model families now improve on SLAKE: Qwen3.5-2B D (+22.5%), Ministral D (+14.5%), Qwen2.5-VL v6 (+10.9%). The models learned to produce both explanatory and concise answers as appropriate.
Lessons Learned
Eight practical lessons from building SynthVision end-to-end:
1. Reasoning models need massive max_tokens budgets. Both Qwen 3.5 and Kimi K2.5 use 5,000-26,000 thinking tokens from their shared max_tokens budget. Setting max_tokens=4096 (a common default) results in truncated outputs because the model spends its entire budget on internal reasoning. We used max_tokens=16384 for all annotation requests.
2. CUBLAS fix for modern PyTorch. Training with torch 2.9+cu128 on A100s produces cublasSgemmStridedBatched INVALID_VALUE errors. The fix is a single line at the top of the training script:
torch.backends.cuda.preferred_blas_library("cublaslt")
Alternatively, set TORCH_BLAS_PREFER_CUBLASLT=1 as an environment variable.
3. Ministral's sliding_window=None config bug. The Mistral model checkpoint shipped with sliding_window: null in its config, which crashes vLLM during inference. We patched config.json to set "sliding_window": 4096 before loading.
4. Qwen3.5-2B requires bleeding-edge dependencies. Qwen3.5-2B requires transformers>=5.3.0 and peft>=0.18.1. Do NOT install fla — it crashes on Python 3.10. For evaluation, we used vLLM 0.16.1rc (dev build) with tensor parallelism on a single GPU (tp=1).
5. Two corrupt samples crashed training at step 690/700. Two records in the training set had no user message in their conversations — just assistant turns. This triggered an assertion error in the collator. We added a defensive check to skip records where the first conversation turn is not from a human. These two samples were found and fixed only after losing a nearly-complete training run.
6. uv run enforces declared dependencies. We used uv as our package manager throughout. By design, uv sync removes packages not listed in pyproject.toml, and uv run does not guarantee their availability. When running scripts that need autobatcher (Doubleword's batching library), which was installed via pip but not declared as a project dependency, we invoked via .venv/bin/python directly instead of uv run.
7. vLLM batched eval is 20x faster than PyTorch generate(). Evaluating Qwen2.5-VL-3B on all three benchmarks (8,231 samples total) takes 2+ hours with standard PyTorch inference (model.generate() loop). With vLLM using tensor parallelism across 4 A100s via Hugging Face Jobs, the same evaluation completes in ~6 minutes. Continuous batching and PagedAttention make the difference. All three model families (Qwen2.5-VL, Qwen3.5, and Ministral) were evaluated via vLLM for the final experiments.
8. Progressive image resizing for API body limits. The Doubleword API has a 5MB per-request body limit (common across batch APIs). With base64 encoding overhead (~1.33x), the effective image size limit is ~3.9MB. Rather than pre-processing all images, we resize on-the-fly: 80% scale per iteration, JPEG quality 85, up to 10 attempts before falling back to 512x512.
When Does Synthetic Data Help?
The pattern holds across all three capability levels:
- Qwen3.5-2B starts at 0.50 avg EM (the highest base) and still gains +15.0%, with PathVQA (+24.2%) and SLAKE (+22.5%) seeing the largest jumps.
- Qwen2.5-VL-3B starts at 0.45 avg EM and gains +8.9%, with all three benchmarks improving.
- Ministral-3B starts at 0.43 avg EM (vLLM eval) and gains +9.6%, with PathVQA and SLAKE seeing the largest improvements.
The gains are largest on benchmarks requiring short-answer formatting (SLAKE, PathVQA), suggesting the training data helps models learn when to give concise vs. explanatory answers.
Artifacts & Reproducibility
Every intermediate artifact, from raw seeds through final models, is on the Hugging Face Hub. Any team can pick up at any stage: use our seeds, start from our validated annotations, or directly load a fine-tuned model.
All datasets and models are in the SynthVision collection.
For med tech teams: reproducing this pipeline
The full pipeline runs end-to-end for under $500 in annotation costs:
- Load seed datasets from the Hub
- Batch-annotate via Doubleword's API (119K images, $452)
- Cross-validate, merge, and deduplicate
- LoRA fine-tune on HF Jobs (YAML configs included)
- Evaluate via vLLM on HF Jobs
Source code: The complete pipeline (data loaders, annotation, validation, training, and evaluation) is at github.com/openmed-labs/synthvision.
Toolchain: Python 3.11, uv package manager, PyTorch 2.9, Hugging Face transformers, PEFT, accelerate, datasets, vLLM.
Compute: 4x NVIDIA A100 80GB (48 vCPU, 568 GB RAM, 320 GB disk) via Hugging Face Jobs ($10/hr): DDP training, tensor-parallel inference. Our longest training run (Qwen v5, 1h 14m) cost ~$12.33.
Future Work
Several directions could extend SynthVision:
- Enable additional seed datasets: NIH ChestX-ray14 (112K chest X-rays), HAM10000 (10K dermatoscopy), Quilt-1M (1M pathology) are already configured but disabled. Adding them would diversify modality coverage and increase dataset size to 300K+.
- GRPO post-training: recent work (MedVLM-R1, Med-R1) shows that Group Relative Policy Optimization on 2B medical VLMs can improve accuracy 15–30% beyond SFT alone, using exact-match rewards from benchmark ground truth. Our SFT-trained adapters could serve as initialization for a GRPO stage, potentially pushing accuracy further without additional annotation.
- DPO from cross-validation pairs: the ~8K rejected annotations form natural preference pairs (accepted vs. rejected) that could train a reward model or directly optimize via DPO.
- Larger target models: our experiments focused on 2–3B models. Fine-tuning 7–8B models (Qwen2.5-VL-7B, LLaVA-Med-7B) may show different gain patterns.
- Additional benchmarks: CheXpert, MedQA, and MMMU-Medical would provide broader evaluation coverage beyond VQA-RAD, PathVQA, and SLAKE.
- Instruction-tuned evaluation: current exact-match metrics penalize verbose answers. Using GPT-4 or Claude as judges for semantic correctness would better capture the clinical utility of explanatory answers.
Conclusion
Cross-model synthetic data generation works for medical VQA. Two frontier VLMs with different error patterns, cross-validating each other's outputs, produce a 93% pass rate on 119K images, producing 110K training records that are clinically grounded and diverse.
Fine-tuning with this data improves all three model families across all three benchmarks. Qwen3.5-2B (D) reaches 0.5716 avg EM (+15.0% over base), Qwen2.5-VL-3B (v6) gains +8.9%, and Ministral-3B (D) gains +9.6%.
The pipeline is fully reproducible — every dataset, adapter, and script is publicly available. We hope SynthVision serves as both a training resource for the medical AI community and a template for what a small team can build with the right tools.
All datasets and models are available at huggingface.co/OpenMed. Batch annotation powered by Doubleword. Source code at github.com/openmed-labs/synthvision.
