GEM Image: Building an AI That Actually Gets Educational Diagrams Right

Community Article Published February 21, 2026

When we think about AI image generation, we usually think about stunning artwork, photorealistic portraits, or creative visual concepts. But what happens when accuracy matters more than aesthetics? What happens when a wrong label on an anatomy diagram or a misshapen country border can actively mislead a student?

That's exactly the problem we at AIPrep Labs set out to solve with GEM Image a proprietary image-generation system built from the ground up for educational settings where structural fidelity isn't optional, it's the whole point.

The Problem: Pretty Isn't Enough

Modern text-to-image models are remarkable. They can conjure breathtaking visuals from a few words. But beneath the surface polish, they routinely fail in ways that matter enormously for education:

  • A heart diagram with a missing ventricle
  • A map of a country with incorrect borders or swapped regions
  • A historical figure rendered with the wrong facial features or clothing
  • A mechanical part with an impossible joint or missing component

These aren't just stylistic quirks they're factual errors that can embed misconceptions in learners' minds. The research team behind GEM Image identified four recurring failure modes in existing models:

  • Identity Drift generating a plausible-looking face or figure that doesn't match the actual person
  • Anatomical Hallucinations inventing internal structures that don't exist or omitting ones that do
  • Topology Errors incorrect region adjacency, boundary discontinuities, or missing/extra areas in maps
  • Geometry Errors wrong proportions, impossible joints, or inconsistent symmetry in technical objects

image

image

image

image

The Solution: GEM Image

GEM Image takes a fundamentally different approach, built on three core pillars:

1. Style- and Format-Constrained Generation

Outputs are standardized to a clean educational rendering vector-like style, white background. This isn't just an aesthetic choice. By reducing spurious visual variation, it becomes much easier to spot structural errors and ensure consistency across a curriculum.

2. Structure-Preserving Guidance

During generation, the model is actively guided away from common failure modes. Rather than letting the model freestyle, GEM Image enforces consistency of salient substructures making sure the key parts, proportions, and topology that define an educational image are preserved.

3. Reference-Based Verification

Perhaps the most important innovation: GEM Image separates aesthetic quality from educational validity. An image is only considered successful if it preserves the structural features a learner needs. This is enforced by comparing generated outputs against curated ground-truth reference images not as a training signal, but as a post-generation quality check.

An image can look beautiful and still be educationally useless. GEM Image is designed to pass the test that actually matters.

Introducing GEM-WebGT100: A New Benchmark

To rigorously evaluate structural fidelity in educational image generation, the team created GEM-WebGT100 a benchmark of 100 items across four categories (25 each):

Category What's Being Tested
Objects Technical objects with salient parts and geometry
Maps Region/city maps where topology and boundaries matter
Portraits Depictions where anatomy, identity, and pose consistency matter
Organs Anatomical structures where shape and internal detail matter

Each benchmark item includes a text name, a category label, and a web-sourced ground-truth image used only for evaluation never provided to any generator. This ensures the comparison is clean and unbiased.

How Evaluation Works

Each model is given only the entity name and a fixed style constraint (vector style, white background). The generated image is then compared to the ground-truth by Gemini 3 Flash, acting as an automated visual verifier.

The verifier prompt is deliberately focused on what matters:

"Compare the candidate image to the reference. Output only TRUE if the candidate matches the reference closely enough for educational use (structure, proportions, key parts/topology). Otherwise output only FALSE. Excluding the style as the Candidate image is gonna be vector style."

The result is a simple pass rate the fraction of items where the generated image is structurally correct enough for real educational use.

Results: GEM Image Pulls Ahead

The numbers speak clearly:

Model Overall Objects Maps Portraits Organs
GEM Image (ours) 92.25 97 93 88 91
NanoBanana Pro 72.5 89 67 63 71
GPT Image 1.5 58.75 87 34 49 65

image

The gains are most dramatic in the categories where structural errors are most consequential:

  • Maps: GEM Image scores 93 vs. 67 (NanoBanana Pro) and a striking 34 (GPT Image 1.5) meaning GPT Image 1.5 fails on nearly two-thirds of map prompts when judged against a real ground truth
  • Portraits: 88 vs. 63 vs. 49
  • Organs: 91 vs. 71 vs. 65

For a student trying to learn geography or regional anatomy, these aren't just statistics they represent lessons actively undermined by incorrect visuals.

Qualitative Takeaways

Looking at side-by-side comparisons, the pattern is consistent: baseline models produce outputs that look right at a glance but fall apart on closer inspection. Missing anatomical segments, incorrect country boundaries, hallucinated facial features, asymmetric machine parts.

GEM Image doesn't win on aesthetics though its clean vector outputs are perfectly usable. It wins because the structures are correct.

Sample Outputs

collage

Why This Matters

Educational AI tools are proliferating rapidly. Tutoring platforms, digital textbooks, adaptive learning systems many are beginning to incorporate AI-generated imagery. If those images are structurally wrong, the consequences aren't just cosmetic. Students build mental models from visual representations. A flawed diagram is a flawed lesson.

GEM Image and the GEM-WebGT100 benchmark represent a meaningful step toward holding AI-generated educational content to a higher standard one where "good enough to fool a quick glance" isn't the bar, and "correct enough to actually teach something" is.

Research by AIPrep Labs. Benchmark: GEM-WebGT100. Verifier: Gemini 3 Flash.

Community

Sign up or log in to comment