GEM Image: Building an AI That Actually Gets Educational Diagrams Right
When we think about AI image generation, we usually think about stunning artwork, photorealistic portraits, or creative visual concepts. But what happens when accuracy matters more than aesthetics? What happens when a wrong label on an anatomy diagram or a misshapen country border can actively mislead a student?
That's exactly the problem we at AIPrep Labs set out to solve with GEM Image a proprietary image-generation system built from the ground up for educational settings where structural fidelity isn't optional, it's the whole point.
The Problem: Pretty Isn't Enough
Modern text-to-image models are remarkable. They can conjure breathtaking visuals from a few words. But beneath the surface polish, they routinely fail in ways that matter enormously for education:
- A heart diagram with a missing ventricle
- A map of a country with incorrect borders or swapped regions
- A historical figure rendered with the wrong facial features or clothing
- A mechanical part with an impossible joint or missing component
These aren't just stylistic quirks they're factual errors that can embed misconceptions in learners' minds. The research team behind GEM Image identified four recurring failure modes in existing models:
- Identity Drift generating a plausible-looking face or figure that doesn't match the actual person
- Anatomical Hallucinations inventing internal structures that don't exist or omitting ones that do
- Topology Errors incorrect region adjacency, boundary discontinuities, or missing/extra areas in maps
- Geometry Errors wrong proportions, impossible joints, or inconsistent symmetry in technical objects
The Solution: GEM Image
GEM Image takes a fundamentally different approach, built on three core pillars:
1. Style- and Format-Constrained Generation
Outputs are standardized to a clean educational rendering vector-like style, white background. This isn't just an aesthetic choice. By reducing spurious visual variation, it becomes much easier to spot structural errors and ensure consistency across a curriculum.
2. Structure-Preserving Guidance
During generation, the model is actively guided away from common failure modes. Rather than letting the model freestyle, GEM Image enforces consistency of salient substructures making sure the key parts, proportions, and topology that define an educational image are preserved.
3. Reference-Based Verification
Perhaps the most important innovation: GEM Image separates aesthetic quality from educational validity. An image is only considered successful if it preserves the structural features a learner needs. This is enforced by comparing generated outputs against curated ground-truth reference images not as a training signal, but as a post-generation quality check.
An image can look beautiful and still be educationally useless. GEM Image is designed to pass the test that actually matters.
Introducing GEM-WebGT100: A New Benchmark
To rigorously evaluate structural fidelity in educational image generation, the team created GEM-WebGT100 a benchmark of 100 items across four categories (25 each):
| Category | What's Being Tested |
|---|---|
| Objects | Technical objects with salient parts and geometry |
| Maps | Region/city maps where topology and boundaries matter |
| Portraits | Depictions where anatomy, identity, and pose consistency matter |
| Organs | Anatomical structures where shape and internal detail matter |
Each benchmark item includes a text name, a category label, and a web-sourced ground-truth image used only for evaluation never provided to any generator. This ensures the comparison is clean and unbiased.
How Evaluation Works
Each model is given only the entity name and a fixed style constraint (vector style, white background). The generated image is then compared to the ground-truth by Gemini 3 Flash, acting as an automated visual verifier.
The verifier prompt is deliberately focused on what matters:
"Compare the candidate image to the reference. Output only TRUE if the candidate matches the reference closely enough for educational use (structure, proportions, key parts/topology). Otherwise output only FALSE. Excluding the style as the Candidate image is gonna be vector style."
The result is a simple pass rate the fraction of items where the generated image is structurally correct enough for real educational use.
Results: GEM Image Pulls Ahead
The numbers speak clearly:
| Model | Overall | Objects | Maps | Portraits | Organs |
|---|---|---|---|---|---|
| GEM Image (ours) | 92.25 | 97 | 93 | 88 | 91 |
| NanoBanana Pro | 72.5 | 89 | 67 | 63 | 71 |
| GPT Image 1.5 | 58.75 | 87 | 34 | 49 | 65 |
The gains are most dramatic in the categories where structural errors are most consequential:
- Maps: GEM Image scores 93 vs. 67 (NanoBanana Pro) and a striking 34 (GPT Image 1.5) meaning GPT Image 1.5 fails on nearly two-thirds of map prompts when judged against a real ground truth
- Portraits: 88 vs. 63 vs. 49
- Organs: 91 vs. 71 vs. 65
For a student trying to learn geography or regional anatomy, these aren't just statistics they represent lessons actively undermined by incorrect visuals.
Qualitative Takeaways
Looking at side-by-side comparisons, the pattern is consistent: baseline models produce outputs that look right at a glance but fall apart on closer inspection. Missing anatomical segments, incorrect country boundaries, hallucinated facial features, asymmetric machine parts.
GEM Image doesn't win on aesthetics though its clean vector outputs are perfectly usable. It wins because the structures are correct.
Sample Outputs
Why This Matters
Educational AI tools are proliferating rapidly. Tutoring platforms, digital textbooks, adaptive learning systems many are beginning to incorporate AI-generated imagery. If those images are structurally wrong, the consequences aren't just cosmetic. Students build mental models from visual representations. A flawed diagram is a flawed lesson.
GEM Image and the GEM-WebGT100 benchmark represent a meaningful step toward holding AI-generated educational content to a higher standard one where "good enough to fool a quick glance" isn't the bar, and "correct enough to actually teach something" is.
Research by AIPrep Labs. Benchmark: GEM-WebGT100. Verifier: Gemini 3 Flash.





