TextPecker-8B-InternVL3

TextPecker-8B-InternVL3 is an evaluator model presented in the paper TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering.

While standard Multimodal LLMs often fail to notice fine-grained text errors like distortion or misalignment in generated images, TextPecker is specifically designed to perceive and quantify these structural anomalies to provide reliable reward signals for RL-based optimization of text-to-image models.

This checkpoint is based on the InternVL3-8B-Instruct architecture and was trained using the ms-swift framework on the TextPecker-1.5M dataset.

Model Details

Developed by: Hanshen Zhu, Yuliang Liu, et al. (Huazhong University of Science & Technology and ByteDance)
Model Type: Multimodal Large Language Model (MLLM)
Base Model: OpenGVLab/InternVL3-8B-Instruct
Task: Image-to-Text (Structural Anomaly Perception / OCR Evaluator)
License: Apache 2.0

Model Sources

Repository: https://github.com/CIawevy/TextPecker
Paper: https://huggingface.co/papers/2602.20903
Dataset: CIawevy/TextPecker-1.5M

Uses

TextPecker can be used to evaluate text structural quality and semantic consistency for text generation or editing scenarios. It helps bridge the gap in Visual Text Rendering (VTR) optimization by providing reliable feedback on character-level structural fidelity.

To use the model for deployment or evaluation, please follow the instructions in the official repository:

Citation

If you find TextPecker useful in your research, please cite:

@article{zhu2026TextPecker,
  title   = {TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering},
  author  = {Zhu, Hanshen and Liu, Yuliang and Wu, Xuecheng and Wang, An-Lan and Feng, Hao and Yang, Dingkang and Feng, Chao and Huang, Can and Tang, Jingqun and Bai, Xiang},
  journal = {arXiv preprint arXiv:2602.20903},
  year    = {2026}
}