Inference results using huggingface code example different from using checkpoint artifacts

#10

by zhaoxin-liang - opened Sep 19, 2025

Sep 19, 2025

I was using the huggingface code snippet from the model card to do some image related inference. The results don't following prompt instructions very well - they are very lengthy and sentences are cut at the end of the response.

Then I tested the checkpoint artifacts downloaded by the get_models.sh script from this repo along with the predict.py from it. It follows the prompt well and the response is sensible without truncated sentences.

pcuenq

Sep 19, 2025

Thanks for the report, @zhaoxin-liang , we'll take a look.

pcuenq

Oct 3, 2025

Hello @zhaoxin-liang !

I double-checked the code and found a missing pre-processing operation in the transformers example 🤦‍♂️. Model output did not change significantly for me when I added it, but I did confirm that results were exactly the same between the transformers version and predict.py with the updated pre-processing. Could you please confirm if that's the case for you as well? This is the updated example (the difference is just the call to expand2square):

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "apple/FastVLM-0.5B"
IMAGE_TOKEN_INDEX = -200  # what the model code looks for

# Load
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

# From llava.mm_utils
def expand2square(pil_img, background_color):
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result

# Build chat -> render to string (not tokens) so we can place <image> exactly
messages = [
    {"role": "user", "content": "<image>\nDescribe the image in one paragraph."}
]
rendered = tok.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)

pre, post = rendered.split("<image>", 1)

# Tokenize the text *around* the image token (no extra specials!)
pre_ids  = tok(pre,  return_tensors="pt", add_special_tokens=False).input_ids
post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids

# Splice in the IMAGE token id (-200) at the placeholder position
img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
attention_mask = torch.ones_like(input_ids, device=model.device)

# Preprocess image via the model's own processor
img = Image.open("rabbit.jpg").convert("RGB")
img = expand2square(img, 0)
image_processor = model.get_vision_tower().image_processor
px = image_processor(images=img, return_tensors="pt")["pixel_values"]
px = px.to(model.device, dtype=model.dtype)

# Generate
with torch.no_grad():
    out = model.generate(
        inputs=input_ids,
        attention_mask=attention_mask,
        do_sample=False,
        images=px,
        max_new_tokens=150,
    )

print(tok.decode(out[0], skip_special_tokens=True))

nels-numan

21 days ago

@pcuenq , I'm still facing issues with inconsistent and repetitive output from the model when using transformers vs. the official predict.py from the Apple ML team. Do you have any further suggestions on how to fix this?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment