Inference results using huggingface code example different from using checkpoint artifacts
I was using the huggingface code snippet from the model card to do some image related inference. The results don't following prompt instructions very well - they are very lengthy and sentences are cut at the end of the response.
Then I tested the checkpoint artifacts downloaded by the get_models.sh script from this repo along with the predict.py from it. It follows the prompt well and the response is sensible without truncated sentences.
Hello @zhaoxin-liang !
I double-checked the code and found a missing pre-processing operation in the transformers example π€¦ββοΈ. Model output did not change significantly for me when I added it, but I did confirm that results were exactly the same between the transformers version and predict.py with the updated pre-processing. Could you please confirm if that's the case for you as well? This is the updated example (the difference is just the call to expand2square):
import torch
from PIL import Image
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "apple/FastVLM-0.5B"
IMAGE_TOKEN_INDEX = -200 # what the model code looks for
# Load
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
)
# From llava.mm_utils
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
# Build chat -> render to string (not tokens) so we can place <image> exactly
messages = [
{"role": "user", "content": "<image>\nDescribe the image in one paragraph."}
]
rendered = tok.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
pre, post = rendered.split("<image>", 1)
# Tokenize the text *around* the image token (no extra specials!)
pre_ids = tok(pre, return_tensors="pt", add_special_tokens=False).input_ids
post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids
# Splice in the IMAGE token id (-200) at the placeholder position
img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
attention_mask = torch.ones_like(input_ids, device=model.device)
# Preprocess image via the model's own processor
img = Image.open("rabbit.jpg").convert("RGB")
img = expand2square(img, 0)
image_processor = model.get_vision_tower().image_processor
px = image_processor(images=img, return_tensors="pt")["pixel_values"]
px = px.to(model.device, dtype=model.dtype)
# Generate
with torch.no_grad():
out = model.generate(
inputs=input_ids,
attention_mask=attention_mask,
do_sample=False,
images=px,
max_new_tokens=150,
)
print(tok.decode(out[0], skip_special_tokens=True))
@pcuenq , I'm still facing issues with inconsistent and repetitive output from the model when using transformers vs. the official predict.py from the Apple ML team. Do you have any further suggestions on how to fix this?