Dissapointing quality with scanned documents

by Awschult - opened 6 days ago

6 days ago

Has anyone tried to use this model to convert documents that have been scanned?
Even on documents that I would consider to be very high quality scans, not crooked or anything, very minimal artifacts, super easy, unquestionable to read by a human, I have found this model to produce really poor quality results and often get stuck in repeating loops.

gabegoodhart

IBM Granite org 6 days ago

Hi @Awschult , can you elaborate a bit on how you tried to run the model? This model is designed specifically to run with the docling library which includes very specific input formatting and prompting, so it's very easy to accidentally mess up the inputs (incorrect newlines, bad instructions, etc). The infinite looping is a very common symptom of subtly incorrect input.

Awschult

6 days ago

I have ran it once with the webgpu space and other times with my own local ollama instance.

Both scenarios, I uploaded an image (PNG) of a page to the chat interface with an empty prompt.

I wanted to see the docling tags produced by the model before I mix docling in.

Some pages turned out fine, but others really didn't perform well. I can send you a document, or post a link to one that I was working with if that would help.

gabegoodhart

IBM Granite org 3 days ago

Thanks for the details! Any samples you have would be great feedback for future models and/or debugging edge failure cases with prompting.

my own local ollama instance

Can you clarify which specific model you ran with Ollama? With Ollama, running directly from a HF GGUF file is likely to end up with the wrong chat template which can cause problems, especially for a model as sensitive as this one. If you haven't already, you can try our official Ollama model: https://ollama.com/ibm/granite-docling.

Awschult

3 days ago

Thank you for the quick feedback. Here is a link to the document that I was testing with.

Link to document

As for which model I was using, I was indeed using the Olama official IBM dock thing model. However, there was another time which I used the web GPU version, I think it was, here in hugging face, and that also ended up with a repeating loop pattern and a bunch of broken messages.

Awschult

about 24 hours ago

Thank you for all of your responses to all of my comments. At this point, I am just going to patiently wait until llama/ollama gets updated.

The granite team has done amazing work and your models are my favorite. If you guys could update the speech model to handle diarization (multiple speaker labeling) that would be incredible and push you to the for front of industry.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment