Skip to content

Granite Vision

This model was released on 2024-12-18 and added to Hugging Face Transformers on 2025-01-23.

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Tips:

  • This model is loaded into Transformers as an instance of LlaVA-Next. The usage and tips from LLaVA-NeXT apply to this model as well.

  • You can apply the chat template on the tokenizer / processor in the same way as well. Example chat format:

Terminal window
"<|user|>\nWhat’s shown in this image?\n<|assistant|>\nThis image shows a red stop sign.<|end_of_text|><|user|>\nDescribe the image in more details.\n<|assistant|>\n"

Sample inference:

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
from accelerate import Accelerator
device = Accelerator().device
model_path = "ibm-granite/granite-vision-3.1-2b-preview"
processor = LlavaNextProcessor.from_pretrained(model_path)
model = LlavaNextForConditionalGeneration.from_pretrained(model_path).to(device)
# prepare image and text prompt, using the appropriate prompt template
url = "https://github.com/haotian-liu/LLaVA/blob/1a91fc274d7c35a9b50b3cb29c4247ae5837ce39/images/llava_v1_5_radar.jpg?raw=true"
conversation = [
{
"role": "user",
"content": [
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
],
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
# autoregressively complete prompt
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

This model was contributed by Alexander Brooks.

[[autodoc]] LlavaNextConfig

[[autodoc]] LlavaNextImageProcessor - preprocess

[[autodoc]] LlavaNextProcessor

[[autodoc]] LlavaNextForConditionalGeneration - forward