DeepseekVL
This model was released on 2024-03-08 and added to Hugging Face Transformers on 2025-07-25.
DeepseekVL
Section titled “DeepseekVL”Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
You can find all the original Deepseek-VL checkpoints under the DeepSeek-community organization.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
import torchfrom transformers import pipeline
pipe = pipeline( task="image-text-to-text", model="deepseek-community/deepseek-vl-1.3b-chat", device=0, dtype=torch.float16)
messages = [ { "role": "user", "content": [ { "type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg", }, { "type": "text", "text": "Describe this image."}, ] }]
pipe(text=messages, max_new_tokens=20, return_full_text=False)import torchfrom transformers import DeepseekVLForConditionalGeneration, AutoProcessor
model = DeepseekVLForConditionalGeneration.from_pretrained( "deepseek-community/deepseek-vl-1.3b-chat", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")
messages = [ { "role":"user", "content":[ { "type":"image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" }, { "type":"text", "text":"Describe this image." } ] }
]
inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=model.dtype)
generated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses torchao to only quantize the weights to int4.
import torchfrom transformers import TorchAoConfig, DeepseekVLForConditionalGeneration, AutoProcessor
quantization_config = TorchAoConfig( "int4_weight_only", group_size=128)
model = DeepseekVLForConditionalGeneration.from_pretrained( "deepseek-community/deepseek-vl-1.3b-chat", dtype=torch.bfloat16, device_map="auto", quantization_config=quantization_config)-
Do inference with multiple images in a single conversation.
import torchfrom transformers import DeepseekVLForConditionalGeneration, AutoProcessormodel = DeepseekVLForConditionalGeneration.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat",dtype=torch.float16,device_map="auto",attn_implementation="sdpa")processor = AutoProcessor.from_pretrained("deepseek-community/deepseek-vl-1.3b-chat")messages = [[{"role": "user","content": [{"type": "text", "text": "What’s the difference between"},{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},{"type": "text", "text": " and "},{"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}]}],[{"role": "user","content": [{"type": "image", "url": "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"},{"type": "text", "text": "What do you see in this image?"}]}]]inputs = processor.apply_chat_template(messages,add_generation_prompt=True,padding=True,truncation=True,tokenize=True,return_dict=True,return_tensors="pt").to(model.device, dtype=model.dtype)generated_ids = model.generate(**inputs, max_new_tokens=128)generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)print(output_text)
DeepseekVLConfig
Section titled “DeepseekVLConfig”[[autodoc]] DeepseekVLConfig
DeepseekVLProcessor
Section titled “DeepseekVLProcessor”[[autodoc]] DeepseekVLProcessor
DeepseekVLImageProcessor
Section titled “DeepseekVLImageProcessor”[[autodoc]] DeepseekVLImageProcessor
DeepseekVLImageProcessorFast
Section titled “DeepseekVLImageProcessorFast”[[autodoc]] DeepseekVLImageProcessorFast
DeepseekVLModel
Section titled “DeepseekVLModel”[[autodoc]] DeepseekVLModel - forward
DeepseekVLForConditionalGeneration
Section titled “DeepseekVLForConditionalGeneration”[[autodoc]] DeepseekVLForConditionalGeneration - forward