Skip to content

Ernie 4.5 VL MoE

This model was released on 2025-06-30 and added to Hugging Face Transformers on TBD.

PyTorch FlashAttention SDPA Tensor parallelism

The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of

  • Dedicated Text Experts
  • Dedicated Vision Experts
  • Shared Experts

This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.

Other models from the family can be found at Ernie 4.5 and at Ernie 4.5 MoE.

The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.

from transformers import pipeline
pipe = pipeline(
task="image-text-to-text",
model="baidu/ERNIE-4.5-VL-28B-A3B-PT",
device_map="auto",
revision="refs/pr/10",
)
message = [
{
"role": "user",
"content": [
{"type": "text", "text": "What kind of dog is this?"},
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
],
}
]
print(pipe(text=message, max_new_tokens=20, return_full_text=False))
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"baidu/ERNIE-4.5-VL-28B-A3B-PT",
dtype="auto",
device_map="auto", # Use tp_plan="auto" instead to enable Tensor Parallelism!
revision="refs/pr/10",
)
processor = AutoProcessor.from_pretrained(
"baidu/ERNIE-4.5-VL-28B-A3B-PT",
# use_fast=False, # closer to the original implementation for less speed
revision="refs/pr/10",
)
message = [
{
"role": "user",
"content": [
{"type": "text", "text": "What kind of dog is this?"},
{
"type": "image",
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
},
],
}
]
inputs = processor.apply_chat_template(
message,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Using Ernie 4.5 VL MoE with video input is similar to using it with image input. The model can process video data and generate text based on the content of the video.

from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"baidu/ERNIE-4.5-VL-28B-A3B-PT",
dtype="auto",
device_map="auto", # Use tp_plan="auto" instead to enable Tensor Parallelism!
revision="refs/pr/10",
)
processor = AutoProcessor.from_pretrained("baidu/ERNIE-4.5-VL-28B-A3B-PT", revision="refs/pr/10")
message = [
{
"role": "user",
"content": [
{"type": "text", "text": "Please describe what you can see during this video."},
{
"type": "video",
"url": "https://huggingface.co/datasets/raushan-testing-hf/videos-test/resolve/main/tiny_video.mp4",
},
],
}
]
inputs = processor.apply_chat_template(
message,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

[[autodoc]] Ernie4_5_VL_MoeConfig

[[autodoc]] Ernie4_5_VL_MoeTextConfig

[[autodoc]] Ernie4_5_VL_MoeVisionConfig

[[autodoc]] Ernie4_5_VL_MoeImageProcessor - preprocess

[[autodoc]] Ernie4_5_VL_MoeImageProcessorFast - preprocess

[[autodoc]] Ernie4_5_VL_MoeVideoProcessor - preprocess

[[autodoc]] Ernie4_5_VL_MoeProcessor

[[autodoc]] Ernie4_5_VL_MoeTextModel - forward

Ernie4_5_VL_MoeVisionTransformerPretrainedModel

Section titled “Ernie4_5_VL_MoeVisionTransformerPretrainedModel”

[[autodoc]] Ernie4_5_VL_MoeVisionTransformerPretrainedModel - forward

Ernie4_5_VL_MoeVariableResolutionResamplerModel

Section titled “Ernie4_5_VL_MoeVariableResolutionResamplerModel”

[[autodoc]] Ernie4_5_VL_MoeVariableResolutionResamplerModel - forward

[[autodoc]] Ernie4_5_VL_MoeModel - forward

[[autodoc]] Ernie4_5_VL_MoeForConditionalGeneration - forward