Phi4 Multimodal
This model was released on 2025-03-03 and added to Hugging Face Transformers on 2025-03-25.
Phi4 Multimodal
Section titled “Phi4 Multimodal”Phi4 Multimodal is a multimodal model capable of text, image, and speech and audio inputs or any combination of these. It features a mixture of LoRA adapters for handling different inputs, and each input is routed to the appropriate encoder.
You can find all the original Phi4 Multimodal checkpoints under the Phi4 collection.
Click on the Phi-4 Multimodal in the right sidebar for more examples of how to apply Phi-4 Multimodal to different tasks.
The example below demonstrates how to generate text based on an image with Pipeline or the AutoModel class.
from transformers import pipelinegenerator = pipeline("text-generation", model="microsoft/Phi-4-multimodal-instruct", dtype="auto", device=0)
prompt = "Explain the concept of multimodal AI in simple terms."
result = generator(prompt, max_length=50)print(result[0]['generated_text'])import torchfrom transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfigfrom accelerate import Accelerator
model_path = "microsoft/Phi-4-multimodal-instruct"device = Accelerator().device
processor = AutoProcessor.from_pretrained(model_path)model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, dtype=torch.float16)
model.load_adapter(model_path, adapter_name="vision", device_map=device, adapter_kwargs={"subfolder": 'vision-lora'})
messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"}, {"type": "text", "text": "What is shown in this image?"}, ], },]
model.set_adapter("vision")inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt",).to(model.device)
generate_ids = model.generate( **inputs, max_new_tokens=1000, do_sample=False,)generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]print(f'>>> Response\n{response}')The example below demonstrates inference with an audio and text input.
import torchfrom transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfigfrom accelerate import Accelerator
model_path = "microsoft/Phi-4-multimodal-instruct"device = Accelerator().device
processor = AutoProcessor.from_pretrained(model_path)model = AutoModelForCausalLM.from_pretrained(model_path, device_map=device, dtype=torch.float16)
model.load_adapter(model_path, adapter_name="speech", device_map=device, adapter_kwargs={"subfolder": 'speech-lora'})model.set_adapter("speech")audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"messages = [ { "role": "user", "content": [ {"type": "audio", "url": audio_url}, {"type": "text", "text": "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the origina transcript and the translation."}, ], },]
inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt",).to(model.device)
generate_ids = model.generate( **inputs, max_new_tokens=1000, do_sample=False,)generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]print(f'>>> Response\n{response}')Phi4MultimodalFeatureExtractor
Section titled “Phi4MultimodalFeatureExtractor”[[autodoc]] Phi4MultimodalFeatureExtractor
Phi4MultimodalImageProcessorFast
Section titled “Phi4MultimodalImageProcessorFast”[[autodoc]] Phi4MultimodalImageProcessorFast
Phi4MultimodalProcessor
Section titled “Phi4MultimodalProcessor”[[autodoc]] Phi4MultimodalProcessor
Phi4MultimodalAudioConfig
Section titled “Phi4MultimodalAudioConfig”[[autodoc]] Phi4MultimodalAudioConfig
Phi4MultimodalVisionConfig
Section titled “Phi4MultimodalVisionConfig”[[autodoc]] Phi4MultimodalVisionConfig
Phi4MultimodalConfig
Section titled “Phi4MultimodalConfig”[[autodoc]] Phi4MultimodalConfig
Phi4MultimodalAudioModel
Section titled “Phi4MultimodalAudioModel”[[autodoc]] Phi4MultimodalAudioModel
Phi4MultimodalVisionModel
Section titled “Phi4MultimodalVisionModel”[[autodoc]] Phi4MultimodalVisionModel
Phi4MultimodalModel
Section titled “Phi4MultimodalModel”[[autodoc]] Phi4MultimodalModel - forward
Phi4MultimodalForCausalLM
Section titled “Phi4MultimodalForCausalLM”[[autodoc]] Phi4MultimodalForCausalLM - forward