PE Audio (Perception Encoder Audio)
This model was released on {release_date} and added to Hugging Face Transformers on 2025-12-16.
PE Audio (Perception Encoder Audio)
Section titled “PE Audio (Perception Encoder Audio)”Overview
Section titled “Overview”PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.
Text input
- Produces a single embedding representing the full text.
Audio input
- PeAudioFrameLevelModel
- Produces a sequence of embeddings, one every 40 ms of audio.
- Suitable for audio event localization and fine-grained temporal analysis.
- PeAudioModel
- Produces a single embedding for the entire audio clip.
- Suitable for global audio-text retrieval tasks.
The resulting embeddings can be used for:
- Audio event localization
- Cross-modal (audio–text) retrieval and matching
Basic usage
Section titled “Basic usage”TODOPeAudioFeatureExtractor
Section titled “PeAudioFeatureExtractor”[[autodoc]] PeAudioFeatureExtractor - call
PeAudioProcessor
Section titled “PeAudioProcessor”[[autodoc]] PeAudioProcessor - call
PeAudioConfig
Section titled “PeAudioConfig”[[autodoc]] PeAudioConfig
PeAudioEncoderConfig
Section titled “PeAudioEncoderConfig”[[autodoc]] PeAudioEncoderConfig
PeAudioEncoder
Section titled “PeAudioEncoder”[[autodoc]] PeAudioEncoder - forward
PeAudioFrameLevelModel
Section titled “PeAudioFrameLevelModel”[[autodoc]] PeAudioFrameLevelModel - forward
PeAudioModel
Section titled “PeAudioModel”[[autodoc]] PeAudioModel - forward