GraniteMoeHybrid
This model was released on 2025-05-02 and added to Hugging Face Transformers on 2025-05-05.
GraniteMoeHybrid
Section titled “GraniteMoeHybrid”Overview
Section titled “Overview”The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ibm-granite/granite-4.0-tiny-preview"tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPUmodel = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")model.eval()
# change input text as desiredprompt = "Write a code to find the maximum value in a list of numbers."
# tokenize the textinput_tokens = tokenizer(prompt, return_tensors="pt")# generate output tokensoutput = model.generate(**input_tokens, max_new_tokens=100)# decode output tokens into textoutput = tokenizer.batch_decode(output)# loop over the batch to print, in this example the batch size is 1for i in output: print(i)This HF implementation is contributed by Sukriti Sharma and Alexander Brooks.
-
GraniteMoeHybridForCausalLMsupports padding-free training which concatenates distinct training examples while still processing inputs as separate batches. It can significantly accelerate inference by ~2x (depending on model and data distribution) and reduce memory-usage if there are examples of varying lengths by avoiding unnecessary compute and memory overhead from padding tokens.Padding-free training requires the
flash-attn,mamba-ssm, andcausal-conv1dpackages and the following arguments must be passed to the model in addition toinput_idsandlabels.position_ids: torch.LongTensor: the position index of each token in each sequence.seq_idx: torch.IntTensor: the index of each sequence in the batch.- Each of the
FlashAttentionKwargscu_seq_lens_q: torch.LongTensor: the cumulative sequence lengths of all queries.cu_seq_lens_k: torch.LongTensor: the cumulative sequence lengths of all keys.max_length_q: int: the longest query length in the batch.max_length_k: int: the longest key length in the batch.
The
attention_maskinputs should not be provided. TheDataCollatorWithFlatteningprogrammatically generates the set of additional arguments above usingreturn_seq_idx=Trueandreturn_flash_attn_kwargs=True. See the Improving Hugging Face Training Efficiency Through Packing with Flash Attention blog post for additional information.from transformers import DataCollatorWithFlattening# Example of using padding-free trainingdata_collator = DataCollatorWithFlattening(tokenizer=tokenizer,return_seq_idx=True,return_flash_attn_kwargs=True)
GraniteMoeHybridConfig
Section titled “GraniteMoeHybridConfig”[[autodoc]] GraniteMoeHybridConfig
GraniteMoeHybridModel
Section titled “GraniteMoeHybridModel”[[autodoc]] GraniteMoeHybridModel - forward
GraniteMoeHybridForCausalLM
Section titled “GraniteMoeHybridForCausalLM”[[autodoc]] GraniteMoeHybridForCausalLM - forward