Skip to content

GPT-2

This model was released on 2019-02-14 and added to Hugging Face Transformers on 2020-11-16.

PyTorch FlashAttention SDPA

GPT-2 is a scaled up version of GPT, a causal transformer language model, with 10x more parameters and training data. The model was pretrained on a 40GB dataset to predict the next word in a sequence based on all the previous words. This approach enabled the model to perform many downstream tasks in a zero-shot setting. The blog post released by OpenAI can be found here.

The model architecture uses a unidirectional (causal) attention mechanism where each token can only attend to previous tokens, making it particularly effective for text generation tasks.

You can find all the original GPT-2 checkpoints under the OpenAI community organization.

The example below demonstrates how to generate text with Pipeline or the AutoModel, and from the command line.

import torch
from transformers import pipeline
pipeline = pipeline(task="text-generation", model="openai-community/gpt2", dtype=torch.float16, device=0)
pipeline("Hello, I'm a language model")
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", dtype=torch.float16, device_map="auto", attn_implementation="sdpa")
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
input_ids = tokenizer("Hello, I'm a language model", return_tensors="pt").to(model.device)
output = model.generate(**input_ids, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
Terminal window
echo -e "Hello, I'm a language model" | transformers run --task text-generation --model openai-community/gpt2 --device 0

One can also serve the model using vLLM with the transformers backend.

Terminal window
vllm serve openai-community/gpt2 --model-imp transformers

Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.

The example below uses bitsandbytes to only quantize the weights to 4-bits.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"openai-community/gpt2-xl",
quantization_config=quantization_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2-xl")
inputs = tokenizer("Once upon a time, there was a magical forest", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

[[autodoc]] GPT2Config

[[autodoc]] GPT2Tokenizer - save_vocabulary

[[autodoc]] GPT2TokenizerFast

[[autodoc]] models.gpt2.modeling_gpt2.GPT2DoubleHeadsModelOutput

[[autodoc]] GPT2Model - forward

[[autodoc]] GPT2LMHeadModel - forward

[[autodoc]] GPT2DoubleHeadsModel - forward

[[autodoc]] GPT2ForQuestionAnswering - forward

[[autodoc]] GPT2ForSequenceClassification - forward

[[autodoc]] GPT2ForTokenClassification - forward