OpenAI introduces DALL-E

On January 5, 2021, OpenAI introduced the first DALL-E, a system that creates images from natural-language captions. The accompanying research paper, “Zero-Shot Text-to-Image Generation” by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever, was submitted to arXiv on February 24, 2021. The paper describes “a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data.”

DALL-E was built on the same family of ideas as GPT-3: a large transformer trained to predict the next token, except the tokens here interleaved text and image patches. The model showed it could combine unrelated concepts in plausible ways, render objects from a written description, and apply transformations to images, all driven by plain language. It was released the same day as CLIP, OpenAI’s companion model for connecting images and text.

DALL-E mattered as the opening of the generative image era. It demonstrated that the next-token-prediction recipe behind large language models could extend to pictures, setting the stage for the explosion of text-to-image tools - including OpenAI’s own DALL-E 2 in 2022 - that followed.

(Sourcing note: the arXiv paper is the fetchable Tier 1 primary. OpenAI’s announcement page openai.com/index/dall-e/ is the canonical product URL and is cited here, but openai.com blocks automated fetching; its January 5, 2021 date and framing were confirmed through multiple independent references to that page.)

Sources

Related