DALL-E is OpenAI’s line of text-to-image models, which generate pictures from natural-language descriptions. The family opened the modern generative-image era and then, over several generations, folded image generation back into OpenAI’s flagship multimodal models.
The original DALL-E (January 2021) is documented in the paper “Zero-Shot Text-to-Image Generation” (arXiv 2102.12092), which describes “a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data” - the same next-token-prediction recipe behind GPT, extended to interleaved text and image tokens. DALL-E 2 (April 2022) is documented in “Hierarchical Text-Conditional Image Generation with CLIP Latents” (arXiv 2204.06125), which pairs OpenAI’s CLIP image-text model with a diffusion-based generator to produce higher-resolution, more coherent images and to edit existing ones. DALL-E 3 followed, with much tighter prompt adherence and integration into ChatGPT.
The succession is the notable part. Rather than maintaining a standalone image model indefinitely, OpenAI moved image generation into its flagship multimodal model: GPT-4o gained native image generation, so a single model both converses and renders images instead of calling out to a separate system. This collapsed the boundary between “the chat model” and “the image model” and is the direction the line has continued. Because OpenAI’s product lineup and the boundary between DALL-E branding and GPT-native image generation shift over time, the live OpenAI site is the reference for what is currently offered; this entry fixes the historical papers and dates rather than a current product list.
Distribution is through OpenAI’s APIs and ChatGPT rather than open weights.
Note on sourcing: the arXiv papers are the fetchable Tier 1 primaries. OpenAI’s announcement pages (openai.com/index/dall-e/ and related) are the canonical product URLs and are cited, but openai.com blocks automated fetching; their dates and framing were confirmed through the existing milestone entries verified the same day.
Why business readers should care: the DALL-E story shows the industry consolidating modalities - what began as a separate image model became a feature of one general model - which is the pattern buyers should expect across vision, audio, and video.