I have written here before about “Photorealistic Image Synthesis from Text Prompts”, describing a system called GLIDE which creates custom images from text descriptions which is open to public access. Its creators, OpenAI, have now released the system toward which this was building, DALL-E 2, on a limited basis to researchers who request access via a wait list. (Fourmilab has applied for access, but has not yet been granted it. You can apply for the wait list via the previous link.)
The technology is described in the paper “Hierarchical Text-Conditional Image Generation with CLIP Latents”, which contains numerous examples of DALL-E 2 output, based upon a model with 3.5 billion training parameters.
Even by the standards of Roaring Twenties artificial intelligence milestones, the results are stunning. Here are a few examples from the paper. For each, I show the exact text prompt that elicited the image shown.
“panda mad scientist mixing sparkling chemicals, artstation”
“vibrant portrait painting of Salvador Dalí with a robotic half face”
“an espresso machine that makes coffee from human souls, artstation”
“a corgi’s head depicted as an explosion of a nebula”
“A teddybear on a skateboard in Times Square” (Random samples of multiple outputs)
Here is OpenAI’s video description of DALL-E 2.