Aditya Ramesh: The Inventor Of AI Text-To-Visual Tool Dall-E Has Indian Origins

Dall-E represents a progression from a notion initially introduced by OpenAI in June 2020, originally referred to as Image GPT.
Dall-E represents a progression from a notion initially introduced by OpenAI in June 2020, originally referred to as Image GPT. L: Analytics India Magazine R: Linux Adictos

I’m sure you remember the universally favorite science fiction movie The Matrix (1999), where the reality we live in is a programmatically generated world. Well, yesterday’s science fiction is today’s reality. Just like the programmer from The Matrix could conjure up a taekwondo dojo or a busy street in New York, by typing in a few words of code, you too, can conjure up just about anything these days by combining the power of your imagination, language and Artificial Intelligence. But exactly at which point did such science fiction become a reality?

Introducing Aditya Ramesh, a homegrown tech wizard, founder of DALL·E and co-creator of DALL·E2. The name DALL·E is inspired by the Spanish surrealist artist Salvador Dali and the famous Disney robot from the American sci-fi movie, WALL-E. The name evokes the universal merging of art and technology. Ramesh introduced DALL·E to the world in January 2021, in collaboration with AI research and deployment company OpenAI. The technology utilizes deep learning models in conjunction with the GPT-3 large language model as its foundation to comprehend user prompts expressed in natural language and produce novel images.

Dall-E represents a progression from a notion initially introduced by OpenAI in June 2020, originally referred to as Image GPT. This early endeavour aimed to showcase the potential of a neural network in generating high-quality images. With the development of Dall-E, OpenAI expanded upon the foundational concept of Image GPT, allowing users to generate fresh images based on textual prompts, much like how GPT-3 generates new text in response to natural language inputs.

Dall-E 2 was released in April 2022, representing an advancement from the original Dall-E. With the original Dall-E, OpenAI employed a dVAE (deep Variational Autoencoder) for image generation. However, Dall-E 2 utilizes CLIP, a diffusion model which is capable of generating images of even higher quality. The diffusion models were a game-changer for DALL-E 2 along with its open-source counterparts, Stable Diffusion and Midjourney. According to OpenAI, Dall-E 2 images have four times the resolution of those created with Dall-E. Moreover, Dall-E 2 exhibits notable improvements in terms of speed and image size capacity compared to its predecessor, enabling users to generate larger images at a faster pace.

Illustrationof the CLIP model
Illustrationof the CLIP modelTowards Data Science

CLIP consists of two neural networks: a text encoder and an image encoder. Through training on a vast collection of image-text pairs, these encoders map inputs to embeddings in a shared "concept space." During training, CLIP receives image-caption pairs and forms matching pairs (image with corresponding caption) and mismatching pairs (image with any other caption). The objective is to train the encoders to map matching pairs close together and mismatching pairs far apart. This contrastive training encourages CLIP to learn various image features, such as objects, aesthetics, colors, and materials. However, CLIP may struggle to differentiate between images with swapped object positions, as it focuses on matching captions rather than preserving positional information.

Illustration of the process used to generate a new image with the diffusion model
Illustration of the process used to generate a new image with the diffusion modelAlex Nichol

"A diffusion model is trained to undo the steps of a fixed corruption process. Each step of the corruption process adds a small amount of noise. Specifically, gaussian noise. to an image, which erases some of the information in it. After the final step, the image becomes indistinguishable from pure noise. The diffusion model is trained to reverse this process, and in doing so learns to regenerate what might have been erased in each step. To generate an image from scratch, we start with pure noise and suppose that it was the end result of the corruption process applied to a real image. Then, we repeatedly apply the model to reverse each step of this hypothetical corruption process. This gradually makes the image more and more realistic, eventually yielding a pristine, noiseless image."

Aditya Ramesh

DALL-E 2 also has additional capabilities unlike its predecessor:

Inpainting: It performs edits to an image using language.

Example of Inpainting; Artwork titled 'Outpainting: an apocalyptic Mona Lisa'
Example of Inpainting; Artwork titled 'Outpainting: an apocalyptic Mona Lisa'tonidl1989

Variations: It generates new images that share the same essence as a given reference image, but differs in how the details are put together.

Variations from DALL·E 2 on a blackboard doodle by Lei Pan. The original doodle is in the center, and the generated variations are displayed around it.
Variations from DALL·E 2 on a blackboard doodle by Lei Pan. The original doodle is in the center, and the generated variations are displayed around it.Aditya Ramesh

Text diffs: It transforms any aspect of an image using language.

Annimation of text diff used to transform a Victorian house into a modern one. The transformation is determined by the captions “a victorian house”, which describes the architecture of the house, and “a modern house”, which describes how the architecture of the house should be changed
Annimation of text diff used to transform a Victorian house into a modern one. The transformation is determined by the captions “a victorian house”, which describes the architecture of the house, and “a modern house”, which describes how the architecture of the house should be changedAditya Ramesh

"We felt like text-to-image generation was interesting because as humans, we’re able to construct a sentence to describe any situation that we might encounter in real life, but also fantastical situations or crazy scenarios that are impossible. So we wanted to see if we trained a model to just generate images from text well enough, whether it could do the same things that humans can as far as extrapolation. I knew that the technology was going to get to a point where it would be impactful to consumers and useful for many different applications, but I was still surprised by how quickly."

Aditya Ramesh, in an interview with Venture Beat

The foundational idea of DALL·E is to help artists. Just the way that Codex is a constant companion for programmers, Ramesh describes DALL·E as a “creative-co-pilot” for artists. Like it or not, this is the future of creative work. While Ramesh’s research and execution have undoubtedly been ground-breaking in the field of scientific advancement, larger questions still remain at large: Will all existing artists be able to adapt or fuse AI into their creative practice? Will traditional or purist art practices decline alongside the rise of AI art? While AI certainly does make art more accessible, will it devalue the “skill” required to draw or paint?

Find out more about DALL·E here.

Related Stories

No stories found.
logo
Homegrown
homegrown.co.in