We touched upon Generative AI while talking about AI assisted content pipeline for XR at Thoughtworks XConf 2022. Generative AI is capable of producing outputs that closely resemble those created by humans. It is evolving in multiple forms, one of its kind is ChatGPT, a useful tool buried beneath the hype. It was touted as Google Killer, and posed a disruption to the way we search, we communicate with machines, and the way machines can respond. Focus is shifting toward building systems/tools to generate right inputs, called prompt, to get the acceptable response from these Generative AI systems.
ChatGPT or its rival Google BARD are based on Large Language Models(LLM). While they aim to revolutionize the text based communication space,
there have been advancements in the text to Image latent diffusion models (LDM) enabling another disruption in 3D content generation space. For example, DALL-E from OpenAI and Stable Diffusion from Stability, open up great possibilities in hitherto untapped areas.
These LDM models are capable of converting natural language inputs into photorealistic images to ultra realistic art output. The natural language inputs are referred to as “Prompts” and the nomenclature which is described or represented by latent space can be fed into the generative AI model in the form of “Prompt Engineering”.
LDM’s building blocks
An LDM’s building blocks are training (or learning) and inference.
In the learning phase, a neural network is trained to add noise to images generated with a mathematical model such as Gaussian. Then the noise is blended with an image with latent text description. This is known as a forward pass. In this process, the neural networks model what a latent space description may look like in the form of a very noisy image representation.
Figure 1: Forward pass → Photo blur to noise
In the inference phase, the learning is applied by a reverse pass. In the reverse pass, a random RGB noise is generated using sampling techniques such as Euler or LPMS. Then the generated noise is denoised step by step. In each step, the AI will try to bring the latent space description provided in the prompt into the image by denoising. Typically, within 10-15 steps the image will have most of the features described in the prompt. Every additional step will bring more clarity and detail into the image.
Figure 2: Reverse pass → Noise to photo for the prompt “a red rose”
Knowing that, let’s see how we can generate 3D models with text inputs.
Creating 3D content with generative AI
Creating 3D models using LDMs involves four fundamental steps. Here, we discuss how you can optimize each step to generate the output you need.
Figure 3: Workflow for creating 3D content with LDMs
Prompt Engineering for Photography
Creating anything with generative AI begins with the text input, called a prompt. Writing clear and specific prompts will generate better outputs. To do that in LDMs, it’s important to bring best practices from photography to prompt engineering nomenclature.
In LDM models like stable diffusion, latent space description is formed by tokens. A token can be a simple English word, a name or any number of technical parameters. Here are some tokens you can use to produce good-looking images.
Here are an indicative list of tokens with examples
<subject/object description with pose> - An “Indian girl jumping”
<subject alignment> - An Indian girl jumping, “centered”
<detailing> - An Indian girl jumping in the middle “of a flower garden”
<lighting> - An Indian girl jumping in the middle of a flower garden “daylight, sun rays”
<resolution> - An Indian …, “Ultra high definition”
<camera angle> - An Indian …, “Aerial Shot”
<camera type> - An Indian …, “DSLR photo”
<lens parameters>, - An Indian …, “f/1.4”
<composition> - An Indian … “at the edge of a lake”
<fashion> - An Indian … “south indian clothing”
<subject size> - An Indian … “close-up shot”
<studio setting> - An Indian … “Studio lighting”
<background> - An Indian … “, background rocky mountain range”
When used with the photography tokens, the following tokens can shape the artistic direction of the output:
<art medium>, <color palette>, <artist>, <stroke>, <mood>, <costume>, <make/material>
In addition to tokens and nomenclature, the generative AI model’s inference (of input) is influenced by the following four parameters:
- The text prompt
- The sampling method used for noise generation (Euler, LPMS or DDIM)
- The number of steps (ranges from 20 to 200)
- The CFG (classifier free guidance) scale. This is a scale that sets the extent to which the AI will adhere to a given text prompt. The lower the value, the more “freedom” the AI has to bring in elements further from the prompt
Here’s what AI produced for the following prompts
3d tiny isometric of a modern western living room in a (((cutaway box))), minimalist style, centered, warm colors, concept art, black background, 3d rendering, high resolution, sunlight, contrast, cinematic 8k, architectural rendering, trending on ArtStation, trending on CGSociety
Photo generated by DreamShaper (Stable diffusion 1.5)
Once we have the photograph, the next step is to create continuous volumetric scenes from sparse sets of photographs/views. Following are the few approaches.
It is a state-of-the-art technique for synthesizing novel views of a scene from a sparse set of input images. We used Google’s Dream Fusion to initiate a Neural Radiance Fields (NeRF) model with a single photograph. Here’s a continuous rendering of a scene from NeRF constructed from fewer photographs/views with depth occlusion even in a complex scene.
Dream Fusion takes NeRF as a building block and uses a mathematical process called probability density distillation to perfect the initial 3D model formed by a single photograph. It then uses gradient descent to adjust the 3D model until it fits the 2D image as closely as possible when viewed from random angles. Probability density distillation leverages the knowledge learned from the 2D image model to improve the creation of the 3D model. The output from this step is a point cloud.
Monocular depth sensing
Another approach is constructing 3D models using monocular depth sensing models. The first step in this approach is depth estimation using the monocular photo. The next step is to use a 3D point cloud encoder to predict and correct the depth-shift to construct a realistic 3D scene shape. This approach is faster and can be executed even in low-end hardware, but the constructed 3D model may be incomplete.
3D mesh model
The NeRF model constructed in the previous step has a point cloud model of the scene. This point cloud is meshed using Marching Cubes or Poisson mesh technique to produce the mesh model. The texture for the mesh model is generated from the RGB color values of the point cloud.
3D mesh model based on the photograph generated above
In a noise-to-photo generation model like stable diffusion, users have little control over the generation process. Despite keeping parameters identical, any given prompt can generate a new variant of the image each time you try. To enhance an existing photograph without losing its shape and geometry details, ControlNet models are helpful.
ControlNet uses typical image processing outcomes such as Canny, Hough, HED and depth to preserve the shape and geometry information during the generation process. It can boost the productivity of content designers in iterating combinations of styles, materials, lighting, etc.
Variants produced using a ControlNet model preserving the shape, geometry and pose
Latent diffusion models and generative AI can transform not only XR applications but also prove useful in segments like media/entertainment, building simulation environments for autonomous vehicles and so on.
In conclusion, the realm of 3D creation is undergoing a remarkable transformation. The paradigm shift is granting unrestricted opportunities for individualism while simultaneously streamlining the creative process, eliminating challenges centered on time, budgets and laborious labor.
We expect generative AI-led 3D modeling to forge ahead and explore innovative functionalities. 3D authoring could be an effortless experience where people, irrespective of budgets and industry could unleash their creative potential.
A variation of this article is originally published at Thoughtworks Insights, co-authored by Raju Kandaswamy