作者: Aritra Roy Gosthipaty, Ritwik Raha
创建日期 2021/08/09
最后修改日期 2023/11/13
描述: NeRF 中所示的体积渲染的最小实现。
在本示例中,我们展示了 Ben Mildenhall 等人发表的研究论文 NeRF:将场景表示为用于视图合成的神经辐射场 的最小实现。作者提出了一种巧妙的方法,通过神经网络对体积场景函数进行建模,来合成场景的新视图。
图 1:将图像坐标输入神经网络 |
作为输入,并要求预测坐标处的颜色。 |
图 2:训练后的神经网络从头开始重新创建图像。 |
现在出现一个问题,我们如何将这个想法扩展到学习 3D 体积场景?实现与上述类似的过程需要了解每个体素(体积像素)。事实证明,这是一项相当具有挑战性的任务。
该论文的作者提出了一种最小且优雅的方法,使用场景的一些图像来学习 3D 场景。他们放弃了使用体素进行训练。该网络学习对体积场景进行建模,从而生成模型在训练时未显示的 3D 场景的新视图(图像)。
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
# Setting random seed to obtain reproducible results.
import tensorflow as tf
import keras
from keras import layers
import os
import glob
import imageio.v2 as imageio
import numpy as np
from tqdm import tqdm
import matplotlib.pyplot as plt
# Initialize global variables.
数据文件包含图像、相机姿势和焦距。图像是从多个相机角度拍摄的,如图 3 所示。
图 3:多个相机角度 |
来源:NeRF |
要理解此上下文中的相机姿势,我们必须首先允许自己认为相机是真实世界和 2D 图像之间的映射。
图 4:通过相机进行的 3D 世界到 2D 图像的映射 |
来源:Mathworks |
其中 x 是 2D 图像点,X 是 3D 世界点,P 是相机矩阵。P 是一个 3 x 4 矩阵,它在将真实世界物体映射到图像平面中起着至关重要的作用。
相机矩阵是一个仿射变换矩阵,它与一个 3 x 1 列 [图像高度、图像宽度、焦距]
连接,以生成姿态矩阵。此矩阵的尺寸为 3 x 5,其中前 3 x 3 块在相机的视角中。轴为 [向下、向右、向后]
或 [-y, x, z]
,其中相机向前 -z
图 5:仿射变换。 |
COLMAP 框架为 [向右、向下、向前]
或 [x, -y, -z]
。在此处阅读有关 COLMAP 的更多信息:here。
# Download the data if it does not already exist.
url = (
data = keras.utils.get_file(origin=url)
data = np.load(data)
images = data["images"]
im_shape = images.shape
(num_images, H, W, _) = images.shape
(poses, focal) = (data["poses"], data["focal"])
# Plot a random image from the dataset for visualization.
plt.imshow(images[np.random.randint(low=0, high=num_images)])
现在您已经了解了相机矩阵的概念以及从 3D 场景到 2D 图像的映射,让我们来谈谈反向映射,即从 2D 图像到 3D 场景。
考虑一张包含 N
个像素的图像。我们从每个像素发射一条光线,并在光线上采样一些点。光线通常由方程 r(t) = o + td
参数化,其中 t
是单位方向向量,如图 6 所示。
图 6: r(t) = o + td ,其中 t 为 3 |
在图 7 中,我们考虑一条光线,并在光线上采样一些随机点。这些采样点各自具有唯一的位置 (x, y, z)
,并且该光线具有一个视角 (theta, phi)
。视角尤其值得关注,因为我们可以通过一个像素以许多不同的方式发射光线,每条光线都有一个独特的视角。另一个值得注意的事情是添加到采样过程中的噪声。我们给每个样本添加均匀噪声,以便样本对应于连续分布。在图 7 中,蓝色点是均匀分布的样本,白色点 (t1, t2, t3)
图 7: 从光线采样点。 |
图 8 展示了整个 3D 采样过程,你可以看到光线从白色图像中射出。这意味着每个像素都会有其对应的光线,并且每条光线都会在不同的点进行采样。
图 8: 从图像的所有像素发射 3D 光线 |
这些采样点作为 NeRF 模型的输入。然后要求模型预测该点的 RGB 颜色和体密度。
图 9: 数据管道 |
来源:NeRF |
def encode_position(x):
"""Encodes the position into its corresponding Fourier feature.
x: The input coordinate.
Fourier features tensors of the position.
positions = [x]
for i in range(POS_ENCODE_DIMS):
for fn in [tf.sin, tf.cos]:
positions.append(fn(2.0**i * x))
return tf.concat(positions, axis=-1)
def get_rays(height, width, focal, pose):
"""Computes origin point and direction vector of rays.
height: Height of the image.
width: Width of the image.
focal: The focal length between the images and the camera.
pose: The pose matrix of the camera.
Tuple of origin point and direction vector for rays.
# Build a meshgrid for the rays.
i, j = tf.meshgrid(
tf.range(width, dtype=tf.float32),
tf.range(height, dtype=tf.float32),
# Normalize the x axis coordinates.
transformed_i = (i - width * 0.5) / focal
# Normalize the y axis coordinates.
transformed_j = (j - height * 0.5) / focal
# Create the direction unit vectors.
directions = tf.stack([transformed_i, -transformed_j, -tf.ones_like(i)], axis=-1)
# Get the camera matrix.
camera_matrix = pose[:3, :3]
height_width_focal = pose[:3, -1]
# Get origins and directions for the rays.
transformed_dirs = directions[..., None, :]
camera_dirs = transformed_dirs * camera_matrix
ray_directions = tf.reduce_sum(camera_dirs, axis=-1)
ray_origins = tf.broadcast_to(height_width_focal, tf.shape(ray_directions))
# Return the origins and directions.
return (ray_origins, ray_directions)
def render_flat_rays(ray_origins, ray_directions, near, far, num_samples, rand=False):
"""Renders the rays and flattens it.
ray_origins: The origin points for rays.
ray_directions: The direction unit vectors for the rays.
near: The near bound of the volumetric scene.
far: The far bound of the volumetric scene.
num_samples: Number of sample points in a ray.
rand: Choice for randomising the sampling strategy.
Tuple of flattened rays and sample points on each rays.
# Compute 3D query points.
# Equation: r(t) = o+td -> Building the "t" here.
t_vals = tf.linspace(near, far, num_samples)
if rand:
# Inject uniform noise into sample space to make the sampling
# continuous.
shape = list(ray_origins.shape[:-1]) + [num_samples]
noise = tf.random.uniform(shape=shape) * (far - near) / num_samples
t_vals = t_vals + noise
# Equation: r(t) = o + td -> Building the "r" here.
rays = ray_origins[..., None, :] + (
ray_directions[..., None, :] * t_vals[..., None]
rays_flat = tf.reshape(rays, [-1, 3])
rays_flat = encode_position(rays_flat)
return (rays_flat, t_vals)
def map_fn(pose):
"""Maps individual pose to flattened rays and sample points.
pose: The pose matrix of the camera.
Tuple of flattened rays and sample points corresponding to the
camera pose.
(ray_origins, ray_directions) = get_rays(height=H, width=W, focal=focal, pose=pose)
(rays_flat, t_vals) = render_flat_rays(
return (rays_flat, t_vals)
# Create the training split.
split_index = int(num_images * 0.8)
# Split the images into training and validation.
train_images = images[:split_index]
val_images = images[split_index:]
# Split the poses into training and validation.
train_poses = poses[:split_index]
val_poses = poses[split_index:]
# Make the training pipeline.
train_img_ds = tf.data.Dataset.from_tensor_slices(train_images)
train_pose_ds = tf.data.Dataset.from_tensor_slices(train_poses)
train_ray_ds = train_pose_ds.map(map_fn, num_parallel_calls=AUTO)
training_ds = tf.data.Dataset.zip((train_img_ds, train_ray_ds))
train_ds = (
.batch(BATCH_SIZE, drop_remainder=True, num_parallel_calls=AUTO)
# Make the validation pipeline.
val_img_ds = tf.data.Dataset.from_tensor_slices(val_images)
val_pose_ds = tf.data.Dataset.from_tensor_slices(val_poses)
val_ray_ds = val_pose_ds.map(map_fn, num_parallel_calls=AUTO)
validation_ds = tf.data.Dataset.zip((val_img_ds, val_ray_ds))
val_ds = (
.batch(BATCH_SIZE, drop_remainder=True, num_parallel_calls=AUTO)
该模型是一个多层感知器 (MLP),使用 ReLU 作为其非线性激活函数。
"我们通过限制网络仅将体密度 sigma 预测为位置 x
的函数,同时允许 RGB 颜色 c
预测为位置和视角方向的函数,来鼓励表示具有多视角一致性。为了实现这一点,MLP 首先使用 8 个全连接层(使用 ReLU 激活和每层 256 个通道)处理输入的 3D 坐标 x
,并输出 sigma 和一个 256 维的特征向量。然后,该特征向量与相机光线的视角方向连接,并传递给一个额外的全连接层(使用 ReLU 激活和 128 个通道),该层输出与视角相关的 RGB 颜色。"
这里我们采用了一个最小化的实现,并使用了 64 个密集单元,而不是论文中提到的 256 个。
def get_nerf_model(num_layers, num_pos):
"""Generates the NeRF neural network.
num_layers: The number of MLP layers.
num_pos: The number of dimensions of positional encoding.
The `keras` model.
inputs = keras.Input(shape=(num_pos, 2 * 3 * POS_ENCODE_DIMS + 3))
x = inputs
for i in range(num_layers):
x = layers.Dense(units=64, activation="relu")(x)
if i % 4 == 0 and i > 0:
# Inject residual connection.
x = layers.concatenate([x, inputs], axis=-1)
outputs = layers.Dense(units=4)(x)
return keras.Model(inputs=inputs, outputs=outputs)
def render_rgb_depth(model, rays_flat, t_vals, rand=True, train=True):
"""Generates the RGB image and depth map from model prediction.
model: The MLP model that is trained to predict the rgb and
volume density of the volumetric scene.
rays_flat: The flattened rays that serve as the input to
the NeRF model.
t_vals: The sample points for the rays.
rand: Choice to randomise the sampling strategy.
train: Whether the model is in the training or testing phase.
Tuple of rgb image and depth map.
# Get the predictions from the nerf model and reshape it.
if train:
predictions = model(rays_flat)
predictions = model.predict(rays_flat)
predictions = tf.reshape(predictions, shape=(BATCH_SIZE, H, W, NUM_SAMPLES, 4))
# Slice the predictions into rgb and sigma.
rgb = tf.sigmoid(predictions[..., :-1])
sigma_a = tf.nn.relu(predictions[..., -1])
# Get the distance of adjacent intervals.
delta = t_vals[..., 1:] - t_vals[..., :-1]
# delta shape = (num_samples)
if rand:
delta = tf.concat(
[delta, tf.broadcast_to([1e10], shape=(BATCH_SIZE, H, W, 1))], axis=-1
alpha = 1.0 - tf.exp(-sigma_a * delta)
delta = tf.concat(
[delta, tf.broadcast_to([1e10], shape=(BATCH_SIZE, 1))], axis=-1
alpha = 1.0 - tf.exp(-sigma_a * delta[:, None, None, :])
# Get transmittance.
exp_term = 1.0 - alpha
epsilon = 1e-10
transmittance = tf.math.cumprod(exp_term + epsilon, axis=-1, exclusive=True)
weights = alpha * transmittance
rgb = tf.reduce_sum(weights[..., None] * rgb, axis=-2)
if rand:
depth_map = tf.reduce_sum(weights * t_vals, axis=-1)
depth_map = tf.reduce_sum(weights * t_vals[:, None, None], axis=-1)
return (rgb, depth_map)
训练步骤作为自定义 keras.Model
子类的一部分实现,以便我们可以使用 model.fit
class NeRF(keras.Model):
def __init__(self, nerf_model):
self.nerf_model = nerf_model
def compile(self, optimizer, loss_fn):
self.optimizer = optimizer
self.loss_fn = loss_fn
self.loss_tracker = keras.metrics.Mean(name="loss")
self.psnr_metric = keras.metrics.Mean(name="psnr")
def train_step(self, inputs):
# Get the images and the rays.
(images, rays) = inputs
(rays_flat, t_vals) = rays
with tf.GradientTape() as tape:
# Get the predictions from the model.
rgb, _ = render_rgb_depth(
model=self.nerf_model, rays_flat=rays_flat, t_vals=t_vals, rand=True
loss = self.loss_fn(images, rgb)
# Get the trainable variables.
trainable_variables = self.nerf_model.trainable_variables
# Get the gradeints of the trainiable variables with respect to the loss.
gradients = tape.gradient(loss, trainable_variables)
# Apply the grads and optimize the model.
self.optimizer.apply_gradients(zip(gradients, trainable_variables))
# Get the PSNR of the reconstructed images and the source images.
psnr = tf.image.psnr(images, rgb, max_val=1.0)
# Compute our own metrics
return {"loss": self.loss_tracker.result(), "psnr": self.psnr_metric.result()}
def test_step(self, inputs):
# Get the images and the rays.
(images, rays) = inputs
(rays_flat, t_vals) = rays
# Get the predictions from the model.
rgb, _ = render_rgb_depth(
model=self.nerf_model, rays_flat=rays_flat, t_vals=t_vals, rand=True
loss = self.loss_fn(images, rgb)
# Get the PSNR of the reconstructed images and the source images.
psnr = tf.image.psnr(images, rgb, max_val=1.0)
# Compute our own metrics
return {"loss": self.loss_tracker.result(), "psnr": self.psnr_metric.result()}
def metrics(self):
return [self.loss_tracker, self.psnr_metric]
test_imgs, test_rays = next(iter(train_ds))
test_rays_flat, test_t_vals = test_rays
loss_list = []
class TrainMonitor(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
loss = logs["loss"]
test_recons_images, depth_maps = render_rgb_depth(
# Plot the rgb, depth and the loss plot.
fig, ax = plt.subplots(nrows=1, ncols=3, figsize=(20, 5))
ax[0].set_title(f"Predicted Image: {epoch:03d}")
ax[1].imshow(keras.utils.array_to_img(depth_maps[0, ..., None]))
ax[1].set_title(f"Depth Map: {epoch:03d}")
ax[2].set_xticks(np.arange(0, EPOCHS + 1, 5.0))
ax[2].set_title(f"Loss Plot: {epoch:03d}")
num_pos = H * W * NUM_SAMPLES
nerf_model = get_nerf_model(num_layers=8, num_pos=num_pos)
model = NeRF(nerf_model)
optimizer=keras.optimizers.Adam(), loss_fn=keras.losses.MeanSquaredError()
# Create a directory to save the images during training.
if not os.path.exists("images"):
def create_gif(path_to_images, name_gif):
filenames = glob.glob(path_to_images)
filenames = sorted(filenames)
images = []
for filename in tqdm(filenames):
kargs = {"duration": 0.25}
imageio.mimsave(name_gif, images, "GIF", **kargs)
create_gif("images/*.png", "training.gif")
[swscaler @ 0x67626c0] Warning: data is not aligned! This can lead to a speed loss
在这里,我们可以看到场景的渲染 360 度视图。该模型已通过 仅 20 个 epoch 中的稀疏图像集成功地学习了整个体积空间。你可以查看本地保存的渲染视频,名为 rgb_video.mp4
我们已经制作了 NeRF 的最小实现,以提供对其核心思想和方法的直观理解。该方法已在计算机图形领域的各种其他工作中被使用。
我们鼓励读者使用此代码作为示例,并尝试使用超参数并可视化输出。下面我们还提供了经过更多 epoch 训练的模型的输出。
Epochs | 训练步骤的 GIF |
100 | |
200 |
如果有人有兴趣深入了解 NeRF,我们在 PyImageSearch 上构建了一个由三部分组成的博客系列。
你可以尝试在 Hugging Face Spaces 上运行该模型。