Skip to the content.

Abstract

State of the art capture for facial rendering typically involves large, expensive apparatus and a highly controlled capture environment, making it inaccessible to many. However, the growth of the 3D visualisation market means that the demand has never been higher for producing photo-realistic avatars. Therefore, this study aims to experiment with using mobile devices to capture facial geometry in less controlled environments with more accessible equipment. To achieve this, this study makes use of NeuS, a study on neural surface rendering, based on the highly popular neural radiance fields (NeRF) work. Our study finds that many challenges still exist for these ’in the wild’ capture pipelines, particularly lighting, which is still very influential on successful surface construction. However, this study demonstrates that results achieved through our NeuS pipeline, in these environments, produce significantly better surface meshes than those produced by existing structure from motion approaches with the same inputs. Additionally, through leveraging depth sensing hardware on mobile devices, we are able to reduce training times of NeuS by up to 4 hours, opening the door to exciting further research that can be done with depth data in this pipeline.

Background

State of the art facial capture requires large and expensive equipment, with the leading approach being the Lightstage, first presented by Debevec et al in 2000. However, over the last 20 years computer graphics researchers have attempted to reduce the size of this equipment and experiment with less controlled capture environments. The 2022-launched, deep-tech, startup Lumirithmic from Imperial College London researchers, attempts to minimise this setup to fit on a single desk.

lightstage-franco

lightstage-spiderman

Whilst the majority of the methods used in industry use stereo methods for extracting facial geometry, this report explores using Neural Radiance Fields. The landmark paper in this field was Mildenhall et al's 2020 paper NeRF (Neural Radiance Fields), which trains a neural network to optimise a volumetric scene function, to predict colour and density for points in a volume. A study which builds on NeRF is NeuS (Neural Surface Reconstruction), a paper published by Wang et al in 2021. NeuS works similarly to NeRF, but instead of outputting density, it outputs the SDF for each point, resulting in smoother, more accurate surface reconstructions.

This study attempts to use NeuS to extract facial geometry from a sparse set on input images, captured in an uncontrolled environment using a single iPad.

Method

Processing Pipeline

We develop a data capture pipeline from the iPad, and feed this data into NeuS. This requires capturing the images, sampling them and preprocessing them to extract masks and camera poses

pipeline img

We also attempt to leverage the depth data captured to reduce training times by providing a more refined sampling strategy. This is done by using the depth data to inform our sampling bound, to reduce samples on empty space, thus enabling a reduction in the number of samples, speeding up training times.

neus ray sampling

our ray sampling

Results

Rendering

time table

neus face render

colmap face render

Above is a comparison between using Colmap structure from motion to build up geometry, and using our NeuS method, for the same data. Whilst neither produce particularly high quality geometry, it is clear that the NeuS pipeline produces better results for data captured in this environment.

Novel View Synthesis

nvs gif

The above GIF shows an interpolation between two input images. As shown, our pipeline is highly accurate for novel view synthesis and colour estimation.

Conclusion

To conclude, this study aimed to experiment with using mobile devices to capture data for facial geometry rendering, primarily utilising the NeuS software for volumetric neural rendering. Several observations were made through the results of this study. Firstly, it appears that when using NeuS, training with masks has little effect of reconstruction accuracy. In fact, using masks can be harmful if the mask is not properly defined, resulting in the mesh fusing with the background. Instead, we found that a far more important factor was to correctly define the region of inter- est encapsulated by the camera poses. Additionally, as with state of the art reconstruction techniques, lighting conditions are still very influential in successful surface construction. More balanced, ambient, lighting conditions result in better surface re- constructions than lighting conditions that create heavy shadows and highly uneven pixel intensities. Despite these challenges in our pipeline, we still found that surface reconstructions achieved through this NeuS pipeline qualitatively outperformed surfaces acquired when using structure from motion with the same data.

Furthermore, one of the advantages offered by mobile devices is the access to depth sensing hardware. These depth cameras were used to inform our neural volumetric rendering sampling strategy, resulting in a quicker training time without reducing mesh quality. Although, even with our optimisation, one of the biggest limitations of this pipeline remains the long training times, which can range between 14 and 25 hours. However, with an increased availability of depth sensing capabilities on mobile devices, this work could be taken further to incorporate depth more to potentially be used to overcome some of the challenges faced in our pipeline.