Submitted by alik31239 t3_117blae in MachineLearning
In the abstract of the Nerf paper (https://arxiv.org/abs/2003.08934), the described framework is that Nerf enable to do the following: the user inputs a set of images with known camera poses, and after training the network they can generate images of the same scene from new angles.
However, the paper itself builds a network that gets as an input 5D vectors (3 location coordinates+2 camera angles) and outputs color and volume density for each such coordinate. I don't understand where do I get those 5D coordinates from? My training data surely doesn't have those - I only have a collection of images. Same for inference data. It seems that the paper assumes not only having a collection of images but also having a 3D representation of the scene, while the abstract doesn't require the latter. What am I missing here?
marixer t1_j9b0x65 wrote
The step you're missing there is finding the cameras positions and angles with something like COLMAP, predicting them by extracting features from the images, pairing and triangulating. That data is then used alongside the RGB images to train the nerf