SpatialComputing

SpatialComputing OP t1_j67xr7u wrote

>Text-To-4D Dynamic Scene Generation
>
>Abstract
>
>We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description. github.io

25

SpatialComputing OP t1_ixzqmh5 wrote

Yes. On the other hand: in the glasses is a Snapdragon XR1 Gen 1 and if that's a Motorola Edge+, there's a SD 865 in there... both not the most efficient SoCs today. Hopefully QC can run this on the Snapdragon AR2 in the future.

2

SpatialComputing OP t1_iwyusth wrote

>With the Garment Transfer Custom Component, Lens Developers can utilize an image of an upper body garment that is transferred in real-time on a person in AR. The Garment Transfer Template offers a quick way for you to get started with the Garment Transfer Custom Component and provides a starting point for photorealistic Try-On experiences.
>
>https://docs.snap.com/lens-studio/references/templates/object/try-on/garment-transfer

18

SpatialComputing OP t1_iv6jn6t wrote

>In order for learning systems to be able to understand and create 3D spaces, progress in generative models for 3D is sorely needed. The quote "The creation continues incessantly through the media of humans." is often attributed to Antoni Gaudí, who we pay homage to with our method’s name. We are interested in generative models that can capture the distribution of 3D scenes and then render views from scenes sampled from the learned distribution. Extensions of such generative models to conditional inference problems could have tremendous impact in a wide range of tasks in machine learning and computer vision. For example, one could sample plausible scene completions that are consistent with an image observation, or a text description (see Fig. 1 for 3D scenes sampled from GAUDI). In addition, such models would be of great practical use in model-based reinforcement learning and planning [12], SLAM [39], or 3D content creation. > >We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples. We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene.

https://github.com/apple/ml-gaudi

22

SpatialComputing OP t1_iud8iek wrote

>TOCH: SPATIO-TEMPORAL OBJECT-TO-HAND CORRESPONDENCE FOR MOTION REFINEMENT
>
>We present TOCH, a method for refining incorrect 3D hand-object interaction sequences using a data prior. Existing hand trackers, especially those that rely on very few cameras, often produce visually unrealistic results with hand-object intersection or missing contacts. Although correcting such errors requires reasoning about temporal aspects of interaction, most previous work focus on static grasps and contacts. The core of our method are TOCH fields, a novel spatio-temporal representation for modeling correspondences between hands and objects during interaction. The key component is a point-wise object-centric representation which encodes the hand position relative to the object. Leveraging this novel representation, we learn a latent manifold of plausible TOCH fields with a temporal denoising auto-encoder. Experiments demonstrate that TOCH outperforms state-of-the-art (SOTA) 3D hand-object interaction models, which are limited to static grasps and contacts. More importantly, our method produces smooth interactions even before and after contact. Using a single trained TOCH model, we quantitatively and qualitatively demonstrate its usefulness for 1) correcting erroneous reconstruction results from off-the-shelf RGB/RGB-D hand-object reconstruction methods, 2) de-noising, and 3) grasp transfer across objects.
>
>Project | Paper | Code

14