Indexed by:
Abstract:
Three-dimensional semantic understanding using only several multi-view images can largely reduce the communication burden on the network. In addition, while point clouds are extensively studied for 3D scene understanding, utilization of multi-view image data offers rich visual details and texture information. However, challenges persist in lifting 2D semantic features to 3D space and leveraging language for segmentation. Inspired by recent advancements, this paper proposes a method that combines CLIP features and SAM masks to create a feature field capable of segmenting objects via natural language text across 2D multi-view and 3D Gaussian splatting. It offers a promising function for extracting 3D assets for game engines and the metaverse. Our method involves mask generation from video frames, extracting physical scales via RGB Nerf with masks, and organizing hierarchical information for semantic comprehension. In the training process, affinity features maintain scale properties and guide CLIP feature generation with auto weights blending for semantic robustness. A straightforward 3D splatting CLIP feature approach and canonical text methodology enhance query robustness across 2D multi-view and 3D splatting through relevance score calculation based on text CLIP features for inference. Experimental results demonstrate promising improvements in semantic understanding of 3D scenes. © 2024 IEEE.
Keyword:
Reprint 's Address:
Email:
Version:
Source :
Year: 2024
Page: 464-467
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 1
Affiliated Colleges: