1 Introduction

Understanding indoor scenes is one of the essential tasks for computer vision and intelligence. The rapid development of depth sensors and three-dimensional (3D) scanners, such as RGB-D cameras and LiDAR, has increased interest in 3D indoor scene comprehension for a variety of applications, such as robotics [1], navigation [2] and augmented/virtual reality [3, 4]. The objective of 3D indoor scene understanding is to discern the geometric and semantic context of each interior scene component. There are a variety of 3D data formats, including depth images, meshes, voxels, and point clouds. Among them, point clouds are the most common non-discretized data representation in 3D applications and can be acquired directly by 3D scanners or reconstructed from stereo or multi-view images.

Point cloud segmentation, which attempts to decompose indoor scenes into meaningful parts and label each point, is a fundamental and indispensable step in understanding 3D indoor scenes. Point clouds provide the original spatial information, making them the preferred data format for segmenting indoor scenes. Segmentation of indoor scene point clouds can be divided into semantic segmentation and instance segmentation. Semantic segmentation assigns each point with a scene-level object category label. Instance segmentation is more difficult and requires individual object identification and localization. Unlike outdoor point cloud segmentation, which addresses dynamic objects, indoor point cloud segmentation commonly handles cluttered man-made objects with regularly designed shapes. Indoor point cloud data are usually captured by consumer-level sensors with short ranges, while outdoor point clouds are commonly collected by LiDAR. Indoor point cloud segmentation faces several challenges. First, point cloud data are typically large and voluminous, with varying qualities from different sensors. This makes it difficult to efficiently process and accurately annotate point cloud data. Second, indoor scenes are typically cluttered with severe occlusions. It is challenging to accurately segment objects when they are hidden or close together. Third, unlike regular data structures in two-dimensional (2D) images, point cloud data are sparse and unorganized, making it difficult to apply sophisticated 2D segmentation methods directly to 3D point clouds. Moreover, annotating 3D data is time-consuming and labor-intensive, limiting the ability of fully supervised learning. Existing indoor point cloud datasets are limited and suffer from long-tailed distributions.

Much effort has been devoted to the task of point cloud segmentation. Traditional geometry-based solutions for point cloud segmentation mainly include clustering-based, model-based, and graph-based methods [5]. The majority of these methods rely on hand-crafted features with heuristic geometric constraints. Deep learning has made significant progress in 2D vision [68], leading to advances in point cloud segmentation. In recent years, point cloud based deep neural networks [9] have demonstrated the ability to extract more powerful features and provide more reliable geometric cues for better understanding 3D scenes. Learning from 3D data has become a reality with the availability of public datasets such as ShapeNet, ModelNet, PartNet, ScanNet, Semantic3D, and KITTI. Recently, weakly supervised learning for point cloud segmentation has become a popular research topic, because it attempts to learn features from limited annotated data.

This paper provides a comprehensive review of point cloud segmentation for indoor 3D scene understanding, especially methods based on deep learning. We will introduce the primary datasets and methods used for indoor scene point cloud segmentation, analyze the current research trends in this area, and discuss future directions for development. The structure of this paper is as follows. Section 2 begins by introducing 3D indoor datasets that are used for understanding 3D scenes. Section 3 presents a brief review of geometry-based point cloud segmentation methods. Section 4 reviews indirect learning approaches with structured data. Section 5 provides a comprehensive survey of existing point cloud based learning frameworks employed for 3D scene segmentation. Section 6 introduces recent learning-based segmentation methods with multimodal data. Section 7 summarizes the performance of indoor point cloud segmentation using different methods. Section 8 discusses open questions and future research directions. Section 9 concludes the paper.

2 3D indoor scene point cloud datasets

The emergence of 3D datasets has led to the development of deep learning-based segmentation methods, which play a crucial role in advancing the field and promoting progress in research and applications. Public benchmarks have proven to be highly effective in facilitating framework evaluation and comparison. By providing real-world data with ground truth annotations, these benchmarks offer a foundation for researchers to test their algorithms and enable fair comparisons between different approaches. The two most commonly used 3D indoor scene point cloud datasets are ScanNet [10] and S3DIS [11].

ScanNet. ScanNet [10] is an RGB-D video dataset encompassing more than 2.5 million views across more than 1500 scans. This dataset is captured by RGB-D cameras and extensively annotated with essential information such as 3D camera poses, surface reconstructions, and instance-level semantic segmentations. This dataset has led to remarkable advancements in state-of-the-art performance across various 3D scene understanding tasks, including object detection, semantic segmentation, instance segmentation, and computer-aided design (CAD) model retrieval. ScanNet v2, the modified released version, has meticulously gathered 1,513 scans that have been annotated with impressive surface coverage. In the semantic segmentation task, the V2 version is labeled with annotations for 20 classes of 3D voxelized objects. Each of these classes corresponds to a specific furniture category or room layout, allowing for a more granular understanding and analysis of the captured indoor scenes. This makes ScanNetV2 one of the most active online evaluation datasets tailored for indoor scene semantic segmentation. Apart from semantic segmentation benchmarks, ScanNetV2 also provides benchmarks for instance segmentation and scene type classification.

ScanNet200. ScanNet200 [12] was developed on the basis of ScanNetV2 to overcome the limited set of 20 class labels. It significantly expands the number of classes to 200, representing an order of magnitude increase compared to the previous version. This annotation enables a better capture and understanding of real-world indoor scenes with a more diverse range of objects. This new benchmark allows for a more comprehensive analysis of performance across different object class distributions by splitting the 200 classes into three sets. Specifically, the “head” set comprises 66 categories with the highest frequency, the “common” set consists of 68 categories with less frequency, and the “tail” set contains the remaining categories.

S3DIS. The Stanford large-scale 3D indoor spaces dataset [11], known as S3DIS, acquired through the Matterport scanner, is another highly popular dataset that has been extensively employed in point cloud segmentation. This dataset comprises 272 room scenes divided into 6 distinct areas. Each point within the scene is assigned a semantic label corresponding to one of the 13 pre-defined categories, such as walls, tables, chairs, cabinets, and others. This dataset is specifically curated for large-scale indoor semantic segmentation.

Cornell RGBD. This dataset [13] provides 52 labeled indoor scenes comprising point clouds with RGB values. It consists of 24 labeled office scenes and 28 labeled home scenes. The point cloud data are generated from the original RGB-D images via the RGBD-SLAM method. This dataset contains approximately 550 views with 2495 labeled segments across 27 object classes, providing valuable resources for previous research and development in indoor scene understanding.

Washington RGB-D dataset. This dataset [14] includes 14 indoor scene point clouds, which are obtained via RGB-D image registration and stitching. It provides annotations of 9 semantic category labels, such as sofas, teacups, and hats.

3 Geometry-based segmentation

Geometry-based solutions for understanding indoor scenes can be classified as clustering or region growing, or model fitting based methods. By incorporating heuristic geometric constraints, most of these methods use handcrafted features. The intuition behind geometry-based methods is that man-made environments normally consist of many geometric structures.

Clustering or region growing. These approaches assume that points in close proximity to each other are more likely to belong to the same object or surface. By considering the geometric properties of these neighboring points, such as spatial coordinates and surface normals, these methods can identify regions that share similarities in these properties. Mattausch et al. [15] proposed a method for segmenting indoor scenes by identifying repeated objects from multi-room indoor scanning data. To represent the indoor scenes, they employed a collection of nearly planar patches. These patches were clustered based on a patch similarity matrix, which was constructed using shape geometrical descriptors. Using this approach, the researchers aimed to effectively segment indoor scenes by exploiting the inherent repeated object structures. Hu et al. [16] partitioned point clouds into surface patches using the dynamic region growing method to generate initial segmentation. By leveraging this intermediate data representation, the model can better account for shape variations and enhance its ability to classify objects.

Model fitting. Model fitting is proposed as a more efficient and robust strategy, particularly in scenarios where noise and outliers are present. Nan et al. [58], MPRM [44] transforms the input points into a latent representation. This transformation, known as x-conv, is implemented using MLPs. This transformation allows for the application of traditional convolution, which is effective in capturing local and global patterns in regular data domains.

GCN-based methods. Recent studies have explored the application of a graph convolutional network (GCN) to point clouds, recognizing that the points and their neighboring points can form a graph structure [25, 46, 72]. The objective is to extract local geometric structure information while preserving permutation invariance. This is achieved by constructing a spatial or spectral adjacency graph using the features of vertices and their neighbors. DGCNN [46] employs MLPs to aggregate edge features, which consist of nodes and their spatial neighbors. The features of the nodes are then updated based on the edge features. RGCNN [45] considers the features of data points in a point cloud as graph signals and uses spectral-based graph convolution for point cloud classification and segmentation. The spectral-based graph convolution operation is defined using the approximation of the Chebyshev polynomial. Furthermore, the Laplacian matrix of the graph is updated at each layer of the network based on the learned depth features. This allows for the extraction of local structural information while accounting for the unordered nature of the data. DGCNN and RGCNN demonstrate different approaches to the use of GCN. DGCNN focuses on edge feature aggregation and node feature updates, while RGCNN uses spectral-based graph convolution and updates the Laplacian matrix based on learned depth features. SPG [25] is a deep learning framework specifically designed for the task of semantic segmentation in large-scale point clouds with millions of points. The framework introduces the concept of a superpoint graph (SPG), which effectively captures the inherent organization of 3D point clouds. By dividing the scanned scene into geometrically homogeneous elements, SPG provides a compact representation that captures the contextual relationships between different object parts within the point cloud. Leveraging this rich representation, GCN is employed to learn and infer semantic segmentation labels. The combination of the SPG structure and GCN enables the capture of contextual relationships, resulting in accurate semantic segmentation of complex and voluminous point cloud data. PointWeb [52] designs an adaptive feature extraction module to find the interaction between densely connected neighbors. Unlike most point-based deep learning methods, PGCNet [60] incorporates geometric information as a prior and uses surface patches for data representation. The idea behind this method is that man-made objects can be decomposed into a set of geometric primitives. The PGCNet framework first extracts surface patches from indoor scene point clouds using the region growing method. With surface patches and their geometric features as input, a GCN-based network is designed to explore patch features and contextual information. Specifically, a dynamic graph U-Net module, which employs dynamic edge convolution, aggregates hierarchical feature embeddings. Taking advantage of the surface patch representation, PGCNet can achieve competitive semantic segmentation performance with much less training.

Transformer-based methods. The Transformer technique has revolutionized natural language processing (NLP) and 2D vision [73, 77] introduced guided point contrastive learning, which improves feature representation in the semi-supervised network. Augmented point clouds generated from input point clouds are fed into an unsupervised branch for backbone network training. The backbone network, classifier, and projector are shared with the supervised branch to produce semantic scores. By incorporating self-supervised learning, Zhang et al. [71] utilizes a Transformer-based module to directly predict instance masks. Semantic and geometric information is encoded into point features through a stacked Transformer decoder, which provides an instance heatmap that indicates the similarities among the point clouds. Recently, SPFormer [86] has been developed to directly predict instances in an end-to-end manner based on superpoint cross-attention. Superpoint features are aggregated from point clouds and used as input to the Transformer decoder.

In recent years, detection-based methods, which attempt to perform the instance segmentation task by a separate detection step, have received less attention than detection-free approaches that aim for an end-to-end solution. Moreover, different backbones with varying levels of annotation have been explored. However, the accuracy of instance segmentation is still low, and the generality of existing methods lacks strong empirical evidence.

5.2.2 Weakly supervised instance segmentation

While fullly supervised point cloud instance segmentation can suffer from performance degradation when dense annotations are unavailable, weakly supervised frameworks attempt to classify points into objects with small numbers of labels.

Liao et al. [87] proposed a semi-supervised framework for point cloud instance segmentation by using a bounding box for supervision. The input point clouds were decomposed into subsets by bounding box proposals. Semantic information and consistency constraints were used to produce final instance masks. Hou et al. [

6 Learning-based segmentation with multi-modality

Recent advances in foundation models of 2D vision and NLP have inspired the exploration of multi-modality methods in 3D models [12, 9198]. For instance, Peng et al. [97] proposed a zero-shot approach that co-embeds point features with images and text. Rozenberszki et al. [12] presented a language-grounded method by discovering the joint embedding space of point features and text features. Liu et al. [91] transferred the knowledge from 2D to 3D for part segmentation. Wang et al. [92] trained a multi-modal model that learns from vision, language, and geometry to improve 3D semantic scene understanding. Xue et al. [93] introduced a unified representation of images, text, and 3D point clouds by aligning them during pre-training. Ding et al. [94] distilled knowledge from vision-language models for 3D scene understanding tasks. Zeng et al. [95] aligned 3D representations to open-world vocabularies via a cross-modal contrastive objective. Zhang et al. [98] performed text-scene paired semantic understanding with language-assisted learning. How to facilitate and adapt multi-modalities with point clouds for better scene understanding is worth exploring. These methods utilize rich information from vision and text, enabling a more comprehensive representation of the indoor scene. However, these approaches require high computational resources, and pre-training is highly dependent on limited multi-modal datasets.

7 Performance evaluation

7.1 Evaluation metrics

The widely adopted evaluation metrics for indoor point cloud semantic segmentation include overall accuracy (OA), mean intersection over union (mIoU), and mean accuracy (mAcc).

The standard evaluation metric for indoor point cloud instance segmentation is mean average precision with IoU thresholds (mAP) from 0.5 to 0.95. In particular, mAP@50 is AP score with IoU thresholds of 0.5. Additionally, mean precision (mPrec) and mean recall (mRec) are frequently used criteria on the S3DIS dataset.

7.2 Results on public datasets

Semantic segmentation results. Tables 1 and 2 present indoor point cloud semantic segmentation results of different methods on S3DIS Area 5 and ScanNet v2, respectively. We can observe that the state-of-the-art methods outperform the pioneering work of PointNet [41] with more than 20% mIoU gains. Transformer-based methods [48, 49, 65] have been the dominant methods in recent years, following the great success in NLP and image understanding. Meanwhile, several weakly supervised methods show the possibility of achieving semantic segmentation with fewer data, reaching more than 65% of mIoU on S3DIS Area 5 and 70% on ScanNet. These results are encouraging, although there is still a gap between fully supervised and weakly supervised approaches. It is desirable to further improve the ability to extract features from limited annotated data.

Table 1 Semantic segmentation results of different methods on S3DIS Area 5
Table 2 Semantic segmentation results of different methods on ScanNet v2 validation set and test set

Instance segmentation results. Tables 3 and 4 present indoor point cloud instance segmentation results of different methods on S3DIS Area 5 and ScanNet v2, respectively. Detection-free methods have received more attention than detection-based methods because they attempt to complete the instance segmentation task in an end-to-end manner. Several networks [31, 68, 70, 87, 89] have started to learn instance information from limited annotation. These results clearly show that there is still room for improvement in point cloud instance segmentation using weakly supervised learning.

Table 3 Instance segmentation results of different methods on S3DIS Area 5
Table 4 Instance segmentation results of different methods on ScanNet v2 validation set and test set

8 Discussion

Point cloud segmentation is a crucial task in 3D indoor scene understanding. With the availability of 3D datasets, deep learning-based segmentation methods have gained significant attention and have contributed to the progress in this field. However, obtaining accurate segmentation results often requires dense annotations, which is a laborious and costly process. In order to mitigate the reliance on extensive annotations and enable learning from limited labeled data, the research focus has shifted towards weakly supervised approaches in recent years. By exploring weakly supervised frameworks, researchers aim to achieve satisfactory segmentation results while minimizing the annotation efforts and costs involved. Despite the rapid development of point cloud segmentation, existing frameworks still face several challenges.

8.1 Datasets and representations

The size of annotated point cloud data is still limited compared to that of image datasets. Although acquiring point clouds becomes affordable, annotating point clouds is still a time-consuming task. Since both fully supervised and pre-training [99, 100] require a large amount of data, larger datasets with more diverse scenes are desired to advance learning-based point cloud segmentation. Therefore, an efficient and user-friendly annotation method for large datasets is needed. This might be achieved by unsupervised approaches with geometric priors. The recently developed datasets, such as the ScanNet200 dataset [12], have drawn increasing attention to imbalanced learning [101] in point cloud segmentation.

Existing point cloud segmentation methods use different data formats, including point clouds, RGB-D images, voxels, and geometric primitives. Each data format has its advantages and drawbacks in different 3D scene understanding tasks. On the basis of point-based networks, we can now directly process point clouds for training and reasoning. Obviously, not all points are needed for scene perception. For indoor scene point cloud data, finding a better representation is still a promising direction of research.

8.2 Data efficiency and multi-modality

Data-efficient learning frameworks are highly desirable because they alleviate the burden of collecting extensive dense annotations for training the model. Although current weakly supervised point cloud segmentation methods can achieve competitive performance with fully supervised learning, there are still gaps to be filled. More importantly, the generality and robustness of these data-efficient methods are not convincing, as they mainly test on public datasets with limited sizes rather than on open-world scenes. Therefore, further exploration of generalist models is the trend for the future.

One promising route is to integrate other modalities, such as images and natural languages. Previous works [37, 102, 103] have explored the combination of 2D images and 3D point clouds for better understanding of scenes. Recent developments in foundation models of 2D vision and NLP have served as inspiration sources for investigating multi-modalities in 3D data [12, 9198]. While these methods achieve incredible results in different 3D tasks, adapting knowledge from other modalities to indoor point cloud segmentation is still challenging. In addition, collecting adequate multi-modal pre-training data can be costly. How to facilitate and adapt multi-modalities with point clouds for a better understanding of indoor scenes is worth exploring.

8.3 Learning methods for dynamic scene segmentation

Current learning-based indoor point cloud segmentation methods are mostly designed for static scenes. Indoor objects can be moved around in real-world scenarios, allowing for a more comprehensive representation of the indoor scene. Moreover, annotating such dynamic scenes is even more costly than annotating 3D point clouds. 4D representation learning has become the core of dynamic feature exploitation. Recent work [104, 105] has explored 4D feature extraction and distillation to improve downstream tasks such as scene segmentation. Transferring such information to varying scales of indoor scenes is still challenging. The development of learning methods for dynamic scene segmentation is an interesting prospect for further investigation.

9 Conclusion

Point cloud segmentation plays a key role in 3D vision and intelligence. This paper aims to provide a concise overview of point cloud segmentation techniques for understanding 3D indoor scenes. First, we present public 3D point cloud datasets, which are the foundation of point cloud segmentation research, especially for deep learning-based methods. Second, we review representative approaches for indoor scene point cloud segmentation, including geometry-based and learning-based methods. Geometry-based approaches extract geometric information and can be combined with learning-based methods. Learning-based methods can be divided into structured data-based and point-based methods. We mainly consider point-based semantic and instance segmentation frameworks, including fully supervised networks and weakly supervised networks. Finally, we discuss the open problems in the field and outline future research directions. We expect that this review can provide insights into the field of indoor scene point cloud segmentation and stimulate new research.