Posted 2021-12-22Updated 2022-02-19an hour read (About 7566 words)

PointNet and related works

最近一门机器学习的课有一个阅读论文的作业，我选了老本行votenet，不过发现自己一些实现细节还是不过关。借着这个机会再从pointnet开始梳理一下思路并且解读一下源码。讲解视频

PointNet

PointNet一定是3D pointcloud perception的里程碑，保留了大道至简的美感的同时，给出了严格的证明。是从每个角度都应该称赞的作品。

正如原文所说，又是空间中的n个点的集合有如下特性：

无序性（输出应当与输入的n个点的排序无关）
点和其邻居之间是有一些依赖关系的，我们设计的网络需要能够从邻居点中提取出局部特征
旋转和位移不变性（不过pointnet和pointnet++对旋转的处理都不是很好）

所以pointnet使用了max-pooling层来作为对称函数解决点云无序性的问题。

PointNet的证明

我们令$\chi={S:S\subseteq[0,1]^m\text{and}|S|=n}$，此处的S就是我们的input，即一个欧式空间中的点集实例。此处做了归一化处理，所以它是模长恒定的m维向量。而$\chi$自然就是欧式空间中的点集的集合。我们希望拟合一个定义在集合上的连续函数$f:\chi\rightarrow\mathbb{R}$，注意此处的$\mathbb{R}$是指实数空间，而不是实数。

因为这个函数是连续函数，所以在点集$S,S’\in\chi$上有如下性质：
$$
\forall\varepsilon>0,\exist\delta>0 \\
\textbf{if } d_h(S,S’)<\delta \\
\textbf{then} |f(S)-f(S’)|<\varepsilon
$$
其中的$d_h$其实就是集合距离（豪斯多夫距离），我们只需要把上述认为是普通函数连续性的推广即可。

对于PointNet来说，它得到的拟合出来的函数的形式是：$\gamma(\mathop{MAX}_{x_i\in S}{h(x_i)})$。其中，我们令$S={x_1,x_2,…,x_n}$，其中$x_i\in R^N$，比如说我们单纯的空间位置的话，那就是$N=3$，而$h(x_i)$其实就是对这n个点做shared-MLP，得到的结果做一个maxpooling出一个$1\times1024$维的global feature。然后$\gamma$其实就是最后处理global feature的MLP。

我们需要证明的就是
$$
\forall\varepsilon>0,\exist \text{ such a function} \\
\textbf{then} |f(S)-\gamma(\mathop{MAX}_{x_i\in S}{h(x_i)})|<\varepsilon
$$

相对直观的分析

我们可以简单地认为$h(·)$就是把$x_i$映射到空间网格中的一个格子里，我们记这个空间网格的大小为$M\times M\times M$。那么因为点只会映射到一个网格中，我们可以得到$h(x_i)$就是一个大小为$1\times M^3$的向量，并且只有一个值为1，其余值为0。

而MAX函数我们可以认为是使用空间网格来重建我们的输入点云，我们可以令这个网格足够密集（M足够大），使得每个网格至多包含一个原先的点，这样就等于我们做完max-pooling后，得到了一个$1\times M^3$的向量，其中有n个点为1，其余为0。

因为网格的密度可以足够大，所以我们可以使用网格模型以任意精度去近似我们原先的点云集合$S$。接下来的$\gamma$其实就是对这个新的表示形式的$1\times M^3$向量(global feature)去做一个MLP。因为MLP可以近似任意函数，那它自然也能近似$f$函数，得证。本证明过程参考了深蓝学院的点云课程。

论文中的纯数学分析

参考CSDN。知乎。

PointNet++

PointNet只使用一个max-pooling层来整合全局信息，而PointNet++使用层级结构来逐层提取特征，并且不断地从层级中抽象出更大的局部区域。在PointNet++中，主要是通过set abstraction layer来实现的，它包含了sampling layer（最远点采样）, grouping layer和PointNet layer。

Sampling Layer

其实这一部分就是一个最远点采样的工作。也就是输入的维度是$B\times N \times(C+D)$，通常情况下$C=3$，而$D$就是特征维度，并且我们已知当前层需要采样到$\text{npoints}$个点。那么其实就是$B\times N \times (C+D)\rightarrow B\times\text{npoints}\times(C+D)$。

Grouping Layer

这个层的目标是找到每个点的邻居，其实就是从半径为$r$的球中找到至多$\text{nsample}$个元素。

PointNet Layer

这个其实就是拿原先的PointNet来提取特征。

Set Abstraction Layer的MSG优化

因为原先的set abstraction layer是在固定的一个半径上去做的，感受野是固定大小。而MSG就是在多个不同的半径上去提取特征，最后组合在一起。

Point Feature Propagation

在set abstraction layer中，原先的点集被降采样了。然而在点分割任务中，我们需要对每个点获取到一个点的种类标签，所以我们希望得到原先所有点的一个特征。一个方法就是在set abstraction layer中，我们永远采样所有的点作为中心点，但是这会导致计算消耗非常大，另一个方法就是使用point feature propagation。

在feature propagation layer中，它的输入是$N_i\times(d+C)$，而它的输出的$N_{i-1}\times(d+C)$，注意到其中的$N_i$就是在每层的set abstraction layer中的大小。所以其实这就是一个Decoder，并且在每一层Decode得到了全局特征后，再拼上了原先set abstraction layer的局部特征。

代码

接下来我们尝试来理解PointNet++的源码，PointNet++提供了三种任务的代码：classification、part segmentation和semantic segmentation，而set abstraction layer分为了SSG(single-scale grouping)和MSG(multi-scale grouping)。通常有纯Python版本的Pytorch实现和带有Cuda实现功能函数的Pytorch实现，因为最原先开源的版本是Tensorflow版本，这两个Pytorch版本在变量命名上都借鉴了原先版本的命名，但是这和论文中的参数命名是一个都对不上。并且个别出出现了破坏软件抽象规则的地方，如在PointNetSetAbstractionMsg中调用了类似于sample_and_group的结构但并没有复用代码，亦或者sample_and_group_all其实可以规约到sample_and_group，但是又重新写了一个函数等一系列问题。所以总体代码看起来比较痛苦，不过当我们彻底搞清楚代码以后，我们就可以把PoineNet++当做开箱即用的东西，再也不管它的底层实现了。

sample_and_group的实现

def sample_and_group(npoint, radius, nsample, xyz, features, returnfps=False):
    """
    Input:
        npoint:	N_{i+1}
        radius:	查询半径
        nsample: 考察至多几个邻域中的点
        xyz: input points position data, [B, N, 3]
        features: input points feature data, [B, N, C]
    Return:
        new_xyz: sampled points position data, [B, npoint, nsample, 3]
        new_FEATURES: sampled points feature data, [B, npoint, nsample, 3+D]
    """
    B, N, d = xyz.shape	# d = 3
    fps_idx = farthest_point_sample(xyz, npoint)	
    # 最远点采样，从N_{i}个点中选出N_{i+1}个作为中心点
    new_xyz = index_points(xyz, fps_idx)    # [B, N_i, 3] -> [B, N_{i+1}, 3]
    idx = query_ball_point(radius, nsample, xyz, new_xyz)
    grouped_xyz = index_points(xyz, idx) 	# [B, N_{i+1}, nsample, 3]
    grouped_xyz_norm = grouped_xyz - new_xyz.view(B, npoint, 1, d)	
    # 此处的new_xyz先变为[B, N_{i+1}, 1, d]，然后加减法自动repeat，最终得到 [B, N_{i+1}, nsample, 3]
    # 此处就是把所有得到的点都转化为了相对于中心点的位置。

    if features is not None:
        grouped_features = index_points(features, idx)	# [B, N_{i+1}, nsample, C_i]
        new_features = torch.cat([grouped_xyz_norm, grouped_features], dim=-1) 
        # [B, N_{i+1}, nsample, 3+C_i]
    else:
        new_features = grouped_xyz_norm
    if returnfps:
        return new_xyz, new_features, grouped_xyz, fps_idx
    else:
        return new_xyz, new_features
    # new_xyz = 		[B, N_{i+1}, K_{i+1}, 3]
    # new_features = 	[B, N_{i+1}, K_{i+1}, 3+C_i]

query_ball_point的实现

def query_ball_point(radius, K, xyz, new_xyz):
    """
    Input:
        radius: local region radius
        nsample: max sample number in local region，注意这个邻居是要在原先的N_i个点中找的。
        xyz: all points, [B, N_i, 3]
        new_xyz: query points, [B, N_{i+1}, 3]
    Return:
        group_idx: grouped points index, [B, N_{i+1}, K]
    """
    device = xyz.device
    B, N, C = xyz.shape
    _, npoint, _ = new_xyz.shape
    group_idx = torch.arange(N, dtype=torch.long).to(device).view(1, 1, N).repeat([B, npoint, 1])
    # 每个S（中心点）都有一个向量[1, 2, 3, ..., N]，如果这个向量里的值为N，那么就是不在中心点的邻域里；如果向量值等于自己，就是在中心点的邻域里
    # 此时我们得到group_idex = [B * N_{i+1} * N_{i}]
    sqrdists = square_distance(new_xyz, xyz)    # 输出的大小为 [B, N_{i+1}, N_{i}]，每个[N_{i+1}, N_{i}]里记录了两个点之间的距离
    group_idx[sqrdists > radius ** 2] = N       # 那些半径不满足条件的索引值都设置为N
    group_idx = group_idx.sort(dim=-1)[0][:, :, :K]   # 按照序号从小到大排，eg:[1, 3, 4, 5, ..., N, N, N]
    # 排完序后取前K个，那也就是[B, N_{i+1}, K]，意思就是每个中心点都对应了K个邻居的序号。
    group_first = group_idx[:, :, 0].view(B, S, 1).repeat([1, 1, K])
    mask = group_idx == N
    group_idx[mask] = group_first[mask]
    # 对于每个中心点的满足半径的邻居可能少于K个的情况，全部用第一个邻居的序号来替代。
    return group_idx

Classificaction Task

这一部分其实就是论文中的这个结构：

我们注意到在实现中有$N_1=512,N_2=128$，$d=3$是点云的欧式空间坐标，如果点云输入有法向量数据，那么$C=3$，否则$C=0$。并且中间层有$C_1=128,C_2=256，C_4=1024,k=\text{num_class}$，在数据集modelnet40下，$k=\text{num_class}=40$。

class get_model(nn.Module):
    def __init__(self,num_class,normal_channel=True):
        super(get_model, self).__init__()
        in_channel = 6 if normal_channel else 3
        self.normal_channel = normal_channel
        self.sa1 = PointNetSetAbstraction(npoint=512, radius=0.2, K=32, in_channel=in_channel, mlp=[64, 64, 128], group_all=False)
        self.sa2 = PointNetSetAbstraction(npoint=128, radius=0.4, K=64, in_channel=128 + 3, mlp=[128, 128, 256], group_all=False)
        self.sa3 = PointNetSetAbstraction(npoint=None, radius=None, K=None, in_channel=256 + 3, mlp=[256, 512, 1024], group_all=True)
        self.fc1 = nn.Linear(1024, 512)
        self.bn1 = nn.BatchNorm1d(512)
        self.drop1 = nn.Dropout(0.4)
        self.fc2 = nn.Linear(512, 256)
        self.bn2 = nn.BatchNorm1d(256)
        self.drop2 = nn.Dropout(0.4)
        self.fc3 = nn.Linear(256, num_class)
    def forward(self, xyz):	#xyz = [B, 3+ C(0 or 3), N_0]
        B, _, _ = xyz.shape
        if self.normal_channel:
            norm = xyz[:, 3:, :]	# norm = [B, 0 or 3, N_0] (如果用到了法向量)
            xyz = xyz[:, :3, :]		# xyz = [B, 3, N_0]
        else:
            norm = None
        l1_xyz, l1_features = self.sa1(xyz, norm)	
        # l1_xyz = [B, 3, 512], l1_features = [B, 128, 512]
        l2_xyz, l2_features = self.sa2(l1_xyz, l1_features)
        # l2_xyz = [B, 3, 128], l1_features = [B, 256, 128]
        l3_xyz, l3_features = self.sa3(l2_xyz, l2_features)
        # l3_xyz = [B, 1, 3], l3_features = [B, 1, 1024]
        # 如果只是利用PointNet++提取特征的话，这样就可以了
        x = l3_features.view(B, 1024)
        x = self.drop1(F.relu(self.bn1(self.fc1(x))))	# [B, 1024] -> [B, 512]
        x = self.drop2(F.relu(self.bn2(self.fc2(x))))	# [B, 512] -> [B, 256]
        x = self.fc3(x)									# [B, 256] -> [B, 40]
        x = F.log_softmax(x, -1)	
        return x, l3_features

接下来其实就是看set abstraction模块了，代码如下：

class PointNetSetAbstraction(nn.Module):
    def __init__(self, npoint, radius, K, in_channel, mlp, group_all):
        super(PointNetSetAbstraction, self).__init__()
        self.npoint = npoint
        self.radius = radius
        self.K = K
        self.mlp_convs = nn.ModuleList()
        self.mlp_bns = nn.ModuleList()
        last_channel = in_channel
        for out_channel in mlp:
            self.mlp_convs.append(nn.Conv2d(last_channel, out_channel, 1))	# 卷积核大小为1
            self.mlp_bns.append(nn.BatchNorm2d(out_channel))
            last_channel = out_channel
        # 对于sa1来说，就是有 3个MLP层，每层的卷积核大小都是1*1
        # 第一层 [B, 3+C_i, K_{i+1}, N_{i+1}] -> [B, 64, K_{i+1}, N_{i+1}]
        # 第二层 [B, 64, K_{i+1}, N_{i+1}] -> [B, 64, K_{i+1}, N_{i+1}]
        # 第三层 [B, 64, K_{i+1}, N_{i+1}] -> [B, 128, K_{i+1}, N_{i+1}]
            
        self.group_all = group_all

    def forward(self, xyz, points):
        """
        Input:
            xyz: input points position data, [B, 3, N_i]
            features: input points feature data, [B, C_i, N_i]
        Return:
            new_xyz: sampled points position data, [B, 3, N_{i+1}]
            new_features: sample points feature data, [B, C_{i+1}, N_{i+1}]
        """
        xyz = xyz.permute(0, 2, 1)
        if features is not None:
            features = features.permute(0, 2, 1)

        if self.group_all:
            new_xyz, new_features = sample_and_group_all(xyz, features)
            # new_xyz = [B, 1, 3]
            # new_features = [B, 1, K_{i+1}, 3+C_i]
        else:
            new_xyz, new_features = sample_and_group(self.npoint, self.radius, self.K, xyz, features)
        # new_xyz: sampled points position data, [B, N_{i+1}, 3]
        # new_features: sampled points position and feature data, [B, N_{i+1}, K_{i+1}, 3+C_i]
        new_features_concat = new_points.permute(0, 3, 2, 1) # [B, 3+C_i, K_{i+1}, N_{i+1}]
        for i, conv in enumerate(self.mlp_convs):
            bn = self.mlp_bns[i]
            new_features_concat = F.relu(bn(conv(new_features_concat)))
        # [B, C_{i+1}, K_{i+1}, N_{i+1}]
		
        new_features_concat = torch.max(new_features_concat, 2)[0]
        # [B, C_{i+1}, K_{i+1}, N_{i+1}] -> [B, C_{i+1}, N_{i+1}]
        # 其实就是说从邻域特征中找到一个最大响应的值，因为每次处理的邻域的半径不同，所以每次提取特征响应的感受野也不同，实现了不同尺度下的特征提取
        new_xyz = new_xyz.permute(0, 2, 1) # [B, N_{i+1}, 3] -> [B, 3, N_{i+1}]
        return new_xyz, new_features_concat

对于set abstraction layer来说，它的输入是$B\times N_{i} \times (3+C_i)$，而它的输出是$B\times N_{i+1}\times(3+C_{i+1})$。

理论上这样我们的分类任务已经可以完成了，这也是SSG(single-scale grouping)的情况。我们需要再看一下论文所提出的MSG的set abstraction module是如何实现的。

我们可以看到models里总体架构没有变，唯一有区别的就是两个sa层变为了SA_MSG。

1
2
3

self.sa1 = PointNetSetAbstractionMsg(npoint=512, radius_list=[0.1, 0.2, 0.4], K_list=[16, 32, 128], in_channel=in_channel, mlp_list=[[32, 32, 64], [64, 64, 128], [64, 96, 128]])
self.sa2 = PointNetSetAbstractionMsg(npoint=128, radius_list=[0.2, 0.4, 0.8], K_list=[32, 64, 128], in_channel=320, mlp_list=[[64, 64, 128], [128, 128, 256], [128, 128, 256]])
self.sa3 = PointNetSetAbstraction(npoint=None, radius=None, K=None, in_channel=640 + 3, mlp=[256, 512, 1024], group_all=True)

其实我们只要搞明白上述的in_channel各自是怎么来的。以sa2的in_channel为例：
$$
\text{in_channel}=320=\Sigma_j(\text{out_channel}_j)=64+128+128
$$
这样我们就明白了，其实就是把不同的radius所提取的不同尺度的特征拼接在一起得到320维的特征向量。

class PointNetSetAbstractionMsg(nn.Module):
    def __init__(self, npoint, radius_list, K_list, in_channel, mlp_list):
        super(PointNetSetAbstractionMsg, self).__init__()
        self.npoint = npoint
        self.radius_list = radius_list
        self.nsample_list = nsample_list
        self.conv_blocks = nn.ModuleList()
        self.bn_blocks = nn.ModuleList()
        for i in range(len(mlp_list)):
            convs = nn.ModuleList()
            bns = nn.ModuleList()
            last_channel = in_channel + 3
            for out_channel in mlp_list[i]:
                convs.append(nn.Conv2d(last_channel, out_channel, 1))
                bns.append(nn.BatchNorm2d(out_channel))
                last_channel = out_channel
            self.conv_blocks.append(convs)
            self.bn_blocks.append(bns)

    def forward(self, xyz, points):
        """
        Input:
            xyz: input points position data, [B, 3, N_i]
            points: input points data, [B, C_i, N_i]
        Return:
            new_xyz: sampled points position data, [B, 3, N_{i+1}]
            new_points_concat: sample points feature data, [B, C_{i+1}, N_{i+1}]
        """
        xyz = xyz.permute(0, 2, 1)   # [B, N_i, 3]
        if points is not None:
            points = points.permute(0, 2, 1)    # [B, N, C_i] ,C_i就是每个点额外的特征向量

        B, N, C = xyz.shape
        new_xyz = index_points(xyz, farthest_point_sample(xyz, self.npoint))  # [B, N_{i+1}, C_i]
        new_points_list = []
        for j, radius in enumerate(self.radius_list):   # 在不同的尺度下找ball query，枚举radius_list
            K = self.nsample_list[j]    				# 当前半径下，中心节点的邻居的数量
            
            
            # ======================以下开始其实是sample_and_group的逻辑====================
            group_idx = query_ball_point(radius, K, xyz, new_xyz)   
            grouped_xyz = index_points(xyz, group_idx)  # [B, 3, npoints, nsample]
            grouped_xyz -= new_xyz.view(B, S, 1, C) # ??? 转换成相对坐标
            if points is not None:
                grouped_points = index_points(points, group_idx)
                grouped_points = torch.cat([grouped_points, grouped_xyz], dim=-1)
            else:
                grouped_points = grouped_xyz
            # ===========================================================================

            grouped_points = grouped_points.permute(0, 3, 2, 1)  
            # [B, N_{i+1}, K_{j}, (3+D)] -> [B, (3+D), K_{j}, N_{i+1}]
            for k in range(len(self.conv_blocks[j])):   
                # 所以mlp_list是一个二维数组，每一行都代表在对应半径下使用的input_channels
                conv = self.conv_blocks[j][k]
                bn = self.bn_blocks[j][k]
                grouped_points = F.relu(bn(conv(grouped_points))) 
                # 得到了 [B, out_channel_j, K_{j}, N_{i+1}]
            # 最终我们得到一个group_points, [B, out_channel_j, K_{j}, N_{i+1}]
            new_points = torch.max(grouped_points, 2)[0]  # [B, out_channel_j, N_{i+1}]
            new_points_list.append(new_points)
        # 最终new_points_list为 len_radius_list 个 [B, out_channel_j, N_{i+1}]

        new_xyz = new_xyz.permute(0, 2, 1)
        new_points_concat = torch.cat(new_points_list, dim=1)   # 最终得到 [B, sigma{out_channel_j}, N_{i+1}]
        print("new_points_concat.size = ", new_points_concat)
        return new_xyz, new_points_concat

Q：PointNet++梯度是如何回传的？？？

A：PointNet++ fps实际上并没有参与梯度计算和反向传播。

可以理解成是PointNet++将点云进行不同规模的fps降采样，事先将这些数据准备好，再送到网络中去训练。

VoteNet

VoteNet是基于end-to-end的3D目标检测网络，它基于3D的深度点云网络和霍夫投票。

可以从讲解视频和PPT获得相关内容。

votenet.py

class VoteNet(nn.Module):
    r"""
        A deep neural network for 3D object detection with end-to-end optimizable hough voting.
        Parameters
        ----------
        num_class: int
            Number of semantics classes to predict over -- size of softmax classifier
        num_heading_bin: int
        num_size_cluster: int
        input_feature_dim: (default: 0)
            Input dim in the feature descriptor for each point.  If the point cloud is Nx9, this
            value should be 6 as in an Nx9 point cloud, 3 of the channels are xyz, and 6 are feature descriptors
        num_proposal: int (default: 128)
            Number of proposals/detections generated from the network. Each proposal is a 3D OBB with a semantic class.
        vote_factor: (default: 1)
            Number of votes generated from each seed point.
    """

    def __init__(self, num_class, num_heading_bin, num_size_cluster, mean_size_arr,
        input_feature_dim=0, num_proposal=128, vote_factor=1, sampling='vote_fps'):
        super().__init__()
        ...
		# omit variable init
		...
        # Backbone point feature learning
        self.backbone_net = Pointnet2Backbone(input_feature_dim=self.input_feature_dim)
        # Hough voting
        self.vgen = VotingModule(self.vote_factor, 256)
        # Vote aggregation and detection
        self.pnet = ProposalModule(num_class, num_heading_bin, num_size_cluster,
            mean_size_arr, num_proposal, sampling)

    def forward(self, inputs):
        """ Forward pass of the network
        Args:
            inputs: dict
                {point_clouds}
                point_clouds: Variable(torch.cuda.FloatTensor)
                    (B, N, 3 + input_channels) tensor
                    Point cloud to run predicts on
                    Each point in the point-cloud MUST
                    be formated as (x, y, z, features...)
        Returns:
            end_points: dict
        """
        end_points = {}
        batch_size = inputs['point_clouds'].shape[0]

        end_points = self.backbone_net(inputs['point_clouds'], end_points)
                
        # --------- HOUGH VOTING ---------
        xyz = end_points['fp2_xyz']					# (B, M, 3)
        features = end_points['fp2_features']		# (B, M, 256)
        end_points['seed_inds'] = end_points['fp2_inds']
        end_points['seed_xyz'] = xyz						# (batch_size, num_seed, 3)
        end_points['seed_features'] = features				# (batch_size, num_seed, 256)
        
        xyz, features = self.vgen(xyz, features)
        # xyz 		: (batch_size, num_vote, 3)
        # features 	: (batch_size, out_dim, num_vote)
        features_norm = torch.norm(features, p=2, dim=1)
        # features_norm 	: (batch_size, num_vote)
        features = features.div(features_norm.unsqueeze(1))
        # features 	: (batch_size, out_dim, num_vote)，此时features已经归一化了特征
        end_points['vote_xyz'] = xyz
        end_points['vote_features'] = features

        end_points = self.pnet(xyz, features, end_points)

        return end_points

voting_module.py

class VotingModule(nn.Module):
    def __init__(self, vote_factor, seed_feature_dim):
        """ Votes generation from seed point features.
        Args:
            vote_facotr: int
                number of votes generated from each seed point
            seed_feature_dim: int
                number of channels of seed point features
            vote_feature_dim: int
                number of channels of vote features
        """
        super().__init__()
        self.vote_factor = vote_factor
        self.in_dim = seed_feature_dim
        self.out_dim = self.in_dim # due to residual feature, in_dim has to be == out_dim
        self.conv1 = torch.nn.Conv1d(self.in_dim, self.in_dim, 1)
        self.conv2 = torch.nn.Conv1d(self.in_dim, self.in_dim, 1)
        self.conv3 = torch.nn.Conv1d(self.in_dim, (3+self.out_dim) * self.vote_factor, 1)
        self.bn1 = torch.nn.BatchNorm1d(self.in_dim)
        self.bn2 = torch.nn.BatchNorm1d(self.in_dim)
        
    def forward(self, seed_xyz, seed_features):
        """ Forward pass.
        Arguments:
            seed_xyz: 		(batch_size, num_seed, 3)
            seed_features: 	(batch_size, feature_dim, num_seed)
        Returns:
            vote_xyz: 		(batch_size, num_seed*vote_factor, 3)
            vote_features: 	(batch_size, vote_feature_dim, num_seed*vote_factor)
        """
        batch_size = seed_xyz.shape[0]
        num_seed = seed_xyz.shape[1]
        num_vote = num_seed * self.vote_factor
        net = F.relu(self.bn1(self.conv1(seed_features))) 
        net = F.relu(self.bn2(self.conv2(net))) 
        net = self.conv3(net) 
        # (batch_size, feature_dim, num_seed) => (batch_size, (3+out_dim)*vote_factor, num_seed)
                
        net = net.transpose(2,1).view(batch_size, num_seed, self.vote_factor, 3+self.out_dim)
        # (batch_size, num_seed, vote_factor, 3+self.out_dim)
        # 对于每个seed，生成vote_factor个vote，每个vote的特征维度为3+self.out_dim（3+256）
        offset = net[:,:,:,0:3]
        # (batch_size, num_seed, vote_factor, 3)
        vote_xyz = seed_xyz.unsqueeze(2) + offset
        # (batch_size, num_seed, 1, 3) => (batch_size, num_seed, vote_factor, 3)
        vote_xyz = vote_xyz.contiguous().view(batch_size, num_vote, 3)
        # (batch_size, num_vote, 3)
        
        residual_features = net[:,:,:,3:] 
        # seed_features: 	(batch_size, feature_dim, num_seed)
        vote_features = seed_features.transpose(2,1).unsqueeze(2) + residual_features
        # (batch_size, num_seed, feature_dim) => # (batch_size, num_seed, vote_factor, out_dim)
        vote_features = vote_features.contiguous().view(batch_size, num_vote, self.out_dim)
        vote_features = vote_features.transpose(2,1).contiguous()
        # (batch_size, out_dim, num_vote)
        
        return vote_xyz, vote_features

proposal_module.py

class ProposalModule(nn.Module):
    def __init__(self, num_class, num_heading_bin, num_size_cluster, mean_size_arr, num_proposal, sampling, seed_feat_dim=256):
        super().__init__() 
        ...
		# omit variable init
		...

        # Vote clustering
        self.vote_aggregation = PointnetSAModuleVotes( 
                npoint=self.num_proposal,
                radius=0.3,
                nsample=16,
                mlp=[self.seed_feat_dim, 128, 128, 128],
                use_xyz=True,
                normalize_xyz=True
            )
    
        # Object proposal/detection
        # Objectness scores (2), center residual (3),
        # heading class+residual (num_heading_bin*2), size class+residual(num_size_cluster*4)
        self.conv1 = torch.nn.Conv1d(128,128,1)
        self.conv2 = torch.nn.Conv1d(128,128,1)
        self.conv3 = torch.nn.Conv1d(128,2 + 3 + num_heading_bin * 2 + num_size_cluster * 4 + self.num_class,1)
        self.bn1 = torch.nn.BatchNorm1d(128)
        self.bn2 = torch.nn.BatchNorm1d(128)

    def forward(self, xyz, features, end_points):
        """
        Args:
            xyz: 		(batch_size, num_vote, 3)
            features: 	(batch_size, out_dim, num_vote)
        Returns:
            scores: 	(batch_size, num_proposal, 2 + 3 + NH * 2 + NS * 4) 
        """
        if self.sampling == 'vote_fps':
            # Farthest point sampling (FPS) on votes
            xyz, features, fps_inds = self.vote_aggregation(xyz, features)
            sample_inds = fps_inds
        elif self.sampling == 'seed_fps': 
            # FPS on seed and choose the votes corresponding to the seeds
            # This gets us a slightly better coverage of *object* votes than vote_fps (which tends to get more cluster votes)
            sample_inds = pointnet2_utils.furthest_point_sample(end_points['seed_xyz'], self.num_proposal)
            xyz, features, _ = self.vote_aggregation(xyz, features, sample_inds)
        elif self.sampling == 'random':
            # Random sampling from the votes
            num_seed = end_points['seed_xyz'].shape[1]
            batch_size = end_points['seed_xyz'].shape[0]
            sample_inds = torch.randint(0, num_seed, (batch_size, self.num_proposal), dtype=torch.int).cuda()
            # (batch_size, num_proposal)
            xyz, features, _ = self.vote_aggregation(xyz, features, sample_inds)
		# xyz : 		(batch_size, num_proposal, 3)
        # features : 	(batch_size, 128, num_proposal)
        end_points['aggregated_vote_xyz'] = xyz # (batch_size, num_proposal, 3)
        end_points['aggregated_vote_inds'] = sample_inds # (batch_size, num_proposal,) # should be 0,1,2,...,num_proposal

        # --------- PROPOSAL GENERATION ---------
        net = F.relu(self.bn1(self.conv1(features))) 
        net = F.relu(self.bn2(self.conv2(net))) 
        net = self.conv3(net)
        # (batch_size, 2 + 3 + num_heading_bin * 2 + num_size_cluster * 4, num_proposal)

        end_points = decode_scores(net, end_points, self.num_class, self.num_heading_bin, self.num_size_cluster, self.mean_size_arr)
        return end_points
    
def decode_scores(net, end_points, num_class, num_heading_bin, num_size_cluster, mean_size_arr):
    net_transposed = net.transpose(2,1) 
    # (batch_size, num_proposal, 2 + 3 + num_heading_bin * 2 + num_size_cluster * 4)
    batch_size = net_transposed.shape[0]
    num_proposal = net_transposed.shape[1]

    objectness_scores = net_transposed[:,:,0:2]
    end_points['objectness_scores'] = objectness_scores
    # (batch_size, num_proposal, 2)
    
    base_xyz = end_points['aggregated_vote_xyz'] 	# (batch_size, num_proposal, 3)
    center = base_xyz + net_transposed[:, :, 2:5] 	# (batch_size, num_proposal, 3)
    end_points['center'] = center

    heading_scores = net_transposed[:, :, 5:5 + num_heading_bin]
    end_points['heading_scores'] = heading_scores
    # (batch_size, num_proposal, num_heading_bin)
    
    heading_residuals_normalized = net_transposed[:, :, 5 + num_heading_bin:5 + num_heading_bin * 2]
    end_points['heading_residuals_normalized'] = heading_residuals_normalized
    # (batch_size, num_proposal, num_heading_bin) (should be -1 to 1)
    
    end_points['heading_residuals'] = heading_residuals_normalized * (np.pi / num_heading_bin)
    # 因为residual的每个值在-1到1
    # 所以这样做完以后heading_residuals的值就在[-np.pi / num_heading_bin, np.pi / num_heading_bin]

    size_scores = net_transposed[:, :, 5 + num_heading_bin * 2:5 + num_heading_bin * 2 + num_size_cluster]
    end_points['size_scores'] = size_scores
    # (batch_size, num_proposal, num_size_cluster)
    
    size_residuals_normalized = net_transposed[:, :, 5 + num_heading_bin * 2 + num_size_cluster:5 + num_heading_bin * 2 + num_size_cluster * 4].view([batch_size, num_proposal, num_size_cluster, 3]) 
    end_points['size_residuals_normalized'] = size_residuals_normalized
    # (batch_size, num_proposal, num_size_cluster * 3)
    
    end_points['size_residuals'] = size_residuals_normalized * torch.from_numpy(mean_size_arr.astype(np.float32)).cuda().unsqueeze(0).unsqueeze(0)

    sem_cls_scores = net_transposed[:,:,5 + num_heading_bin * 2 + num_size_cluster * 4:] 
    end_points['sem_cls_scores'] = sem_cls_scores
    # (batch_size, num_proposal, 10)
    
    return end_points

在默认情况下，我们decode_scores对256个检测框每个输出12个heading分类和12个heading的res，10个size分类和10个size的res，3个中心点的坐标，2个代表有无目标，以及10类分类的置信度。

loss_helper.py

VoteNet的总的Loss由vote_loss, objectness_loss, box loss, sem cls loss组成。

def get_loss(end_points, config):
    # Loss functions

    # Vote loss
    vote_loss = compute_vote_loss(end_points)
    end_points['vote_loss'] = vote_loss

    # Obj loss
    objectness_loss, objectness_label, objectness_mask, object_assignment =   compute_objectness_loss(end_points)
    end_points['objectness_loss'] = objectness_loss
    end_points['objectness_label'] = objectness_label
    end_points['objectness_mask'] = objectness_mask
    end_points['object_assignment'] = object_assignment
    total_num_proposal = objectness_label.shape[0]*objectness_label.shape[1]
    end_points['pos_ratio'] = torch.sum(objectness_label.float().cuda())/float(total_num_proposal)
    end_points['neg_ratio'] = torch.sum(objectness_mask.float())/float(total_num_proposal) - end_points['pos_ratio']

    # Box loss and sem cls loss
    center_loss, heading_cls_loss, heading_reg_loss, size_cls_loss, size_reg_loss, sem_cls_loss = \
        compute_box_and_sem_cls_loss(end_points, config)
    end_points['center_loss'] = center_loss
    end_points['heading_cls_loss'] = heading_cls_loss
    end_points['heading_reg_loss'] = heading_reg_loss
    end_points['size_cls_loss'] = size_cls_loss
    end_points['size_reg_loss'] = size_reg_loss
    end_points['sem_cls_loss'] = sem_cls_loss
    box_loss = center_loss + 0.1*heading_cls_loss + heading_reg_loss + 0.1*size_cls_loss + size_reg_loss
    end_points['box_loss'] = box_loss

    # Final loss function
    loss = vote_loss + 0.5*objectness_loss + box_loss + 0.1*sem_cls_loss
    loss *= 10
    end_points['loss'] = loss

    # --------------------------------------------
    # Some other statistics
    obj_pred_val = torch.argmax(end_points['objectness_scores'], 2) # B,K
    obj_acc = torch.sum((obj_pred_val==objectness_label.long()).float()*objectness_mask)/(torch.sum(objectness_mask)+1e-6)
    end_points['obj_acc'] = obj_acc

    return loss, end_points

vote loss

def compute_vote_loss(end_points):
    """ Compute vote loss: Match predicted votes to GT votes.          
    Overall idea:
    	如果我们的seed point属于一个物体（votes_label_mask == 1），那么我们需要它向着物体中心投票
    	每个seed point可能投票出多个translation v1,v2,v3
    	一个seed point也可能在多个物体o1,o2,o3的bounding box中，对应的GT vote为 c1, c2, c3
    	对于这个seed point的loss为： min(d(v_i,c_j)) for i=1,2,3 and j=1,2,3
    """

    # Load ground truth votes and assign them to seed points
    batch_size = end_points['seed_xyz'].shape[0]
    num_seed = end_points['seed_xyz'].shape[1] 		# (B, num_seed, 3)
    vote_xyz = end_points['vote_xyz'] 				# (B, num_seed * vote_factor, 3)
    seed_inds = end_points['seed_inds'].long() 		# (B, num_seed) in [0,num_points-1]

    # Get groundtruth votes for the seed points
    # vote_label_mask: Use gather to select B,num_seed from B,num_point
    #   non-object point has no GT vote mask = 0, object point has mask = 1
    # vote_label: Use gather to select B,num_seed,9 from B,num_point,9
    #   with inds in shape B,num_seed,9 and 9 = GT_VOTE_FACTOR * 3
    seed_gt_votes_mask = torch.gather(end_points['vote_label_mask'], 1, seed_inds)
    seed_inds_expand = seed_inds.view(batch_size,num_seed,1).repeat(1, 1, 3 * GT_VOTE_FACTOR)
    seed_gt_votes = torch.gather(end_points['vote_label'], 1, seed_inds_expand)
    seed_gt_votes += end_points['seed_xyz'].repeat(1, 1, 3)

    # Compute the min of min of distance
    vote_xyz_reshape = vote_xyz.view(batch_size*num_seed, -1, 3) 
    # (B, num_seed * vote_factor, 3) => (B * num_seed, vote_factor, 3)
    seed_gt_votes_reshape = seed_gt_votes.view(batch_size*num_seed, GT_VOTE_FACTOR, 3) 
    # (B, num_seed, 3 * GT_VOTE_FACTOR) => (B * num_seed, GT_VOTE_FACTOR, 3)
    # A predicted vote to no where is not penalized as long as there is a good vote near the GT vote.
    dist1, _, dist2, _ = nn_distance(vote_xyz_reshape, seed_gt_votes_reshape, l1=True)
    votes_dist, _ = torch.min(dist2, dim=1) 
    votes_dist = votes_dist.view(batch_size, num_seed)
    # (B * num_seed, vote_factor) => (B * num_seed,) => (B, num_seed)
    vote_loss = torch.sum(votes_dist * seed_gt_votes_mask.float()) / (torch.sum(seed_gt_votes_mask.float()) + 1e-6)
    return vote_loss

objectness loss

def compute_objectness_loss(end_points):
    """ Compute objectness loss for the proposals.
    Args:
        end_points: dict (read-only)
    Returns:
        objectness_loss: scalar Tensor
        objectness_label: (batch_size, num_seed) Tensor with value 0 or 1
        objectness_mask: (batch_size, num_seed) Tensor with value 0 or 1
        object_assignment: (batch_size, num_seed) Tensor with long int
            within [0,num_gt_object-1]
    """ 
    # Associate proposal and GT objects by point-to-point distances
    aggregated_vote_xyz = end_points['aggregated_vote_xyz']
    gt_center = end_points['center_label'][:,:,0:3]
    B = gt_center.shape[0]
    K = aggregated_vote_xyz.shape[1]
    K2 = gt_center.shape[1]
    dist1, ind1, dist2, _ = nn_distance(aggregated_vote_xyz, gt_center) # dist1: BxK, dist2: BxK2

    # Generate objectness label and mask
    # objectness_label: 1 if pred object center is within NEAR_THRESHOLD of any GT object
    # objectness_mask: 0 if pred object center is in gray zone (DONOTCARE), 1 otherwise
    euclidean_dist1 = torch.sqrt(dist1+1e-6)
    objectness_label = torch.zeros((B,K), dtype=torch.long).cuda()
    objectness_mask = torch.zeros((B,K)).cuda()
    objectness_label[euclidean_dist1<NEAR_THRESHOLD] = 1
    objectness_mask[euclidean_dist1<NEAR_THRESHOLD] = 1
    objectness_mask[euclidean_dist1>FAR_THRESHOLD] = 1

    # Compute objectness loss
    objectness_scores = end_points['objectness_scores']
    criterion = nn.CrossEntropyLoss(torch.Tensor(OBJECTNESS_CLS_WEIGHTS).cuda(), reduction='none')
    objectness_loss = criterion(objectness_scores.transpose(2,1), objectness_label)
    objectness_loss = torch.sum(objectness_loss * objectness_mask)/(torch.sum(objectness_mask)+1e-6)

    # Set assignment
    object_assignment = ind1 # (B,K) with values in 0,1,...,K2-1

    return objectness_loss, objectness_label, objectness_mask, object_assignment

box and sem cls loss

论文中的描述如下：

The max-pooled features are further processed by MLP2 with output sizes of 128, 128, 5+2NH+4NS+NC where the output consists of 2 objectness scores, 3 center regression values, 2NH numbers for heading regression (NH heading bins) and 4NS numbers for box size regression (NS box anchors) and NC numbers for semantic classification

其实，根据VoteNet引用的Frustum PointNet所提到的，两篇文章都只考虑了Up-axis轴上的角度作为heading angle，对于3D Bounding Box做了2个自由度的简化。那么其实就很容易懂了，把180度分成12份，先预测在哪个bin中，再回归bin内的偏移量是多少。而size bin以及size residual是类似的。


def compute_box_and_sem_cls_loss(end_points, config):
    """ Compute 3D bounding box and semantic classification loss.
    Args:
        end_points: dict (read-only)
    Returns:
        center_loss
        heading_cls_loss
        heading_reg_loss
        size_cls_loss
        size_reg_loss
        sem_cls_loss
    """

    num_heading_bin = config.num_heading_bin
    num_size_cluster = config.num_size_cluster
    num_class = config.num_class
    mean_size_arr = config.mean_size_arr

    object_assignment = end_points['object_assignment']
    batch_size = object_assignment.shape[0]

    # Compute center loss
    # 因为是K个点，K2个物体，我们把Loss分成两部分
    # centroid_reg_loss1 (B, K)： 距离每个点最近的GT物体的距离
    # centroid_reg_loss2 (B, K1)： 距离每个GT物体最近的center的距离
    # 这样的话就把对应关系考虑进去了
    pred_center = end_points['center']
    gt_center = end_points['center_label'][:, :, 0:3]
    dist1, ind1, dist2, _ = nn_distance(pred_center, gt_center) # dist1: BxK, dist2: BxK2
    box_label_mask = end_points['box_label_mask']
    objectness_label = end_points['objectness_label'].float()
    centroid_reg_loss1 = torch.sum(dist1 * objectness_label) / (torch.sum(objectness_label) + 1e-6)
    centroid_reg_loss2 = torch.sum(dist2 * box_label_mask) / (torch.sum(box_label_mask) + 1e-6)
    center_loss = centroid_reg_loss1 + centroid_reg_loss2

    # Compute heading loss
    heading_class_label = torch.gather(end_points['heading_class_label'], 1, object_assignment)
    # select (B,K) from (B,K2)，得到每个点的GT heading class
    # end_points['heading_scores'].transpose(2,1): (batch_size, num_heading_bin, K)
    criterion_heading_class = nn.CrossEntropyLoss(reduction='none')
    heading_class_loss = criterion_heading_class(end_points['heading_scores'].transpose(2,1), heading_class_label) 
    # (B,K)
    heading_class_loss = torch.sum(heading_class_loss * objectness_label) / (torch.sum(objectness_label) + 1e-6)
    # 这里就是对heading bin做了多分类，并且在Loss和的时候只考虑了objectness_label == 1的情况，并且归一化

    heading_residual_label = torch.gather(end_points['heading_residual_label'], 1, object_assignment) 
    # select (B,K) from (B,K2)
    heading_residual_normalized_label = heading_residual_label / (np.pi/num_heading_bin)

    # Ref: https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/3
    heading_label_one_hot = torch.cuda.FloatTensor(batch_size, heading_class_label.shape[1], num_heading_bin).zero_()
    heading_label_one_hot.scatter_(2, heading_class_label.unsqueeze(-1), 1)
    # src==1 so it's *one-hot* (B,K,num_heading_bin)
    heading_residual_normalized_loss = huber_loss(torch.sum(end_points['heading_residuals_normalized'] * heading_label_one_hot, -1) - heading_residual_normalized_label, delta=1.0) # (B,K)
    heading_residual_normalized_loss = torch.sum(heading_residual_normalized_loss * objectness_label) / (torch.sum(objectness_label)+1e-6)

    # Compute size loss
    size_class_label = torch.gather(end_points['size_class_label'], 1, object_assignment)
    # select (B,K) from (B,K2)
    criterion_size_class = nn.CrossEntropyLoss(reduction='none')
    size_class_loss = criterion_size_class(end_points['size_scores'].transpose(2,1), size_class_label) # (B,K)
    size_class_loss = torch.sum(size_class_loss * objectness_label) / (torch.sum(objectness_label) + 1e-6)

    size_residual_label = torch.gather(end_points['size_residual_label'], 1, object_assignment.unsqueeze(-1).repeat(1,1,3)) 
    # select (B,K,3) from (B,K2,3)
    size_label_one_hot = torch.cuda.FloatTensor(batch_size, size_class_label.shape[1], num_size_cluster).zero_()
    size_label_one_hot.scatter_(2, size_class_label.unsqueeze(-1), 1) # src==1 so it's *one-hot* (B,K,num_size_cluster)
    size_label_one_hot_tiled = size_label_one_hot.unsqueeze(-1).repeat(1,1,1,3) # (B,K,num_size_cluster,3)
    predicted_size_residual_normalized = torch.sum(end_points['size_residuals_normalized']*size_label_one_hot_tiled, 2) # (B,K,3)

    mean_size_arr_expanded = torch.from_numpy(mean_size_arr.astype(np.float32)).cuda().unsqueeze(0).unsqueeze(0) 
    # (1,1,num_size_cluster,3) 
    mean_size_label = torch.sum(size_label_one_hot_tiled * mean_size_arr_expanded, 2) # (B,K,3)
    size_residual_label_normalized = size_residual_label / mean_size_label # (B,K,3)
    size_residual_normalized_loss = torch.mean(huber_loss(predicted_size_residual_normalized - size_residual_label_normalized, delta=1.0), -1) # (B,K,3) -> (B,K)
    size_residual_normalized_loss = torch.sum(size_residual_normalized_loss * objectness_label) / (torch.sum(objectness_label) + 1e-6)

    # Compute Semantic cls loss，此处就是算预测的多分类交叉熵
    sem_cls_label = torch.gather(end_points['sem_cls_label'], 1, object_assignment) 
    # select (B,K) from (B,K2)
    criterion_sem_cls = nn.CrossEntropyLoss(reduction='none')
    sem_cls_loss = criterion_sem_cls(end_points['sem_cls_scores'].transpose(2,1), sem_cls_label) # (B,K)
    sem_cls_loss = torch.sum(sem_cls_loss * objectness_label) / (torch.sum(objectness_label) + 1e-6)

    return center_loss, heading_class_loss, heading_residual_normalized_loss, size_class_loss, size_residual_normalized_loss, sem_cls_loss

PointNet and related works

https://kami-code.com/2021/12/22/pointnet-and-related-works/

Author

Kami-code

Posted on

2021-12-22

Updated on

2022-02-19

Licensed under

PointNet and related works

PointNet

PointNet的证明

相对直观的分析

论文中的纯数学分析

PointNet++

Sampling Layer

Grouping Layer

PointNet Layer

Set Abstraction Layer的MSG优化

Point Feature Propagation

代码

sample_and_group的实现

query_ball_point的实现

Classificaction Task

VoteNet

votenet.py

voting_module.py

proposal_module.py

loss_helper.py

vote loss

objectness loss

box and sem cls loss

Author

Posted on

Updated on

Licensed under

Comments

Links

Recents

Archives

Tags

Subscribe for updates

follow.it