Posted 2022-01-18Updated 2022-02-13an hour read (About 11783 words)

(CVPR2020)GraspNet-1Billion

最近在研究各种各样的GraspNet，组内浩树学长的GraspNet虽然我已经用了很久，但是在方法论和具体实现上一直没有时间去研究，借着寒假的机会来集中学习一下。

其实GraspNet-1Billion包含三部分，大规模的数据集、网络和benchmark。

数据集部分

数据集包含88个日常常见的物体的3D模型。数据是从190个clutter scene中收集的，每个scene中，2个深度摄像机拍512张深度图，共97280张图片。每张图片中，我们通过计算force closure的方法稠密标注了6D grasp pose。每个场景中通常会有300万到900万个Grasp Pose。

数据收集

我原以为是仿真器里渲染的，居然真的是在现实世界中采集的，这样的话就不存在什么sim2real的问题了。两个深度相机同时拍摄场景并且合成成一个点云。摄像机所固连的机械臂的末端执行器在单位球面上找到了256个位置拍摄。地面上还放置了ArUco marker来协助摄像机标定，这样避免了计算fk所带来的误差。

数据标定

有了这97280张图以后，因为每256个位置的相对位置都是已知的，所以我们只需要标定每个的第一帧即可，即380张。但是Grasp Pose分布在一个大型的连续搜索空间中，是标不完的，手动标是巨大的工作量。因为我们的每个物体是已知的，文章中提出了一个2阶段的Grasp Pose标注方法。

首先，我们在单个物体上采样并标注Grasp Pose。为了达到这个目的，我们先把单个物体的网格模型降采样成均匀分布的Voxel space，其中的每个点称为grasp point。对于每个Grasp Point，我们从单位球找到V个均匀分布的approaching vector。然后，我们在二维网格$D\times A$上搜索（其中D是夹爪深度，而A是in-plane旋转角度）。

我们使用理论计算的方法来对每个Grasp打分，force-closure指标是很有用的：给定一个Grasp Pose、相关的Object以及摩擦系数$\mu$，它可以输出一个二分类的标签来判断这个Grasp是否可以在对应的摩擦系数下被抓起来。因为force-closure是基于物理的，所以比较鲁棒。此处我们采用24所提到的一个改进版本，使用$\Delta\mu=0.1$作为间隔，我们逐渐从1递减到0.1，直到Grasp不再antipodal。因为具有更低摩擦系数$\mu$的成功抓取更容易成功，所以我们定义Grasp的成功率为：$s=1.1-\mu$。

对于每个scene，我们把对应单个物体的Grasp投影到clutter的场景中，此外还做了collision-check来避免非法的情况。在这两部以后，我们就对每个scene创建了稠密的抓取集合$G_{(w)}$。根据统计，数据集中正例和反例的比例为1:2。

数据集评估

对于190个场景，100个用来训练、90个用来测试。我们进一步把测试集分类为3种： 30 scenes with seen objects, 30 with unseen but similar objects and 30 for novel objects。

为了评估Grasp Pose的预测表现，之前的方法通常认为一个正确的Grasp需要满足：

rotation error小于$30^{\circ}$
矩形的IOU（Intersection over Union）大于0.25

但是这样的度量指标有一些问题，比如它只能评估Grasp Pose的矩阵表示方法，并且它的错误容忍度太高了，康奈尔数据集已经可以达到99%的进度了。这篇工作提出了一个在线的评估算法来评估Grasp的精度。

我们首先演示如何来分类是否一个Grasp Pose是true positive的。对于每个预测出来的Grasp Pose$\hat{P}_i$，我们首先做collision-checking，第二部就是我们通过force-closure来判断在给定不同的摩擦系数$\mu$下是否可以抓取。

对于cluttered scene，我们的Grasp预测算法会预测出多个Grasp Pose。因为对于抓取来说，我们通常是预测以后再执行的，所以我们认为true positive的比例是最重要的。因此，我们采用Precision@k作为我们的评估指标，也就是前k个抓取的精度。$AP_\mu$在摩擦系数$\mu$代表Precision@k的均值（k从1取到50）。和COCO数据集类似，我们在不同的摩擦系数$\mu$来计算$AP_\mu$。为了避免相同的Grasp Pose，在评估之前，我们对Grasp Pose来使用pose-NMS。

网络结构整体上我觉得和VoteNet非常接近，也是生成M个seeds。然后通过ApproachNet把Grasp的接近向量做一个多分类出来。然后我们对这M个grasp proposal做KNN聚类，聚出来K个grasp proposal。对于每个接近向量，我们继续做多分类，在$D\times A$上找到一个分数最大的夹爪渐进距离和角度。

Loss Function

对于每个候选点，我们为它分配一个二分类标签来表示它是否是可以抓取的。

对于那些不在物体上的点，我们直接认为是失败的样本。
对于在物体上的点，我们在5mm半径内寻找它是否存在至少一个graspable的GT。如果存在，对应的graspable标签即为1。
如果物体上的点附近5mm半径内没有graspable的GT，那么我们忽略掉这种情况，因为对我们的训练没有贡献。

对于每个点，我们有V个approaching vector，我们定义第i个点的第j个approaching vector为$v_{ij}$。我们然后寻找它对应的GT向量：$\hat{v}_{ij}$。类似地，我们只考虑相差在5度以内的。最终，我们的损失函数定义如下：
$$
L^A({c_i},{s_{ij}})=\frac{1}{N_{cls}}\sum_iL_{cls}(c_i,c_i^*)+\lambda_1\frac{1}{N_{reg}}\sum_i\sum_j c_i^*
\mathbf{1}(|v_{ij},v_{ij}^*|<5^{\circ})L_{reg}(s_{ij},s_{ij}^*))
$$
其中$s_{ij}$代表网络预测的第i个点的confidence score。而$s_{ij}^*$是对应的GT，是通过EQ2选出来的最大的grasp的置信度。$|v_{ij},v^*_{ij}|$代表的就是角度差。指示函数$\mathbf{1}()$约束了一个approaching vector所产生的loss是通过它附近5度内的GT所提供的。

这个损失函数前半部分就是说，希望对每个点可以正确地二分类到graspable还是not graspable。后半部分就是说，对于每个graspable的点（满足$c_i^*==1$），它存在V个approaching vector，我们希望通过稠密的Grasp GT来让每个approaching vector都能预测出一个confidence score，也就是对应Grasp数据集创建时候，所计算出来的成功率。

Operation Network

在得到了graspable points的approaching vector之后，我们需要进一步预测in-plane rotation, approaching distance, gripper width和grasp confidence。

Cylinder Region Transformation

文章提出了一个统一的Grasp表示方式。因为approaching distance相对不那么敏感，所以分为了K个bin，对于每个给定的distance $d_k$，我们在圆柱中沿着approaching vector采样一些点。这些采样的点会转化到一个新的坐标系下，其原点是Grasp point，而z轴就是approaching vector $v_{ij}$。

Rotation和Width

在之前的文章中，证明了预测in-plane rotation时，分类比起回归有更好的效果。所以，rotation network把对齐过的点云作为输入，输出分类的分数、对每个rotation bin的归一化过的残差，以及对应的grasp width和grasp confidence。因为夹爪是对称的，所以我们只需要预测0~180度即可。目标函数如下：
$$
L^R(R_{ij},S_{ij},{W_{ij}})=\sum_{d=1}^K\left(\frac{1}{N_{cls}}\sum_{ij}L^d_{cls}(R_{ij},R_{ij}^*)+\lambda_2\frac{1}{N_{reg}}\sum_{ij}L^d_{reg}(S_{ij},S^*_{ij})+\lambda_3\frac{1}{N_{reg}}\sum_{ij}L^d_{reg}(W_{ij}, W^*_{ij})\right)
$$
其中$R_{ij}$代表binned rotation degree, $S_{ij}$代表grasp confidence score，$W_{ij}$代表夹爪闭合宽度，$d$代表approaching distance。其中$L^d$代表的是第d个binned distance的loss。此处的$L_{cls}$代表的是多分类任务的交叉熵损失。

Tolerance Network

现在我们已经有了一个end-to-end的网络了。本文进一步提出了Grasp Affinity Field（GAFs）的概念，它可以提升预测出来的Grasp的鲁棒性。因为合法的Grasp Pose是无限的，我们希望能够挑选出那些可以容忍更大error的鲁棒的Grasp。所以，GAFs就学习的是每个grasp对于扰动的鲁棒性。

给定一个GT的grasp pose，我们在球空间中搜索它的邻域，来找到满足grasp score > 0.5的最远的距离作为GATs。损失函数如下：
$$
L^F(A_{ij})=\frac{1}{N_{reg}}\sum_{d=1}^K\sum_{ij}L^d_{reg}(T_{ij},T_{ij}^*)
$$
其中$T_{ij}$代表了grasp pose可以忍受的最大的扰动。

在训练过程中，网络的总的目标函数如下
$$
L=L^A({c_i},{s_{ij}})+\alpha L^R(R_{ij}, S_{ij}, W_{ij})+\beta L^F(T_{ij})
$$
在推理阶段，我们把grasp根据分数分成10个bin，然后在每个bin中根据tolerance network的计算出来的扰动程度来排序。

Experiments

GT Evaluation

为了评估预测的Grasp Pose，本文设立了一个真机实验，因为在现实中需要获取到物体的6D Pose才能做投影，把ArUco code贴在物体上。

对于训练过程，我们把in-plane rotational angle分成了12个bin、approaching distance分成了4个bin(0.01, 0.02, 0.03, 0.04)m。我们设置$M=1024$和$V=300$。我们的ApproachNet有MLP(256, 302, 302)，OperationNet有MLP(128, 128, 36)和ToleranceNet有MLP(128, 64, 12)。

具体代码实现

数据集格式

数据集的官网说明如下：

1. Download, unzip all the files and place them in the following structure, 
   the train images and test images contain the 190 scenes in total.
|-- graspnet
    |-- scenes
    |   |-- scene_0000/
    |   |-- scene_0001/
    |   |-- ... ...
    |   `-- scene_0189/
    |
    |
    |-- models
    |   |-- 000/
    |   |-- 001/
    |   |-- ...
    |   `-- 087/
    |
    |
    |-- dex_models(optional but strongly recommended for accelerating evaluation)
    |   |-- 000.pkl
    |   |-- 001.pkl
    |   |-- ...
    |   `-- 087.pkl
    |   
    |
    |-- grasp_label
    |   |-- 000_labels.npz
    |   |-- 001_labels.npz
    |   |-- ...
    |   `-- 087_labels.npz
    |
    |
    `-- collision_label
        |-- scene_0000/
        |-- scene_0001/
        |-- ... ...
        `-- scene_0189/


2. Detail structure of each scene
|-- scenes
    |-- scene_0000
    |   |-- object_id_list.txt              # objects' id that appear in this scene, 0-indexed
    |   |-- rs_wrt_kn.npy                   # realsense camera pose with respect to kinect, shape: 256x(4x4)
    |   |-- kinect                          # data of kinect camera
    |   |   |-- rgb                         
    |   |   |   |-- 0000.png to 0255.png    # 256 rgb images
    |   |   `-- depth
    |   |   |   |-- 0000.png to 0255.png    # 256 depth images
    |   |   `-- label
    |   |   |   |-- 0000.png to 0255.png    # 256 object mask images, 0 is background, 1-88 denotes each object (1-indexed), same format as YCB-Video dataset
    |   |   `-- annotations
    |   |   |   |-- 0000.xml to 0255.xml    # 256 object 6d pose annotation. ‘pos_in_world' and'ori_in_world' denotes position and orientation w.r.t the camera frame. 
    |   |   `-- meta
    |   |   |   |-- 0000.mat to 0255.mat    # 256 object 6d pose annotation, same format as YCB-Video dataset for easy usage
    |   |   `-- rect
    |   |   |   |-- 0000.npy to 0255.npy    # 256 2D planar grasp labels
    |   |   |   
    |   |   `-- camK.npy                    # camera intrinsic, shape: 3x3, [[f_x,0,c_x], [0,f_y,c_y], [0,0,1]]
    |   |   `-- camera_poses.npy            # 256 camera poses with respect to the first frame, shape: 256x(4x4)
    |   |   `-- cam0_wrt_table.npy          # first frame's camera pose with respect to the table, shape: 4x4
    |   |
    |   `-- realsense
    |       |-- same structure as kinect
    |
    |
    `-- scene_0001
    |
    `-- ... ...
    |
    `-- scene_0189

我们可以看到190个场景，每个都有对应的256张RGB, depth，mask，以及每个场景中10个物体的id、Pose、以及两个摄像机的外参矩阵。

而Grasp Label的格式可以通过API官网找到，注意到是88种物体，每一个都有一组Grasp Label。

>>> import numpy as np
>>> l = np.load('000_labels.npz') # GRASPNET_ROOT/grasp_label/000_labels.npz
>>> l.files
['points', 'offsets', 'collision', 'scores']
>>> l['points'].shape
(3459, 3)
>>> l['offsets'].shape
(3459, 300, 12, 4, 3)
>>> l['collision'].shape
(3459, 300, 12, 4)
>>> l['collision'].dtype
dtype('bool')
>>> l['scores'].shape
(3459, 300, 12, 4)
>>> l['scores'][0][0][0][0]
-1.0

[‘points’] 记录了模型坐标系下的Grasp Center Point。
[‘offsets’] 记录了对应Grasp的in-plane rotation，夹爪深度和夹爪宽度。
[‘collision’] 记录了对应Grasp是否和物体模型存在碰撞。
[‘scores’] 记录了达到稳定Grasp时最小的摩擦系数。

还有一个比较重要的数据就是每个场景的collision_label，在官网中也被成为Collision Masks on Each Scene。具体地，因为我们每个场景中维护了物体的6D Pose，我们是知道每个Grasp的位置在哪里的，我们可以预先地把我们的夹爪模型放上去做碰撞检测。如果Gripper和场景中的Model有碰撞，对于的collision_label就设置为True，我们要在训练的时候把存在碰撞的Grasp Pose的score设置为0。

>>> import numpy as np
>>> c = np.load('collision_labels.npz') # GRASPNET_ROOT/collision_label/scene_0000/collision_labels.npz
>>> c.files
['arr_0', 'arr_4', 'arr_5', 'arr_2', 'arr_3', 'arr_7', 'arr_1', 'arr_8', 'arr_6']
>>> c['arr_0'].shape
(487, 300, 12, 4)
>>> c['arr_0'].dtype
dtype('bool')
>>> c['arr_0'][10][20][3]
array([ True,  True,  True,  True])

DataLoader

class GraspNetDataset(Dataset):
    def __init__(self, root, valid_obj_idxs, grasp_labels, camera='kinect', split='train', num_points=20000,
                 remove_outlier=False, remove_invisible=True, augment=False, load_label=True):
        assert(num_points<=50000)
		# ignore some self.x = x
        if split == 'train':
            self.sceneIds = list( range(100) )
        elif split == 'test':
            self.sceneIds = list( range(100,190) )
        elif split == 'test_seen':
            self.sceneIds = list( range(100,130) )
        elif split == 'test_similar':
            self.sceneIds = list( range(130,160) )
        elif split == 'test_novel':
            self.sceneIds = list( range(160,190) )
        self.sceneIds = ['scene_{}'.format(str(x).zfill(4)) for x in self.sceneIds]
        # ignore some dir concat

    def scene_list(self):
        return self.scenename

    def __len__(self):
        return len(self.depthpath)

    def __getitem__(self, index):
        if self.load_label:
            return self.get_data_label(index)
        else:
            return self.get_data(index)

    def get_data(self, index, return_raw_cloud=False):
        # 仅读入depth, pointcloud, mask，不读入grasp_label，
        color = np.array(Image.open(self.colorpath[index]), dtype=np.float32) / 255.0
        depth = np.array(Image.open(self.depthpath[index]))
        seg = np.array(Image.open(self.labelpath[index]))
        meta = scio.loadmat(self.metapath[index])
        scene = self.scenename[index]
        intrinsic = meta['intrinsic_matrix']
        factor_depth = meta['factor_depth']
        camera = CameraInfo(1280.0, 720.0, intrinsic[0][0], intrinsic[1][1], intrinsic[0][2], intrinsic[1][2], factor_depth)

        # 合成点云
        cloud = create_point_cloud_from_depth_image(depth, camera, organized=True)

        # 得到两个mask
        depth_mask = (depth > 0)
        seg_mask = (seg > 0)
        if self.remove_outlier:
            camera_poses = np.load(os.path.join(self.root, 'scenes', scene, self.camera, 'camera_poses.npy'))
            align_mat = np.load(os.path.join(self.root, 'scenes', scene, self.camera, 'cam0_wrt_table.npy'))
            trans = np.dot(align_mat, camera_poses[self.frameid[index]])
            workspace_mask = get_workspace_mask(cloud, seg, trans=trans, organized=True, outlier=0.02)
            mask = (depth_mask & workspace_mask)
        else:
            mask = depth_mask
        cloud_masked = cloud[mask]
        color_masked = color[mask]
        seg_masked = seg[mask]
        if return_raw_cloud:
            return cloud_masked, color_masked

        # 如果点太多，那么就采样固定的点输出
        if len(cloud_masked) >= self.num_points:
            idxs = np.random.choice(len(cloud_masked), self.num_points, replace=False)
        else:
            idxs1 = np.arange(len(cloud_masked))
            idxs2 = np.random.choice(len(cloud_masked), self.num_points-len(cloud_masked), replace=True)
            idxs = np.concatenate([idxs1, idxs2], axis=0)
        cloud_sampled = cloud_masked[idxs]
        color_sampled = color_masked[idxs]
        
        ret_dict = {}
        ret_dict['point_clouds'] = cloud_sampled.astype(np.float32)
        ret_dict['cloud_colors'] = color_sampled.astype(np.float32)

        return ret_dict

    def get_data_label(self, index):
		# 省略了get_data的所有逻辑，基于get_data的函数之后，这个函数做了如下的处理
        cloud_sampled = cloud_masked[idxs]
        color_sampled = color_masked[idxs]
        seg_sampled = seg_masked[idxs]
        objectness_label = seg_sampled.copy()
        objectness_label[objectness_label>1] = 1
        
        object_poses_list = []
        grasp_points_list = []
        grasp_offsets_list = []
        grasp_scores_list = []
        grasp_tolerance_list = []
        for i, obj_idx in enumerate(obj_idxs):	# 枚举场景中的10类物体
            if obj_idx not in self.valid_obj_idxs:
                continue
            if (seg_sampled == obj_idx).sum() < 50:	# 拍到的对应物体点云数量太少
                continue
            object_poses_list.append(poses[:, :, i])	# 加入到合法的object_poses_list中
            points, offsets, scores, tolerance = self.grasp_labels[obj_idx]	# 得到物体坐标系下创建的Grasp Pose
            collision = self.collision_labels[scene][i] # 得到场景中的点的collision mask(Np, V, A, D)

            # remove invisible grasp points
            if self.remove_invisible:
                visible_mask = remove_invisible_grasp_points(cloud_sampled[seg_sampled==obj_idx], points, poses[:,:,i], th=0.01)
                points = points[visible_mask]
                offsets = offsets[visible_mask]
                scores = scores[visible_mask]
                tolerance = tolerance[visible_mask]
                collision = collision[visible_mask]

            idxs = np.random.choice(len(points), min(max(int(len(points)/4),300),len(points)), replace=False)
            grasp_points_list.append(points[idxs])
            grasp_offsets_list.append(offsets[idxs])
            collision = collision[idxs].copy()
            scores = scores[idxs].copy()
            tolerance = tolerance[idxs].copy()
            scores[collision] = 0
            tolerance[collision] = 0	# 场景中存在collision的情况，我们设置Grasp分数为0
            grasp_scores_list.append(scores)
            grasp_tolerance_list.append(tolerance)
        
        # 如果设置了这个，每次__getitem__的时候得到的数据也会有不一样，会有轻微的扰动，但是这样就不需要在创建Label的时候就加随机，降低的数据集的大小和处理的复杂度
        if self.augment:
            cloud_sampled, object_poses_list = self.augment_data(cloud_sampled, object_poses_list)
        
        ret_dict = {}
        ret_dict['point_clouds'] = cloud_sampled.astype(np.float32)
        ret_dict['cloud_colors'] = color_sampled.astype(np.float32)
        ret_dict['objectness_label'] = objectness_label.astype(np.int64)
        ret_dict['object_poses_list'] = object_poses_list
        ret_dict['grasp_points_list'] = grasp_points_list
        ret_dict['grasp_offsets_list'] = grasp_offsets_list
        ret_dict['grasp_labels_list'] = grasp_scores_list
        ret_dict['grasp_tolerance_list'] = grasp_tolerance_list

        return ret_dict
    
def augment_data(self, point_clouds, object_poses_list):
   # Flipping along the YZ plane
    if np.random.random() > 0.5:
        flip_mat = np.array([[-1, 0, 0],
                            [ 0, 1, 0],
                            [ 0, 0, 1]])
        point_clouds = transform_point_cloud(point_clouds, flip_mat, '3x3')
        for i in range(len(object_poses_list)):
            object_poses_list[i] = np.dot(flip_mat, object_poses_list[i]).astype(np.float32)

    # 沿着z轴旋转
    rot_angle = (np.random.random()*np.pi/3) - np.pi/6 # -30 ~ +30 degree
    c, s = np.cos(rot_angle), np.sin(rot_angle)
    rot_mat = np.array([[1, 0, 0],
                        [0, c,-s],
                        [0, s, c]])
    point_clouds = transform_point_cloud(point_clouds, rot_mat, '3x3')
    for i in range(len(object_poses_list)):
        object_poses_list[i] = np.dot(rot_mat, object_poses_list[i]).astype(np.float32)
    return point_clouds, object_poses_list

def collate_fn(batch):
    # 在train.py创建DataLoader的时候会使用到这个函数。
    # collate_fn：如何取样本的，我们可以定义自己的函数来准确地实现想要的功能
    if type(batch[0]).__module__ == 'numpy':
        return torch.stack([torch.from_numpy(b) for b in batch], 0)
    elif isinstance(batch[0], container_abcs.Mapping):
        return {key:collate_fn([d[key] for d in batch]) for key in batch[0]}
    elif isinstance(batch[0], container_abcs.Sequence):
        return [[torch.from_numpy(sample) for sample in b] for b in batch]
    
    raise TypeError("batch must contain tensors, dicts or lists; found {}".format(type(batch[0])))

graspnet.py

我们首先来看GraspNet这个类。我们可以看到，GraspNetStage1其实就是我们的PointNet++ backbone和ApproachNet部分。而GraspNetStage2主要是OperationNet和ToleranceNet部分。

class GraspNet(nn.Module):
    def __init__(self, input_feature_dim=0, num_view=300, num_angle=12, num_depth=4, cylinder_radius=0.05, hmin=-0.02, hmax_list=[0.01,0.02,0.03,0.04], is_training=True):
        super().__init__()
        self.is_training = is_training
        self.view_estimator = GraspNetStage1(input_feature_dim, num_view)
        self.grasp_generator = GraspNetStage2(num_angle, num_depth, cylinder_radius, hmin, hmax_list, is_training)

    def forward(self, end_points):
        end_points = self.view_estimator(end_points)
        if self.is_training:
            end_points = process_grasp_labels(end_points)
        end_points = self.grasp_generator(end_points)
        return end_points

根据forward的顺序，我们先来看看GarspNetStage1的逻辑：

class GraspNetStage1(nn.Module):
    def __init__(self, input_feature_dim=0, num_view=300):
        super().__init__()
        self.backbone = Pointnet2Backbone(input_feature_dim)
        self.vpmodule = ApproachNet(num_view, 256)

    def forward(self, end_points):
        pointcloud = end_points['point_clouds']
        seed_features, seed_xyz, end_points = self.backbone(pointcloud, end_points)
        end_points = self.vpmodule(seed_xyz, seed_features, end_points)
        return end_points

其实就是走了一个Pointnet2Backbone和ApproachNet，关于PointNet不再赘述。我们来看看ApproachNet

class ApproachNet(nn.Module):
    def __init__(self, num_view, seed_feature_dim):
        """ Approach vector estimation from seed point features.
            Input:
                num_view: [int]
                    number of views generated from each each seed point
                seed_feature_dim: [int]
                    number of channels of seed point features
        """
        super().__init__()
        self.num_view = num_view
        self.in_dim = seed_feature_dim
        self.conv1 = nn.Conv1d(self.in_dim, self.in_dim, 1)
        self.conv2 = nn.Conv1d(self.in_dim, 2+self.num_view, 1)
        self.conv3 = nn.Conv1d(2+self.num_view, 2+self.num_view, 1)
        self.bn1 = nn.BatchNorm1d(self.in_dim)
        self.bn2 = nn.BatchNorm1d(2+self.num_view)

    def forward(self, seed_xyz, seed_features, end_points):
        """ Forward pass.
            Input:
                seed_xyz: (B, Ns, 3), coordinates of seed points
                seed_features: (batch_size,feature_dim,num_seed)
                    features of seed points
                end_points: [dict]
            Output:
                end_points: [dict]
        """
        B, num_seed, _ = seed_xyz.size()
        features = F.relu(self.bn1(self.conv1(seed_features)), inplace=True)
        # (B, C, Ns)
        features = F.relu(self.bn2(self.conv2(features)), inplace=True)
        # (B, 2 + V, Ns)
        features = self.conv3(features)
        # (B, 2 + V, Ns)
        objectness_score = features[:, :2, :] # (B, 2, Ns)
        view_score = features[:, 2:2+self.num_view, :].transpose(1,2).contiguous() 
        # (B, Ns, V)
        end_points['objectness_score'] = objectness_score
        end_points['view_score'] = view_score
        # 找到一个最大Grasp score的approach vector的编号
        top_view_scores, top_view_inds = torch.max(view_score, dim=2) # (B, Ns)
        top_view_inds_ = top_view_inds.view(B, num_seed, 1, 1).expand(-1, -1, -1, 3).contiguous()
        # expand代表把(B, M, 1, 1)复制成(B, M, 1, 3)，最后一维共享内存
        template_views = generate_grasp_views(self.num_view).to(features.device) # (V, 3)
        template_views = template_views.view(1, 1, self.num_view, 3).expand(B, num_seed, -1, -1).contiguous() #(B, Ns, V, 3)
        vp_xyz = torch.gather(template_views, 2, top_view_inds_).squeeze(2) #(B, Ns, 3)
        # torch.gather意味着我们可以使用index把V那一维的对应index选出来，然后我们再用squeeze压缩维度
        # 也就是 (B, M, V, 3) => (B, M, 1, 3) => (B, M, 3)
        vp_xyz_ = vp_xyz.view(-1, 3) # (B * M, 3)
        batch_angle = torch.zeros(vp_xyz_.size(0), dtype=vp_xyz.dtype, device=vp_xyz.device)
        # batch_angle : (B * M, 3)，此时的in-plane rotation
        vp_rot = batch_viewpoint_params_to_matrix(-vp_xyz_, batch_angle).view(B, num_seed, 3, 3)
        # batch_viewpoint_params_to_matrix是把approach vector和in-plane rotation转化为旋转矩阵
        end_points['grasp_top_view_inds'] = top_view_inds
        end_points['grasp_top_view_score'] = top_view_scores
        end_points['grasp_top_view_xyz'] = vp_xyz
        end_points['grasp_top_view_rot'] = vp_rot
        return end_points

注意到在GraspNet的forward过程中，我们的end_points需要经过process_grasp_labels这个函数。我们先来看看这个函数在做什么，它主要做了几件事情：

把物体坐标系下的Grasp投射到标准场景坐标系下，并且聚类到最近的场景系下的标准模板向量（V个approaching vector）
根据聚类结果对(Ns, V, A, D, 3)的第2维度V重排顺序，使得第二维满足相同的顺序，即不同物体的Grasp可以合成一个tensor：(Np’, V, A, D, 3)，其中满足$N_P’=\sum N_P$。
对物体的原先标签Grasp Label做转换，原先的意义是达到稳定抓取的最小摩擦系数，转换成Score时需要满足摩擦系数越大的Grasp的成功率越小，所以其中做了一个归一化的倒数处理。

def process_grasp_labels(end_points):
    """ 根据场景中的点和场景中物体的6D Pose处理grasp labels"""
    clouds = end_points['input_xyz'] #(B, N, 3)
    seed_xyzs = end_points['fp2_xyz'] #(B, Ns, 3)
    batch_size, num_samples, _ = seed_xyzs.size()

    batch_grasp_points = []
    batch_grasp_views = []
    batch_grasp_views_rot = []
    batch_grasp_labels = []
    batch_grasp_offsets = []
    batch_grasp_tolerance = []
    for i in range(batch_size):	# batch中的每个数据单独处理
        seed_xyz = seed_xyzs[i] # (Ns, 3)
        poses = end_points['object_poses_list'][i] # [<=10, (3, 4)]

        # get merged grasp points for label computation
        grasp_points_merged = []
        grasp_views_merged = []
        grasp_views_rot_merged = []
        grasp_labels_merged = []
        grasp_offsets_merged = []
        grasp_tolerance_merged = []
        for obj_idx, pose in enumerate(poses):	# 枚举当前场景中的所有物体
            # 得到每个物体单独时候的Grasps，数量为Np * V * A * D
            grasp_points = end_points['grasp_points_list'][i][obj_idx] #(Np, 3)
            grasp_labels = end_points['grasp_labels_list'][i][obj_idx] #(Np, V, A, D)
            grasp_offsets = end_points['grasp_offsets_list'][i][obj_idx] #(Np, V, A, D, 3)
            grasp_tolerance = end_points['grasp_tolerance_list'][i][obj_idx] #(Np, V, A, D)
            _, V, A, D = grasp_labels.size()
            num_grasp_points = grasp_points.size(0)	# 得到 Np
            # generate and transform template grasp views
            # 我们需要把物体mesh上的模板Grasp投射到当前场景的6D Pose下
            grasp_views = generate_grasp_views(V).to(pose.device) 
            # 使用单位球均匀采样方法，得到V个方向向量(V, 3)
            grasp_points_trans = transform_point_cloud(grasp_points, pose, '3x4')、
            # 3x4就是既要旋转又要平移，根据物体在场景中的6D Pose，把grasp的投影到场景中, (Np, 3)
            grasp_views_trans = transform_point_cloud(grasp_views, pose[:3,:3], '3x3')
            # 3x3就是只旋转，我们现在得到了V个场景中对齐的方向向量
            angles = torch.zeros(grasp_views.size(0), dtype=grasp_views.dtype, device=grasp_views.device)
            # 我们把单位向量转化为旋转矩阵，这个转换需要一个额外的自由度，也就in-plane rotation，这也是为什么我们传入了一个全零的angles的理由
            grasp_views_rot = batch_viewpoint_params_to_matrix(-grasp_views, angles) #(V, 3, 3)
            # 这个函数内部，无非就就是把方向向量和标准坐标轴的某一维度对齐，再根据数学算出一个基矢量，然后计算基变换矩阵
            grasp_views_rot_trans = torch.matmul(pose[:3,:3], grasp_views_rot) #(V, 3, 3)
            # 这个矩阵乘以物体的Pose矩阵，就得到了这个Grasp的Approach Vector在场景系下的绝对orn
            
            grasp_views_ = grasp_views.transpose(0, 1).contiguous().unsqueeze(0)	# (1, 3, V)
            grasp_views_trans_ = grasp_views_trans.transpose(0, 1).contiguous().unsqueeze(0)	#(1, 3, V)
       
            view_inds = knn(grasp_views_trans_, grasp_views_, k=1).squeeze() - 1
        	# 注意此处ref = grasp_views_trans_(物体坐标系的模板)
            # query = grasp_views_(物体坐标系)
            # 所以是要把物体坐标系下的方向向量分配到场景坐标系的模板方向向量上
            # 因为物体在场景中的6D Pose是任意的，所以导致在场景对齐后，V个模板方向向量也是任意的，此处的KNN就是把场景对齐后的任意方向的V个approaching vector，重新聚类回场景系下的模板(V, 3)，这样场景中所有Grasp的GT都会属于绝对场景坐标系下的V个approaching vector中。
            grasp_views_trans = torch.index_select(grasp_views_trans, 0, view_inds) #(V, 3)
            grasp_views_trans = grasp_views_trans.unsqueeze(0).expand(num_grasp_points, -1, -1)
            # (Np, V, 3)，重新计算Np个Grasp的新的方向向量
            grasp_views_rot_trans = torch.index_select(grasp_views_rot_trans, 0, view_inds) #(V, 3, 3)
            grasp_views_rot_trans = grasp_views_rot_trans.unsqueeze(0).expand(num_grasp_points, -1, -1, -1)
            # (Np, V, 3, 3)，重新计算Np个Grasp的新的orn
            grasp_labels = torch.index_select(grasp_labels, 1, view_inds) #(Np, V, A, D)
            grasp_offsets = torch.index_select(grasp_offsets, 1, view_inds) #(Np, V, A, D, 3)
            grasp_tolerance = torch.index_select(grasp_tolerance, 1, view_inds) #(Np, V, A, D)
            # 上面三行其实都是在dim=1上重排序号，使其重新满足V这一维度的顺序等于模板Approaching Vector的创建顺序
            
            # add to list
            grasp_points_merged.append(grasp_points_trans)
            grasp_views_merged.append(grasp_views_trans)
            grasp_views_rot_merged.append(grasp_views_rot_trans)
            grasp_labels_merged.append(grasp_labels)
            grasp_offsets_merged.append(grasp_offsets)
            grasp_tolerance_merged.append(grasp_tolerance)
		# Np' = sum(Np) for all object in the scene
        grasp_points_merged = torch.cat(grasp_points_merged, dim=0) #(Np', 3)
        grasp_views_merged = torch.cat(grasp_views_merged, dim=0) #(Np', V, 3)
        grasp_views_rot_merged = torch.cat(grasp_views_rot_merged, dim=0) #(Np', V, 3, 3)
        grasp_labels_merged = torch.cat(grasp_labels_merged, dim=0) #(Np', V, A, D)
        grasp_offsets_merged = torch.cat(grasp_offsets_merged, dim=0) #(Np', V, A, D, 3)
        grasp_tolerance_merged = torch.cat(grasp_tolerance_merged, dim=0) #(Np', V, A, D)

        # compute nearest neighbors
        seed_xyz_ = seed_xyz.transpose(0, 1).contiguous().unsqueeze(0) #(1, 3, Ns)
        grasp_points_merged_ = grasp_points_merged.transpose(0, 1).contiguous().unsqueeze(0) #(1, 3, Np')
        nn_inds = knn(grasp_points_merged_, seed_xyz_, k=1).squeeze() - 1 
        # 一般来说，Np'通常会有很多个，而Ns = M = 1024，所以此处就是在把每个seed分配到最近的一个grasp point中，这样每个seed就保留了一个最接近的grasp point，这样我们对于每个点有稠密的V * A * D个抓点的标注。

        grasp_points_merged = torch.index_select(grasp_points_merged, 0, nn_inds) # (Ns, 3)
        grasp_views_merged = torch.index_select(grasp_views_merged, 0, nn_inds) # (Ns, V, 3)
        grasp_views_rot_merged = torch.index_select(grasp_views_rot_merged, 0, nn_inds) #(Ns, V, 3, 3)
        grasp_labels_merged = torch.index_select(grasp_labels_merged, 0, nn_inds) # (Ns, V, A, D)
        grasp_offsets_merged = torch.index_select(grasp_offsets_merged, 0, nn_inds) # (Ns, V, A, D, 3)
        grasp_tolerance_merged = torch.index_select(grasp_tolerance_merged, 0, nn_inds) # (Ns, V, A, D)
		# 以上就是得到了当前这个Batch的稠密标注
        
        # add to batch
        batch_grasp_points.append(grasp_points_merged)
        batch_grasp_views.append(grasp_views_merged)
        batch_grasp_views_rot.append(grasp_views_rot_merged)
        batch_grasp_labels.append(grasp_labels_merged)
        batch_grasp_offsets.append(grasp_offsets_merged)
        batch_grasp_tolerance.append(grasp_tolerance_merged)
		
        
    batch_grasp_points = torch.stack(batch_grasp_points, 0) #(B, Ns, 3)
    batch_grasp_views = torch.stack(batch_grasp_views, 0) #(B, Ns, V, 3)
    batch_grasp_views_rot = torch.stack(batch_grasp_views_rot, 0) #(B, Ns, V, 3, 3)
    batch_grasp_labels = torch.stack(batch_grasp_labels, 0) #(B, Ns, V, A, D)
    batch_grasp_offsets = torch.stack(batch_grasp_offsets, 0) #(B, Ns, V, A, D, 3)
    batch_grasp_tolerance = torch.stack(batch_grasp_tolerance, 0) #(B, Ns, V, A, D)
	# 得到了所有batch的稠密标注
    
    # process labels
    batch_grasp_widths = batch_grasp_offsets[:,:,:,:,:,2]
    label_mask = (batch_grasp_labels > 0) & (batch_grasp_widths <= GRASP_MAX_WIDTH)
    u_max = batch_grasp_labels.max()
    # 此时的grasp_labels的物理意义为：达到一个稳定抓取所需要的最小摩擦系数
    batch_grasp_labels[label_mask] = torch.log(u_max / batch_grasp_labels[label_mask])
    # 我们让摩擦系数最大的Grasp的分数为0，因为所需要的摩擦系数越大，Grasp的可能性越低
    # 因为摩擦系数分为了0.1~1，所以torch.log里最大的情况就是10，也就是grasp_score的最大值是torch.log(10)
    batch_grasp_labels[~label_mask] = 0
    batch_grasp_view_scores, _ = batch_grasp_labels.view(batch_size, num_samples, V, A*D).max(dim=-1)

    end_points['batch_grasp_point'] = batch_grasp_points
    end_points['batch_grasp_view'] = batch_grasp_views
    end_points['batch_grasp_view_rot'] = batch_grasp_views_rot
    end_points['batch_grasp_label'] = batch_grasp_labels
    end_points['batch_grasp_offset'] = batch_grasp_offsets
    end_points['batch_grasp_tolerance'] = batch_grasp_tolerance
    end_points['batch_grasp_view_label'] = batch_grasp_view_scores.float()

    return end_points

GraspNetStage2的结构如下，它主要包含三个模块：

match_grasp_view_and_label
CloudCrop
并列的两个论文中画出来的网络：OperationNet和ToleranceNet

class GraspNetStage2(nn.Module):
    def __init__(self, num_angle=12, num_depth=4, cylinder_radius=0.05, hmin=-0.02, hmax_list=[0.01,0.02,0.03,0.04], is_training=True):
        super().__init__()
        self.num_angle = num_angle
        self.num_depth = num_depth
        self.is_training = is_training
        self.crop = CloudCrop(nsample=64, seed_feature_dim=3, cylinder_radius=cylinder_radius, hmin=hmin, hmax_list=hmax_list)
        self.operation = OperationNet(num_angle, num_depth)
        self.tolerance = ToleranceNet(num_angle, num_depth)
    
    def forward(self, end_points):
        pointcloud = end_points['input_xyz']
        if self.is_training:
            grasp_top_views_rot, _, _, _, end_points = match_grasp_view_and_label(end_points)
            seed_xyz = end_points['batch_grasp_point']
        else:
            grasp_top_views_rot = end_points['grasp_top_view_rot']
            seed_xyz = end_points['fp2_xyz']

        vp_features = self.crop(seed_xyz, pointcloud, grasp_top_views_rot)
        end_points = self.operation(vp_features, end_points)
        end_points = self.tolerance(vp_features, end_points)

        return end_points

def match_grasp_view_and_label(end_points):
    # 这个函数在做的事情就是，我们已经预测出来了V个approaching vector中最高分的index，那么我们就要从batch_grasp中把对应的Grasp的参数找出来作为当前点的GT往后传递
    top_view_inds = end_points['grasp_top_view_inds'] # (B, Ns)
    template_views_rot = end_points['batch_grasp_view_rot'] # (B, Ns, V, 3, 3)
    grasp_labels = end_points['batch_grasp_label'] # (B, Ns, V, A, D)
    grasp_offsets = end_points['batch_grasp_offset'] # (B, Ns, V, A, D, 3)
    grasp_tolerance = end_points['batch_grasp_tolerance'] # (B, Ns, V, A, D)

    B, Ns, V, A, D = grasp_labels.size()
    top_view_inds_ = top_view_inds.view(B, Ns, 1, 1, 1).expand(-1, -1, -1, 3, 3)
    # 此时top_view_inds_为(B, Ns, 1, 3, 3)，也就是在V这一维中把最高分的选出来
    top_template_views_rot = torch.gather(template_views_rot, 2, top_view_inds_).squeeze(2)
    top_view_inds_ = top_view_inds.view(B, Ns, 1, 1, 1).expand(-1, -1, -1, A, D)
    top_view_grasp_labels = torch.gather(grasp_labels, 2, top_view_inds_).squeeze(2)
    top_view_grasp_tolerance = torch.gather(grasp_tolerance, 2, top_view_inds_).squeeze(2)
    top_view_inds_ = top_view_inds.view(B, Ns, 1, 1, 1, 1).expand(-1, -1, -1, A, D, 3)
    top_view_grasp_offsets = torch.gather(grasp_offsets, 2, top_view_inds_).squeeze(2)
    # top_template_views_rot (B, Ns, 3, 3)
    # top_view_grasp_labels (B, Ns, A, D)
    # top_view_grasp_tolerance (B, Ns, A, D, 3)
    # top_view_grasp_offsets (B, Ns, A, D)

    end_points['batch_grasp_view_rot'] = top_template_views_rot
    end_points['batch_grasp_label'] = top_view_grasp_labels
    end_points['batch_grasp_offset'] = top_view_grasp_offsets
    end_points['batch_grasp_tolerance'] = top_view_grasp_tolerance
	# 从此以后，这些数据从原先的V个，只保留了分数最高的那一个了
    return top_template_views_rot, top_view_grasp_labels, top_view_grasp_offsets, top_view_grasp_tolerance, end_points

接下来CloudCrop模块：

class CloudCrop(nn.Module):
    """ Cylinder group and align for grasp configure estimation. Return a list of grouped points with different cropping depths.
        Input:
            nsample: [int]
                sample number in a group
            seed_feature_dim: [int]
                number of channels of grouped points
            cylinder_radius: [float]
                radius of the cylinder space
            hmin: [float]
                height of the bottom surface
            hmax_list: [list of float]
                list of heights of the upper surface
    """
    def __init__(self, nsample, seed_feature_dim, cylinder_radius=0.05, hmin=-0.02, hmax_list=[0.01,0.02,0.03,0.04]):
        super().__init__()
        self.nsample = nsample
        self.in_dim = seed_feature_dim
        self.cylinder_radius = cylinder_radius
        mlps = [self.in_dim, 64, 128, 256]
        
        self.groupers = []
        for hmax in hmax_list:
            self.groupers.append(CylinderQueryAndGroup(cylinder_radius, hmin, hmax, nsample, use_xyz=True))
        self.mlps = pt_utils.SharedMLP(mlps, bn=True)

    def forward(self, seed_xyz, pointcloud, vp_rot):
        """ Forward pass.
            Input:
                seed_xyz:   (B, Ns, 3), coordinates of seed points
                pointcloud: (B, Ns, 3), the points to be cropped
                vp_rot:     (B, Ns, 3, 3), rotation matrices generated from approach vectors
            Output:
                vp_features: (B, num_features, Ns,num_depth), features of grouped points in different depths
        """
        B, num_seed, _, _ = vp_rot.size()
        num_depth = len(self.groupers)
        grouped_features = []
        for grouper in self.groupers:	# 枚举D个grouper
            grouped_features.append(grouper(pointcloud, seed_xyz, vp_rot)) 
            # (B, 3, Ns, nsample)
        grouped_features = torch.stack(grouped_features, dim=3) 
        # (B, 3, Ns, D, nsample)
        grouped_features = grouped_features.view(B, -1, num_seed*num_depth, self.nsample) 
        # (B, 3, Ns * D, nsample)
        vp_features = self.mlps(grouped_features) 
        # (B, mlps[-1], Ns * D, nsample)
        vp_features = F.max_pool2d(vp_features, kernel_size=[1, vp_features.size(3)])
        # 圆筒聚类后，对聚类中的点做max_pool2d来聚合成一个cluster feature
        # (B, mlps[-1], Ns * D, 1)
        vp_features = vp_features.view(B, -1, num_seed, num_depth)
        # (B, mlps[-1], Ns, D)
		# 这个模组在做的事情就是把D维根据不同的深度提取出不同的特征，放到dim=1中。
        return vp_features

在此之中，用到了一个CylinderQueryAndGroup模块，我们进一步往下探索。

class CylinderQueryAndGroup(nn.Module):
    r"""
    Groups with a cylinder query of radius and height
    Parameters
    ---------
    radius : float32
        Radius of cylinder
    hmin, hmax: float32
        endpoints of cylinder height in x-rotation axis
    nsample : int32
        Maximum number of features to gather in the ball
    """

    def __init__(self, radius, hmin, hmax, nsample, use_xyz=True, ret_grouped_xyz=False, normalize_xyz=False, rotate_xyz=True, sample_uniformly=False, ret_unique_cnt=False):
        super(CylinderQueryAndGroup, self).__init__()
        self.radius, self.nsample, self.hmin, self.hmax, = radius, nsample, hmin, hmax
        self.use_xyz = use_xyz
        self.ret_grouped_xyz = ret_grouped_xyz
        self.normalize_xyz = normalize_xyz
        self.rotate_xyz = rotate_xyz
        self.sample_uniformly = sample_uniformly
        self.ret_unique_cnt = ret_unique_cnt
        if self.ret_unique_cnt:
            assert(self.sample_uniformly)

    def forward(self, xyz, new_xyz, rot, features=None):
        """
        Parameters
        ----------
        xyz : 点云坐标      (B, N, 3)
        new_xyz : 抓点中心  (B, npoint, 3)
        rot : 抓点的旋转矩阵 (B, npoint, 3, 3)
        features : torch.Tensor
            Descriptors of the features (B, C, N)
        Returns
        -------
        new_features : torch.Tensor
            (B, 3 + C, npoint, nsample) tensor
        """
        B, npoint, _ = new_xyz.size()
        idx = cylinder_query(self.radius, self.hmin, self.hmax, self.nsample, xyz, new_xyz, rot.view(B, npoint, 9))

        if self.sample_uniformly:
            unique_cnt = torch.zeros((idx.shape[0], idx.shape[1]))
            for i_batch in range(idx.shape[0]):
                for i_region in range(idx.shape[1]):
                    unique_ind = torch.unique(idx[i_batch, i_region, :])
                    num_unique = unique_ind.shape[0]
                    unique_cnt[i_batch, i_region] = num_unique
                    sample_ind = torch.randint(0, num_unique, (self.nsample - num_unique,),dtype=torch.long)
                    all_ind = torch.cat((unique_ind, unique_ind[sample_ind]))
                    idx[i_batch, i_region, :] = all_ind


        xyz_trans = xyz.transpose(1, 2).contiguous()
        grouped_xyz = grouping_operation(xyz_trans, idx)  # (B, 3, npoint, nsample)
        grouped_xyz -= new_xyz.transpose(1, 2).unsqueeze(-1)
        if self.normalize_xyz:
            grouped_xyz /= self.radius
        if self.rotate_xyz:
            grouped_xyz_ = grouped_xyz.permute(0, 2, 3, 1).contiguous() # (B, npoint, nsample, 3)
            grouped_xyz_ = torch.matmul(grouped_xyz_, rot)
            grouped_xyz = grouped_xyz_.permute(0, 3, 1, 2).contiguous()


        if features is not None:
            grouped_features = grouping_operation(features, idx)
            if self.use_xyz:
                new_features = torch.cat(
                    [grouped_xyz, grouped_features], dim=1
                )  # (B, C + 3, npoint, nsample)
            else:
                new_features = grouped_features
        else:
            assert (self.use_xyz), "Cannot have not features and not use xyz as a feature!"
            new_features = grouped_xyz

        ret = [new_features]
        if self.ret_grouped_xyz:
            ret.append(grouped_xyz)
        if self.ret_unique_cnt:
            ret.append(unique_cnt)
        if len(ret) == 1:
            return ret[0]
        else:
            return tuple(ret)

为了理解其中cylinder_query的作用，我们看到其CUDA实现：

//我们先尝试理解PointNet++中的ball_query的实现
__global__ void query_ball_point_kernel(int b, int n, int m, float radius,
                                        int nsample,
                                        const float *__restrict__ new_xyz,	//query
                                        const float *__restrict__ xyz,		//原先点
                                        int *__restrict__ idx) {
  int batch_index = blockIdx.x;
  xyz += batch_index * n * 3;
  new_xyz += batch_index * m * 3;
  idx += m * nsample * batch_index;

  int index = threadIdx.x;
  int stride = blockDim.x;

  float radius2 = radius * radius;
  for (int j = index; j < m; j += stride) {
    float new_x = new_xyz[j * 3 + 0];
    float new_y = new_xyz[j * 3 + 1];
    float new_z = new_xyz[j * 3 + 2];
    for (int k = 0, cnt = 0; k < n && cnt < nsample; ++k) {	//枚举原先的每个点
      float x = xyz[k * 3 + 0];
      float y = xyz[k * 3 + 1];
      float z = xyz[k * 3 + 2];
      float d2 = (new_x - x) * (new_x - x) + (new_y - y) * (new_y - y) +
                 (new_z - z) * (new_z - z);	//计算和Query的距离
      if (d2 < radius2) {
        if (cnt == 0) {
          for (int l = 0; l < nsample; ++l) {	// 小于ball半径的情况下，就加到idx中，并且第一次把idx中的值全部赋值为k
            idx[j * nsample + l] = k;
          }
        }
        idx[j * nsample + cnt] = k;
        ++cnt;
      }
    }
  }
}

__global__ void query_cylinder_point_kernel(int b, int n /*xyz.size(1)*/, int m/*new_xyz.size(1)*/, float radius, float hmin, float hmax,
                                        int nsample,
                                        const float *__restrict__ new_xyz,	//抓点中心
                                        const float *__restrict__ xyz,	//点云
                                        const float *__restrict__ rot,
                                        int *__restrict__ idx) {
  int batch_index = blockIdx.x;
  xyz += batch_index * n * 3;	// 访问到当前batch的点云数据偏移量，也就是xyz[batch*n*3]
  new_xyz += batch_index * m * 3;	// 访问到当前batch的抓点数据偏移量，也就是new_xyz[batch*m*3]
  rot += batch_index * m * 9;
  idx += m * nsample * batch_index;

  int index = threadIdx.x;	//当前Block中的线程ID
  int stride = blockDim.x;	//当前Grid中的Thread Block ID，注意调用这个kernel函数的时候传入的Grid和Block都是一维的

  float radius2 = radius * radius;
  for (int j = index; j < m; j += stride) {	//Query中的每个Grasp Center Point
    float new_x = new_xyz[j * 3 + 0];
    float new_y = new_xyz[j * 3 + 1];
    float new_z = new_xyz[j * 3 + 2];
    float r0 = rot[j * 9 + 0];
    float r1 = rot[j * 9 + 1];
    float r2 = rot[j * 9 + 2];
    float r3 = rot[j * 9 + 3];
    float r4 = rot[j * 9 + 4];
    float r5 = rot[j * 9 + 5];
    float r6 = rot[j * 9 + 6];
    float r7 = rot[j * 9 + 7];
    float r8 = rot[j * 9 + 8];
    for (int k = 0, cnt = 0; k < n && cnt < nsample; ++k) {	//枚举点云中的每个点
      float x = xyz[k * 3 + 0] - new_x;
      float y = xyz[k * 3 + 1] - new_y;
      float z = xyz[k * 3 + 2] - new_z;
      float x_rot = r0 * x + r3 * y + r6 * z;
      float y_rot = r1 * x + r4 * y + r7 * z;
      float z_rot = r2 * x + r5 * y + r8 * z;
      // 把裁剪采样过的点云变换到夹爪坐标系里
      float d2 = y_rot * y_rot + z_rot * z_rot;	//沿着圆柱切向量的半径计算
      if (d2 < radius2 && x_rot > hmin && x_rot < hmax) {	//沿着圆柱法向量的裁剪
        if (cnt == 0) {
          for (int l = 0; l < nsample; ++l) {
            idx[j * nsample + l] = k;
          }
        }
        idx[j * nsample + cnt] = k;
        ++cnt;
      }
    }
  }
}

所以上述CylinderQuery的意义就是在圆柱范围内对点做聚类。

接下来是OperationNet和ToleranceNet：

class OperationNet(nn.Module):
    """ Grasp configure estimation.
        Input:
            num_angle: [int]
                number of in-plane rotation angle classes
                the value of the i-th class --> i*PI/num_angle (i=0,...,num_angle-1)
            num_depth: [int]
                number of gripper depth classes
    """
    def __init__(self, num_angle, num_depth):
        # Output:
        # scores(num_angle)
        # angle class (num_angle)
        # width (num_angle)
        super().__init__()
        self.num_angle = num_angle
        self.num_depth = num_depth

        self.conv1 = nn.Conv1d(256, 128, 1)
        self.conv2 = nn.Conv1d(128, 128, 1)
        self.conv3 = nn.Conv1d(128, 3*num_angle, 1)
        self.bn1 = nn.BatchNorm1d(128)
        self.bn2 = nn.BatchNorm1d(128)

    def forward(self, vp_features, end_points):
        """ Forward pass.
            Input:
                vp_features:  (B, mlps[-1], Ns, D)    features of grouped points in different depths
                end_points: [dict]
            Output:
                end_points: [dict]
        """
        # 在这个函数中，我们已经确定了Ns和D的情况下，尝试把in-plane angle多分类出来。
        B, _, num_seed, num_depth = vp_features.size()
        vp_features = vp_features.view(B, -1, num_seed*num_depth)	
        # (B, mlps[-1], Ns * D)，在默认情况下为(B, 256, Ns * 4)
        vp_features = F.relu(self.bn1(self.conv1(vp_features)), inplace=True)
        # (B, 256, 256) => (B, 128, 256)
        vp_features = F.relu(self.bn2(self.conv2(vp_features)), inplace=True)
        # (B, 128, 256) => (B, 128, 256)
        vp_features = self.conv3(vp_features)
        # (B, 128, 256) => (B, 3 * A, 256) 也就是 (B, 3 * A, Ns * D)
        vp_features = vp_features.view(B, -1, num_seed, num_depth)
        # (B, 3 * A, Ns, D)

        # split prediction
        end_points['grasp_score_pred'] = vp_features[:, 0:self.num_angle]
        end_points['grasp_angle_cls_pred'] = vp_features[:, self.num_angle:2*self.num_angle]
        end_points['grasp_width_pred'] = vp_features[:, 2*self.num_angle:3*self.num_angle]
        # 拆成了三个(B, A, Ns, D)
        return end_points
    
class ToleranceNet(nn.Module):
    """ Grasp tolerance prediction.
    
        Input:
            num_angle: [int]
                number of in-plane rotation angle classes
                the value of the i-th class --> i*PI/num_angle (i=0,...,num_angle-1)
            num_depth: [int]
                number of gripper depth classes
    """
    def __init__(self, num_angle, num_depth):
        # Output:
        # tolerance (num_angle)
        super().__init__()
        self.conv1 = nn.Conv1d(256, 128, 1)
        self.conv2 = nn.Conv1d(128, 128, 1)
        self.conv3 = nn.Conv1d(128, num_angle, 1)
        self.bn1 = nn.BatchNorm1d(128)
        self.bn2 = nn.BatchNorm1d(128)

    def forward(self, vp_features, end_points):
        """ Forward pass.
            Input:
                vp_features: [torch.FloatTensor, (batch_size,num_seed,3)]
                    features of grouped points in different depths
                end_points: [dict]
            Output:
                end_points: [dict]
        """
        B, _, num_seed, num_depth = vp_features.size()
        vp_features = vp_features.view(B, -1, num_seed*num_depth)
        vp_features = F.relu(self.bn1(self.conv1(vp_features)), inplace=True)
        vp_features = F.relu(self.bn2(self.conv2(vp_features)), inplace=True)
        vp_features = self.conv3(vp_features)
        vp_features = vp_features.view(B, -1, num_seed, num_depth)
        end_points['grasp_tolerance_pred'] = vp_features
        # (B, num_angle, Ns, num_depth)
        return end_points

到这里为止，GraspNet的前向传播过程全部结束。我们接下来需要通过pred_decode对预测的end_points字典解码。

def pred_decode(end_points):
    batch_size = len(end_points['point_clouds'])
    grasp_preds = []
    for i in range(batch_size):
        ## load predictions
        objectness_score = end_points['objectness_score'][i].float()
        grasp_score = end_points['grasp_score_pred'][i].float()
        grasp_angle_class_score = end_points['grasp_angle_cls_pred'][i]
        grasp_width = 1.2 * end_points['grasp_width_pred'][i]
        grasp_width = torch.clamp(grasp_width, min=0, max=GRASP_MAX_WIDTH)
        grasp_tolerance = end_points['grasp_tolerance_pred'][i]
        # 以上四个都是网络预测的(A, Ns, D)的Grasp参数
        grasp_center = end_points['fp2_xyz'][i].float()					# (Ns, 3)
        approaching = -end_points['grasp_top_view_xyz'][i].float()		# (Ns, 3)

        ## slice preds by angle
        # grasp angle
        grasp_angle_class = torch.argmax(grasp_angle_class_score, 0)
        # 得到最高分数的in-plane rotation angle类，也就是(1, Ns, D)
        grasp_angle = grasp_angle_class.float() / 12 * np.pi
        # 转换成对应的in-plane rotation angle
        # grasp score & width & tolerance
        grasp_angle_class_ = grasp_angle_class.unsqueeze(0)
        # 这是 (Ns, D) 的indx
        grasp_score = torch.gather(grasp_score, 0, grasp_angle_class_).squeeze(0)
        grasp_width = torch.gather(grasp_width, 0, grasp_angle_class_).squeeze(0)
        grasp_tolerance = torch.gather(grasp_tolerance, 0, grasp_angle_class_).squeeze(0)
		# 对每个Ns和D得到了一个angle, (Ns, D)
        
        ## slice preds by score/depth
        # grasp depth
        grasp_depth_class = torch.argmax(grasp_score, 1, keepdims=True)
        # 在D这个维度，根据grasp_score来选出分数最高的深度
        # (Ns, 1)
        grasp_depth = (grasp_depth_class.float()+1) * 0.01
        # 得到(Ns, 1)的depth
        # grasp score & angle & width & tolerance
        grasp_score = torch.gather(grasp_score, 1, grasp_depth_class)
        grasp_angle = torch.gather(grasp_angle, 1, grasp_depth_class)
        grasp_width = torch.gather(grasp_width, 1, grasp_depth_class)
        grasp_tolerance = torch.gather(grasp_tolerance, 1, grasp_depth_class)
        # 以上四个都是(Ns， 1)，每个seed对应唯一一个分数最高的Grasp

        ## slice preds by objectness
        objectness_pred = torch.argmax(objectness_score, 0)
        # 对每个点的objectness做判断
        objectness_mask = (objectness_pred==1)
        # 根据objectness预测结果得到mask
        grasp_score = grasp_score[objectness_mask]
        grasp_width = grasp_width[objectness_mask]
        grasp_depth = grasp_depth[objectness_mask]
        approaching = approaching[objectness_mask]
        grasp_angle = grasp_angle[objectness_mask]
        grasp_center = grasp_center[objectness_mask]
        grasp_tolerance = grasp_tolerance[objectness_mask]
        # 如果objectness预测没有物体，那么Grasp的所有参数置为空
        grasp_score = grasp_score * grasp_tolerance / GRASP_MAX_TOLERANCE
        # 真正Grasp的分数，要考虑到grasp_tolerance的影响，这里就简单的加权乘了一下

        ## convert to rotation matrix
        Ns = grasp_angle.size(0)
        approaching_ = approaching.view(Ns, 3)
        grasp_angle_ = grasp_angle.view(Ns)
        rotation_matrix = batch_viewpoint_params_to_matrix(approaching_, grasp_angle_)
        rotation_matrix = rotation_matrix.view(Ns, 9)
        #根据我们的预测，把Ns个Grasp的approaching vector和grasp_angle_组合得到(Ns, 9)的旋转矩阵

        # merge preds
        grasp_height = 0.02 * torch.ones_like(grasp_score)
        # 此处ones_like就是创建Ns个0.02的元素，也就是我们认为Grasp_height恒为2cm
        obj_ids = -1 * torch.ones_like(grasp_score)
        grasp_preds.append(torch.cat([grasp_score, grasp_width, grasp_height, grasp_depth, rotation_matrix, grasp_center, obj_ids], axis=-1))
    return grasp_preds

loss.py

def get_loss(end_points):
    # 正如论文所提到的，最终的loss分为三部分
    objectness_loss, end_points = compute_objectness_loss(end_points)
    view_loss, end_points = compute_view_loss(end_points)
    grasp_loss, end_points = compute_grasp_loss(end_points)
    loss = objectness_loss + view_loss + 0.2 * grasp_loss
    end_points['loss/overall_loss'] = loss
    return loss, end_points

def compute_objectness_loss(end_points):
    criterion = nn.CrossEntropyLoss(reduction='mean')
    objectness_score = end_points['objectness_score']	# (B, Ns, 1)
    objectness_label = end_points['objectness_label']	# (B, M, 1)
    fp2_inds = end_points['fp2_inds'].long()
    objectness_label = torch.gather(objectness_label, 1, fp2_inds)	# (B, Ns, 1)
    loss = criterion(objectness_score, objectness_label)
    # 此处就是对Ns个点做二分类的交叉熵

    end_points['loss/stage1_objectness_loss'] = loss
    objectness_pred = torch.argmax(objectness_score, 1)
    end_points['stage1_objectness_acc'] = (objectness_pred == objectness_label.long()).float().mean()
	# 精度
    end_points['stage1_objectness_prec'] = (objectness_pred == objectness_label.long())[objectness_pred == 1].float().mean()
    # 精确率：预测正确的objectness个数占总的正类预测个数的比例
    end_points['stage1_objectness_recall'] = (objectness_pred == objectness_label.long())[objectness_label == 1].float().mean()
    # 召回率：被预测为正类占所有标注的个数

    return loss, end_points

def compute_view_loss(end_points):
    criterion = nn.MSELoss(reduction='none')
    view_score = end_points['view_score']				# (B, Ns, V)
    view_label = end_points['batch_grasp_view_label']	# (B, Ns, V, A, D)
    objectness_label = end_points['objectness_label']	# (B, M, 1)
    fp2_inds = end_points['fp2_inds'].long()			# (B, Ns)
    V = view_label.size(2)
    objectness_label = torch.gather(objectness_label, 1, fp2_inds)	# (B, Ns, 1)

    objectness_mask = (objectness_label > 0)
    objectness_mask = objectness_mask.unsqueeze(-1).repeat(1, 1, V)	# (B, Ns, V)
    pos_view_pred_mask = ((view_score >= THRESH_GOOD) & objectness_mask)
    # 满足这个点附近有物体，并且approaching vector的分数满足阈值

    loss = criterion(view_score, view_label)	#均方误差
    loss = loss[objectness_mask].mean()	# 只取有objectness的部分，其余的如论文所说，忽略掉

    end_points['loss/stage1_view_loss'] = loss
    end_points['stage1_pos_view_pred_count'] = pos_view_pred_mask.long().sum()

    return loss, end_points


def compute_grasp_loss(end_points, use_template_in_training=True):
    top_view_inds = end_points['grasp_top_view_inds'] 	# (B, Ns)
    vp_rot = end_points['grasp_top_view_rot'] 			# (B, Ns, view_factor, 3, 3)
    objectness_label = end_points['objectness_label']
    fp2_inds = end_points['fp2_inds'].long()
    objectness_mask = torch.gather(objectness_label, 1, fp2_inds).bool() # (B, Ns)

    # 从数据集中得到的，每个batch的Ns个点，都有A * D个Grasp
    batch_grasp_label = end_points['batch_grasp_label'] # (B, Ns, A, D)
    batch_grasp_offset = end_points['batch_grasp_offset'] # (B, Ns, A, D, 3)
    batch_grasp_tolerance = end_points['batch_grasp_tolerance'] # (B, Ns, A, D)
    B, Ns, A, D = batch_grasp_label.size()

    # 以下这段都是在从数据集中取出对应的GT
    top_view_grasp_angles = batch_grasp_offset[:, :, :, :, 0] #(B, Ns, A, D)
    top_view_grasp_depths = batch_grasp_offset[:, :, :, :, 1] #(B, Ns, A, D)
    top_view_grasp_widths = batch_grasp_offset[:, :, :, :, 2] #(B, Ns, A, D)
    # 对于每个深度，取出对应的angle-bin分数最高的inds，并且取出对应的GT
    target_labels_inds = torch.argmax(batch_grasp_label, dim=2, keepdim=True) # (B, Ns, 1, D)
    target_labels = torch.gather(batch_grasp_label, 2, target_labels_inds).squeeze(2) # (B, Ns, D)
    target_angles = torch.gather(top_view_grasp_angles, 2, target_labels_inds).squeeze(2) # (B, Ns, D)
    target_depths = torch.gather(top_view_grasp_depths, 2, target_labels_inds).squeeze(2) # (B, Ns, D)
    target_widths = torch.gather(top_view_grasp_widths, 2, target_labels_inds).squeeze(2) # (B, Ns, D)
    target_tolerance = torch.gather(batch_grasp_tolerance, 2, target_labels_inds).squeeze(2) # (B, Ns, D)

    graspable_mask = (target_labels > THRESH_BAD)
    objectness_mask = objectness_mask.unsqueeze(-1).expand_as(graspable_mask)
    loss_mask = (objectness_mask & graspable_mask).float()

    # 1. grasp score loss
    target_labels_inds_ = target_labels_inds.transpose(1, 2) # (B, 1, Ns, D)
    grasp_score = torch.gather(end_points['grasp_score_pred'], 1, target_labels_inds_).squeeze(1)
    # (B, Ns, D)
    grasp_score_loss = huber_loss(grasp_score - target_labels, delta=1.0)
    grasp_score_loss = torch.sum(grasp_score_loss * loss_mask) / (loss_mask.sum() + 1e-6)
    # 只考虑mask==1的情况
    end_points['loss/stage2_grasp_score_loss'] = grasp_score_loss

    # 2. inplane rotation cls loss
    target_angles_cls = target_labels_inds.squeeze(2) # (B, Ns, D)
    criterion_grasp_angle_class = nn.CrossEntropyLoss(reduction='none')
    grasp_angle_class_score = end_points['grasp_angle_cls_pred']
    grasp_angle_class_loss = criterion_grasp_angle_class(grasp_angle_class_score, target_angles_cls)
    grasp_angle_class_loss = torch.sum(grasp_angle_class_loss * loss_mask) / (loss_mask.sum() + 1e-6)
    # 多分类交叉熵
    end_points['loss/stage2_grasp_angle_class_loss'] = grasp_angle_class_loss
    grasp_angle_class_pred = torch.argmax(grasp_angle_class_score, 1)
    # 得到分数最高的in-place rotation
    end_points['stage2_grasp_angle_class_acc/0_degree'] = (grasp_angle_class_pred==target_angles_cls)[loss_mask.bool()].float().mean()
    acc_mask_15 = ((torch.abs(grasp_angle_class_pred-target_angles_cls) <= 1) | (torch.abs(grasp_angle_class_pred-target_angles_cls) >= A - 1))
    end_points['stage2_grasp_angle_class_acc/15_degree'] = acc_mask_15[loss_mask.bool()].float().mean()
    acc_mask_30 = ((torch.abs(grasp_angle_class_pred-target_angles_cls) <= 2) | (torch.abs(grasp_angle_class_pred-target_angles_cls) >= A - 2))
    end_points['stage2_grasp_angle_class_acc/30_degree'] = acc_mask_30[loss_mask.bool()].float().mean()
	# 因为一类相差15度，此处就根据分类的具体位置是否相差1个或2个来计算15度/30度的accuracy
    
    # 3. width reg loss
    grasp_width_pred = torch.gather(end_points['grasp_width_pred'], 1, target_labels_inds_).squeeze(1)
    #  (B, num_angle, Ns, D) => (B, Ns, D)
    grasp_width_loss = huber_loss((grasp_width_pred-target_widths)/GRASP_MAX_WIDTH, delta=1)
    grasp_width_loss = torch.sum(grasp_width_loss * loss_mask) / (loss_mask.sum() + 1e-6)
    # 对width做回归
    end_points['loss/stage2_grasp_width_loss'] = grasp_width_loss

    # 4. tolerance reg loss
    grasp_tolerance_pred = torch.gather(end_points['grasp_tolerance_pred'], 1, target_labels_inds_).squeeze(1)
    # (B, num_angle, Ns, D) => (B, Ns, D)
    grasp_tolerance_loss = huber_loss((grasp_tolerance_pred-target_tolerance)/GRASP_MAX_TOLERANCE, delta=1)
    grasp_tolerance_loss = torch.sum(grasp_tolerance_loss * loss_mask) / (loss_mask.sum() + 1e-6)
    end_points['loss/stage2_grasp_tolerance_loss'] = grasp_tolerance_loss

    grasp_loss = grasp_score_loss + grasp_angle_class_loss\
                + grasp_width_loss + grasp_tolerance_loss
    return grasp_loss, end_points

def huber_loss(error, delta=1.0):
    """
    Args:
        error: Torch tensor (d1,d2,...,dk)
    Returns:
        loss: Torch tensor (d1,d2,...,dk)
    x = error = pred - gt or dist(pred,gt)
    0.5 * |x|^2                 if |x|<=d
    0.5 * d^2 + d * (|x|-d)     if |x|>d
    Author: Charles R. Qi
    Ref: https://github.com/charlesq34/frustum-pointnets/blob/master/models/model_util.py
    """
    abs_error = torch.abs(error)
    quadratic = torch.clamp(abs_error, max=delta)
    linear = (abs_error - quadratic)
    loss = 0.5 * quadratic**2 + delta * linear
    return loss

附言

和浩树学长聊了一下，他很明确地指出这其实是一个多分类任务。并且因为论文中提到了使用离散的Bin来代替回归出一个向量的idea，我就继续请教了一下这样做的原因。因为这一点其实在VoteNet中尝试预测物体的6D pose的时候，也是有体现的。他认为回归在方法论上就不太对，因为正常的情况下Valid Approach Vector / Operations 应当是一个分布，而不是回归出来的一个值。多分类对于每个离散的bin预测出一个score，在我的理解下，就是用细粒度的均匀分布去拟合这个我们所希望预测的分布，这样训练效果就会好。听罢，感叹自己科研的路上还有很多路要走…因为各种各样的论文只会告诉你它的优点是什么，真正的硬伤都不会写在论文里。而要把握领域前沿动向，并且做出正确的选择，那就必须要在这种抽象层面上建构出自己稳定的理论体系，但是我觉得我自己还差了很多这一块的知识积累。真是道阻且长啊。好在，我在努力开发出很多未来可以使用的方法论、视野、论文积累、代码积累等，希望可以在这条路上继续努力！

(CVPR2020)GraspNet-1Billion

https://kami-code.com/2022/01/18/GraspNet-1Billion/

Author

Kami-code

Posted on

2022-01-18

Updated on

2022-02-13

Licensed under

#GraspNet CVPR

(CVPR2020)GraspNet-1Billion

数据集部分

数据收集

数据标定

数据集评估

Loss Function

Operation Network

Cylinder Region Transformation

Rotation和Width

Tolerance Network

Experiments

GT Evaluation

具体代码实现

数据集格式

DataLoader

graspnet.py

loss.py

附言

Author

Posted on

Updated on

Licensed under

Comments

Links

Recents

Archives

Tags

Subscribe for updates

follow.it