**Dataset Split: ** For our 190 scenes, we use 100 for training and 90 for testing. Specifically, we further divide our test sets into 3 categories: 30 scenes with seen objects, 30 with unseen but similar objects and 30 for novel objects. We hope that such setting can better evaluate the generalization ability of different methods.

**Evaluation Code: ** Evaluation code is available on the GraspNet github. You can also refer to API Document
.

**Metrics: ** We do not pre-compute the ground-truth labels for test set but adopt an online evaluation algorithm to evaluate the grasp accuracy.

We first illustrate how we classify whether a single grasp pose is true positive. For each predicted grasp pose $\mathbf{\hat{P}}_i$, we associate it with the target object by checking the point cloud inside the gripper. Then, similar to the process of generating grasp annotation, we can get a binary label for each grasp pose by force-closure metric, given different $\mu$.

For cluttered scene, grasp pose prediction algorithms are expected to predict multiple grasps. Since for grasping, we usually conduct execution after the prediction, the percentage of true positive is more important. Thus, we adopt $Precision@k$ as our evaluation metric, which measures the precision of top-$k$ ranked grasps. $AP_\mu$ denotes the average $Precision@k$ for $k$ ranges from 1 to 50 given friction $\mu$. Similar to MS-COCO, we report $AP_\mu$ at different $\mu$. Specifically, we denote **AP** for the average of $AP_\mu$ ranging from $\mu=0.2$ to $\mu=1.2$, with $\Delta\mu= 0.2$ as interval.

To avoid dominated by similar grasp poses or grasp poses from single object, we run a pose-NMS before evaluation.

**Grasp Pose NMS: **For two grasps $\mathbf{G}_1$ and $\mathbf{G}_2$, we define grasp pose distance $D(\mathbf{G}_1, \mathbf{G}_2)$ as a tuple:
\begin{equation}
D(\mathbf{G}_1, \mathbf{G}_2) = (d_t(\mathbf{G}_1, \mathbf{G}_2), d_{\alpha}(\mathbf{G}_1, \mathbf{G}_2)),
\end{equation}
where $d_t(\mathbf{G}_1, \mathbf{G}_2)$ and $d_{\alpha}(\mathbf{G}_1, \mathbf{G}_2))$ denote translation distance and rotation distance of two grasps respectively. Let a grasp pose $\mathbf{G}$ be denoted by a translation vector $\mathbf{t}$ and a rotation matrix $\mathbf{R}$, then $d_t(\cdot)$ an $d_{\alpha}(\cdot)$ is defined as:
\begin{equation}
\begin{split}
d_t(\mathbf{G}_1, \mathbf{G}_2) &= ||\mathbf{t}_1 - \mathbf{t}_2||,\\
d_{\alpha}(\mathbf{G}_1, \mathbf{G}_2) &= \arccos \frac{1}{2} (\mathrm{tr}(R_1\cdot R_2^\mathrm{T}) - 1),
\end{split}
\end{equation}
where $\mathrm{tr}(\mathbf{M})$ denotes the trace of matrix $\mathbf{M}$.
Since translation and rotation are not in the same metric space, we define the NMS threshold as a tuple too. Let $TH = (th_d, th_{\alpha})$, we say $D(\mathbf{G}_1, \mathbf{G}_2) < TH$ if and only if
\begin{equation}
d_t(\mathbf{G}_1, \mathbf{G}_2) < th_d,\quad d_{\alpha}(\mathbf{G}_1, \mathbf{G}_2) < th_{\alpha}.
\end{equation}
Based on the tuple metric, two grasps are merged when their distance is lower than $TH$. Meanwhile, only the top $K$ grasps from each object are considered according to confidence scores and other grasps are omitted. In evaluation, we set $th_d = 3$ cm, $th_{\alpha} = 30$ degree and $K = 10$.

Copyright © 2021 Machine Vision and Intelligence Group, Shanghai Jiao Tong University.