Getting the all ordered sample data in a scene

Hi, I’m currently working on a 3D Object Tracking Problem with using temporal data. In my case, I use n-previous frames to predict the current frame. Until now I used only keyframes for that by using the next field in sample object, but I also want to use non keyframes. As far as I’m concerned you have suggested using get_boxes() method for that. I also have an idea for a script to annotate this non keyframes. For that, I just wonder, how can I iterate through all frames, i.e. including all non key frames, in a scene? Does it make sense to annotate objects in non keyframes by using the class of the respective object of the closest ground truth box in the keyframe for any newly generated box. Thanks in advance!

Hi.

  • To get a lidar pointcloud, simply do nusc.get('sample_data', sample['data']['LIDAR_TOP']). Then you can repeatedly call prev or next to navigate to the non-keyframe lidar pointclouds.
  • get_boxes() can be used to get the interpolated annotations at the non-keyframe.

Hi @holger-motional,
Thanks for your answer.

Hi @holger-motional,

I’ve tweaked export.kitti.py such that it now traverses through both key and non keyframes and then converts them into KITTI -Tracking format. However, when I render images, after conversion, the boxes are not as accurate as in this video. I didn’t call get_boxes(), since it’s also used by get_sample_data(). For the keyframes, the conversion to KITTI was always successful, however I couldn’t locate where my mistake is.

Here are two example renders after conversion (Scene-0061, Frames: 9, 11):


As you said, I traverse through non-keyframes by repeatedly calling next after doing nusc.get('sample_data', sample['data']['LIDAR_TOP']) .

I just made the following changes:

        sd_record = self.nusc.get('sample_data', sample_record_lidar['token'])
        curr_sample_record = self.nusc.get('sample', sd_record['sample_token'])
        sample_annotation_tokens = curr_sample_record['anns']
        cam_front_token = sample_record_cam['token']
        lidar_token = sample_record_lidar['token']

and also this:

for sample_annotation_token in sample_annotation_tokens:
            _, box_lidar_nusc, _ = self.nusc.get_sample_data(lidar_token, box_vis_level=BoxVisibility.ANY,
                                                                selected_anntokens=[sample_annotation_token])
            sample_annotation = self.nusc.get('sample_annotation', sample_annotation_token)


            # Get box in LIDAR frame.
            box_lidar_nusc = box_lidar_nusc[0]

            # # Truncated: Set all objects to 0 which means untruncated.
            # truncated = 0.0

            # # Occluded: Set all objects to full visibility as this information is not available in nuScenes.
            # occluded = 0

            obj = dict()

            # Add frame_id of the object
            obj["frame"] = frame_count


            # Convert nuScenes category to nuScenes detection challenge category.
            obj["detection_name"] = category_to_detection_name(sample_annotation['category_name'])

            # Skip categories that are not part of the nuScenes detection challenge.
            if obj["detection_name"] is None or obj["detection_name"] not in CLASS_MAP.keys():
                continue

            obj["detection_name"] = CLASS_MAP[obj["detection_name"]]

            # Convert from nuScenes to KITTI box format.
            obj["box_cam_kitti"] = KittiDB.box_nuscenes_to_kitti(
                box_lidar_nusc, Quaternion(matrix=velo_to_cam_rot), velo_to_cam_trans, r0_rect)

            # Project 3d box to 2d box in image, ignore box if it does not fall inside.
            bbox_2d = project_to_2d(obj["box_cam_kitti"], p_left_kitti, imsize[1], imsize[0])
            if bbox_2d is None:
                continue
            obj["bbox_2d"] = bbox_2d["bbox"]
            obj["truncated"] = bbox_2d["truncated"]

            # Set dummy score so we can use this file as result.
            obj["box_cam_kitti"].score = 0
            
            # Add track_id for this particular sample in the scene
            object_id_dict = dict()
            if scene["token"] in objects_of_scene:
                object_id_dict = objects_of_scene[scene["token"]]
            else:
                objects_of_scene[scene["token"]] = object_id_dict
            
            instance_token = sample_annotation["instance_token"]
            if instance_token in object_id_dict:
                obj["track_id"] =  object_id_dict[instance_token]
            else:
                # New object for this scene
                object_id_dict[instance_token] = object_id_counter
                obj["track_id"] =  object_id_counter
                object_id_counter += 1
            
            v = np.dot(obj["box_cam_kitti"].rotation_matrix, np.array([1, 0, 0]))
            rot_y = -np.arctan2(v[2], v[0])
            obj["alpha"] = -10.0
            obj["depth"] = np.linalg.norm(np.array(obj["box_cam_kitti"].center[:3]))
            objects.append(obj)

        objects = postprocessing(objects, imsize[1], imsize[0])

The rest of the code is practically identical with your implementation.
Again, thanks in advance for your help.

Hi. You didn’t show where you get the various transformations, e.g. velo_to_cam_rot, velo_to_cam_trans, r0_rect

Hi @holger-motional, Sorry for my late answer.

Below you can find the entire code. As I said, I adjusted your implementation to KITTI-Tracking format:

-----Update 03.07.2021-----

     # Find the frames of each scene
        objects_of_scene = dict()
        scenes = self.create_splits_tokens(self.split, self.nusc)
        for scene in scenes:
            object_id_counter = 0
            frame_count = 0
            path_var = "training"
            print(scene["name"], "Conversion started...")
            sample_record = self.nusc.get('sample', scene['first_sample_token'])
            sample_record_lidar = self.nusc.get('sample_data', sample_record['data']['LIDAR_TOP'])
            sample_record_cam = self.nusc.get('sample_data', sample_record['data']['CAM_FRONT'])
            object_id_counter = self.convert(frame_count, sample_record_lidar, sample_record_cam, scene, objects_of_scene, kitti_to_nu_lidar, kitti_to_nu_lidar_inv, tokens, imsize,object_id_counter, paths[path_var])

            print(scene["name"], frame_count)

            while not sample_record_lidar['next'] == "" and not sample_record_cam['next'] == "":
                # Get sample data.
                frame_count += 1
                print(scene["name"], frame_count)
                sample_record_lidar = self.nusc.get('sample_data', sample_record_lidar['next'])
                sample_record_cam = self.nusc.get('sample_data', sample_record_cam['next'])
                object_id_counter = self.convert(frame_count, sample_record_lidar, sample_record_cam, scene, objects_of_scene, kitti_to_nu_lidar, kitti_to_nu_lidar_inv, tokens, imsize, object_id_counter, paths[path_var])
            
            print(scene["name"], "Conversion finished.")


    def convert(self, frame_count, sample_record_lidar, sample_record_cam, scene, objects_of_scene, kitti_to_nu_lidar, kitti_to_nu_lidar_inv, tokens, imsize,object_id_counter, paths):
        sd_record = self.nusc.get('sample_data', sample_record_lidar['token'])
        curr_sample_record = self.nusc.get('sample', sd_record['sample_token'])
        sample_annotation_tokens = curr_sample_record['anns']
        cam_front_token = sample_record_cam['token']
        lidar_token = sample_record_lidar['token']
        sample_name = "%06d" % frame_count
        scene_name = "%04d" % int(scene['name'][-4:])
        
        
        # Retrieve sensor records.
        sd_record_cam = self.nusc.get('sample_data', cam_front_token)
        sd_record_lid = self.nusc.get('sample_data', lidar_token)
        cs_record_cam = self.nusc.get('calibrated_sensor', sd_record_cam['calibrated_sensor_token'])
        cs_record_lid = self.nusc.get('calibrated_sensor', sd_record_lid['calibrated_sensor_token'])

                # Combine transformations and convert to KITTI format.
        # Note: cam uses same conventions in KITTI and nuScenes.
        lid_to_ego = transform_matrix(cs_record_lid['translation'], Quaternion(cs_record_lid['rotation']),
                                    inverse=False)
        ego_to_cam = transform_matrix(cs_record_cam['translation'], Quaternion(cs_record_cam['rotation']),
                                    inverse=True)
        velo_to_cam = np.dot(ego_to_cam, lid_to_ego)

        # Convert from KITTI to nuScenes LIDAR coordinates, where we apply velo_to_cam.
        velo_to_cam_kitti = np.dot(velo_to_cam, kitti_to_nu_lidar.transformation_matrix)

        # Currently not used.
        imu_to_velo_kitti = np.zeros((3, 4))  # Dummy values.
        r0_rect = Quaternion(axis=[1, 0, 0], angle=0)  # Dummy values.

        # Projection matrix.
        p_left_kitti = np.zeros((3, 4))
        p_left_kitti[:3, :3] = cs_record_cam['camera_intrinsic']  # Cameras are always rectified.

        # Create KITTI style transforms.
        velo_to_cam_rot = velo_to_cam_kitti[:3, :3]
        velo_to_cam_trans = velo_to_cam_kitti[:3, 3]

        # Check that the rotation has the same format as in KITTI.
        assert (velo_to_cam_rot.round(0) == np.array([[0, -1, 0], [0, 0, -1], [1, 0, 0]])).all()
        assert (velo_to_cam_trans[1:3] < 0).all()

        # Retrieve the token from the lidar.
        # Note that this may be confusing as the filename of the camera will include the timestamp of the lidar,
        # not the camera.
        filename_cam_full = sd_record_cam['filename']
        filename_lid_full = sd_record_lid['filename']


        # Convert image (jpg to png).
        src_im_path = os.path.join(self.nusc.dataroot, filename_cam_full)
        scene_im_path = os.path.join(paths["image"],scene_name)
        if not os.path.exists(scene_im_path):
            os.makedirs(scene_im_path)
        dst_im_path = os.path.join(scene_im_path, sample_name + '.png')
        if not os.path.exists(dst_im_path):
            im = Image.open(src_im_path)
            im.save(dst_im_path, "PNG")

        # Convert lidar.
        # Note that we are only using a single sweep, instead of the commonly used n sweeps.
        src_lid_path = os.path.join(self.nusc.dataroot, filename_lid_full)
        scene_lid_path = os.path.join(paths["lidar"],scene_name)
        if not os.path.exists(scene_lid_path):
            os.makedirs(scene_lid_path)
        dst_lid_path = os.path.join(scene_lid_path, sample_name + '.bin')
        assert not dst_lid_path.endswith('.pcd.bin')
        pcl = LidarPointCloud.from_file(src_lid_path)
        #pcl, _ = LidarPointCloud.from_file_multisweep(self.nusc, sample_record, self.lidar_name, self.lidar_name, nsweeps=10)
        pcl.rotate(kitti_to_nu_lidar_inv.rotation_matrix)  # In KITTI lidar frame.
        with open(dst_lid_path, "w") as lid_file:
            pcl.points.T.tofile(lid_file)

        # Add to tokens.
        tokens.append(sample_record_lidar['token'])

        # Create calibration file.
        kitti_transforms = dict()
        kitti_transforms['P0'] = np.zeros((3, 4))  # Dummy values.
        kitti_transforms['P1'] = np.zeros((3, 4))  # Dummy values.
        kitti_transforms['P2'] = p_left_kitti  # Left camera transform.
        kitti_transforms['P3'] = np.zeros((3, 4))  # Dummy values.
        kitti_transforms['R0_rect'] = r0_rect.rotation_matrix  # Cameras are already rectified.
        kitti_transforms['Tr_velo_to_cam'] = np.hstack((velo_to_cam_rot, velo_to_cam_trans.reshape(3, 1)))
        kitti_transforms['Tr_imu_to_velo'] = imu_to_velo_kitti
        calib_path = os.path.join(paths["calib"], scene_name + '.txt')
        if not os.path.exists(calib_path):
            with open(calib_path, "w") as calib_file:
                for (key, val) in kitti_transforms.items():
                    val = val.flatten()
                    val_str = '%.12e' % val[0]
                    for v in val[1:]:
                        val_str += ' %.12e' % v
                    calib_file.write('%s: %s\n' % (key, val_str))

        # Write label file.
        label_path = os.path.join(paths["label"], scene_name + '.txt')
        # if os.path.exists(label_path):
        #     # print('Skipping existing file: %s' % label_path)
        #     continue
        # # else:
        # #     print('Writing file: %s' % label_path)

        objects = []
        box_lidar_nusc_list = self.nusc.get_boxes(sample_data_token=lidar_token)
        for box_lidar_nusc in box_lidar_nusc_list:


            # Get box in LIDAR frame.
            #box_lidar_nusc = box_lidar_nusc[0]


            # Retrieve sensor & pose records
            sd_record = self.nusc.get('sample_data', lidar_token)
            cs_record = self.nusc.get('calibrated_sensor', sd_record['calibrated_sensor_token'])
            sensor_record = self.nusc.get('sensor', cs_record['sensor_token'])
            pose_record = self.nusc.get('ego_pose', sd_record['ego_pose_token'])

            if sensor_record['modality'] == 'camera':
                cam_intrinsic = np.array(cs_record['camera_intrinsic'])
                imsize_tmp = (sd_record['width'], sd_record['height'])
            else:
                cam_intrinsic = None
                imsize_tmp = None
            # Move box to ego vehicle coord system.
            box_lidar_nusc.translate(-np.array(pose_record['translation']))
            box_lidar_nusc.rotate(Quaternion(pose_record['rotation']).inverse)

            #  Move box to sensor coord system.
            box_lidar_nusc.translate(-np.array(cs_record['translation']))
            box_lidar_nusc.rotate(Quaternion(cs_record['rotation']).inverse)

            if sensor_record['modality'] == 'camera' and not \
                box_in_image(box_lidar_nusc, cam_intrinsic, imsize_tmp, vis_level=BoxVisibility.NONE):
                continue

            # Truncated: Set all objects to 0 which means untruncated.
            truncated = 0.0

            # Occluded: Set all objects to full visibility as this information is not available in nuScenes.
            occluded = 0

            obj = dict()

            # Add frame_id of the object
            obj["frame"] = frame_count


            # Convert nuScenes category to nuScenes detection challenge category.
            obj["detection_name"] = category_to_detection_name(box_lidar_nusc.name)

            # Skip categories that are not part of the nuScenes detection challenge.
            if obj["detection_name"] is None or obj["detection_name"] not in CLASS_MAP.keys():
                continue

            obj["detection_name"] = CLASS_MAP[obj["detection_name"]]

            # Convert from nuScenes to KITTI box format.
            obj["box_cam_kitti"] = KittiDB.box_nuscenes_to_kitti(
                box_lidar_nusc, Quaternion(matrix=velo_to_cam_rot), velo_to_cam_trans, r0_rect)

            # Project 3d box to 2d box in image, ignore box if it does not fall inside.
            bbox_2d = project_to_2d(obj["box_cam_kitti"], p_left_kitti, imsize[1], imsize[0])
            if bbox_2d is None:
                continue
            obj["bbox_2d"] = bbox_2d["bbox"]
            obj["truncated"] = bbox_2d["truncated"]

            # Set dummy score so we can use this file as result.
            obj["box_cam_kitti"].score = 0
            
            # Add track_id for this particular sample in the scene
            object_id_dict = dict()
            if scene["token"] in objects_of_scene:
                object_id_dict = objects_of_scene[scene["token"]]
            else:
                objects_of_scene[scene["token"]] = object_id_dict
            
            instance_token = box_lidar_nusc.token
            if instance_token in object_id_dict:
                obj["track_id"] =  object_id_dict[instance_token]
            else:
                # New object for this scene
                object_id_dict[instance_token] = object_id_counter
                obj["track_id"] =  object_id_counter
                object_id_counter += 1
            
            v = np.dot(obj["box_cam_kitti"].rotation_matrix, np.array([1, 0, 0]))
            rot_y = -np.arctan2(v[2], v[0])
            obj["alpha"] = -10.0
            obj["depth"] = np.linalg.norm(np.array(obj["box_cam_kitti"].center[:3]))
            objects.append(obj)

        objects = postprocessing(objects, imsize[1], imsize[0])

        with open(label_path, "a") as label_file:
            for obj in objects:
                # Convert box to output string format.
                output = box_to_string(frame=obj["frame"], track_id = obj["track_id"], name=obj["detection_name"], box=obj["box_cam_kitti"], bbox_2d=obj["bbox_2d"],
                                            truncation=obj["truncated"], occlusion=obj["occluded"], alpha=obj["alpha"])
                label_file.write(output + '\n')
        
        return object_id_counter

Hi @holger-motional, I solved the problem by looking at your render_scene() function. The problem was caused by the fact that camera and lidar data are not synchronized. By taking the same timestamps, I could solve the problem. Thank you very much for your support!

1 Like

Great! Btw., just to be nitpicky: I would say that they are “synchronized”. The camera is triggered as the lidar sweeps through the center of its field of view. It’s just that there are 6 cameras for the full 360 degree rotation, so they have a (more or less) constant offset to the lidar, depending on their location.

1 Like