On evaluation metrics

We would like to raise attention to the set of evaluation metrics that are currently used in the nuScenes Object Detection Task.

There is a growing interest in the literature for task-aware metrics, with the idea that not all the objects are equally important in the detection. In nuScenes, only PKL goes in this direction, but with an important limitation that it does not differentiate between the impact of mistakes on safety and on reliability of the driving tasks.

We would like to propose for inclusion in the evaluation metrics the metrics we proposed in a recent work:

A. Ceccarelli, L. Montecchi. Evaluating Object (Mis)Detection From a Safety and Reliability Perspective: Discussion and Measures. In: IEEE Access, Vol. 11, pp. 44952-44963 (May 2023).

We also provide a modified implementation of the nuscenes-devkit library, to compute the metrics proposed in the paper, so the inclusion in the metrics evaluated for the challenge, at least in an experimental way, should be straightforward.

There are also other approaches that go in this direction but - as far as we know - they are either not implemented over nuScenes, or their implementation is not released to the public. For example:

  • A. Bansal, et al., “Risk ranked recall: Collision safety metric for object detection systems in autonomous vehicles”, Proc. 10th Medit. Conf. Embedded Comput. (MECO) , pp. 1-4, Jun. 2021, which ranks each object in three categories (imminent collision, potential collision, no collision), based on its collision risk.

  • G. Volk, et al., “A comprehensive safety metric to evaluate perception in autonomous systems”, Proc. IEEE 23rd Int. Conf. Intell. Transp. Syst. (ITSC), pp. 1-8, Sep. 2020, which combines scores measuring detection quality, collision potential, and time needed to make the detection. This allows computing a safety score of a test scenario, in 5 classes from insufficient to excellent.

We look forward to your opinion on this matter!