is there a reason why you don’t use the Hungarian Algorithm to match predictions to ground truth labels?
If I understood your code correctly you sort the predictions by their score first.
Then you take the prediction with the highest score and try to find a match by searching for the ground truth label with the shortest distance to this prediction.
If this distance is below a threshold you say “it’s a match” and remove the label so other predictions cannot be matched to it.
Then you continue with the prediction with the second highest score and so on.
More or less here in your code
Am I right?
Just to visualize this with an example.
For the image below, with the task of detecting computer screens and the ground truth labels as red boxes.
The violet box with the highest score would be matched to the left ground truth label, assuming that the distance is below the threshold (2m according to your paper).
Then the blue box would be matched to the right ground truth label with the same assumption.
And the green box would be a False Positive.
The CLEAR metric is using the Hungarian Method and the distances for matching which would result in a match between the blue prediction and the left ground truth label + a match between the green prediction and the right ground truth label + a False Positive for the violet box.
The distance error would be much lower than the one computed with your matching algorithm.