The goal is to predict the localization of objects in an image via bounding boxes and the classes of the located objects. With deep learning approach we tackle this problem with two type of models called one stage or two stages detectors. Even if they share a common gold the two approach are sightly different and can be summarized as follow:

Common tricks used in object detection model

Bounding Box Regression

We define by p $=(p_{x},p_{y},p_{w},p_{h})$ the prediction of the bounding box regressor, where x and y correspond to the centred coordinate and h and w its height, weight the corresponding ground truth box coordinates g $=(g_{x},g_{y},g_{w},g_{h})$, the regressor is configured to learn scale-invariant transformation between two centers and log-scale transformation between widths and heights. All the transformation functions take p as input.

| ———– | ———– | | $$\hat{g}_x = p_w d_x(\mathbf{p}) + p_x$$ | $$\hat{g}_y = p_h d_y(\mathbf{p}) + p_y$$ | | $$\hat{g}_w = p_w \exp({d_w(\mathbf{p})}) $$ | $$\hat{g}_h = p_h \exp({d_h(\mathbf{p})})$$ | | $$t_x = (g_x - p_x) / p_w$$ | $$t_y = (g_y - p_y) / p_h$$ | | $$t_w = \log(g_w/p_w)$$ | $$t_h = \log(g_h/p_h)$$ |

Syntax Description
Header Title
Paragraph Text
Syntax Description Test Text
Header Title Here’s this
$$t_w = \log(g_w/p_w)$$ $$t_w = \log(g_w/p_w)$$ $t_w = \log(g_w/p_w)$ (1) Text And more
Tables Are Cool
col 1 is left-aligned $1600
col 2 is centered $12
col 3 is right-aligned $1

Intersection Over Union (IoU)

The IoU is an evaluation metric based on the degree of overlap between the an abject (bounding box in our case) and the actual ground truth bounding box (hand labeled).

$$IoU(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}$$

Hard example mining

Detectors usually perform more predictions than the number of objects present in the image. Unfortunately there are much more negative matches than positive matches. This creates a class imbalance which hurts training. We are training the model to learn background space rather than detecting objects. However, we need negative sampling so it can learn what constitutes a bad prediction. By picking the top ones and makes sure the ratio between the picked negatives and positives is at most 3:1. This process allows fast and stable training.

Non-maximal suppression in inference

The algorithm is quite simple yet extremely effective and necessary for our designed problem however it can required a lot computation power. Because detectors are making duplicate detection for the same object. To tackle this problem the non-maximal suppression is applied to remove duplication with lower confidence. We sort the predictions by the confidence scores and go down the list one by one. If any previous prediction has the same class and IoU greater than 0.5 with the current prediction, we remove it from the list.

Anchors

Because using an approximation of the coordinates of the bounding box via row pixel is sub optimal owing to high variance of pixels values we introduced the concept of anchors. The Archors can be seen has plausible bounding boxes candidates that can be generates from the input image, for each pixel in the image we generally generate 9 anchors with the following technics let’s assume that the input image has a height of $h$ and width of $w$. We generate anchor boxes with different shapes centered on every pixels of the image. Assume the size is s$\in$(0,1], the aspect ratio is r$>$0, and the width and height of the anchor box are $ws\sqrt{r}$ and $hs/\sqrt{r}$, respectively. When the center position is given, an anchor box with known width and height is determined.

During training anchors box need to be labeled with two labels, the target class and the offset of the ground truth bounding box relative to the archors box. The process of assigning categories to each anchors box and their respective offset with enable the prediction head to predict good bounding boxes and classes target. The assignment is performed by using the similarity measure, the Jacquard distance also know as IoU metric.

For a given object bounding box $\mathcal{B}{i'}$ and for all the anchors boxes, $\mathcal{A}{j'}$ the index of the ground truth archors box is the index of the ground truth archors box is determined by the index that respect $\max_{j}IoU(\mathcal{B}_{i}$ ,$\mathcal{A}_{i})$ during the training we keep all anchors that are kept are also known as positive anchors and the opposite the anchors that are not selected are called negative anchors

Region of interest pooling (ROI poolling)

ROI pooling is a type of max pooling layer that convert features and projected region of the image of any given size, h x w, into a small fixed window, H x W. The input region is divided into H x W grids, approximately every sub window of size h/H x w/W. Then apply max-pooling in each grid.