.. _chapter_anchor:
Anchor Boxes
============
Object detection algorithms usually sample a large number of regions in
the input image, determine whether these regions contain objects of
interest, and adjust the edges of the regions so as to predict the
ground-truth bounding box of the target more accurately. Different
models may use different region sampling methods. Here, we introduce one
such method: it generates multiple bounding boxes with different sizes
and aspect ratios while centering on each pixel. These bounding boxes
are called anchor boxes. We will practice object detection based on
anchor boxes in the following sections.
First, import the packages or modules required for this section. Here,
we have introduced the ``contrib`` package, and modified the printing
accuracy of NumPy. Because printing NDArray actually calls the print
function of NumPy, the floating-point numbers in NDArray printed in this
section are more concise.
.. code:: python
%matplotlib inline
import d2l
from mxnet import contrib, gluon, image, nd
import numpy as np
np.set_printoptions(2)
Generate Multiple Anchor Boxes
------------------------------
Assume the input image has a height of :math:`h` and width of :math:`w`.
We generate anchor boxes with different shapes centered on each pixel of
the image. Assume the size is :math:`s\in (0,1]`, the aspect ratio is
:math:`r > 0`, and the width and height of the anchor box are
:math:`ws\sqrt{r}` and :math:`hs/\sqrt{r}`, respectively. When the
center position is given, an anchor box with known width and height is
determined.
Below we set a set of sizes :math:`s_1,\ldots, s_n` and a set of aspect
ratios :math:`r_1,\ldots, r_m`. If we use a combination of all sizes and
aspect ratios with each pixel as the center, the input image will have a
total of :math:`whnm` anchor boxes. Although these anchor boxes may
cover all ground-truth bounding boxes, the computational complexity is
often excessive. Therefore, we are usually only interested in a
combination containing :math:`s_1` or :math:`r_1` sizes and aspect
ratios, that is:
.. math:: (s_1, r_1), (s_1, r_2), \ldots, (s_1, r_m), (s_2, r_1), (s_3, r_1), \ldots, (s_n, r_1).
That is, the number of anchor boxes centered on the same pixel is
:math:`n+m-1`. For the entire input image, we will generate a total of
:math:`wh(n+m-1)` anchor boxes.
The above method of generating anchor boxes has been implemented in the
``MultiBoxPrior`` function. We specify the input, a set of sizes, and a
set of aspect ratios, and this function will return all the anchor boxes
entered.
.. code:: python
img = image.imread('../img/catdog.jpg').asnumpy()
h, w = img.shape[0:2]
print(h, w)
X = nd.random.uniform(shape=(1, 3, h, w)) # Construct input data
Y = contrib.nd.MultiBoxPrior(X, sizes=[0.75, 0.5, 0.25], ratios=[1, 2, 0.5])
Y.shape
.. parsed-literal::
:class: output
561 728
.. parsed-literal::
:class: output
(1, 2042040, 4)
We can see that the shape of the returned anchor box variable ``y`` is
(batch size, number of anchor boxes, 4). After changing the shape of the
anchor box variable ``y`` to (image height, image width, number of
anchor boxes centered on the same pixel, 4), we can obtain all the
anchor boxes centered on a specified pixel position. In the following
example, we access the first anchor box centered on (250, 250). It has
four elements: the :math:`x, y` axis coordinates in the upper-left
corner and the :math:`x, y` axis coordinates in the lower-right corner
of the anchor box. The coordinate values of the :math:`x` and :math:`y`
axis are divided by the width and height of the image, respectively, so
the value range is between 0 and 1.
.. code:: python
boxes = Y.reshape((h, w, 5, 4))
boxes[250, 250, 0, :]
.. parsed-literal::
:class: output
[0.06 0.07 0.63 0.82]
In order to describe all anchor boxes centered on one pixel in the
image, we first define the ``show_bboxes`` function to draw multiple
bounding boxes on the image.
.. code:: python
# Save to the d2l package.
def show_bboxes(axes, bboxes, labels=None, colors=None):
"""Show bounding boxes."""
def _make_list(obj, default_values=None):
if obj is None:
obj = default_values
elif not isinstance(obj, (list, tuple)):
obj = [obj]
return obj
labels = _make_list(labels)
colors = _make_list(colors, ['b', 'g', 'r', 'm', 'c'])
for i, bbox in enumerate(bboxes):
color = colors[i % len(colors)]
rect = d2l.bbox_to_rect(bbox.asnumpy(), color)
axes.add_patch(rect)
if labels and len(labels) > i:
text_color = 'k' if color == 'w' else 'w'
axes.text(rect.xy[0], rect.xy[1], labels[i],
va='center', ha='center', fontsize=9, color=text_color,
bbox=dict(facecolor=color, lw=0))
As we just saw, the coordinate values of the :math:`x` and :math:`y`
axis in the variable ``boxes`` have been divided by the width and height
of the image, respectively. When drawing images, we need to restore the
original coordinate values of the anchor boxes and therefore define the
variable ``bbox_scale``. Now, we can draw all the anchor boxes centered
on (250, 250) in the image. As you can see, the blue anchor box with a
size of 0.75 and an aspect ratio of 1 covers the dog in the image well.
.. code:: python
d2l.set_figsize((3.5, 2.5))
bbox_scale = nd.array((w, h, w, h))
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, boxes[250, 250, :, :] * bbox_scale,
['s=0.75, r=1', 's=0.5, r=1', 's=0.25, r=1', 's=0.75, r=2',
's=0.75, r=0.5'])
.. figure:: output_anchor_0ce4f6_9_0.svg
Intersection over Union
-----------------------
We just mentioned that the anchor box covers the dog in the image well.
If the ground-truth bounding box of the target is known, how can “well”
here be quantified? An intuitive method is to measure the similarity
between anchor boxes and the ground-truth bounding box. We know that the
Jaccard index can measure the similarity between two sets. Given sets
:math:`\mathcal{A}` and :math:`\mathcal{B}`, their Jaccard index is the
size of their intersection divided by the size of their union:
.. math:: J(\mathcal{A},\mathcal{B}) = \frac{\left|\mathcal{A} \cap \mathcal{B}\right|}{\left| \mathcal{A} \cup \mathcal{B}\right|}.
In fact, we can consider the pixel area of a bounding box as a
collection of pixels. In this way, we can measure the similarity of the
two bounding boxes by the Jaccard index of their pixel sets. When we
measure the similarity of two bounding boxes, we usually refer the
Jaccard index as intersection over union (IoU), which is the ratio of
the intersecting area to the union area of the two bounding boxes, as
shown in Figure 9.2. The value range of IoU is between 0 and 1: 0 means
that there are no overlapping pixels between the two bounding boxes,
while 1 indicates that the two bounding boxes are equal.
.. figure:: ../img/iou.svg
IoU is the ratio of the intersecting area to the union area of two
bounding boxes.
For the remainder of this section, we will use IoU to measure the
similarity between anchor boxes and ground-truth bounding boxes, and
between different anchor boxes.
Labeling Training Set Anchor Boxes
----------------------------------
In the training set, we consider each anchor box as a training example.
In order to train the object detection model, we need to mark two types
of labels for each anchor box: first, the category of the target
contained in the anchor box (category) and, second, the offset of the
ground-truth bounding box relative to the anchor box (offset). In object
detection, we first generate multiple anchor boxes, predict the
categories and offsets for each anchor box, adjust the anchor box
position according to the predicted offset to obtain the bounding boxes
to be used for prediction, and finally filter out the prediction
bounding boxes that need to be output.
We know that, in the object detection training set, each image is
labelled with the location of the ground-truth bounding box and the
category of the target contained. After the anchor boxes are generated,
we primarily label anchor boxes based on the location and category
information of the ground-truth bounding boxes similar to the anchor
boxes. So how do we assign ground-truth bounding boxes to anchor boxes
similar to them?
Assume the anchor boxes in the image are
:math:`A_1, A_2, \ldots, A_{n_a}` and the ground-truth bounding boxes
are :math:`B_1, B_2, \ldots, B_{n_b}` and :math:`n_a \geq n_b`. Define
matrix :math:`\boldsymbol{X} \in \mathbb{R}^{n_a \times n_b}`, where
element :math:`x_{ij}` in the :math:`i`\ th row and :math:`j`\ th column
is the IoU of the anchor box :math:`A_i` to the ground-truth bounding
box :math:`B_j`. First, we find the largest element in the matrix
:math:`\boldsymbol{X}` and record the row index and column index of the
element as :math:`i_1,j_1`. We assign the ground-truth bounding box
:math:`B_{j_1}` to the anchor box :math:`A_{i_1}`. Obviously, anchor box
:math:`A_{i_1}` and ground-truth bounding box :math:`B_{j_1}` have the
highest similarity among all the “anchor box - ground-truth bounding
box” pairings. Next, discard all elements in the :math:`i_1`\ th row and
the :math:`j_1`\ th column in the matrix :math:`\boldsymbol{X}`. Find
the largest remaining element in the matrix :math:`\boldsymbol{X}` and
record the row index and column index of the element as :math:`i_2,j_2`.
We assign ground-truth bounding box :math:`B_{j_2}` to anchor box
:math:`A_{i_2}` and then discard all elements in the :math:`i_2`\ th row
and the :math:`j_2`\ th column in the matrix :math:`\boldsymbol{X}`. At
this point, elements in two rows and two columns in the matrix
:math:`\boldsymbol{X}` have been discarded. We proceed until all
elements in the :math:`n_b` column in the matrix :math:`\boldsymbol{X}`
are discarded. At this time, we have assigned a ground-truth bounding
box to each of the :math:`n_b` anchor boxes. Next, we only traverse the
remaining :math:`n_a - n_b` anchor boxes. Given anchor box :math:`A_i`,
find the bounding box :math:`B_j` with the largest IoU with :math:`A_i`
according to the :math:`i`\ th row of the matrix :math:`\boldsymbol{X}`,
and only assign ground-truth bounding box :math:`B_j` to anchor box
:math:`A_i` when the IoU is greater than the predetermined threshold.
As shown in Figure 9.3 (left), assuming that the maximum value in the
matrix :math:`\boldsymbol{X}` is :math:`x_{23}`, we will assign
ground-truth bounding box :math:`B_3` to anchor box :math:`A_2`. Then,
we discard all the elements in row 2 and column 3 of the matrix, find
the largest element :math:`x_{71}` of the remaining shaded area, and
assign ground-truth bounding box :math:`B_1` to anchor box :math:`A_7`.
Then, as shown in Figure 9.3 (middle), discard all the elements in row 7
and column 1 of the matrix, find the largest element :math:`x_{54}` of
the remaining shaded area, and assign ground-truth bounding box
:math:`B_4` to anchor box :math:`A_5`. Finally, as shown in Figure 9.3
(right), discard all the elements in row 5 and column 4 of the matrix,
find the largest element :math:`x_{92}` of the remaining shaded area,
and assign ground-truth bounding box :math:`B_2` to anchor box
:math:`A_9`. After that, we only need to traverse the remaining anchor
boxes of :math:`A_2, A_5, A_7, A_9` and determine whether to assign
ground-truth bounding boxes to the remaining anchor boxes according to
the threshold.
.. figure:: ../img/anchor-label.svg
Assign ground-truth bounding boxes to anchor boxes.
Now we can label the categories and offsets of the anchor boxes. If an
anchor box :math:`A` is assigned ground-truth bounding box :math:`B`,
the category of the anchor box :math:`A` is set to the category of
:math:`B` and the offset of the anchor box :math:`A` is set according to
the relative position of the central coordinates of :math:`B` and
:math:`A` and the relative sizes of the two boxes. Because the positions
and sizes of various boxes in the data set may vary, these relative
positions and relative sizes usually require some special
transformations to make the offset distribution more uniform and easier
to fit. Assume the center coordinates of anchor box :math:`A` and its
assigned ground-truth bounding box :math:`B` are
:math:`(x_a, y_a), (x_b, y_b)`, the widths of :math:`A` and :math:`B`
are :math:`w_a, W_b`, and their heights are :math:`h_a, h_b`,
respectively. In this case, a common technique is to label the offset of
:math:`A` as
.. math::
\left( \frac{ \frac{x_b - x_a}{w_a} - \mu_x }{\sigma_x},
\frac{ \frac{y_b - y_a}{h_a} - \mu_y }{\sigma_y},
\frac{ \log \frac{w_b}{w_a} - \mu_w }{\sigma_w},
\frac{ \log \frac{h_b}{h_a} - \mu_h }{\sigma_h}\right),
The default values of the constant are
:math:`\mu_x = \mu_y = \mu_w = \mu_h = 0, \sigma_x=\sigma_y=0.1, and \sigma_w=\sigma_h=0.2`.
If an anchor box is not assigned a ground-truth bounding box, we only
need to set the category of the anchor box to background. Anchor boxes
whose category is background are often referred to as negative anchor
boxes, and the rest are referred to as positive anchor boxes.
Below we demonstrate a detailed example. We define ground-truth bounding
boxes for the cat and dog in the read image, where the first element is
category (0 for dog, 1 for cat) and the remaining four elements are the
:math:`x, y` axis coordinates at top-left corner and :math:`x, y` axis
coordinates at lower-right corner (the value range is between 0 and 1).
Here, we construct five anchor boxes to be labeled by the coordinates of
the upper-left corner and the lower-right corner, which are recorded as
:math:`A_0, \ldots, A_4`, respectively (the index in the program starts
from 0). First, draw the positions of these anchor boxes and the
ground-truth bounding boxes in the image.
.. code:: python
ground_truth = nd.array([[0, 0.1, 0.08, 0.52, 0.92],
[1, 0.55, 0.2, 0.9, 0.88]])
anchors = nd.array([[0, 0.1, 0.2, 0.3], [0.15, 0.2, 0.4, 0.4],
[0.63, 0.05, 0.88, 0.98], [0.66, 0.45, 0.8, 0.8],
[0.57, 0.3, 0.92, 0.9]])
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, ground_truth[:, 1:] * bbox_scale, ['dog', 'cat'], 'k')
show_bboxes(fig.axes, anchors * bbox_scale, ['0', '1', '2', '3', '4']);
.. figure:: output_anchor_0ce4f6_11_0.svg
We can label categories and offsets for anchor boxes by using the
``MultiBoxTarget`` function in the ``contrib.nd`` module. This function
sets the background category to 0 and increments the integer index of
the target category from zero by 1 (1 for dog and 2 for cat). We add
example dimensions to the anchor boxes and ground-truth bounding boxes
and construct random predicted results with a shape of (batch size,
number of categories including background, number of anchor boxes) by
using the ``expand_dims`` function.
.. code:: python
labels = contrib.nd.MultiBoxTarget(anchors.expand_dims(axis=0),
ground_truth.expand_dims(axis=0),
nd.zeros((1, 3, 5)))
There are three items in the returned result, all of which are in
NDArray format. The third item is represented by the category labelled
for the anchor box.
.. code:: python
labels[2]
.. parsed-literal::
:class: output
[[0. 1. 2. 0. 2.]]
We analyze these labelled categories based on positions of anchor boxes
and ground-truth bounding boxes in the image. First, in all “anchor box
- ground-truth bounding box” pairs, the IoU of anchor box :math:`A_4` to
the ground-truth bounding box of the cat is the largest, so the category
of anchor box :math:`A_4` is labeled as cat. Without considering anchor
box :math:`A_4` or the ground-truth bounding box of the cat, in the
remaining “anchor box - ground-truth bounding box” pairs, the pair with
the largest IoU is anchor box :math:`A_1` and the ground-truth bounding
box of the dog, so the category of anchor box :math:`A_1` is labeled as
dog. Next, traverse the remaining three unlabeled anchor boxes. The
category of the ground-truth bounding box with the largest IoU with
anchor box :math:`A_0` is dog, but the IoU is smaller than the threshold
(the default is 0.5), so the category is labeled as background; the
category of the ground-truth bounding box with the largest IoU with
anchor box :math:`A_2` is cat and the IoU is greater than the threshold,
so the category is labeled as cat; the category of the ground-truth
bounding box with the largest IoU with anchor box :math:`A_3` is cat,
but the IoU is smaller than the threshold, so the category is labeled as
background.
The second item of the return value is a mask variable, with the shape
of (batch size, four times the number of anchor boxes). The elements in
the mask variable correspond one-to-one with the four offset values of
each anchor box. Because we don’t care about background detection,
offsets of the negative class should not affect the target function. By
multiplying by element, the 0 in the mask variable can filter out
negative class offsets before calculating target function.
.. code:: python
labels[1]
.. parsed-literal::
:class: output
[[0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1.]]
The first item returned is the four offset values labeled for each
anchor box, with the offsets of negative class anchor boxes labeled as
0.
.. code:: python
labels[0]
.. parsed-literal::
:class: output
[[ 0.00e+00 0.00e+00 0.00e+00 0.00e+00 1.40e+00 1.00e+01 2.59e+00
7.18e+00 -1.20e+00 2.69e-01 1.68e+00 -1.57e+00 0.00e+00 0.00e+00
0.00e+00 0.00e+00 -5.71e-01 -1.00e+00 -8.94e-07 6.26e-01]]
Output Bounding Boxes for Prediction
------------------------------------
During model prediction phase, we first generate multiple anchor boxes
for the image and then predict categories and offsets for these anchor
boxes one by one. Then, we obtain prediction bounding boxes based on
anchor boxes and their predicted offsets. When there are many anchor
boxes, many similar prediction bounding boxes may be output for the same
target. To simplify the results, we can remove similar prediction
bounding boxes. A commonly used method is called non-maximum suppression
(NMS).
Let us take a look at how NMS works. For a prediction bounding box
:math:`B`, the model calculates the predicted probability for each
category. Assume the largest predicted probability is :math:`p`, the
category corresponding to this probability is the predicted category of
:math:`B`. We also refer to :math:`p` as the confidence level of
prediction bounding box :math:`B`. On the same image, we sort the
prediction bounding boxes with predicted categories other than
background by confidence level from high to low, and obtain the list
:math:`L`. Select the prediction bounding box :math:`B_1` with highest
confidence level from :math:`L` as a baseline and remove all
non-benchmark prediction bounding boxes with an IoU with :math:`B_1`
greater than a certain threshold from :math:`L`. The threshold here is a
preset hyper-parameter. At this point, :math:`L` retains the prediction
bounding box with the highest confidence level and removes other
prediction bounding boxes similar to it. Next, select the prediction
bounding box :math:`B_2` with the second highest confidence level from
:math:`L` as a baseline, and remove all non-benchmark prediction
bounding boxes with an IoU with :math:`B_2` greater than a certain
threshold from :math:`L`. Repeat this process until all prediction
bounding boxes in :math:`L` have been used as a baseline. At this time,
the IoU of any pair of prediction bounding boxes in :math:`L` is less
than the threshold. Finally, output all prediction bounding boxes in the
list :math:`L`.
Next, we will look at a detailed example. First, construct four anchor
boxes. For the sake of simplicity, we assume that predicted offsets are
all 0. This means that the prediction bounding boxes are anchor boxes.
Finally, we construct a predicted probability for each category.
.. code:: python
anchors = nd.array([[0.1, 0.08, 0.52, 0.92], [0.08, 0.2, 0.56, 0.95],
[0.15, 0.3, 0.62, 0.91], [0.55, 0.2, 0.9, 0.88]])
offset_preds = nd.array([0] * anchors.size)
cls_probs = nd.array([[0] * 4, # Predicted probability for background
[0.9, 0.8, 0.7, 0.1], # Predicted probability for dog
[0.1, 0.2, 0.3, 0.9]]) # Predicted probability for cat
Print prediction bounding boxes and their confidence levels on the
image.
.. code:: python
fig = d2l.plt.imshow(img)
show_bboxes(fig.axes, anchors * bbox_scale,
['dog=0.9', 'dog=0.8', 'dog=0.7', 'cat=0.9'])
.. figure:: output_anchor_0ce4f6_23_0.svg
We use the ``MultiBoxDetection`` function of the ``contrib.nd`` module
to perform NMS and set the threshold to 0.5. This adds an example
dimension to the NDArray input. We can see that the shape of the
returned result is (batch size, number of anchor boxes, 6). The 6
elements of each row represent the output information for the same
prediction bounding box. The first element is the predicted category
index, which starts from 0 (0 is dog, 1 is cat). The value -1 indicates
background or removal in NMS. The second element is the confidence level
of prediction bounding box. The remaining four elements are the
:math:`x, y` axis coordinates of the upper-left corner and the
:math:`x, y` axis coordinates of the lower-right corner of the
prediction bounding box (the value range is between 0 and 1).
.. code:: python
output = contrib.ndarray.MultiBoxDetection(
cls_probs.expand_dims(axis=0), offset_preds.expand_dims(axis=0),
anchors.expand_dims(axis=0), nms_threshold=0.5)
output
.. parsed-literal::
:class: output
[[[ 0. 0.9 0.1 0.08 0.52 0.92]
[ 1. 0.9 0.55 0.2 0.9 0.88]
[-1. 0.8 0.08 0.2 0.56 0.95]
[-1. 0.7 0.15 0.3 0.62 0.91]]]
We remove the prediction bounding boxes of category -1 and visualize the
results retained by NMS.
.. code:: python
fig = d2l.plt.imshow(img)
for i in output[0].asnumpy():
if i[0] == -1:
continue
label = ('dog=', 'cat=')[int(i[0])] + str(i[1])
show_bboxes(fig.axes, [nd.array(i[2:]) * bbox_scale], label)
.. figure:: output_anchor_0ce4f6_27_0.svg
In practice, we can remove prediction bounding boxes with lower
confidence levels before performing NMS, thereby reducing the amount of
computation for NMS. We can also filter the output of NMS, for example,
by only retaining results with higher confidence levels as the final
output.
Summary
-------
- We generate multiple anchor boxes with different sizes and aspect
ratios, centered on each pixel.
- IoU, also called Jaccard index, measures the similarity of two
bounding boxes. It is the ratio of the intersecting area to the union
area of two bounding boxes.
- In the training set, we mark two types of labels for each anchor box:
one is the category of the target contained in the anchor box and the
other is the offset of the ground-truth bounding box relative to the
anchor box.
- When predicting, we can use non-maximum suppression (NMS) to remove
similar prediction bounding boxes, thereby simplifying the results.
Exercises
---------
- Change the ``sizes`` and ``ratios`` values in
``contrib.nd.MultiBoxPrior`` and observe the changes to the generated
anchor boxes.
- Construct two bounding boxes with and IoU of 0.5, and observe their
coincidence.
- Verify the output of offset ``labels[0]`` by marking the anchor box
offsets as defined in this section (the constant is the default
value).
- Modify the variable ``anchors`` in the “Labeling Training Set Anchor
Boxes” and “Output Bounding Boxes for Prediction” sections. How do
the results change?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_anchor.svg