Multiscale Object Detection
===========================
In :numref:`chapter_anchor`, we generated multiple anchor boxes
centered on each pixel of the input image. These anchor boxes are used
to sample different regions of the input image. However, if anchor boxes
are generated centered on each pixel of the image, soon there will be
too many anchor boxes for us to compute. For example, we assume that the
input image has a height and a width of 561 and 728 pixels respectively.
If five different shapes of anchor boxes are generated centered on each
pixel, over two million anchor boxes (:math:`561 \times 728 \times 5`)
need to be predicted and labeled on the image.
It is not difficult to reduce the number of anchor boxes. An easy way is
to apply uniform sampling on a small portion of pixels from the input
image and generate anchor boxes centered on the sampled pixels. In
addition, we can generate anchor boxes of varied numbers and sizes on
multiple scales. Notice that smaller objects are more likely to be
positioned on the image than larger ones. Here, we will use a simple
example: Objects with shapes of :math:`1 \times 1`, :math:`1 \times 2`,
and :math:`2 \times 2` may have 4, 2, and 1 possible position(s) on an
image with the shape :math:`2 \times 2`. Therefore, when using smaller
anchor boxes to detect smaller objects, we can sample more regions; when
using larger anchor boxes to detect larger objects, we can sample fewer
regions.
To demonstrate how to generate anchor boxes on multiple scales, let us
read an image first. It has a height and width of 561 \* 728 pixels.
.. code:: python
%matplotlib inline
import d2l
from mxnet import contrib, image, nd
img = image.imread('../img/catdog.jpg')
h, w = img.shape[0:2]
h, w
.. parsed-literal::
:class: output
(561, 728)
In :numref:`chapter_conv_layer`, the 2D array output of the
convolutional neural network (CNN) is called a feature map. We can
determine the midpoints of anchor boxes uniformly sampled on any image
by defining the shape of the feature map.
The function ``display_anchors`` is defined below. We are going to
generate anchor boxes ``anchors`` centered on each unit (pixel) on the
feature map ``fmap``. Since the coordinates of axes :math:`x` and
:math:`y` in anchor boxes ``anchors`` have been divided by the width and
height of the feature map ``fmap``, values between 0 and 1 can be used
to represent relative positions of anchor boxes in the feature map.
Since the midpoints of anchor boxes ``anchors`` overlap with all the
units on feature map ``fmap``, the relative spatial positions of the
midpoints of the ``anchors`` on any image must have a uniform
distribution. Specifically, when the width and height of the feature map
are set to ``fmap_w`` and ``fmap_h`` respectively, the function will
conduct uniform sampling for ``fmap_h`` rows and ``fmap_w`` columns of
pixels and use them as midpoints to generate anchor boxes with size
``s`` (we assume that the length of list ``s`` is 1) and different
aspect ratios (``ratios``).
.. code:: python
def display_anchors(fmap_w, fmap_h, s):
d2l.set_figsize((3.5, 2.5))
# The values from the first two dimensions will not affect the output
fmap = nd.zeros((1, 10, fmap_w, fmap_h))
anchors = contrib.nd.MultiBoxPrior(fmap, sizes=s, ratios=[1, 2, 0.5])
bbox_scale = nd.array((w, h, w, h))
d2l.show_bboxes(d2l.plt.imshow(img.asnumpy()).axes,
anchors[0] * bbox_scale)
We will first focus on the detection of small objects. In order to make
it easier to distinguish upon display, the anchor boxes with different
midpoints here do not overlap. We assume that the size of the anchor
boxes is 0.15 and the height and width of the feature map are 4. We can
see that the midpoints of anchor boxes from the 4 rows and 4 columns on
the image are uniformly distributed.
.. code:: python
display_anchors(fmap_w=4, fmap_h=4, s=[0.15])
.. figure:: output_multiscale-object-detection_f4262f_5_0.svg
We are going to reduce the height and width of the feature map by half
and use a larger anchor box to detect larger objects. When the size is
set to 0.4, overlaps will occur between regions of some anchor boxes.
.. code:: python
display_anchors(fmap_w=2, fmap_h=2, s=[0.4])
.. figure:: output_multiscale-object-detection_f4262f_7_0.svg
Finally, we are going to reduce the height and width of the feature map
by half and increase the anchor box size to 0.8. Now the midpoint of the
anchor box is the center of the image.
.. code:: python
display_anchors(fmap_w=1, fmap_h=1, s=[0.8])
.. figure:: output_multiscale-object-detection_f4262f_9_0.svg
Since we have generated anchor boxes of different sizes on multiple
scales, we will use them to detect objects of various sizes at different
scales. Now we are going to introduce a method based on convolutional
neural networks (CNNs).
At a certain scale, suppose we generate :math:`h \times w` sets of
anchor boxes with different midpoints based on :math:`c_i` feature maps
with the shape :math:`h \times w` and the number of anchor boxes in each
set is :math:`a`. For example, for the first scale of the experiment, we
generate 16 sets of anchor boxes with different midpoints based on 10
(number of channels) feature maps with a shape of :math:`4 \times 4`,
and each set contains 3 anchor boxes. Next, each anchor box is labeled
with a category and offset based on the classification and position of
the ground-truth bounding box. At the current scale, the object
detection model needs to predict the category and offset of
:math:`h \times w` sets of anchor boxes with different midpoints based
on the input image.
We assume that the :math:`c_i` feature maps are the intermediate output
of the CNN based on the input image. Since each feature map has
:math:`h \times w` different spatial positions, the same position will
have :math:`c_i` units. According to the definition of receptive field
in the :numref:`chapter_conv_layer`, the :math:`c_i` units of the
feature map at the same spatial position have the same receptive field
on the input image. Thus, they represent the information of the input
image in this same receptive field. Therefore, we can transform the
:math:`c_i` units of the feature map at the same spatial position into
the categories and offsets of the :math:`a` anchor boxes generated using
that position as a midpoint. It is not hard to see that, in essence, we
use the information of the input image in a certain receptive field to
predict the category and offset of the anchor boxes close to the field
on the input image.
When the feature maps of different layers have receptive fields of
different sizes on the input image, they are used to detect objects of
different sizes. For example, we can design a network to have a wider
receptive field for each unit in the feature map that is closer to the
output layer, to detect objects with larger sizes in the input image.
We will implement a multiscale object detection model in the following
section.
Summary
-------
- We can generate anchor boxes with different numbers and sizes on
multiple scales to detect objects of different sizes on multiple
scales.
- The shape of the feature map can be used to determine the midpoint of
the anchor boxes that uniformly sample any image.
- We use the information for the input image from a certain receptive
field to predict the category and offset of the anchor boxes close to
that field on the image.
Exercises
---------
- Given an input image, assume :math:`1 \times c_i \times h \times w`
to be the shape of the feature map while :math:`c_i, h, w` are the
number, height, and width of the feature map. What methods can you
think of to convert this variable into the anchor box’s category and
offset? What is the shape of the output?
Scan the QR Code to `Discuss `__
-----------------------------------------------------------------
|image0|
.. |image0| image:: ../img/qr_multiscale-object-detection.svg