IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018 2799
GapFlyt: Active Vision Based Minimalist
Structure-Less Gap Detection For Quadrotor Flight
Nitin J. Sanket , Chahat Deep Singh, Kanishka Ganguly, Cornelia Ferm
¨
uller, and Yiannis Aloimonos
Abstract—Although quadrotors, and aerial robots in general,
are inherently active agents, their perceptual capabilities in lit-
erature so far have been mostly passive in nature. Researchers
and practitioners today use traditional computer vision algorithms
with the aim of building a representation of general applicability: a
3-D reconstruction of the scene. Using this representation, planning
tasks are constructed and accomplished to allow the quadrotor to
demonstrate autonomous behavior. These methods are inefficient
as they are not task driven and such methodologies are not uti-
lized by flying insects and birds. Such agents have been solving the
problem of navigation and complex control for ages without the
need to build a 3-D map and are highly task driven. In this letter,
we propose this framework of bioinspired perceptual design for
quadrotors. We use this philosophy to design a minimalist senso-
rimotor framework for a quadrotor to fly through unknown gaps
without an explicit 3-D reconstruction of the scene using only a
monocular camera and onboard sensing. We successfully evaluate
and demonstrate the proposed approach in many real-world ex-
periments with different settings and window shapes, achieving a
success rate of 85% at 2.5 ms
1
even with a minimum tolerance
of just 5 cm. To best of our knowledge, this is the first letter that
addresses the problem of gap detection of an unknown shape and
location with a monocular camera and onboard sensing.
Index Terms—Active vision, gap detection, quadrotor, visual ser-
voing, deep learning in robotics and automation, optical flow, track-
ing, collision avoidance, computer vision for automation, aerial sys-
tems: perception and autonomy, visual tracking and visual-based
navigation.
I. INTRODUCTION AND PHILOSOPHY
T
HE quest to develop intelligent and autonomous quadro-
tors [1]–[3] has taken a center stage in the recent years
due to their usefulness in aiding s afety and intuitive control.
To achieve any form of autonomy, quadrotors need to have
Manuscript received February 15, 2018; accepted May 19, 2018. Date of
publication June 4, 2018; date of current version June 15, 2018. This letter was
recommended for publication by Associate Editor V. Lippiello and Editor J.
Roberts upon evaluation of the reviewers’ comments. This work was supported
in part by the Brin Family Foundation, in part by the Northrop Grumman
Corporation, and in part by the National Science Foundation under Grants
SMA 1540917 and CNS 1544797, respectively. (Nitin J. Sanket and Chahat
Deep Singh contributed equally to this work.) (Corresponding author: Nitin J.
Sanket.)
The authors are with the University of Maryland Institute for Advanced Com-
puter Studies, College Park, MD 20742 USA (e-mail:,[email protected];
This letter has supplemental downloadable multimedia material available at
http://ieeexplore.ieee.org, provided by the authors. The Supplementary Mate-
rials include a video, a supplementary PDF, code, and hardware tutorial. This
material is 91.1 MB in size. The supplementary hardware tutorial, appendix,
code and video are also available at prg.cs.umd.edu/GapFlyt.html.
Digital Object Identifier 10.1109/LRA.2018.2843445
Fig. 1. Different parts of the pipeline: (a) Detection of the unknown gap us-
ing active vision and TS
2
P algorithm (cyan highlight shows the path followed
for obtaining multiple images for detection), (b) Sequence of quadrotor pass-
ing through the unknown gap using visual servoing based control. (The blue
and green highlights represent the tracked foreground and background regions
respectively. All the images in this letter are best viewed in color.)
perceptual capabilities in order to sense the world and react
accordingly. A generic and fairly common solution to pro-
viding perceptual capabilities is to acquire a 3D model of its
environment. Creating such a model is very useful because of
its general applicability one could accomplish many tasks us-
ing the same model. The process of obtaining a 3D model of the
environment using onboard sensing and a myriad of different
sensors has gained momentum in the last few years [4]. Sensors
like the LIDAR, RGB-D and stereo camera cannot be used on a
small quadrotor due to their size, weight, and power limitations.
2377-3766 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications
standards/publications/rights/index.html for more information.
2800 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018
TABLE I
M
INIMALIST DESIGN OF AUTONOMOUS QUADROTOR (DRONE)BEHAVIOURS
This constrains us to a monocular camera along with the already
present onboard inertial sensor (IMU) and many algorithms have
been developed based on the same principle [5], [6].
Instead of attempting to recover a 3D model of the scene, we
want to recover a “minimal” amount of information that is suffi-
cient to complete the task under consideration. We conceptualize
an autonomous quadrotor as a collection of processes that allow
the agent to perform a number of behaviors (or tasks) (Table I).
At the bottom of the hierarchy is the task of kinetic stabilization
(or egomotion). Next comes the ability for obstacle avoidance,
where the obstacles could be static or dynamic, followed by
ability to perform homing, i.e., to find a specific location in an
environment. Last in the hierarchy comes the ability to land (on
a static or a dynamic surface) and the ability to pursue, or escape
from, another agent. This hierarchy of competences, with each
competence needing the one before it, constitutes the sensori-
motor system of the quadrotor agent. These behaviors can be
accomplished without an explicit 3D reconstruction because of
the “active” nature of the quadrotor. The quadrotor can perform
maneuvers and control the image acquisition process, thus intro-
ducing new constraints that were not there before - this is called
“active” vision [7]–[9]. This methodology was inspired by the
fruit fly [10]. Prior research has shown that fruit flies, and other
insects [11], [12], do not perceive depth directly. It is achieved by
a balance of active and exploratory procedures. In this letter, we
focus on the second competence of obstacle avoidance. Specif-
ically, the question this letter deals with is: “Can a quadrotor
manage to go through an arbitrarily shaped gap without build-
ing an explicit 3D model of a scene, using only a monocular
camera?” We develop the visual mathematics of the solution,
evaluate and demonstrate the proposed approach in many real
experiments with different settings and window s hapes.
Tr aditional computer vision based approaches such as sparse
or dense reconstruction [ 13]–[16] have been used to obtain a
3D structure of the scene over which sophisticated path plan-
ners have been used to plan collision free paths. Lately, deep-
learning driven approaches have taken a center stage in solving
the problem of fast collision avoidance and safe planning on
quadrotors [17]. Most of these neural network approaches com-
pute Fast Fourier Transforms (FFTs) for large filter sizes [18].
Such approaches can be processed on a Field-Programmable
Gate Array (FPGA), rather than a Graphical Processing Unit
(GPU) to drastically improve the computation performance and
power efficiency [19], [20].
To our knowledge, this is the first letter which addresses
the problem of gap detection with a monocular camera and
onboard sensing. However, the problem of going through gaps
has fascinated researchers from many years and some of the
recent works which present algorithms for planning and control
can be found in [21], [22]. Some of the works which paved way
to the bio-inspired approach used in this letter can be found in
[23]–[26].
Fig. 2. Components of the environment. On-set Image: Quadrotor view of the
scene.
The key contributions of this letter are given below:
r
Active vision based structure-less minimalist gap detection
algorithm Temporally Stacked Spatial Parallax or TS
2
P
(Fig. 1(a)).
r
Visual servoing on a contour for the quadrotor to fly
through unknown gaps (Fig. 1(b)).
A. Organization of the Paper
Section II presents the detection algorithm of a gap in an ar-
bitrary shaped window using Deep Learning based optical flow
and the role of active vision in such algorithms. Section III de-
scribes the algorithm used for tracking the gap/safe point and the
quadrotor control policy. Section IV illustrates the experimental
setup and provides error analyses and comparisons with tradi-
tional approaches. We finally conclude the paper in Section V
with parting thoughts on future work.
B. Problem Formulation and Proposed Solutions
A quadrotor is present in a static scene (Fig. 2), where the
absolute depth of each point as ‘seen’ from the camera can be
modelled as an univariate bimodal gaussian distribution. The
location at which there is a maximum spatial depth discrepancy
between pixels (projected point) is defined as the Contour (C)of
the opening. In this letter, the words boundary, gap, window or
opening refer to the same entity and will be used interchange-
ably. Any pixel close to the mode corresponding to the lower
depth value is defined as the Foreground (F) and similarly that
corresponding to the higher depth value is defined as the Back-
ground (B). A simple way of solving the problem of finding
the gap for a near fronto-parallel pose is to find the depth dis-
crepancy which is a trivial clustering problem when the depth is
known. The depth can be known if multiple calibrated cameras
are used or the entire scene is reconstructed in 3D. These meth-
ods are computationally expensive [4]. In this letter, we propose
a ‘minimalist’ solution to find any arbitrary shaped gap for the
quadrotor to go through using Temporally Stacked Spatial Par-
allax (TS
2
P) algorithm. A control policy based on contour visual
servoing is used to fly through unknown gaps.
SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2801
Fig. 3. Representation of co-ordinate frames.
II. GAP DETECTION USING TS
2
P
Before we explain the procedure for gap detection, we need
to formally define the notation used. (
a
X
b
,
a
Y
b
,
a
Z
b
) denotes
the coordinate frame of b represented in the reference of a.The
letters I, C, B and W are used as sub/superscript to denote
quantities related to Inertial Measurement Unit (IMU), Camera,
Body and World respectively. If a sub/superscript is omitted,
the quantity can be assumed to be in W . C and I are assumed
to be rigidly attached to each other with known intrinsics and
extrinsics. B is defined to be aligned with I (Fig. 3). A pinhole
camera model is used for the formation of the image. The world
point X gets projected onto the image plane point x.
The camera captures images/frames at a certain frequency. Let
the frame at time t
i
be denoted as F
i
and is called the ith frame.
Optical flow [27] between ith and jth frames is denoted by
j
i
˙p
x
which is a sum of both translational (
j
i
˙p
x,T
) and rotational
(
j
i
˙p
x,R
) components and are given by:
j
i
˙p
x,T
=
1
Z
x
xV
z
V
x
yV
z
V
y
;
j
i
˙p
x,R
=
xy (1 + x
2
) y
(1 + y
2
) xy x
Ω
where x =[
xy
]
T
denotes the pixel coordinates in the im-
age plane, Z
x
is the depth of X (world point corresponding to
pixel x), V =[
V
x
V
y
V
z
]
T
and Ω are the linear and angular
velocities of the C in W between t
i
and t
j
respectively.
The otherwise complex depth estimation problem can be triv-
ialized by moving in a certain way [7]. These “active” vision
principles dictate us to control the quadrotor’s movements so
as to make the interpretation of optical flow easier. Since the
quadrotor is a differentially flat system [28], the rotation about
(X
B
,Y
B
,Z
B
) or roll, pitch and yaw can be decoupled. As
an implication, the movements can be controlled in a way to
simplify the depth (Z
x
) estimation problem. The quadrotor is
controlled in such a way that Ω 0, V
z
V
x
and V
z
V
y
,
then the optical flow can be modelled as:
j
i
˙p
x
=
j
i
Z
x
1
j
i
V
x
j
i
V
y
T
+ η
where η is the approximation error. This shows that using the
aforementioned “active” control strategy, we obtain an implicit
3D structure of the environment in the optical flow. The inverse
depth in this “controlled” case manifests as a linear function of
the optical flow.
The optical flow equation can be written for both foreground
(F) and background (B) pixels independently. The magnitude
of optical flow for F is given by,
j
i
˙p
x,F
2
=
j
i
Z
x,F
1
j
i
V
2
x
+
j
i
V
2
y
+ ν
where ν ∼N(0) is assumed to be an additive white Gaussian
noise and is independent of the scene structure or the amount
of camera movement between two frames (V, Ω). For such as-
sumptions to be valid, the optical flow algorithm needs to work
well for a wide range of camera movements in a variety of scene
structures. Using fast traditional optical flow formulations based
on methods like [29] or [30] voids such assumptions. This moti-
vated us to use deep-learning based flow algorithms which excel
at this task while maintaining a reasonable speed when running
on a GPU. In this letter, FlowNet2 [31] is used to compute
optical flow unless stated otherwise.
A simple method to reduce noise is to compute the mean of
the flow magnitudes across a few frames. Let ξ = {F
j
, ..., F
k
}
be a set of N frames from which the optical flow is computed
with respect to some reference frame F
i
where the complete
gap is assumed to be visible. Here, N =
ξ is the cardinality of
the set ξ. The mean flow magnitude at x for F (
ξ
i
˙p
x,F
2
) and
B (
ξ
i
˙p
x,B
2
) is given by,
ξ
i
˙p
x,F/B
2
=
NZ
x,F/B
1
k
r=j
r
i
V + ν
where ν
∼N(0,N
0.5
σ) and V =
V
2
x
+V
2
y
. Clearly, the noise
varies inversely with N.
Since Z
x,F
<Z
x,B
, we can say that
ξ
i
˙p
x,F
2
>
ξ
i
˙p
x,B
2
.
Now,
ξ
i
˙p
x,F
2
−
ξ
i
˙p
x,B
2
τ can be used as a criterion for
finding possible boundary regions. It was found experimentally
that using inverse flow differences
ξ
i
˙p
x,B
1
2
−
ξ
i
˙p
x,F
1
2
τ
gave better numerical stability and better noise performance
due to the scale compression by the inverse function. This is
inherently the spatial derivative of inverse average (stacked) flow
magnitudes and it can be written as Ξ=∇·
ξ
i
˙p
x
1
2
, where,
=[
∂/∂x ∂/∂y
]
T
. Note that this is the same as solving
the edge detection problem in computer vision and any kernel
or method like the Sobel operator or the Canny edge detection
algorithm can be used.
III. H
IGH SPEED GAP TRACKING FOR VISUAL SERVOING
BASED CONTROL
This section presents a targeted solution for tracking a con-
tour using label sets propagated using Focus of Expansion
(FOE) constraints. A pixel at location x is associated with a
score χ (x) [1, 1] which denotes its score as foreground or
background. The foreground and background pixel locations
are defined by F = {x|χ (x)=+1}and B = {x|χ (x)=1}
respectively.
We define the opening O on the image plane as O =
{x|χ (x) < 0}. The pixel locations which cannot be perfectly
classified as foreground or background belong to the uncertainty
zone and are defined as U = {x|χ (x) (1, +1)}. The con-
tour location is defined as C = {x|χ (x)=0}.Fig.4givesa
visual representation of the different sets on a sample image.
Tcon C across time. This problem is hard as the contour
tracking relies on updating χ (x) x ∈U
over time which is
non-trivial and computationally expensive. To simplify the prob-
lem, we use the dual formulation of the problem which is to
track the pixel locations which belong to the set defined by
{x /∈U}= F∪B. This enables us to track the contour indi-
rectly at high speeds as corner tracking is comparatively faster
[32]. The trade-off i n using the dual formulation is that we don’t
obtain the actual contour location across time - which might
2802 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018
Fig. 4. Label sets used in tracking. (blue: foreground region, green: back-
ground region, orange: uncertainty region, black line: contour, brighter part of
frame: active region and darker part of frame: inactive region.)
Algorithm 1: Label Set Propagation Using FOE
Constraints.
Data: C
i
F
, C
i
B
, F
i
, F
j
, F
i
, B
i
Result:C
j
F
, C
j
B
, F
j
, B
j
1: C
j
F
= FeatureTracker
C
i
F
, F
i
, F
j
;
2: A =
C
i
F,x
1
;
3: B =
C
j
F,x
C
i
F,x
;
4: [
αβ
]=A
B;
5: x
0,F
= β/α; y
0,F
=
C
j
F,y
1
α
C
j
F,y
C
i
F,y

;
6: F
j
= α
F
i
x
x
0,F
F
i
y
y
0,F
+
F
i
x
F
i
y
;
7: Repeat steps 1 through 6 for B in parallel;
be needed for aggressive maneuvers, but this is not dealt in the
scope of this letter. The label set propagation algorithm is de-
scribed in Algorithm 1. Here, C
i
F
and C
i
B
represents a set of
corner/feature points for the foreground and background pix-
els respectively in F
i
. A
denotes the pseudo inverse of the
matrix A and [
x
0,F
y
0,F
]
T
is the FOE for the foreground.
Intuitively, we solve the linear equations of the horizontal flow
field to obtain the divergence/time-to-contact α. The divergence
in x-direction is used to also predict the y-coordinates.
A. Safe Point Computation and Tracking
We can assume that the real-world points corresponding to
the background B∪U
B
are far enough that we can approximate
them to lie on a plane. The foreground points under considera-
tion F∪U
F
occupy a small area around the contour which can
be assumed to be planar. Here, U
F
⊂Uand U
B
⊂Uare the
sets which actually belong to the foreground and background
respectively.
Now, the quadrotor can be represented as an ellipsoid with
semi-axes of lengths a, b and c. As an implication of the above
assumptions, the projection of the quadrotor on the window
at any instant is an ellipse. Let us define the projection of the
quadrotor on the image as Q. The largest Q can be written
in terms of matrix equation of the ellipse centered at [h, k]
T
defined as Q(h, k, R
θ
)=0.
Here, R
θ
is a two-dimensional rotation matrix and θ is the
angle the largest semi-axis of the ellipse makes with the X
C
axis. The projection of the quadrotor on the image is given by
Q = {x|Q(x) 0}. The safe region S can be computed as
S =
θ
OQ
where denotes the Minkowski difference of sets. Now, we
define the ‘safest point’ (x
s
) as the barycentric mean of S.
Remark: The above optimization problem can only be solved
using convex optimization with a guaranteed global solution
when both O and Q are convex sets. A conservative solution
to the above problem is fitting the largest scaled version of Q
inside O when O is a non-convex set and Q is a convex set.
Note that as Q is a chosen model, it can always be chosen to
be a convex set, i.e., convex hull of the non-convex set. Also,
from the above remark, the ‘safest point’ (x
s
) can be defined as
the center of the largest ellipse which can be fit inside S whilst
maintaining the eccentricity equal to that defined by Q.The
optimization problem becomes,
argmax
a,θ
S∩Qs.t. Q⊆Sand |θ|≤θ
max
This problem can be solved using the procedure described in
[33]. However, a desirable property of the safe region is that it
should favor less-aggressive maneuvers. This can be modelled
as a regularization penalty in the above formulation,
argmax
a,θ
S∩Q+ λθ s.t. Q⊆Sand |θ|≤θ
max
Solving the above optimization problem is computationally
intensive and not-possible without the prior knowledge of the
scale/depth. For obtaining the minimalist solution, we assume
that the gap is large enough for the quadrotor to pass through
and replace the above optimization problem by an empirically
chosen approximation. A simple and efficient safe point com-
putation can be performed as the median of the convex set O
and is given by x
s
argmin
x
o∈O
o x
2
.
Remark: If the above approximation is used when O is non-
convex, the amount of deviation from the ‘actual’ safe point is a
function of
Conv (O)/ O.HereConv (O) is the convex hull of
O.
Keen readers would note that the formulation for the safe
point is in-terms of O which we wanted to avoid tracking
in the first place. Indeed this is true and a simple solution
takes care of this. Because we are propagating F and B with
FOE constraints, the cross-ratios of {o, f
2
|o ∈O,f ∈F}
and {o, b
2
|o ∈O,b∈B} are preserved. Here, F and B
are computed from the detected opening O as follows: F =
{O
1
−O
2
} and B = {O
3
}. Here,
i
is a user cho-
sen kernel (circular in this letter).
The F and B are propagated from F
i
onwards where the
detection was performed. The ‘safest point’ (x
s
) is computed as
follows:
x
s
=
x
s,F
, F≥B
x
s,B
, otherwise
where x
s,F
and x
s,B
are the safest points computed individually
for F and Bby taking their median respectively. Fig. 5 shows the
tracking of F and B across time and how the tracking algorithm
actively switches between x
s,F
and x
s,B
for the safest point
x
s
. When C
i
F
k
F
or C
i
B
k
B
, the tracker/feature points are
reset in the current F and B sets respectively. Here, k
F
and k
B
are empirically chosen thresholds. For all experiments k
F
=40
and k
B
=20. Due to resetting and active switching, x
s
can jump
around making the control hard, hence a simple Kalman filter
with a forward motion model is used to smooth out the value of
x
s
. From here on, safe point refers to the safest point x
s
.
SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2803
Fig. 5. Tracking of F and Bacross frames. (a) shows tracking when C
i
F
>k
F
and C
i
B
>k
B
.(b)WhenC
i
B
k
B
, the tracking for B will be reset. (c) When
C
i
F
k
F
, the tracking for F will be reset. (d) shows tracking only with B,
when F = . (blue: F, green: B, yellow: O, yellow dots: C
i
F
, red dots: C
i
B
,
blue Square: x
s,F
, red Square: x
s,B
.)
B. Control Policy
We propose a control policy such that it follows the tracked x
s
.
The quadrotor follows the dynamic model as given in [34]. The
controller uses the traditional backstepping approach based on
[34] and contains the following loops: Inner loop and outer loop
controllers. Inner loop controls the attitude stability while the
outer loop controller is responsible for the quadrotor position.
It is important to note that frames are transformed from C to B.
Since the quadrotor is differentially flat, the altitude Z
B
can
be controlled independently from X
B
and Y
B
[28]. The con-
trol policy i s to align the projection of the body center on the
image plane with x
s
. The difference between the two centers is
called the error e.Thex and y component of the error e can be
minimized by varying roll (φ) and net thrust (u
1
) respectively.
A simple Proportional-Integral-Derivative (PID) controller on e
is used. This control policy only deals with the alignment of B
to x
s
and does not deal with moving forward (Z
C
). To move
forward, the quadrotor pitch (φ) needs to be controlled. The
rate of forward motion is controlled by the pitch angle θ
0
which
is empirically chosen. The bigger the value of θ
0
the faster the
quadrotor will fly towards the gap. It is important to note the im-
plicit assumption made in this letter that the gap is large enough
for the quadrotor to go through safely.
IV. E
XPERIMENTS
A. Experimental Setup
The proposed framework was tested on a modified hobby
quadrotor, Parrot Bebop 2, for its cost effectiveness and ease of
use. The Bebop 2 is equipped with a front facing camera, a 9-axis
IMU and a downward facing optical flow sensor coupled with
a sonar. The Parrot Bebop 2 allows only high level controls in
terms of body frame velocities using ROS. An NVIDIA Jetson
TX2 GPU is mounted on the Bebop 2 as shown in Fig. 6 and is
used to run all the perception and control algorithms onboard.
The TX2 and Bebop 2 communicate via a WiFi connection,
where the images are received at 30 Hz. The overall weight
Fig. 6. The platform used for experiments. (1) The front facing camera, (2)
NVIDIA TX2 CPU+GPU, (3) Downward facing optical flow sensor (cam-
era+sonar) which is only used for position hold.
Fig. 7. First two rows: (X
W
,Y
W
), (Y
W
,Z
W
) and (X
W
,Z
W
) Vicon esti-
mates of the trajectory executed by the quadrotor in different stages (gray bar
indicates the gap). (X
W
,Z
W
) plot shows the diagonal scanning trajectory (the
lines don’t coincide due to drift). Last row: Photo of the quadrotor during gap
traversal. (cyan: detection stage, red: traversal stage.)
of the flight setup is 680 g with the dimensions being 32.8 ×
38.2 × 12 cm.
All the experiments were prototyped on a PC running Ubuntu
16.04 with an Intel Core i7 6850K 3.6 GHz CPU, an NVIDIA
Titan-Xp GPU and 64 GB of RAM in M
ATLAB using the
Robotics Toolbox. The deep learning based optical flow r uns
on Python with TensorFlow back-end. All the final versions of
the software were ported to Python to run on the NVIDIA Jet-
son TX2 running Linux for Tegra (L4T) 28.2. A step-by-step
tutorial on using Bebop 2 as a research platform is available at
prg.cs.umd.edu/GapFlyt.html.
The environmental setup for the experiments consists of a
rigid scene which has two near-planar structures, one for the
foreground and the other for the background. As shown in
Fig. 2, let us denote the initial perpendicular distance between
the quadrotor body center and the foreground as
0
Z
F
and the
background as
0
Z
B
. The near-planar structures are made of
foam-core with newspapers stuck on them to add texture to
the scene. The gap is located near the center of the foreground
and is of an arbitrary shape. For the detection of the window,
the quadrotor executes a fixed diagonal straight line trajectory
in the X
W
Z
W
plane as shown in Fig. 7 while taking a number
of images along its path. The number of images used for the de-
tection stage is a parameter denoted by N. Once the window is
detected, F and B are tracked across time in order for quadrotor
to go through the gap using visual servoing based control.
2804 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018
Fig. 8. Sequence of images of quadrotor going through different shaped gaps. Top on-set: Ξ outputs, bottom on-set: quadrotor view.
Fig. 9. Top Row (left to right): Quadrotor view at
0
Z
F
= 1.5, 2.6, 3mre-
spectively with
0
Z
B
= 5.7 m. Bottom Row: Respective Ξ outputs for N =4.
Observe how the fidelity of Ξ reduces as
0
Z
F
0
Z
B
, making the detection
more noisy. (white boxes show the location of the gap in Figs. 9 to 13.)
Fig. 10. Comparison of different philosophies to gap detection. Top row (left
to right): DSO, Stereo Depth, MonoDepth, TS
2
P. Bottom row shows the detected
gap overlayed on the corresponding input image. (green: G∩O, yellow: false
negative G∩O
, red: false positive G
∩O.)
B. Experimental Results
The pipeline was evaluated on different variations of the en-
vironmental setup. In the first experiment, we test our pipeline
on five different arbitrarily shaped gaps as shown in Fig. 8.
Unless otherwise stated
0
Z
F
2.6 m and
0
Z
B
5.7 m. The
aim here is to find biases in the detection pipeline. The windows
were chosen to have a diversity in the geometrical sharpness
of the corners, convexity of the shape and the size. As stated
earlier, the only constraint imposed on the gaps is that they are
large enough to go through with the quadrotor pitch angle close
to zero and near-convex. The outputs of TS
2
P algorithm for
different windows are s hown in Fig. 8 with N =4along with
their inverse average (stacked) flow magnitudes Ξ, illustrating
that our detection algorithm is independent of shape and size
of the opening. A canny edge detector is run on Ξ followed by
morphological operations to obtain C.
The second experiment is designed to test the noise sensi-
tivity of TS
2
P. The intuition is that as
0
Z
F
0
Z
B
, noisier the
Fig. 11. Left Column: Images used to compute Ξ. Middle Column (top to
bottom): Ξ outputs for DIS Flow, SpyNet and FlowNet2. Right Column: Gap
Detection outputs. (green: G∩O, yellow: false negative G∩O
,red:false
positive G
∩O.
Fig. 12. Top row (left to right): Quadrotor view at image sizes of 384 ×
576, 192 × 288, 96 × 144, 48 × 72, 32 × 48. Note all images are re-scaled to
384 × 576 for better viewing. Bottom row shows the respective Ξ outputs for
N =4.
detection result . The outputs for different
0
Z
F
and
0
Z
B
are
showninFig.9whenN =4. This is because the fidelity of Ξ be-
comes less and is more prone to noise. By increasing N the noise
gets averaged out across frames improving the fidelity of Ξ.
In the third experiment, we present detection outputs for dif-
ferent values of N , image baselines and image sizes. The effect
of N has been already discussed previously. Having a very small
baseline results in effectively dropping the value of N and vice-
versa. The results from different sized images as illustrated in
Fig. 12 show that the detection algorithm can work even on a
very small quadrotor which can only carry a very low-resolution
camera (as low as 32 ×48 pixels). Our algorithm can also handle
dynamic noises very well though being modelled as a gaussian
for the discussion. However, one can notice that the results im-
prove significantly with increase in N (Fig. 13) demonstrating
the advantage of TS
2
P.
SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2805
Fig. 13. Top two rows show the input images. The third row shows the Ξ
outputs when only the first 2, 4 and all 8 images are used.
TABLE I I
C
OMPARISON OF DIFFERENT METHODS USED FOR GAP DETECTION
Gap detection using TS
2
P almost always results in an eroded
version of the true gap. This is good for safe maneuvers like
the one considered in this letter. However, aggressive flight al-
gorithms might suffer due to conservative results. This can be
mitigated by tuning the values of image baselines, N and the
trajectory of the quadrotor to obtain minimum erosion results.
Tuning these parameters is easy when a prior about the scene is
known or the flow algorithms are so fast that one can actively
change the detection trajectory so as to maximize the coverage
on the part of the contour least ‘seen’. The dynamic choice of
these parameters comes into the scope of our future work.
In the last experiment we present alternative approaches in-
cluding state-of-the-art methods which can be used to find the
gap. The methods can be subdivided into structure based ap-
proaches and stuctureless approaches. The structure based ap-
proaches can be defined as the set of approaches where a full 3D
reconstruction of the scene is computed, whereas, stuctureless
approaches do not. The structure based approaches presented
are DSO [16] Direct Sparse Odometry, depth from hardware
stereo cameras [15] and Stereo SLAM Simultaneous Local-
ization and mapping using Stereo Cameras and IMU [15]. The
data for the structured approaches were collected using a Parrot
SLAMDunk [15]. The structureless approaches presented are
MonoDepth [35] deep learning based monocular depth esti-
mation and the proposed TS
2
P on two different deep learning
based dense optical flow algorithms, namely, FlowNet2 [31],
SpyNet [36] and DIS [37]. Table II shows the comparison of the
stated methods averaged over 150 trials.
Fig. 10 compares the results of DSO, stereo depth, Mon-
oDepth and our method (TS
2
P) with the ground truth. It can be
inferred that the MonoDepth results are extremely noisy (even
with different models) making it impossible to detect the gap
as the images in our paper were never “seen” during training.
Note that we don’t retrain or finetune any of the deep learn-
ing models in this letter. Retraining MonoDepth and other deep
learning based methods used in this letter on our dataset might
lead to better results. Whereas, DSO and stereo depth results can
used to detect the opening with some filtering. Stereo SLAM
and DSO are slow in the map building stage (taking about 6 s
TABLE III
C
OMPARISON OF DIFFERENT METHODS USED FOR TRACKING
Fig. 14. Quadrotor traversing an unknown window with a minimum tolerance
of just 5 cm. (red dashed line denotes C.)
and 12 s respectively), however, once the map is built and the
algorithms are in the localization stage the depth (or scaled
depth) are obtained at 20 Hz. The Stereo SLAM and Stereo
Depth were run on the SLAMDunk with an NVIDIA Jetson
TK1 processor which is much slower than the NVIDIA Jetson
TX2 processor used for running DSO and other methods.
Fig. 11 compares different optical flow methods used for
TS
2
P. Though SpyNet and DIS optical flow are faster, FlowNet2
outputs significantly better results at the edges which is impor-
tant for obtaining a good gap detection this can be observed
by looking at Ξ for each algorithm.
After the gap detection has been performed, F and B are
computed from the detected gap C.Fig.5showsF and B be-
ing propagated across time as the quadrotor is in pursuit of
going through the gap with the update of x
s
. A comparison of
tracking using different methods are given in Table III. Clearly,
KLT outperforms all other methods with a theoretical maximum
quadrotor speed of 8 ms
1
in the given scene. The theoretical
maximum speed is calculated for a global shutter camera in
such a way that the motion parallax is constrained within one
pixel for the scene with
0
Z
F
2.6 m and
0
Z
B
5.7 m. The cal-
culation assumes that none of the tracking/matching methods
work when the motion blur is more than one pixel. However,
most of the methods can work well upto some pixels of motion
blur, this will in-turn increase the theoretical maximum speed
by this factor. If a rolling shutter camera is used without rolling
shutter compensation, the theoretical maximum speed value has
to be divided by the factor of blur caused by rolling shutter. We
achieved a practical maximum speed of 2.5 ms
1
in our exper-
iments. We were limited to this speed due to the acceleration
constraints on the Bebop 2 and the rolling shutter camera.
We achieved a remarkable success rate of 85% over 150 trials
for different arbitrary shaped windows under a wide range of
conditions which includes a window with a minimum tolerance
of just 5 cm (Fig. 14). Success is defined as window detection
output O having at least 75% overlap with the ground truth and
traversal through the gap without collision. Failure cases also
include the optimization failures and/or feature tracking failures
for structure based approaches. For TS
2
P, we define Detection
Rate (DR), Average False Negative (AFN) and Average False
2806 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018
Positive (AFP) as follows (AFN and AFP are computed only for
successful trails):
DR =
Num. Trails
k=1
λ
k
D
Num. Trails
; λ
k
D
=
G∩O
G
k
0.75
AFN =
Num. S ucc. Trails
k=1
λ
k
N
Num. Succ. Trails
; λ
k
N
=
G∩O
G
k
AFP =
Num. S ucc. Trails
k=1
λ
k
P
Num. Succ. Trails
; λ
k
P
=
G
∩O
G
k
where A
is the negation of the set A.
V. C
ONCLUSION
We present a minimalist philosophy to mimic insect be-
haviour to solve complex problems with minimal sensing and
active movement to simplify the problem in hand. This philoso-
phy was used to develop a method to find an unknown gap and
fly through it using only a monocular camera and onboard sens-
ing. A comprehensive comparison and analysis is provided. To
our knowledge, this is the first letter which addresses the prob-
lem of gap detection of an unknown shape and location with a
monocular camera and onboard sensing. As a parting thought,
IMU data can be coupled with the monocular camera to get a
scale of the window and plan for aggressive maneuvers.
A
CKNOWLEDGMENT
The authors would like to thank K. Zampogiannis for helpful
discussions and feedback.
R
EFERENCES
[1] T. Tomic et al., “Toward a fully autonomous UAV: Research platform for
indoor and outdoor urban search and rescue,” IEEE Robot. Autom. Mag.,
vol. 19, no. 3, pp. 46–56, Sep. 2012.
[2] T.
¨
Ozaslan et al., “Inspection of penstocks and featureless tunnel-like
environments using micro UAVs,” in Field and Service Robotics. Berlin,
Germany: Springer, 2015, pp. 123–136.
[3] N. Michael et al., “Collaborative mapping of an earthquake-damaged
building via ground and aerial robots,” J. Field Robot., vol. 29, no. 5,
pp. 832–841, 2012.
[4] T. Taketomi et al., “Visual slam algorithms: A survey from 2010 to 2016,”
IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 16, pp. 16–26, Jun. 2017.
[5] T. Qin and S. Shen., “Robust initialization of monocular visual-inertial
estimation on aerial robots,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Syst., Vancouver, BC, Canada, 2017, pp. 4225–4232.
[6] M. B loesch et al., “Iterated extended Kalman filter based visual-inertial
odometry using direct photometric feedback,” Int. J. Robot. Res., vol. 36,
no. 10, pp. 1053–1072, 2017.
[7] J. Aloimonos et al., “Active vision,” Int. J. Comput. Vis., vol. 1, no. 4,
pp. 333–356, 1988.
[8] J. Bohg et al., “Interactive perception: Leveraging action in perception and
perception in action,” IEEE Trans. Robot., vol. 33, no. 6, pp. 1273–1291,
Dec. 2017.
[9] R. Bajcsy et al., “Revisiting active perception,” Auton. Robots, vol. 42,
pp. 177–196, 2018.
[10] T. S. Collett, “Insect vision: Controlling actions through optic flow,” Cur-
rent Biol., vol. 12, no. 18, pp. R615–R617, 2002.
[11] B. Mantel, T. A. Stoffregen, A. Campbell, and B. G. Bardy, “Exploratory
movement generates higher-order information that is sufficient for accu-
rate perception of scaled egocentric distance,” PloS One, vol. 10, no. 4,
2015, Art. no. e0120025.
[12] K. Kral and M. Poteser, “Motion parallax as a source of distance informa-
tion in locusts and mantids,” J. Insect Behavior, vol. 10, no. 1, pp. 145–163,
1997.
[13] Geoslam. 2018. [Online]. Available: “https://geoslam.com/
[14] F. Endres, J. Hess, J. Strum, D. Cremers, and W. Burgard, “3-D mapping
with an RGB-D camera,” IEEE Trans. Robot., vol. 30, no. 1, pp. 177–187,
Feb. 2014.
[15] Parrot slamdunk. 2018. [Online]. Available: “http://developer.
parrot.com/docs/slamdunk/
[16] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2017.
[17] N. Smolyanskiy et al., “Toward low-flying autonomous MAV trail navi-
gation using deep neural networks for environmental awareness,” in Proc.
IEEE/RSJ Int. Conf. Intell. Robots Syst., Vancouver, BC, Canada, 2017,
pp. 4241–4247.
[18] P. D’Alberto et al., “Generating FPGA-accelerated DFT libraries,”
in Proc.15th Annu. IEEE Symp. Field-Programmable Custom Comput.
Mach., 2007, pp. 173–184.
[19] E. Nurvitadhi et al., “Can FPGAs beat GPUs in accelerating next-
generation deep neural networks?” in Proc. ACM/SIGDA Int. Symp. Field-
Programmable Gate Arrays, 2017, pp. 5–14.
[20] S. Han et al., “ESE: Efficient speech recognition engine with sparse
LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable
Gate Arrays, 2017, pp. 75–84.
[21] G. Loianno, C. Brunner, G. McGrath, and V. Kumar, “Estimation, control,
and planning for aggressive flight with a small quadrotor with a single
camera and IMU,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 404–411,
Apr. 2017.
[22] D. Falanga et al., “Aggressive quadrotor flight through narrow gaps with
onboard sensing and computing using active vision,” in Proc. IEEE Int.
Conf. Robot. Autom., 2017, pp. 5774–5781.
[23] N. Franceschini, J. M. Pichon, and C. Blanes, “From insect vision to
robot vision,” Philosoph. Trans. Roy. Soc. Lond. B, vol. 337, no. 1281,
pp. 283–294, 1992.
[24] M. V. Srinivasan et al., “Robot navigation inspired by principles of insect
vision,” Robot. Auton. Syst., vol. 26, no. 2/3, pp. 203–216, 1999.
[25] J. R. Serres and F. Ruffier, “Optic flow-based collision-free strategies:
From insects to robots,” Arthropod Struct. Develop., vol. 46, no. 5,
pp. 703–717, 2017.
[26] K. Y. W. Scheper et al., “Behavior trees for evolutionary robotics,” Artif.
Life, vol. 22, no. 1, pp. 23–48, 2016.
[27] Longuet-Higgins et al., “The interpretation of a moving retinal image,”
in Proc. Roy. Soc. Lond. B, 1980, vol. 208, pp. 385–397.
[28] K. Sreenath, N. Micheal, and V. Kumar, “Trajectory generation and control
of a quadrotor with a cable-suspended load-a differentially-flat hybrid
system,” in Proc. IEEE Int. Conf. Robot. Autom., 2013, pp. 4888–4895.
[29] B. D. Lucas and T. Kanade, An iterative image registration technique
with an application to stereo vision, in Proc. 7th Int. Joint Conf. Artif.
Intell., 1981, pp. 674–679.
[30] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell.,
vol. 17, no. 1-3, pp. 185–203, 1981.
[31] E. Ilg et al., “Flownet 2.0: Evolution of optical flow estimation with deep
networks,” in Proc IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
vol. 2, pp. 2462–2470.
[32] C. Tomasi and T. Kanade, “Detection and tracking of point features,”
Tech. Rep. CMU-CS-91-132, Carnegie Mellon University, Apr. 1991.
[33] O. Hall-Holt et al., “Finding large sticks and potatoes in poly-
gons,” in Proc. 17th Annu. ACM-SIAM Symp. Discr. Algorithm, 2006,
pp. 474–483.
[34] D. Mellinger and V. Kumar, “Minimum snap trajectory generation and
control for quadrotors,” in Proc. IEEE Int. Conf. Robot. Autom., 2011,
pp. 2520–2525.
[35] C. Godard et al., “Unsupervised monocular depth estimation with left-
right consistency,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,
2017, pp. 6602–6611.
[36] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyra-
mid network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 2720–2729.
[37] T. Kroeger
et al., “Fast optical flow using dense inverse search,” in
Proc.Eur. Conf. Comput. Vis., 2016, pp. 471–488.
[38] J. W. Bian et al., “GMS: Grid-based motion statistics for fast, ultra-
robust feature correspondence,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2017, pp. 2828–2837.
[39] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine
learning approach to corner detection,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 32, no. 1, pp. 105–119, Jan. 2010.
[40] M. Bj
¨
orkman et al., “Detecting, segmenting and tracking unknown objects
using multi-label MRF inference,” Comput. Vis. Image Understanding,
vol. 118, pp. 111–127, 2014.