untitled

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018 2799

GapFlyt: Active Vision Based Minimalist

Structure-Less Gap Detection For Quadrotor Flight

Nitin J. Sanket , Chahat Deep Singh, Kanishka Ganguly, Cornelia Ferm

uller, and Yiannis Aloimonos

Abstract—Although quadrotors, and aerial robots in general,

are inherently active agents, their perceptual capabilities in lit-

erature so far have been mostly passive in nature. Researchers

and practitioners today use traditional computer vision algorithms

with the aim of building a representation of general applicability: a

3-D reconstruction of the scene. Using this representation, planning

tasks are constructed and accomplished to allow the quadrotor to

demonstrate autonomous behavior. These methods are inefﬁcient

as they are not task driven and such methodologies are not uti-

lized by ﬂying insects and birds. Such agents have been solving the

problem of navigation and complex control for ages without the

need to build a 3-D map and are highly task driven. In this letter,

we propose this framework of bioinspired perceptual design for

quadrotors. We use this philosophy to design a minimalist senso-

rimotor framework for a quadrotor to ﬂy through unknown gaps

without an explicit 3-D reconstruction of the scene using only a

monocular camera and onboard sensing. We successfully evaluate

and demonstrate the proposed approach in many real-world ex-

periments with different settings and window shapes, achieving a

success rate of 85% at 2.5 ms

−1

even with a minimum tolerance

of just 5 cm. To best of our knowledge, this is the ﬁrst letter that

addresses the problem of gap detection of an unknown shape and

location with a monocular camera and onboard sensing.

Index Terms—Active vision, gap detection, quadrotor, visual ser-

voing, deep learning in robotics and automation, optical ﬂow, track-

ing, collision avoidance, computer vision for automation, aerial sys-

tems: perception and autonomy, visual tracking and visual-based

navigation.

I. INTRODUCTION AND PHILOSOPHY

HE quest to develop intelligent and autonomous quadro-

tors [1]–[3] has taken a center stage in the recent years

due to their usefulness in aiding s afety and intuitive control.

To achieve any form of autonomy, quadrotors need to have

Manuscript received February 15, 2018; accepted May 19, 2018. Date of

publication June 4, 2018; date of current version June 15, 2018. This letter was

recommended for publication by Associate Editor V. Lippiello and Editor J.

Roberts upon evaluation of the reviewers’ comments. This work was supported

in part by the Brin Family Foundation, in part by the Northrop Grumman

Corporation, and in part by the National Science Foundation under Grants

SMA 1540917 and CNS 1544797, respectively. (Nitin J. Sanket and Chahat

Deep Singh contributed equally to this work.) (Corresponding author: Nitin J.

Sanket.)

The authors are with the University of Maryland Institute for Advanced Com-

puter Studies, College Park, MD 20742 USA (e-mail:,[email protected];

[email protected]; [email protected]; [email protected];

[email protected]).

This letter has supplemental downloadable multimedia material available at

http://ieeexplore.ieee.org, provided by the authors. The Supplementary Mate-

rials include a video, a supplementary PDF, code, and hardware tutorial. This

material is 91.1 MB in size. The supplementary hardware tutorial, appendix,

code and video are also available at prg.cs.umd.edu/GapFlyt.html.

Digital Object Identiﬁer 10.1109/LRA.2018.2843445

Fig. 1. Different parts of the pipeline: (a) Detection of the unknown gap us-

ing active vision and TS

P algorithm (cyan highlight shows the path followed

for obtaining multiple images for detection), (b) Sequence of quadrotor pass-

ing through the unknown gap using visual servoing based control. (The blue

and green highlights represent the tracked foreground and background regions

respectively. All the images in this letter are best viewed in color.)

perceptual capabilities in order to sense the world and react

accordingly. A generic and fairly common solution to pro-

viding perceptual capabilities is to acquire a 3D model of its

environment. Creating such a model is very useful because of

its general applicability – one could accomplish many tasks us-

ing the same model. The process of obtaining a 3D model of the

environment using onboard sensing and a myriad of different

sensors has gained momentum in the last few years [4]. Sensors

like the LIDAR, RGB-D and stereo camera cannot be used on a

small quadrotor due to their size, weight, and power limitations.

See http://www.ieee.org/publications

standards/publications/rights/index.html for more information.

2800 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018

TABLE I

INIMALIST DESIGN OF AUTONOMOUS QUADROTOR (DRONE)BEHAVIOURS

This constrains us to a monocular camera along with the already

present onboard inertial sensor (IMU) and many algorithms have

been developed based on the same principle [5], [6].

Instead of attempting to recover a 3D model of the scene, we

want to recover a “minimal” amount of information that is sufﬁ-

cient to complete the task under consideration. We conceptualize

an autonomous quadrotor as a collection of processes that allow

the agent to perform a number of behaviors (or tasks) (Table I).

At the bottom of the hierarchy is the task of kinetic stabilization

(or egomotion). Next comes the ability for obstacle avoidance,

where the obstacles could be static or dynamic, followed by

ability to perform homing, i.e., to ﬁnd a speciﬁc location in an

environment. Last in the hierarchy comes the ability to land (on

a static or a dynamic surface) and the ability to pursue, or escape

from, another agent. This hierarchy of competences, with each

competence needing the one before it, constitutes the sensori-

motor system of the quadrotor agent. These behaviors can be

accomplished without an explicit 3D reconstruction because of

the “active” nature of the quadrotor. The quadrotor can perform

maneuvers and control the image acquisition process, thus intro-

ducing new constraints that were not there before - this is called

“active” vision [7]–[9]. This methodology was inspired by the

fruit ﬂy [10]. Prior research has shown that fruit ﬂies, and other

insects [11], [12], do not perceive depth directly. It is achieved by

a balance of active and exploratory procedures. In this letter, we

focus on the second competence of obstacle avoidance. Specif-

ically, the question this letter deals with is: “Can a quadrotor

manage to go through an arbitrarily shaped gap without build-

ing an explicit 3D model of a scene, using only a monocular

camera?” We develop the visual mathematics of the solution,

evaluate and demonstrate the proposed approach in many real

experiments with different settings and window s hapes.

Tr aditional computer vision based approaches such as sparse

or dense reconstruction [ 13]–[16] have been used to obtain a

3D structure of the scene over which sophisticated path plan-

ners have been used to plan collision free paths. Lately, deep-

learning driven approaches have taken a center stage in solving

the problem of fast collision avoidance and safe planning on

quadrotors [17]. Most of these neural network approaches com-

pute Fast Fourier Transforms (FFTs) for large ﬁlter sizes [18].

Such approaches can be processed on a Field-Programmable

Gate Array (FPGA), rather than a Graphical Processing Unit

(GPU) to drastically improve the computation performance and

power efﬁciency [19], [20].

To our knowledge, this is the ﬁrst letter which addresses

the problem of gap detection with a monocular camera and

onboard sensing. However, the problem of going through gaps

has fascinated researchers from many years and some of the

recent works which present algorithms for planning and control

can be found in [21], [22]. Some of the works which paved way

to the bio-inspired approach used in this letter can be found in

[23]–[26].

Fig. 2. Components of the environment. On-set Image: Quadrotor view of the

scene.

The key contributions of this letter are given below:

Active vision based structure-less minimalist gap detection

algorithm – Temporally Stacked Spatial Parallax or TS

(Fig. 1(a)).

Visual servoing on a contour for the quadrotor to ﬂy

through unknown gaps (Fig. 1(b)).

A. Organization of the Paper

Section II presents the detection algorithm of a gap in an ar-

bitrary shaped window using Deep Learning based optical ﬂow

and the role of active vision in such algorithms. Section III de-

scribes the algorithm used for tracking the gap/safe point and the

quadrotor control policy. Section IV illustrates the experimental

setup and provides error analyses and comparisons with tradi-

tional approaches. We ﬁnally conclude the paper in Section V

with parting thoughts on future work.

B. Problem Formulation and Proposed Solutions

A quadrotor is present in a static scene (Fig. 2), where the

absolute depth of each point as ‘seen’ from the camera can be

modelled as an univariate bimodal gaussian distribution. The

location at which there is a maximum spatial depth discrepancy

between pixels (projected point) is deﬁned as the Contour (C)of

the opening. In this letter, the words boundary, gap, window or

opening refer to the same entity and will be used interchange-

ably. Any pixel close to the mode corresponding to the lower

depth value is deﬁned as the Foreground (F) and similarly that

corresponding to the higher depth value is deﬁned as the Back-

ground (B). A simple way of solving the problem of ﬁnding

the gap for a near fronto-parallel pose is to ﬁnd the depth dis-

crepancy which is a trivial clustering problem when the depth is

known. The depth can be known if multiple calibrated cameras

are used or the entire scene is reconstructed in 3D. These meth-

ods are computationally expensive [4]. In this letter, we propose

a ‘minimalist’ solution to ﬁnd any arbitrary shaped gap for the

quadrotor to go through using Temporally Stacked Spatial Par-

allax (TS

P) algorithm. A control policy based on contour visual

servoing is used to ﬂy through unknown gaps.

SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2801

Fig. 3. Representation of co-ordinate frames.

II. GAP DETECTION USING TS

Before we explain the procedure for gap detection, we need

to formally deﬁne the notation used. (

) denotes

the coordinate frame of b represented in the reference of a.The

letters I, C, B and W are used as sub/superscript to denote

quantities related to Inertial Measurement Unit (IMU), Camera,

Body and World respectively. If a sub/superscript is omitted,

the quantity can be assumed to be in W . C and I are assumed

to be rigidly attached to each other with known intrinsics and

extrinsics. B is deﬁned to be aligned with I (Fig. 3). A pinhole

camera model is used for the formation of the image. The world

point X gets projected onto the image plane point x.

The camera captures images/frames at a certain frequency. Let

the frame at time t

be denoted as F

and is called the ith frame.

Optical ﬂow [27] between ith and jth frames is denoted by

˙p

which is a sum of both translational (

˙p

x,T

) and rotational

(

˙p

x,R

) components and are given by:

˙p

x,T



− V



;

˙p

x,R



xy −(1 + x

) y

(1 + y

) −xy −x



where x =[

]

denotes the pixel coordinates in the im-

age plane, Z

is the depth of X (world point corresponding to

pixel x), V =[

]

and Ω are the linear and angular

velocities of the C in W between t

and t

respectively.

The otherwise complex depth estimation problem can be triv-

ialized by moving in a certain way [7]. These “active” vision

principles dictate us to control the quadrotor’s movements so

as to make the interpretation of optical ﬂow easier. Since the

quadrotor is a differentially ﬂat system [28], the rotation about

) or roll, pitch and yaw can be decoupled. As

an implication, the movements can be controlled in a way to

simplify the depth (Z

) estimation problem. The quadrotor is

controlled in such a way that Ω ≈ 0, V

 V

and V

 V

then the optical ﬂow can be modelled as:

˙p

−1



−



+ η

where η is the approximation error. This shows that using the

aforementioned “active” control strategy, we obtain an implicit

3D structure of the environment in the optical ﬂow. The inverse

depth in this “controlled” case manifests as a linear function of

the optical ﬂow.

The optical ﬂow equation can be written for both foreground

(F) and background (B) pixels independently. The magnitude

of optical ﬂow for F is given by,



˙p

x,F



x,F

−1



+ ν

where ν ∼N(0,σ) is assumed to be an additive white Gaussian

noise and is independent of the scene structure or the amount

of camera movement between two frames (V, Ω). For such as-

sumptions to be valid, the optical ﬂow algorithm needs to work

well for a wide range of camera movements in a variety of scene

structures. Using fast traditional optical ﬂow formulations based

on methods like [29] or [30] voids such assumptions. This moti-

vated us to use deep-learning based ﬂow algorithms which excel

at this task while maintaining a reasonable speed when running

on a GPU. In this letter, FlowNet2 [31] is used to compute

optical ﬂow unless stated otherwise.

A simple method to reduce noise is to compute the mean of

the ﬂow magnitudes across a few frames. Let ξ = {F

, ..., F

}

be a set of N frames from which the optical ﬂow is computed

with respect to some reference frame F

where the complete

gap is assumed to be visible. Here, N =

ξ is the cardinality of

the set ξ. The mean ﬂow magnitude at x for F (

˙p

x,F



) and

B (

˙p

x,B



) is given by,



˙p

x,F/B





x,F/B



−1



r=j

V + ν



where ν



∼N(0,N

−0.5

σ) and V =

√

. Clearly, the noise

varies inversely with N.

Since Z

x,F

x,B

, we can say that 

˙p

x,F



> 

˙p

x,B



Now, 

˙p

x,F



−

˙p

x,B



≥ τ can be used as a criterion for

ﬁnding possible boundary regions. It was found experimentally

that using inverse ﬂow differences 

˙p

x,B



−1

−

˙p

x,F



−1

≥



gave better numerical stability and better noise performance

due to the scale compression by the inverse function. This is

inherently the spatial derivative of inverse average (stacked) ﬂow

magnitudes and it can be written as Ξ=∇·

˙p



−1

, where,

∇ =[

∂/∂x ∂/∂y

]

. Note that this is the same as solving

the edge detection problem in computer vision and any kernel

or method like the Sobel operator or the Canny edge detection

algorithm can be used.

III. H

IGH SPEED GAP TRACKING FOR VISUAL SERVOING

BASED CONTROL

This section presents a targeted solution for tracking a con-

tour using label sets propagated using Focus of Expansion

(FOE) constraints. A pixel at location x is associated with a

score χ (x) ∈ [−1, 1] which denotes its score as foreground or

background. The foreground and background pixel locations

are deﬁned by F = {x|χ (x)=+1}and B = {x|χ (x)=−1}

respectively.

We deﬁne the opening O on the image plane as O =

{x|χ (x) < 0}. The pixel locations which cannot be perfectly

classiﬁed as foreground or background belong to the uncertainty

zone and are deﬁned as U = {x|χ (x) ∈ (−1, +1)}. The con-

tour location is deﬁned as C = {x|χ (x)=0}.Fig.4givesa

visual representation of the different sets on a sample image.

Tcon C across time. This problem is hard as the contour

tracking relies on updating χ (x) ∀x ∈U

over time which is

non-trivial and computationally expensive. To simplify the prob-

lem, we use the dual formulation of the problem which is to

track the pixel locations which belong to the set deﬁned by

{x /∈U}= F∪B. This enables us to track the contour indi-

rectly at high speeds as corner tracking is comparatively faster

[32]. The trade-off i n using the dual formulation is that we don’t

obtain the actual contour location across time - which might

2802 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018

Fig. 4. Label sets used in tracking. (blue: foreground region, green: back-

ground region, orange: uncertainty region, black line: contour, brighter part of

frame: active region and darker part of frame: inactive region.)

Algorithm 1: Label Set Propagation Using FOE

Constraints.

Data: C

, C

, F

, B

Result:C

, C

, F

, B

1: C

= FeatureTracker



, F



;

2: A =



F,x

−1



;

3: B =



F,x

− C

F,x



;

4: [

αβ

]=A

†

5: x

0,F

= β/α; y

0,F



F,y



F,y

− C

F,y



;

6: F

= α



− x

0,F

− y

0,F







;

7: Repeat steps 1 through 6 for B in parallel;

be needed for aggressive maneuvers, but this is not dealt in the

scope of this letter. The label set propagation algorithm is de-

scribed in Algorithm 1. Here, C

and C

represents a set of

corner/feature points for the foreground and background pix-

els respectively in F

. A

†

denotes the pseudo inverse of the

matrix A and [

0,F

]

is the FOE for the foreground.

Intuitively, we solve the linear equations of the horizontal ﬂow

ﬁeld to obtain the divergence/time-to-contact α. The divergence

in x-direction is used to also predict the y-coordinates.

A. Safe Point Computation and Tracking

We can assume that the real-world points corresponding to

the background B∪U

are far enough that we can approximate

them to lie on a plane. The foreground points under considera-

tion F∪U

occupy a small area around the contour which can

be assumed to be planar. Here, U

⊂Uand U

⊂Uare the

sets which actually belong to the foreground and background

respectively.

Now, the quadrotor can be represented as an ellipsoid with

semi-axes of lengths a, b and c. As an implication of the above

assumptions, the projection of the quadrotor on the window

at any instant is an ellipse. Let us deﬁne the projection of the

quadrotor on the image as Q. The largest Q can be written

in terms of matrix equation of the ellipse centered at [h, k]

deﬁned as Q(h, k, R

)=0.

Here, R

is a two-dimensional rotation matrix and θ is the

angle the largest semi-axis of the ellipse makes with the X

axis. The projection of the quadrotor on the image is given by

Q = {x|Q(x) ≤ 0}. The safe region S can be computed as

S =



∀θ

OQ

where  denotes the Minkowski difference of sets. Now, we

deﬁne the ‘safest point’ (x

) as the barycentric mean of S.

Remark: The above optimization problem can only be solved

using convex optimization with a guaranteed global solution

when both O and Q are convex sets. A conservative solution

to the above problem is ﬁtting the largest scaled version of Q

inside O when O is a non-convex set and Q is a convex set.

Note that as Q is a chosen model, it can always be chosen to

be a convex set, i.e., convex hull of the non-convex set. Also,

from the above remark, the ‘safest point’ (x

) can be deﬁned as

the center of the largest ellipse which can be ﬁt inside S whilst

maintaining the eccentricity equal to that deﬁned by Q.The

optimization problem becomes,

argmax

a,θ

S∩Qs.t. Q⊆Sand |θ|≤θ

max

This problem can be solved using the procedure described in

[33]. However, a desirable property of the safe region is that it

should favor less-aggressive maneuvers. This can be modelled

as a regularization penalty in the above formulation,

argmax

a,θ

S∩Q+ λθ s.t. Q⊆Sand |θ|≤θ

max

Solving the above optimization problem is computationally

intensive and not-possible without the prior knowledge of the

scale/depth. For obtaining the minimalist solution, we assume

that the gap is large enough for the quadrotor to pass through

and replace the above optimization problem by an empirically

chosen approximation. A simple and efﬁcient safe point com-

putation can be performed as the median of the convex set O

and is given by x

≈ argmin



∀o∈O

o − x

Remark: If the above approximation is used when O is non-

convex, the amount of deviation from the ‘actual’ safe point is a

function of

Conv (O)/ O.HereConv (O) is the convex hull of

Keen readers would note that the formulation for the safe

point is in-terms of O which we wanted to avoid tracking

in the ﬁrst place. Indeed this is true and a simple solution

takes care of this. Because we are propagating F and B with

FOE constraints, the cross-ratios of {o, f

|o ∈O,f ∈F}

and {o, b

|o ∈O,b∈B} are preserved. Here, F and B

are computed from the detected opening O as follows: F =

{O ⊕ 

−O⊕

} and B = {O  

}. Here, 

is a user cho-

sen kernel (circular in this letter).

The F and B are propagated from F

onwards where the

detection was performed. The ‘safest point’ (x

) is computed as

follows:



s,F

, F≥B

s,B

, otherwise

where x

s,F

and x

s,B

are the safest points computed individually

for F and Bby taking their median respectively. Fig. 5 shows the

tracking of F and B across time and how the tracking algorithm

actively switches between x

s,F

and x

s,B

for the safest point

. When C

≤ k

or C

≤ k

, the tracker/feature points are

reset in the current F and B sets respectively. Here, k

and k

are empirically chosen thresholds. For all experiments k

=40

and k

=20. Due to resetting and active switching, x

can jump

around making the control hard, hence a simple Kalman ﬁlter

with a forward motion model is used to smooth out the value of

. From here on, safe point refers to the safest point x

SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2803

Fig. 5. Tracking of F and Bacross frames. (a) shows tracking when C

and C

.(b)WhenC

≤ k

, the tracking for B will be reset. (c) When

≤ k

, the tracking for F will be reset. (d) shows tracking only with B,

when F = ∅. (blue: F, green: B, yellow: O, yellow dots: C

, red dots: C

blue Square: x

s,F

, red Square: x

s,B

B. Control Policy

We propose a control policy such that it follows the tracked x

The quadrotor follows the dynamic model as given in [34]. The

controller uses the traditional backstepping approach based on

[34] and contains the following loops: Inner loop and outer loop

controllers. Inner loop controls the attitude stability while the

outer loop controller is responsible for the quadrotor position.

It is important to note that frames are transformed from C to B.

Since the quadrotor is differentially ﬂat, the altitude Z

can

be controlled independently from X

and Y

[28]. The con-

trol policy i s to align the projection of the body center on the

image plane with x

. The difference between the two centers is

called the error e.Thex and y component of the error e can be

minimized by varying roll (φ) and net thrust (u

) respectively.

A simple Proportional-Integral-Derivative (PID) controller on e

is used. This control policy only deals with the alignment of B

to x

and does not deal with moving forward (Z

). To move

forward, the quadrotor pitch (φ) needs to be controlled. The

rate of forward motion is controlled by the pitch angle θ

which

is empirically chosen. The bigger the value of θ

the faster the

quadrotor will ﬂy towards the gap. It is important to note the im-

plicit assumption made in this letter that the gap is large enough

for the quadrotor to go through safely.

IV. E

XPERIMENTS

A. Experimental Setup

The proposed framework was tested on a modiﬁed hobby

quadrotor, Parrot Bebop 2, for its cost effectiveness and ease of

use. The Bebop 2 is equipped with a front facing camera, a 9-axis

IMU and a downward facing optical ﬂow sensor coupled with

a sonar. The Parrot Bebop 2 allows only high level controls in

terms of body frame velocities using ROS. An NVIDIA Jetson

TX2 GPU is mounted on the Bebop 2 as shown in Fig. 6 and is

used to run all the perception and control algorithms onboard.

The TX2 and Bebop 2 communicate via a WiFi connection,

where the images are received at 30 Hz. The overall weight

Fig. 6. The platform used for experiments. (1) The front facing camera, (2)

NVIDIA TX2 CPU+GPU, (3) Downward facing optical ﬂow sensor (cam-

era+sonar) which is only used for position hold.

Fig. 7. First two rows: (X

), (Y

) and (X

) Vicon esti-

mates of the trajectory executed by the quadrotor in different stages (gray bar

indicates the gap). (X

) plot shows the diagonal scanning trajectory (the

lines don’t coincide due to drift). Last row: Photo of the quadrotor during gap

traversal. (cyan: detection stage, red: traversal stage.)

of the ﬂight setup is 680 g with the dimensions being 32.8 ×

38.2 × 12 cm.

All the experiments were prototyped on a PC running Ubuntu

16.04 with an Intel Core i7 6850K 3.6 GHz CPU, an NVIDIA

Titan-Xp GPU and 64 GB of RAM in M

ATLAB using the

Robotics Toolbox. The deep learning based optical ﬂow r uns

on Python with TensorFlow back-end. All the ﬁnal versions of

the software were ported to Python to run on the NVIDIA Jet-

son TX2 running Linux for Tegra (L4T) 28.2. A step-by-step

tutorial on using Bebop 2 as a research platform is available at

prg.cs.umd.edu/GapFlyt.html.

The environmental setup for the experiments consists of a

rigid scene which has two near-planar structures, one for the

foreground and the other for the background. As shown in

Fig. 2, let us denote the initial perpendicular distance between

the quadrotor body center and the foreground as

and the

background as

. The near-planar structures are made of

foam-core with newspapers stuck on them to add texture to

the scene. The gap is located near the center of the foreground

and is of an arbitrary shape. For the detection of the window,

the quadrotor executes a ﬁxed diagonal straight line trajectory

in the X

−Z

plane as shown in Fig. 7 while taking a number

of images along its path. The number of images used for the de-

tection stage is a parameter denoted by N. Once the window is

detected, F and B are tracked across time in order for quadrotor

to go through the gap using visual servoing based control.

2804 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018

Fig. 8. Sequence of images of quadrotor going through different shaped gaps. Top on-set: Ξ outputs, bottom on-set: quadrotor view.

Fig. 9. Top Row (left to right): Quadrotor view at

= 1.5, 2.6, 3mre-

spectively with

= 5.7 m. Bottom Row: Respective Ξ outputs for N =4.

Observe how the ﬁdelity of Ξ reduces as

→

, making the detection

more noisy. (white boxes show the location of the gap in Figs. 9 to 13.)

Fig. 10. Comparison of different philosophies to gap detection. Top row (left

to right): DSO, Stereo Depth, MonoDepth, TS

P. Bottom row shows the detected

gap overlayed on the corresponding input image. (green: G∩O, yellow: false

negative G∩O



, red: false positive G



∩O.)

B. Experimental Results

The pipeline was evaluated on different variations of the en-

vironmental setup. In the ﬁrst experiment, we test our pipeline

on ﬁve different arbitrarily shaped gaps as shown in Fig. 8.

Unless otherwise stated

∼ 2.6 m and

∼ 5.7 m. The

aim here is to ﬁnd biases in the detection pipeline. The windows

were chosen to have a diversity in the geometrical sharpness

of the corners, convexity of the shape and the size. As stated

earlier, the only constraint imposed on the gaps is that they are

large enough to go through with the quadrotor pitch angle close

to zero and near-convex. The outputs of TS

P algorithm for

different windows are s hown in Fig. 8 with N =4along with

their inverse average (stacked) ﬂow magnitudes Ξ, illustrating

that our detection algorithm is independent of shape and size

of the opening. A canny edge detector is run on Ξ followed by

morphological operations to obtain C.

The second experiment is designed to test the noise sensi-

tivity of TS

P. The intuition is that as

→

, noisier the

Fig. 11. Left Column: Images used to compute Ξ. Middle Column (top to

bottom): Ξ outputs for DIS Flow, SpyNet and FlowNet2. Right Column: Gap

Detection outputs. (green: G∩O, yellow: false negative G∩O



,red:false

positive G



∩O.

Fig. 12. Top row (left to right): Quadrotor view at image sizes of 384 ×

576, 192 × 288, 96 × 144, 48 × 72, 32 × 48. Note all images are re-scaled to

384 × 576 for better viewing. Bottom row shows the respective Ξ outputs for

N =4.

detection result . The outputs for different

and

are

showninFig.9whenN =4. This is because the ﬁdelity of Ξ be-

comes less and is more prone to noise. By increasing N the noise

gets averaged out across frames improving the ﬁdelity of Ξ.

In the third experiment, we present detection outputs for dif-

ferent values of N , image baselines and image sizes. The effect

of N has been already discussed previously. Having a very small

baseline results in effectively dropping the value of N and vice-

versa. The results from different sized images as illustrated in

Fig. 12 show that the detection algorithm can work even on a

very small quadrotor which can only carry a very low-resolution

camera (as low as 32 ×48 pixels). Our algorithm can also handle

dynamic noises very well though being modelled as a gaussian

for the discussion. However, one can notice that the results im-

prove signiﬁcantly with increase in N (Fig. 13) demonstrating

the advantage of TS

SANKET et al.: GAPFLYT: ACTIVE VISION BASED MINIMALIST STRUCTURE-LESS GAP DETECTION 2805

Fig. 13. Top two rows show the input images. The third row shows the Ξ

outputs when only the ﬁrst 2, 4 and all 8 images are used.

TABLE I I

OMPARISON OF DIFFERENT METHODS USED FOR GAP DETECTION

Gap detection using TS

P almost always results in an eroded

version of the true gap. This is good for safe maneuvers like

the one considered in this letter. However, aggressive ﬂight al-

gorithms might suffer due to conservative results. This can be

mitigated by tuning the values of image baselines, N and the

trajectory of the quadrotor to obtain minimum erosion results.

Tuning these parameters is easy when a prior about the scene is

known or the ﬂow algorithms are so fast that one can actively

change the detection trajectory so as to maximize the coverage

on the part of the contour least ‘seen’. The dynamic choice of

these parameters comes into the scope of our future work.

In the last experiment we present alternative approaches in-

cluding state-of-the-art methods which can be used to ﬁnd the

gap. The methods can be subdivided into structure based ap-

proaches and stuctureless approaches. The structure based ap-

proaches can be deﬁned as the set of approaches where a full 3D

reconstruction of the scene is computed, whereas, stuctureless

approaches do not. The structure based approaches presented

are DSO [16] – Direct Sparse Odometry, depth from hardware

stereo cameras [15] and Stereo SLAM – Simultaneous Local-

ization and mapping using Stereo Cameras and IMU [15]. The

data for the structured approaches were collected using a Parrot

SLAMDunk [15]. The structureless approaches presented are

MonoDepth [35] – deep learning based monocular depth esti-

mation and the proposed TS

P on two different deep learning

based dense optical ﬂow algorithms, namely, FlowNet2 [31],

SpyNet [36] and DIS [37]. Table II shows the comparison of the

stated methods averaged over 150 trials.

Fig. 10 compares the results of DSO, stereo depth, Mon-

oDepth and our method (TS

P) with the ground truth. It can be

inferred that the MonoDepth results are extremely noisy (even

with different models) making it impossible to detect the gap

as the images in our paper were never “seen” during training.

Note that we don’t retrain or ﬁnetune any of the deep learn-

ing models in this letter. Retraining MonoDepth and other deep

learning based methods used in this letter on our dataset might

lead to better results. Whereas, DSO and stereo depth results can

used to detect the opening with some ﬁltering. Stereo SLAM

and DSO are slow in the map building stage (taking about 6 s

TABLE III

OMPARISON OF DIFFERENT METHODS USED FOR TRACKING

Fig. 14. Quadrotor traversing an unknown window with a minimum tolerance

of just 5 cm. (red dashed line denotes C.)

and 12 s respectively), however, once the map is built and the

algorithms are in the localization stage the depth (or scaled

depth) are obtained at 20 Hz. The Stereo SLAM and Stereo

Depth were run on the SLAMDunk with an NVIDIA Jetson

TK1 processor which is much slower than the NVIDIA Jetson

TX2 processor used for running DSO and other methods.

Fig. 11 compares different optical ﬂow methods used for

P. Though SpyNet and DIS optical ﬂow are faster, FlowNet2

outputs signiﬁcantly better results at the edges which is impor-

tant for obtaining a good gap detection – this can be observed

by looking at Ξ for each algorithm.

After the gap detection has been performed, F and B are

computed from the detected gap C.Fig.5showsF and B be-

ing propagated across time as the quadrotor is in pursuit of

going through the gap with the update of x

. A comparison of

tracking using different methods are given in Table III. Clearly,

KLT outperforms all other methods with a theoretical maximum

quadrotor speed of 8 ms

−1

in the given scene. The theoretical

maximum speed is calculated for a global shutter camera in

such a way that the motion parallax is constrained within one

pixel for the scene with

∼ 2.6 m and

∼ 5.7 m. The cal-

culation assumes that none of the tracking/matching methods

work when the motion blur is more than one pixel. However,

most of the methods can work well upto some pixels of motion

blur, this will in-turn increase the theoretical maximum speed

by this factor. If a rolling shutter camera is used without rolling

shutter compensation, the theoretical maximum speed value has

to be divided by the factor of blur caused by rolling shutter. We

achieved a practical maximum speed of 2.5 ms

−1

in our exper-

iments. We were limited to this speed due to the acceleration

constraints on the Bebop 2 and the rolling shutter camera.

We achieved a remarkable success rate of 85% over 150 trials

for different arbitrary shaped windows under a wide range of

conditions which includes a window with a minimum tolerance

of just 5 cm (Fig. 14). Success is deﬁned as window detection

output O having at least 75% overlap with the ground truth and

traversal through the gap without collision. Failure cases also

include the optimization failures and/or feature tracking failures

for structure based approaches. For TS

P, we deﬁne Detection

Rate (DR), Average False Negative (AFN) and Average False

2806 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 3, NO. 4, OCTOBER 2018

Positive (AFP) as follows (AFN and AFP are computed only for

successful trails):

DR =



Num. Trails

k=1





Num. Trails

; λ



G∩O



≥ 0.75

AFN =



Num. S ucc. Trails

k=1





Num. Succ. Trails

; λ



G∩O





AFP =



Num. S ucc. Trails

k=1





Num. Succ. Trails

; λ





∩O



where A



is the negation of the set A.

V. C

ONCLUSION

We present a minimalist philosophy to mimic insect be-

haviour to solve complex problems with minimal sensing and

active movement to simplify the problem in hand. This philoso-

phy was used to develop a method to ﬁnd an unknown gap and

ﬂy through it using only a monocular camera and onboard sens-

ing. A comprehensive comparison and analysis is provided. To

our knowledge, this is the ﬁrst letter which addresses the prob-

lem of gap detection of an unknown shape and location with a

monocular camera and onboard sensing. As a parting thought,

IMU data can be coupled with the monocular camera to get a

scale of the window and plan for aggressive maneuvers.

CKNOWLEDGMENT

The authors would like to thank K. Zampogiannis for helpful

discussions and feedback.

EFERENCES

[1] T. Tomic et al., “Toward a fully autonomous UAV: Research platform for

indoor and outdoor urban search and rescue,” IEEE Robot. Autom. Mag.,

vol. 19, no. 3, pp. 46–56, Sep. 2012.

[2] T.

Ozaslan et al., “Inspection of penstocks and featureless tunnel-like

environments using micro UAVs,” in Field and Service Robotics. Berlin,

Germany: Springer, 2015, pp. 123–136.

[3] N. Michael et al., “Collaborative mapping of an earthquake-damaged

building via ground and aerial robots,” J. Field Robot., vol. 29, no. 5,

pp. 832–841, 2012.

[4] T. Taketomi et al., “Visual slam algorithms: A survey from 2010 to 2016,”

IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 16, pp. 16–26, Jun. 2017.

[5] T. Qin and S. Shen., “Robust initialization of monocular visual-inertial

estimation on aerial robots,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots

Syst., Vancouver, BC, Canada, 2017, pp. 4225–4232.

[6] M. B loesch et al., “Iterated extended Kalman ﬁlter based visual-inertial

odometry using direct photometric feedback,” Int. J. Robot. Res., vol. 36,

no. 10, pp. 1053–1072, 2017.

[7] J. Aloimonos et al., “Active vision,” Int. J. Comput. Vis., vol. 1, no. 4,

pp. 333–356, 1988.

[8] J. Bohg et al., “Interactive perception: Leveraging action in perception and

perception in action,” IEEE Trans. Robot., vol. 33, no. 6, pp. 1273–1291,

Dec. 2017.

[9] R. Bajcsy et al., “Revisiting active perception,” Auton. Robots, vol. 42,

pp. 177–196, 2018.

[10] T. S. Collett, “Insect vision: Controlling actions through optic ﬂow,” Cur-

rent Biol., vol. 12, no. 18, pp. R615–R617, 2002.

[11] B. Mantel, T. A. Stoffregen, A. Campbell, and B. G. Bardy, “Exploratory

movement generates higher-order information that is sufﬁcient for accu-

rate perception of scaled egocentric distance,” PloS One, vol. 10, no. 4,

2015, Art. no. e0120025.

[12] K. Kral and M. Poteser, “Motion parallax as a source of distance informa-

tion in locusts and mantids,” J. Insect Behavior, vol. 10, no. 1, pp. 145–163,

1997.

[13] Geoslam. 2018. [Online]. Available: “https://geoslam.com/

[14] F. Endres, J. Hess, J. Strum, D. Cremers, and W. Burgard, “3-D mapping

with an RGB-D camera,” IEEE Trans. Robot., vol. 30, no. 1, pp. 177–187,

Feb. 2014.

[15] Parrot slamdunk. 2018. [Online]. Available: “http://developer.

parrot.com/docs/slamdunk/

[16] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE

Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2017.

[17] N. Smolyanskiy et al., “Toward low-ﬂying autonomous MAV trail navi-

gation using deep neural networks for environmental awareness,” in Proc.

IEEE/RSJ Int. Conf. Intell. Robots Syst., Vancouver, BC, Canada, 2017,

pp. 4241–4247.

[18] P. D’Alberto et al., “Generating FPGA-accelerated DFT libraries,”

in Proc.15th Annu. IEEE Symp. Field-Programmable Custom Comput.

Mach., 2007, pp. 173–184.

[19] E. Nurvitadhi et al., “Can FPGAs beat GPUs in accelerating next-

generation deep neural networks?” in Proc. ACM/SIGDA Int. Symp. Field-

Programmable Gate Arrays, 2017, pp. 5–14.

[20] S. Han et al., “ESE: Efﬁcient speech recognition engine with sparse

LSTM on FPGA,” in Proc. ACM/SIGDA Int. Symp. Field-Programmable

Gate Arrays, 2017, pp. 75–84.

[21] G. Loianno, C. Brunner, G. McGrath, and V. Kumar, “Estimation, control,

and planning for aggressive ﬂight with a small quadrotor with a single

camera and IMU,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 404–411,

Apr. 2017.

[22] D. Falanga et al., “Aggressive quadrotor ﬂight through narrow gaps with

onboard sensing and computing using active vision,” in Proc. IEEE Int.

Conf. Robot. Autom., 2017, pp. 5774–5781.

[23] N. Franceschini, J. M. Pichon, and C. Blanes, “From insect vision to

robot vision,” Philosoph. Trans. Roy. Soc. Lond. B, vol. 337, no. 1281,

pp. 283–294, 1992.

[24] M. V. Srinivasan et al., “Robot navigation inspired by principles of insect

vision,” Robot. Auton. Syst., vol. 26, no. 2/3, pp. 203–216, 1999.

[25] J. R. Serres and F. Rufﬁer, “Optic ﬂow-based collision-free strategies:

From insects to robots,” Arthropod Struct. Develop., vol. 46, no. 5,

pp. 703–717, 2017.

[26] K. Y. W. Scheper et al., “Behavior trees for evolutionary robotics,” Artif.

Life, vol. 22, no. 1, pp. 23–48, 2016.

[27] Longuet-Higgins et al., “The interpretation of a moving retinal image,”

in Proc. Roy. Soc. Lond. B, 1980, vol. 208, pp. 385–397.

[28] K. Sreenath, N. Micheal, and V. Kumar, “Trajectory generation and control

of a quadrotor with a cable-suspended load-a differentially-ﬂat hybrid

system,” in Proc. IEEE Int. Conf. Robot. Autom., 2013, pp. 4888–4895.

[29] B. D. Lucas and T. Kanade, “An iterative image registration technique

with an application to stereo vision, in Proc. 7th Int. Joint Conf. Artif.

Intell., 1981, pp. 674–679.

[30] B. K. P. Horn and B. G. Schunck, “Determining optical ﬂow,” Artif. Intell.,

vol. 17, no. 1-3, pp. 185–203, 1981.

[31] E. Ilg et al., “Flownet 2.0: Evolution of optical ﬂow estimation with deep

networks,” in Proc IEEE Conf. Comput. Vis. Pattern Recognit., 2017,

vol. 2, pp. 2462–2470.

[32] C. Tomasi and T. Kanade, “Detection and tracking of point features,”

Tech. Rep. CMU-CS-91-132, Carnegie Mellon University, Apr. 1991.

[33] O. Hall-Holt et al., “Finding large sticks and potatoes in poly-

gons,” in Proc. 17th Annu. ACM-SIAM Symp. Discr. Algorithm, 2006,

pp. 474–483.

[34] D. Mellinger and V. Kumar, “Minimum snap trajectory generation and

control for quadrotors,” in Proc. IEEE Int. Conf. Robot. Autom., 2011,

pp. 2520–2525.

[35] C. Godard et al., “Unsupervised monocular depth estimation with left-

right consistency,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit.,

2017, pp. 6602–6611.

[36] A. Ranjan and M. J. Black, “Optical ﬂow estimation using a spatial pyra-

mid network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,

pp. 2720–2729.

[37] T. Kroeger

et al., “Fast optical ﬂow using dense inverse search,” in

Proc.Eur. Conf. Comput. Vis., 2016, pp. 471–488.

[38] J. W. Bian et al., “GMS: Grid-based motion statistics for fast, ultra-

robust feature correspondence,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., 2017, pp. 2828–2837.

[39] E. Rosten, R. Porter, and T. Drummond, “Faster and better: A machine

learning approach to corner detection,” IEEE Trans. Pattern Anal. Mach.

Intell., vol. 32, no. 1, pp. 105–119, Jan. 2010.

[40] M. Bj

orkman et al., “Detecting, segmenting and tracking unknown objects

using multi-label MRF inference,” Comput. Vis. Image Understanding,

vol. 118, pp. 111–127, 2014.