Interpreting the Structure of Single
Images by Learning from Examples
Osian Haines
A dissertation submitted to the University of Bristol in accordance with the
requirements for the degree of Doctor of Philosophy in the Faculty of Engineering,
Visual Information Laboratory.
October 2013
52000 words
Abstract
One of the central problems in computer vision is the interpretation of the
content of a single image. A particularly interesting example of this is the
extraction of the underlying 3D structure apparent in an image, which is
especially challenging due to the ambiguity introduced by having no depth
information. Nevertheless, knowledge of the regular and predictable nature of
the 3D world imposes constraints upon images, which can be used to recover
basic structural information.
Our work is inspired by the human visual system, which appears to have
little difficulty in interpreting complex scenes from only a single viewpoint.
Humans are thought to rely heavily on learned prior knowledge for this. As
such we take a machine learning approach, to learn the relationship between
appearance and scene structure from training examples.
This thesis investigates this challenging area by focusing on the task of plane
detection, which is important since planes are a ubiquitous feature of human-
made environments. We develop a new plane detection method, which works
by learning from labelled training data, and can find planes and estimate
their orientation. This is done from a single image, without relying on explicit
geometric information, nor requiring depth.
This is achieved by first introducing a method to identify whether an individ-
ual image region is planar or not, and if so to estimate its orientation with
respect to the camera. This is done by describing the image region using
basic feature descriptors, and classifying against training data. This forms
the core of our plane detector, since by applying it repeatedly to overlapping
image regions we can estimate plane likelihood across the image, which is
used to segment it into individual planar and non-planar regions. We evalu-
ate both these algorithms against known ground truth, giving good results,
and compare to prior work.
We also demonstrate an application of this plane detection algorithm, show-
ing how it is useful for visual odometry (localisation of a camera in an un-
known environment). This is done by enhancing a planar visual odometry
system to detect planes from one frame, thus being able to quickly initialise
planes in appropriate locations, avoiding a search over the whole image. This
enables rapid extraction of structured maps while exploring, and may increase
accuracy over the baseline system.
Declaration
I declare that the work in this dissertation was carried out in accordance with the Reg-
ulations of the University of Bristol. The work is original, except where indicated by
special reference in the text, and no part of the dissertation has been submitted for any
other academic award.
Any views expressed in the dissertation are those of the author and in no way represent
those of the University of Bristol.
The dissertation has not been presented to any other University for examination either
in the United Kingdom or overseas.
SIGNED:
DATE:
Acknowledgements
I would like to begin by thanking my supervisor Andrew Calway, for all his guidance and
sage advice over the years, and making sure I got through the PhD. I would also like to
thank Neill Campbell and Nishan Canagarajah, for very useful and essential comments,
helping to keep me on the right track. My special thanks go to Jos ´e Mart´ınez-Carranza
and Sion Hannuna, who have given me so much help and support throughout the PhD – I
would not have been able to do this without them. Thanks also to my former supervisors
at Cardiff, Dave Marshall and Paul Rosin, for setting me on the research path in the
first place.
I owe a large amount of gratitude to Mam, Dad and Nia, for being supportive throughout
the PhD, always being there when needed, for their encouragement and understanding,
and not minding all the missed birthdays and events.
I am enormously grateful to my fantastic team of proofreaders, whose insightful com-
ments and invaluable suggestions made this thesis into what it is. They are, in no particu-
lar order: David Hanwell, Austin Gregg-Smith, Jos ´e Mart´ınez-Carranza, Oliver Moolan-
Feroze, Toby Perrett, Rob Frampton, John McGonigle, Sion Hannuna, Nia Haines, and
Jack Greenhalgh.
I would like to thank all of my friends in the lab, and in Bristol generally, who have
made the past five years so enjoyable, and without whose friendship and support the
PhD would have been a very different experience. This includes, but is not limited
to: Elena, Andrew, Dima, Adeline, Lizzy, Phil, Tom, John, Anthony, Alex, Tim, Kat,
Louise, Teesid, and of course all of the proofreaders already mentioned above.
Thanks also the British Machine Vision Association, whose helping hand has shaped the
PhD in various ways, including the excellent computer vision summer school, experiences
at the BMVC conferences, and essential financial help for attending conferences toward
the end of the PhD.
Finally, thanks to Cathryn, for everything.
Publications
The work described in this thesis has been presented in the following publications:
1. Visual mapping using learned structural priors (Haines, Mart´ınez-Carranza and
Calway, International Conference on Robotics and Automation 2013) [59]
2. Detecting planes and estimating their orientation from a single image (Haines and
Calway, British Machine Vision Conference 2012) [57]
3. Estimating planar structure in single images by learning from examples (Haines
and Calway, International Conference on Pattern Recognition Applications and
Methods 2012) [58]
4. Recognising planes in a single image (Haines and Calway, submitted to IEEE
Transactions on Pattern Analysis and Machine Intelligence)
For Cathryn
Contents
List of Figures
vii
List of Tables
xi
1 Introduction
1
1.1 Perception of Single Images . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2 Motivation and Applications . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3 Human Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.4 On the Psychology of Perception . . . . . . . . . . . . . . . . . . . . . .
8
1.4.1
Hypotheses and Illusions . . . . . . . . . . . . . . . . . . . . . . .
9
1.4.2
Human Vision Through Learning . . . . . . . . . . . . . . . . . .
11
1.5 Machine Learning for Image Interpretation . . . . . . . . . . . . . . . . .
11
1.6 Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.7 Plane Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
i
1.8 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.9 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2 Background
19
2.1 Vanishing Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.2 Shape from Texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3 Learning from Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.3.1
Learning Depth Maps . . . . . . . . . . . . . . . . . . . . . . . .
25
2.3.2
Geometric Classification . . . . . . . . . . . . . . . . . . . . . . .
28
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3 Plane Recognition
33
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.1
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.2
Reflection and Warping . . . . . . . . . . . . . . . . . . . . . . .
36
3.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
3.3.1
Salient Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.2
Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
3.3.3
Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
41
3.3.4
Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.3.5
Spatiograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.4.1
Relevance Vector Machines . . . . . . . . . . . . . . . . . . . . . .
50
ii
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
4 Plane Recognition Experiments
54
4.1 Investigation of Parameters and Settings . . . . . . . . . . . . . . . . . .
55
4.1.1
Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
4.1.2
Saliency and Scale . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.1.3
Feature Representation . . . . . . . . . . . . . . . . . . . . . . . .
58
4.1.4
Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
4.1.5
Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.1.6
Spatiogram Analysis . . . . . . . . . . . . . . . . . . . . . . . . .
62
4.2 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.3 Independent Results and Examples . . . . . . . . . . . . . . . . . . . . .
65
4.4 Comparison to Nearest Neighbour Classification . . . . . . . . . . . . . .
69
4.4.1
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.4.2
Random Comparison . . . . . . . . . . . . . . . . . . . . . . . . .
74
4.5 Summary of Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.5.1
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
4.5.2
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
5 Plane Detection
77
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.1.1
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.1.2
Discussion of Alternatives . . . . . . . . . . . . . . . . . . . . . .
78
5.2 Overview of the Method . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
iii
5.3 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.4 Region Sweeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
82
5.5 Ground Truth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.6 Training Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.6.1
Region Sweeping for Training Data . . . . . . . . . . . . . . . . .
85
5.7 Local Plane Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.8 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
5.8.1
Segmentation Overview . . . . . . . . . . . . . . . . . . . . . . . .
89
5.8.2
Graph Segmentation . . . . . . . . . . . . . . . . . . . . . . . . .
90
5.8.3
Markov Random Field Overview . . . . . . . . . . . . . . . . . . .
90
5.8.4
Plane/Non-plane Segmentation . . . . . . . . . . . . . . . . . . .
92
5.8.5
Orientation Segmentation . . . . . . . . . . . . . . . . . . . . . .
94
5.8.6
Region Shape Verification . . . . . . . . . . . . . . . . . . . . . . 102
5.9 Re-classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.9.1
Training Data for Final Regions . . . . . . . . . . . . . . . . . . . 103
5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6 Plane Detection Experiments
105
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.1.1
Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Discussion of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.1
Region Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2.2
Kernel Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Evaluation on Independent Data . . . . . . . . . . . . . . . . . . . . . . . 112
iv
6.3.1
Results and Examples . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.2
Discussion of Failures . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.1
Description of HSL . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.4.2
Repurposing for Plane Detection . . . . . . . . . . . . . . . . . . 122
6.4.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.1
Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5.2
Translation Invariance . . . . . . . . . . . . . . . . . . . . . . . . 130
6.5.3
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Application to Visual Odometry
136
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.1.1
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Visual Odometry System . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.1
Unified Parameterisation . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.2
Keyframes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.3.3
Undelayed Initialisation . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.4
Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.3.5
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.3.6
Characteristics and Behaviour of IDPP . . . . . . . . . . . . . . . 147
7.4 Plane Detection for Visual Odometry . . . . . . . . . . . . . . . . . . . . 149
v
7.4.1
Structural Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.4.2
Plane Initialisation . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.4.3
Guided Plane Growing . . . . . . . . . . . . . . . . . . . . . . . . 151
7.4.4
Time and Threading . . . . . . . . . . . . . . . . . . . . . . . . . 152
7.4.5
Persistent Plane Map . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.6.1
Fast Map Building . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.7.1
Comparison to Point-Based Mapping . . . . . . . . . . . . . . . . 162
7.7.2
Learning from Planes . . . . . . . . . . . . . . . . . . . . . . . . . 162
8 Conclusion
165
8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.3.1
Depth Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.2
Boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.3.3
Enhanced Visual Mapping . . . . . . . . . . . . . . . . . . . . . . 172
8.3.4
Structure Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.4 Final Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
References
175
vi
List of Figures
1.1 Projection from 3D to 2D — depth information is lost . . . . . . . . . .
3
1.2 Some configurations of 3D shapes are more likely than others . . . . . . .
4
1.3 Examples of images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4 Images whose interpretation is harder to explain . . . . . . . . . . . . . .
7
1.5 The hollow face illusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.6 Examples outputs of plane recognition . . . . . . . . . . . . . . . . . . .
16
1.7 Examples of plane detection . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.1 Examples from Koˇseck ´a and Zhang . . . . . . . . . . . . . . . . . . . . .
21
2.2 Shape from texture result due to G˚arding . . . . . . . . . . . . . . . . . .
24
2.3 Some objects are more likely at certain distances . . . . . . . . . . . . . .
25
2.4 Illustration of Saxena et al.’s method . . . . . . . . . . . . . . . . . . . .
26
2.5 Typical outputs from Saxena et al. . . . . . . . . . . . . . . . . . . . . .
27
2.6 Illustration of the multiple-segmentation of Hoiem et al. . . . . . . . . . .
29
vii
2.7 Examples outputs of Hoiem et al. . . . . . . . . . . . . . . . . . . . . . .
30
3.1 How the ground truth orientation is manually specified . . . . . . . . . .
36
3.2 Examples of training data . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.3 Examples of warped training data . . . . . . . . . . . . . . . . . . . . . .
38
3.4 The quadrant-based gradient histogram . . . . . . . . . . . . . . . . . . .
40
3.5 Words and topics in an image . . . . . . . . . . . . . . . . . . . . . . . .
45
3.6 Topic visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.7 Topic visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.8 Topic distributions in 2D . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
4.1 Performance as the size of the vocabulary is changed . . . . . . . . . . .
56
4.2 Effect of different patch sizes for saliency detection . . . . . . . . . . . .
57
4.3 Comparison of different kernel functions . . . . . . . . . . . . . . . . . .
61
4.4 Adding synthetically generated training data . . . . . . . . . . . . . . . .
62
4.5 Region shape itself can suggest orientation . . . . . . . . . . . . . . . . .
63
4.6 Distribution of orientation errors for cross-validation . . . . . . . . . . .
64
4.7 Distribution of errors for testing on independent data . . . . . . . . . . .
65
4.8 Example results on an independent dataset . . . . . . . . . . . . . . . . .
67
4.9 Examples of correct detection of non-planes . . . . . . . . . . . . . . . .
68
4.10 Examples where the algorithm fails . . . . . . . . . . . . . . . . . . . . .
69
4.11 Comparison of RVM and KNN . . . . . . . . . . . . . . . . . . . . . . .
70
4.12 Example results with K-nearest neighbours . . . . . . . . . . . . . . . . .
72
4.13 Example results with K-nearest neighbours . . . . . . . . . . . . . . . . .
73
viii
4.14 Example of failure cases with K-nearest neighbours . . . . . . . . . . . .
74
4.15 Comparison of KNN and random classification . . . . . . . . . . . . . . .
75
5.1 Problems with appearance-based image segmentation . . . . . . . . . . .
79
5.2 Examples of the four main steps of plane detection . . . . . . . . . . . .
82
5.3 Illustration of region sweeping . . . . . . . . . . . . . . . . . . . . . . . .
83
5.4 Examples of ground truth images for plane detection . . . . . . . . . . .
84
5.5 Obtaining the local plane estimate from region sweeping . . . . . . . . .
87
5.6 Thresholding on classification confidence . . . . . . . . . . . . . . . . . .
88
5.7 Examples of pointwise local plane estimates . . . . . . . . . . . . . . . .
88
5.8 Segmentation: planes and non-planes . . . . . . . . . . . . . . . . . . . .
93
5.9 Illustration of the kernel density estimate . . . . . . . . . . . . . . . . . .
95
5.10 Visualisation of the kernel density estimate of normals . . . . . . . . . .
98
5.11 Plane segmentation using estimated orientations . . . . . . . . . . . . . . 101
5.12 Colour coding for different orientations . . . . . . . . . . . . . . . . . . . 102
6.1 Results when changing region size for sweeping . . . . . . . . . . . . . . . 108
6.2 Changing the size of sweep regions for training data . . . . . . . . . . . . 110
6.3 Results for varying mean shift bandwidth . . . . . . . . . . . . . . . . . . 112
6.4 Distribution of orientation errors for evaluation . . . . . . . . . . . . . . 114
6.5 Hand-picked examples of plane detection on our independent test set . . 115
6.6 Some of the best examples of plane detection on our independent test set 116
6.7 Some typical examples of plane detection on our independent test set . . 117
6.8 Some of the worst examples of plane detection on our independent test set 118
ix
6.9 Example of poor performance of plane detection . . . . . . . . . . . . . . 119
6.10 Hoiem et al.’s algorithm when used for plane detection . . . . . . . . . . 123
6.11 Comparison of our method to Hoiem et al. . . . . . . . . . . . . . . . . . 125
6.12 Increased density local plane estimate . . . . . . . . . . . . . . . . . . . . 129
6.13 Colour coding for different orientations . . . . . . . . . . . . . . . . . . . 129
6.14 Illustrating the re-location of a plane across the image . . . . . . . . . . . 130
6.15 Using translation invariant versus absolute position spatiograms . . . . . 132
7.1 Examples of plane detection masks . . . . . . . . . . . . . . . . . . . . . 150
7.2 Comparing plane initialisation with IDPP and our method . . . . . . . . 155
7.3 Views of the 3D map for the Berkeley Square sequence . . . . . . . . . . 156
7.4 Views of the 3D map for the Denmark Street sequence . . . . . . . . . . 157
7.5 Examples of visual odometry with plane detection . . . . . . . . . . . . . 158
7.6 Examples of plane detection . . . . . . . . . . . . . . . . . . . . . . . . . 158
7.7 Trajectories overlaid on a map . . . . . . . . . . . . . . . . . . . . . . . . 159
7.8 Trajectories overlaid on a map . . . . . . . . . . . . . . . . . . . . . . . . 159
7.9 Frame-rate of our method, compared to the original . . . . . . . . . . . . 160
x
List of Tables
4.1 Comparison between gradient and colour features . . . . . . . . . . . . .
59
4.2 Kernel functions used for the RVM . . . . . . . . . . . . . . . . . . . . .
60
4.3 Evaluating spatiograms with circular regions . . . . . . . . . . . . . . . .
63
6.1 Comparison to Hoiem et al. . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.2 Comparison of absolute and zero-mean spatiograms . . . . . . . . . . . . 131
7.1 Comparison between IDPP and PDVO . . . . . . . . . . . . . . . . . . . 157
xi
CHAPTER 1
Introduction
A key problem in computer vision is the processing, perception and understanding of
individual images. This is an area which includes heavily researched tasks such as object
recognition, image segmentation, face detection and text recognition, but can also involve
trying to interpret the underlying structure of the scene depicted by an image. This is a
particularly interesting and challenging problem, and includes tasks such as identifying
the 3D relationships of objects, gauging depth without parallax, segmenting an image
into structural regions, and even creating 3D models of a scene. However, in part due to
the difficulty and ill-posed nature of such tasks, it is an area which – compared to object
recognition, for example – has not been so widely studied.
One reason why perceiving structure from one image is difficult is that unlike many
tasks in image processing, it must deal with the fact that depth is ambiguous. Short of
using laser scanners, structured light or time-of-flight sensors, an individual image will
not record absolute depth; and with no parallax information (larger apparent motion
in the image for closer objects, in a moving camera or an image pair), it is impossible
to distinguish even relative depths. Furthermore, when relying on the ambiguous and
potentially low resolution pixel information, it remains difficult even to tell which regions
belong to continuous surfaces – an important problem in image segmentation – making
extraction of structure or surface information challenging.
1
1.1 Perception of Single Images
2
Nevertheless, recent work has shown that substantial progress can be made in perceiving
3D structure by exploiting various features of single images: they may be ambiguous, but
there is still sufficient information to begin perceiving the structures represented. This
involves making use of cues such as vanishing points and rectilinear structure, shading
and texture patterns, or relating appearance directly to structure.
Motivated by the initial success of such techniques, and driven by the potential benefits
that single-image perception will bring, this thesis focuses on investigating new methods
of interpreting the 3D structure of a single image. We believe this is a very interesting
task theoretically, since despite the considerable difficulties involved, some kinds of single
image structure perception do indeed seem to be possible. Recent developments in
this area also highlight that it is of great practical interest, with applications in 3D
reconstruction [66, 113], object recognition [65], wide-baseline matching [93], stereopsis
[111] and robot navigation [73, 92], amongst others.
Furthermore, this is an interesting topic because of its relationship to biological vision
systems, which as we discuss below seem to have little difficulty in interpreting complex
structure from single viewpoints. This is important, since it shows that a sufficiently
advanced algorithm can make good use of the very rich information available in an image;
and suggests that reliable and detailed perception from even a single image must be in
principle possible. The means by which biological systems are thought to do this even
hint at possible ways of solving the problem, and motivates striving for a more general
solution than existing methods. In particular, humans’ ability to take very limited
information, and by relating it to past visual experiences generate a seemingly complete
model of the real world, is something we believe we can learn much from, as we discuss
in depth below.
1.1 Perception of Single Images
In general, extracting the 3D structure from a 2D image is a very difficult problem,
because there is insufficient information recorded in an image. There is no way of re-
covering depth information directly from the image pixels, due to the nature of image
formation. Assuming a pin-hole camera model, all points along one ray from the cam-
era centre, toward some location in space, will be projected to the same image location
(Figure 1.1), which means that from the image there is no way to work backwards to
find where on that line the original point was.
1.1 Perception of Single Images
3
Figure 1.1: This illustrates the ambiguity of projecting 3D points to a 2D
image. The circle in the image plane shows the projection of the red sphere, but
any one of the spheres along the ray through the camera centre will project to
the same location. We have shown the image plane in front of the camera centre
for visual clarity.
Because of this there could potentially be an infinite number of different 3D scenes which
lead to the same 2D image. For example, a configuration of irregular quadrilaterals,
appropriately placed and shaded, may falsely give the appearance of a street scene or a
building fa¸cade, when viewed from a particular vantage point. At the extreme, one may
even be looking at a picture of a picture, and have no way of knowing that all depths
are in fact equal (this is resolved as soon as multi-view information becomes available).
However, while this argument seems to imply that any extraction of structure with-
out 3D information is not possible, it is evident that although any of a number of 3D
configurations are technically valid, some are much more likely than others. While we
cannot for definite say that we are looking at a building, say, as opposed to a contrived
collection of shapes which happen to give that appearance after projection, the former
interpretation is much more plausible. We illustrate this in Figure 1.2, where out of two
possible 3D configurations for a 2D image, one is much more realistic. Pictures, most
of the time, depict actual physical objects which obey physical laws, and tend to be
arranged in characteristic ways (especially if human-made).
This observation – that the real world is quite predictable and structured – is what
makes any kind of structure perception from a single image possible, by finding the most
plausible configuration of the world out of all possible alternatives [133], based on what
we can assume about the world in general. Thus the notion of prior knowledge, either
explicitly encoded or learned beforehand, becomes essential for making sense of otherwise
confusing image data.
1.2 Motivation and Applications
4
(a)
(b)
(c)
Figure 1.2: From even this very simple cartoon image of a house (a), it is
possible to recognise what it is, and from this imagine a likely 3D configuration
(b), despite the fact that, from a single image, the ambiguity of depth means
there are an infinite number of possible configurations, such as (c), which project
to an identical image. Such contrived collections of shapes are arguably much
less likely in the real world.
The different ways in which generic knowledge about the world is used leads to the various
computer vision approaches to detecting structure in single images. For example, parallel
world lines are imaged as lines which converge at a point in the image, and can be used
to recover the orientation of a plane; and the predictable deformation of texture due
to projection is enough to recover some of the viewing information. We go into more
detail about these possibilities in the following chapter, but for now it suffices to say
that, despite the difficulty of the problem, a number of attempts have been made toward
addressing it, with considerable success.
1.2 Motivation and Applications
The task of extracting structure from single images is interesting for several reasons.
As stated above, it is a difficult and ill-posed problem, since the information in one
image cannot resolve the issue of depths; and yet by exploiting low-level cues or learning
about characteristic structures, it is possible to recover some of this lost information.
It is also an interesting task to attempt since it has some biological relevance. Because
it is something humans have little difficulty with, this suggests ways of addressing the
problem, and gives us some baseline with which to ultimately compare. An algorithm
attempting to emulate the psychology of vision may also shed some light on unknown
details of how this is achieved, if it is sufficiently physiologically accurate, although this
is not a route we intend to investigate.
1.3 Human Vision
5
From a practical point of view, the potential of being able to interpret single images would
have a number of interesting applications. For example, being able to perceive structure
in a single image would allow quick, approximate 3D reconstructions to be created, using
only one image as input [4, 66]. For some tasks, a more sophisticated understanding than
knowledge of the structures themselves would not be necessary — that is, the geometry
alone is sufficient. For example, reconstructing rough 3D models when limited data are
available (historical images, for example, or more prosaically to visualise holiday photos),
or to extend the range and coverage of multi-view reconstructions [113].
Alternatively, a deeper understanding of the visual elements would make possible a va-
riety of interesting applications. For example, perception of the underlying structural
elements or their relationships would be useful in reconstructing 3D models where in-
sufficient data are available to see the intersections of all surfaces or retrieve the depth
of all pixels [113]. It would also be useful for providing context for other tasks, such as
object recognition, acting as a prior on likely object location in the image [65].
Knowledge of structure would also be very useful to tasks such as mapping and robot
navigation. Usually, when exploring a new and unknown environment, all that can be
sensed is the location of individual points from different positions, from which a 3D
point cloud can be derived. If however there was a way of gleaning knowledge of the
structure apparent in the image, this could be used to more quickly create a map of
the environment, and build a richer representation. Indeed, knowledge of higher level
structure (obtained by means other than single-image perception) has been very useful in
simplifying map representations and reducing the computation burden of maintaining the
map [47, 88]. We speculate that being able to more quickly derive higher-level structures
would bring further benefits, in terms of scene representation and faster initialisation of
features. This is something we come back to in Chapter 7.
1.3 Human Vision
As discussed above, interpreting general structure from single images is a significant
challenge, and one which is currently far from being solved. On the other hand, we
argue that this must in principle be possible, due to the obvious success of human (and
other animal) vision systems. To a human observer, it appears immediately obvious when
viewing a scene (and crucially, even an image of the scene, void of stereo or parallax
cues) what it is, with an interpretation of the underlying structure good enough to make
1.3 Human Vision
6
(a)
(b)
(c)
Figure 1.3: Examples of images which can be interpreted without difficulty by
a human, despite the complex shapes and clutter.
judgements about relative depths and spatial relationships [51]. Despite the complexity
of the task, this even appears subjectively to happen so fast as to be instantaneous [124],
and in reality cannot take much more than a fraction of a second.
The human vision system is remarkably adept at interpreting a wide range of types of
scene, and this does not appear to depend on calculation from any particular low-level
features, such as lines or gradients. While these might seem to be useful cues, humans
are quite capable of perceiving more complex scenes where such features are absent or
unreliable. Indeed, it is difficult to articulate precisely why one sees a scene as one does,
other than simply that it looks like what it is, or looks like something similar seen before.
This suggests that an important part of human vision may be the use of learned prior
experience to interpret new scenes, which is why complex scenes are so ‘obvious’ as to
what they contain.
For example, consider the image of Figure 1.3a. This is clearly a street scene, comprising
a ground plane with two opposing, parallel walls. The geometric structure is quite ap-
parent, in that there are a number of parallel lines (pointing toward a common vanishing
point near the centre of the image). It is plausible that this structure is what allows
the shape to be so easily perceived, and indeed this has been the foundation of many
geometry-based approaches to single-image perception [73, 93]. However, consider Fig-
ure 1.3b, which shows a similar configuration of streets and walls, and while it remains
obvious to a human what this is, there are considerably fewer converging lines, and the
walls show a more irregular appearance.
1.3 Human Vision
7
(a)
(b)
Figure 1.4: These images further highlight the powers of human perception.
The grassy ground scattered with leaves can easily be seen as sloping away from
the viewer; and the reflections in a river are seen to be of trees, despite the lack
of tree-like colour or texture.
A more extreme example is shown in Figure 1.3c, where the main structures are obscured
by balconies and other obtrusions, and people occlude the ground plane. This would be
difficult for any algorithm attempting to recover the scene structure based on explicit
geometric constructs, or for general image segmentation — and yet the human brain still
sees it with ease. This example in particular is suggestive that perception is a top-down
process of interpretation, rather than building up from low-level cues, and depends upon
a wealth of visual experience for it to make sense.
Finally, we demonstrate how humans can perceive structural content in images with
limited or distorted information. In Figure 1.4a, a grassy plane is shown, strewn with
fallen leaves. Despite the lack of any other objects for context or scaling, nor a uniform
size of leaves (and where cues from foreshortening of texture are quite weak), it is still
possible to see the relative orientation of the ground. Another interesting example where
the information available in the image itself is ambiguous is in Figure 1.4b, taken from
a picture of a riverside. One can see without much difficulty that this depicts trees
reflected in water. However, there is very little in the reflections corresponding to the
features generally indicative of trees. The branching structure is completely invisible,
the overall shape is not particularly tree like, and even the colour is not a reliable cue.
As before, this gives the impression that these objects and structures are perceptible by
virtue of prior experience with similar scenes, and knowing that trees reflected in water
tend to have this kind of distorted shape. The alternative, of recovering shapes from the
image, accounting for the distortion produced by the uneven reflections, and matching
them to some kind of tree prototype, seems rather unlikely (similar effects are discussed
in [51]).
Of course, despite these abilities, a human cannot guess reliable depth estimates nor
1.4 On the Psychology of Perception
8
report the precise orientation of surfaces in a scene. While they may have the impression
that they can visualise a full model of the scene, this is rarely the case, and much of the
actual details of structures remain hidden. Humans tend to be good at perceiving the
general layout of a scene and its content, without being able to describe its constituent
parts in detail (the number of branches on a tree or the relative distances of separated
objects, for example). Then again, the fact that such details are not even necessary
for a human to understand the gross layout is very interesting, hinting that the ability
to describe a scene in such detail is a step too far. This is an important point for
an algorithm attempting to follow the success of human vision since it suggests that
attempting to fully recover the scene structure or depth may not be necessary for a
number of tasks, and a limited perception ability can be sufficient.
These insights into the powers of human vision are what inspire us to take on the
challenge of using computer vision techniques to extract structure from single images,
and suggests that a learning-based approach is a good way to address this. In the
next section we take a deeper look at the details of human perception, focusing on how
learning and prior knowledge are thought to play a crucial role.
1.4 On the Psychology of Perception
Having described how the impressive feats of human perception inspire us to build a
learning-based system, we now consider how humans are believed to be achieve this.
The mechanics of human vision – the optics of the eye, the transmission of visual stimuli
to the brain, and so on – are of little interest here, being less germane to our discussion
than the interpretation of these signals once they arrive. In this section we give a brief
overview of current theories of human perception, focusing on how prior knowledge forms
a key part of the way humans so easily perceive the world.
An important figure in the psychology of human perception is James J. Gibson, whose
theory of how humans perceive pictures [50] focused on the idea that light rays emanating
from either a scene or a picture convey information about its contents, the interpretation
of which allows one to reconstruct what is being viewed. As such, vision can be considered
a process of picking up information about the world, and actively interpreting it, rather
than passive observation. This contrasted strongly with previous theories, which claimed
for example that the light rays emanating from a picture, if identical to those from a
real object, would give rise to the same perception.
1.4 On the Psychology of Perception
9
One aspect of his theory is that what humans internally perceive are perceptual and
temporal invariants [49], encompassing all views of an object, not just the face currently
visible. A consequence is that in order to perceive an object as a whole there must be
some model in the mind of what it is likely to look like, since from any particular view,
the information about an object is incomplete, and must be supplemented by already
acquired internal knowledge to make sense. Thus, vision is a process of applying stored
knowledge about the world, in order to make sense of partial information. This certainly
implies some kind of learning is necessary for the process, even in cases of totally novel
objects.
Similarly Ernst Gombrich, a contemporary of Gibson, likened an image to a trace left
behind by some object or event, that must be interpreted to recover its subject [51]. He
described this as requiring a “well stocked mind”, again clearly implying that learning
is important for perception. It follows that no picture can be properly interpreted on its
own, as being a collection of related 3D surfaces, without prior knowledge of how such
structures generally relate to each other, and how they are typically depicted in images.
For example, seeing an image of a building front as anything other than a jumble of
shapes is not possible without knowing what such shapes tend to represent.
These theories suggest that even though humans have the impression they perceive a
real and detailed description of the world, it is based on incomplete and inaccurate
information (due to occlusion, distance, limited acuity and simply being unable to see
everything at once), and prior beliefs will fill in the mental gaps. This inference and
synthesis is automatic and unconscious, and as Gombrich said, it is almost impossible to
see with an ‘innocent eye’, meaning perceive only what the eyes are receiving, without
colouring with subjective interpretations.
1.4.1 Hypotheses and Illusions
These ideas were developed further by Richard Gregory, reacting to what he perceived as
Gibson’s ignorance of the role of ‘perceptual intelligence’. This was defined as knowledge
applied to perception, as opposed to ‘conceptual intelligence’ (the knowledge of specific
facts). In this paradigm, perception is analogous to hypothesis generation [53], so that
vision can be considered a process of generating perceptual hypotheses (explanations of
the world derived from incomplete data received from the eyes), and comparing them to
reality. Recognition of objects and scenes equates to the formation of working hypotheses
1.4 On the Psychology of Perception
10
Figure 1.5: The hollow face illusion: even when this face mask is viewed from
behind (right) the human brain more easily interprets it as being convex. This
suggests that prior experience has a strong impact on perception (figure taken
from [54]).
about the world, when the information directly available is insufficient to give a complete
description.
This also explains why familiar, predictable objects are easier to see: the mind more
readily forms hypotheses to cover things about which it has more prior experience, and
requires more evidence to believe the validity of an unusual visual stimulus. One exam-
ple of this is the hollow face illusion [54], shown in Figure 1.5, where despite evidence
from motion and stereo cues, the hollow interior of a face mask is mistakenly perceived
as a convex shape. This happens because such a configuration agrees much better with
conventional experience — in everyday life, faces tend to be convex. Interestingly, the
illusion is much more pronounced when the face is the right way up, further supporting
the theory that it is its familiarity as a recognised object which leads to the misinterpre-
tation. This is a striking example of where the patterns and heuristics, learned in order
to quickly generate valid hypotheses, can sometimes lead one astray, when presented
with unusual data.
This is explored further by Gregory in his discussion of optical illusions [54], phenomena
which offer an interesting insight into human perception, and show that people are
easily fooled by unexpected stimuli. Not all optical illusions are of this type, for example
the Caf ´e wall illusion, which is due to neural signals becoming confused by adjacent
parallel lines [55]. But many are due to misapplication of general rules, presumably
1.5 Machine Learning for Image Interpretation
11
learned from experience. One such example is the Ames window illusion [54], where the
strong expectation for a window shape to be rectangular – when it is actually a skewed
quadrilateral – makes it appear to change direction as it rotates.
1.4.2 Human Vision Through Learning
These examples and others strongly suggest that there is an important contribution to
human vision from learned prior information, and that one’s previous experience with
the world is used to make sense of it. Humans cannot see without projecting outward
what they expect to see, in order to form hypotheses about the world. This allows quick
interpretation of scenes from incomplete and ambiguous data, but also leads to formation
of incorrect beliefs when presented with unusual stimuli. In turn this suggests that a
successful approach to interpreting information from single images would also benefit
from attempting to learn from past experiences, enabling fast hypothesis generation
from incomplete data — even if in doing so we are not directly emulating the details of
human vision. Indeed, as Gregory states in the introduction to [54], taking advantage
of prior knowledge should be crucial to machine vision, and the failure to recognise this
may account for the lack of progress to date (as of 1997).
With these insights in mind, we next introduce possibilities for perceiving structure in
single images by learning from prior experience. It is important to note, however, that
we are not claiming that the above overview is a comprehensive or complete description
of human vision. Nor do we claim any further biological relevance for our algorithm.
We do not attempt to model ocular or neurological function, but resort to typical image
processing and machine learning techniques; we merely assert that the potential learning-
based paradigm of human vision is a starting point for thinking of solutions to the
problem. Given the success of human vision at arbitrarily difficult tasks, an approach
in a similar vein, based on using machine learning to learn from prior visual experience,
appears to offer great promise.
1.5 Machine Learning for Image Interpretation
Learning from past experience therefore appears to be a promising and biologically plau-
sible means of interpreting novel images. The use of machine learning is further motivated
1.6 Directions
12
by the difficulty of creating models manually for tasks this complex. As previous work
has shown (such as [73, 93, 107] as reviewed in Chapter 2), explicitly defining how visual
cues can be used to infer structure in general scenes is difficult. For example, shape
from texture methods make assumptions on the type of texture being viewed, and from
a statistical analysis of visible patterns attempt to directly calculate surface slant and
tilt [44, 133]. However, this means that they can work well only in a limited range of
scenarios.
Alternatively, when it is difficult to specify the details of the model, but we know what
form its inputs and outputs should take, it may be more straightforward to learn the
model. Indeed, this is a common approach to complex problems, where describing the
system fully is tedious or impossible. Almost all object recognition algorithms, for exam-
ple, are based on automatically learning what characterises certain objects [69], rather
than attempting to explicitly describe the features which can be used to identify and
distinguish them. We must proceed with caution, since this still supposes that the world
is well behaved, and we should accept that almost any algorithm (including sometimes
human vision, as discussed above) would be fooled by sufficiently complicated or con-
trived configurations of stimuli. Thus the desire to make use of heuristics and ever more
general assumptions is tempered by the need to deal with unusual situations.
We are also motivated by other recent works, which use machine learning to deduce
the 3D structure of images. This includes the work of Saxena et al. [113], who learn the
relationship between depth maps (gathered by a laser range sensor) and images, allowing
good pixel-wise depth maps to be estimated for new, previously unseen images. Another
example is Hoiem et al. [66], who use classification of image segments into geometric
classes, representing different types of structure and their orientation, to build simple
3D interpretations of images. Both these examples, which use quite different types of
scene structure, have very practical applications, suggesting that machine learning is a
realistic way to tackle the problem — and so next we investigate possible directions to
take, with regards what kind of structure perception we intend to achieve.
1.6 Directions
There are several possible ways in which we might explore the perception of structure
in a single image. As stated above, one of the primary difficulties of single image inter-
pretation is that no depth information is available. Once depth can be observed, it is
1.6 Directions
13
then a fairly simple matter to create a 3D point cloud, from which other reconstruction
or segmentation algorithms can work [25, 32]. Therefore one of the most obvious ways
to approach single-image perception would be to try to recover the depth. Clearly this
cannot be done by geometric means, but as the work of Torralba and Oliva [127] and
Sudderth et al. [121] have shown, exploiting familiar objects, or configurations of scene
elements, can allow some rudimentary depth estimates to be recovered. This is related
to the discussion above, in that this relies on the knowledge that some relationships of
depths to image appearance are much more likely than others.
The most sophisticated algorithm to date for perceiving depth from a single image is by
Saxena et al. [113], where detailed depth maps can be extracted from images of general
outdoor scenes. This work has shown that very good depth estimates can be obtained in
the absence of any stereo or multi-view cues (and even used in conjunction with them for
better depth map generation). Since this work has succeeded in extracting rather good
depth maps, we leave the issue of depth detection, and consider other types of structure.
As we discussed in the context of human vision, it may not be necessary to know de-
tailed depth information in order to recover the shape, and this suggests that we could
attempt to recover the shape of general objects without needing to learn the relationship
with depth. Indeed, some progress has been made in this direction using shape from
shading and texture [41, 98, 122], in which the shape of even deformable objects can be
recovered from one image. However, this is a very difficult problem in general, due to
the complexities of shapes in the real world, and the fact that lighting and reflectance
introduce significant challenges — thus it may be difficult to extend to more general
situations.
A simplification of the above is to assume the world is composed of a range of simple
shapes – such as cubes and spheres – and attempt to recover them from images. This
is an interesting approach, since it relies on the fact that certain kinds of volumetric
primitive are likely to appear, and we can insert them into our model based on limited
data (i.e. it is not necessary to see all sides of a cuboid). This is the motivation for
early works such as Roberts’ ‘blocks world’ [109], in which primitive volumes are used to
approximate the structure of a scene; and a more recent development [56] based on first
estimating the scene layout [66] and fitting primitives using physical constraints. While
this approach limits the resulting model to only fairly simple volumetric primitives, many
human-made scenes do tend to consist of such shapes (especially at medium-sized scales),
and so a rough approximation to the scene is plausible.
1.6 Directions
14
A rather different approach would be to attempt to segment the image, so that segments
correspond to the continuous parts of the scene, and by assigning each a semantic label
via classification, to begin to perceive the overall structure. Following such a segmenta-
tion, we could potentially combine the information from the entire image into a coherent
whole, using random fields for example. This bears some similarity to the work of [66], in
which the segments are assigned to geometric classes, which are sufficient to recover the
overall scene layout. A related idea is to associate elements detected across the image to
create a coherent whole using a grammar based interpretation [3, 74], in which the prior
information encoded in the grammatical rules enforces the overall structure.
The above possibilities offer potentially powerful ways to get structure from a single im-
age. For practical purposes, in order to make a start at extracting more basic structures,
we will consider a somewhat more restricted – though still very general – alternative,
based on two of the key ideas above. First, we might further simplify the volume primi-
tives concept to use a smaller class of more basic shapes, which should be easier to find
from a single image, at the same time as being able to represent a wider variety of types
of scene. The most basic primitive we could use like this are planes. Planar surfaces are
very common and so can compactly represent a diverse range of images. If we consider
human-made environments – arguably where the majority of computer vision algorithms
will see most use – these are often comprised of planar structures. The fact that planes
can compactly represent structure means they have seen much use in tasks such as 3D
reconstruction and robot navigation.
While shapes such as cubes and spheres may be distinctive due to their shape in an image,
the same cannot be said of planes. Planes project simply to a quadrilateral, assuming
there is no occlusion; and while quadrilateral shapes have been used to find planes [94]
this is limited to fairly simple environments with obvious line structure. It is here that the
second idea becomes relevant: that using classification of surfaces can inform us about
structure. Specifically, while the shape of planes may not be sufficiently informative, their
appearance is often distinctive, due to the way planar structures are constructed and
used. For example, building fa¸cades and walls have an easily recognisable appearance,
and this allows them to be quickly recognised by humans as being planar even if they
have no obvious outline. Moreover, the appearance of a plane is related to its orientation
with respect to the viewer. Its surface will appear different depending which way it is
facing, due to effects such as foreshortening and texture compression, which suggests we
should be able to exploit this in order to predict orientation.
For these reasons, planar structures appear to be a good place to begin in order to
1.7 Plane Detection
15
extract structure from a single image. Therefore, to investigate single-image perception
in a feasible manner, we look at the potential of recovering planar structures in images
of urban scenes, by learning from their appearance and how this relates to structure.
1.7 Plane Detection
Planes are important structures in computer vision, partly because they are amongst
the simplest possible objects and are easy to represent geometrically. A 3D plane needs
only four parameters to completely specify it, usually expressed as a normal vector in
R 3 and a distance (plus a few more to describe spatial extent if appropriate). Because
of this, they are often easy to extract from data. Only three 3D points are required
to hypothesise a plane in a set of points [5], enabling planar structure recovery from
even reasonably sparse point clouds [46]. Alternatively, a minimum of only four point
correspondences between two images are required to create a homography [62], which
defines a planar relationship between two images, making plane detection possible from
pairs [42] or sequences [136] of images.
Planes are also important because they are ubiquitous in urban environments, making
up a significant portion of both indoor and outdoor urban scenes. Their status as one
of the most basic geometric primitives makes them a defining feature of human-made
environments, and therefore an efficient and compact way of representing structure,
giving realistic and semantically meaningful models while remaining simple. As such,
they have been shown to be very useful in tasks including wide baseline matching [73],
object recognition [65], 3D mesh reconstruction [25], robot navigation [47] and augmented
reality [16].
To investigate the potential of detecting and using planes from single image information,
we have developed an algorithm capable of detecting planar structures in single images,
and estimating their orientation with respect to the camera. This uses general feature
descriptors and standard methods for image representation, combined with a classifier
and regressor to learn from a large set of training data. One of the key points about
our method is that it does not depend upon any particular geometric or textural feature
(such as vanishing points and image gradients), and so is not constrained to a particular
type of scene. Rather, it exploits the fact that appearance of an image is related to scene
structure, and that learning from relevant cues in a set of training images can be used
to interpret new images. We show how this can be used in a variety of environments,
1.8 Thesis Overview
16
and that it is useful for a visual odometry task.
The work can be divided into three main sections: developing the basic algorithm to
represent image regions, recognise planes and estimate orientation; using this for detect-
ing planes in an image, giving groupings of image points into planar and non-planar
regions; and an example application, showing how plane detection can provide useful
prior information for a plane-based visual mapping system.
1.8 Thesis Overview
After this introductory chapter concludes by summarising our main contributions, Chap-
ter 2 presents some background to this area and reviews related work, including details of
geometric, texture, and learning-based approaches to single image structure perception.
In Chapter 3 we introduce our method for plane recognition , which can identify planes
from non-planes and estimate their orientation, example results of which are shown in
Figure 1.6. This chapter describes how regions of an image are represented, using a
collection of feature descriptors and a visual bag of words model, enhanced with spatial
information. We gather and annotate a large training set of examples, and use this to
train a classifier, so that new image regions may be recognised as planar or not; then,
for those which are planar, their orientation can be estimated by training a regression
algorithm. The important point about this chapter is that it deals only with known,
pre-specified image regions. This algorithm cannot find planes in an image, but can
identify them assuming a region of interest has been defined.
Figure 1.6: Examples of our plane recognition algorithm, which for manually
delineated regions such as these, can identify whether they are planar or not,
and estimate their orientation with respect to the camera.
1.8 Thesis Overview
17
Chapter 4 thoroughly evaluates the plane recognition algorithm, investigating the effects
of feature representation, vocabulary size, use of synthetic data, and other design choices
and parameters, by running cross-validation on our training set. We also experiment with
an alternative classification technique, leading to useful insights into how and why the
algorithm works. We evaluate the algorithm against an independent test set of images,
showing it can generalise to a new environment.
The plane recognition algorithm is adapted for use as part of a full plane detection
algorithm in Chapter 5, which is able to find the planes themselves, in a previously unseen
image; and for these detected planes, estimate their orientation. Since the boundaries
between planes are unknown, the plane recognition algorithm is applied repeatedly across
the image, to find the most likely locations of planes. This allows planar regions to be
segmented from each other, as well as separating planes with different orientations. The
result is an algorithm capable of detecting multiple planes from a single image, each with
an orientation estimate, using no multi-view or depth information nor explicit geometric
features: this is the primary novel contribution of this thesis. Example results can be
seen in Figure 1.7.
Figure 1.7: Examples of our plane detection algorithm, in a variety of environ-
ments, showing how it can find planes from amongst non-planes, and estimate
their orientation.
We then evaluate this detection algorithm in Chapter 6, showing the results of exper-
iments to investigate the effect of parameters such as the size of regions used for the
recognition step, or sensitivity of plane segmentation to different orientations. Such ex-
periments are used to select the optimal parameters, before again testing our algorithm
on an independent, ground-truth-labelled test set, showing good results for plane detec-
tion in urban environments. We also compare our algorithm to similar work, showing
side-by-side comparison for plane detection. Our method compares favourably, showing
superior performance in some situations and higher accuracy overall.
Finally, Chapter 7 presents an example application of the plane detection algorithm.
1.9 Contributions
18
Planar structures have been shown to be very useful in tasks such as simultaneous
localisation and mapping and visual odometry, for simplifying the map representation
and producing higher-level, easier to interpret scene representations [47, 89, 132]. We
show that our plane detector can be beneficial in a plane-based visual odometry task, by
giving prior information on the location and orientation of planes. This means planes may
be initialised quickly and accurately, before even the accumulation of parallax necessary
for multi-view methods. We show that this enables fast building of structured maps
of urban environments, and may improve the accuracy by reducing drift. We end with
Chapter 8, which concludes the thesis with a summary of the work and a discussion of
possible future directions.
1.9 Contributions
To summarise, these are the main contributions of this thesis:
  • We introduce a method for determining whether individual regions of an image are
  • planar or not, according to their appearance (represented with a variety of basic
    descriptors), based on learning from training data.
  • We show that it is possible, for these planar regions, to estimate their 3D orientation
  • with respect to the viewer, based only on their appearance in one image.
  • The plane recognition algorithm can be used as part of a plane detection algorithm,
  • which can recover the location and extent of planar structure in one image.
  • This plane detection algorithm can estimate the orientation of (multiple) planes
  • in an image — showing that there is an important link between appearance and
    structure, and that this is sufficient to recover the gross scene structures.
  • We show that both these methods work well in a variety of scenes, and investigate
  • the effects of various parameters on the algorithms’ performance.
  • Our method is shown to perform similarly to a state of the art method on our
  • dataset.
  • We demonstrate an example application of this plane detection method, by apply-
  • ing it to monocular visual odometry, which as discussed above could benefit from
    being able to quickly see structure without requiring parallax.
    CHAPTER 2
    Background
    In this chapter we discuss related works on single image perception, focusing on those
    which consider the problem of detecting planar structure, or estimating surface orien-
    tations. These can be broadly divided into two main categories: firstly, those which
    directly use the geometric or textural properties of the image to infer structure. These
    may be further divided into those which calculate directly from visible geometric entities,
    and those that use a statistical approach based on image features. Secondly, there exist
    methods which use machine learning, to learn the relationship between appearance and
    scene structure.
    As we discussed in Chapter 1, even when only a single image is available it is usually
    possible to infer something about the 3D structure it represents, by considering what is
    most likely, given the information available. Various methods have been able to glean
    information about surface orientation, relative depth, curvature or material properties,
    for example, even though none of these are measurable from image pixels themselves.
    This is due to a number of cues present in images, especially those of human-made,
    regular environments, such as vanishing points, rectilinear structure, spectral properties,
    colours and textures, lines and edges, gradient and shading, and previously learned
    objects.
    19
    2.1 Vanishing Points
    20
    The use of such cues amounts to deliberately using prior knowledge applied to the scene.
    Here prior knowledge means information that is assumed to be true in general, and can
    be used to extrapolate beyond what is directly observed. This can be contrasted with
    specific scene knowledge [2, 102], which can also help to recover structure when limited
    or incomplete observations are available. However, this is limited to working in specific
    locations.
    Over the following sections we review work on single image perception which makes
    use of a variety of prior knowledge, represented in different ways and corresponding to
    different types of assumption about the viewed scene. These are organised by the way
    they make assumptions about the image, and which properties of 3D scenes and 2D
    projections they exploit in order to achieve reconstruction or image understanding.
    2.1 Vanishing Points
    Vanishing points are defined as the points in the image where parallel lines appear to
    meet. They lie on the plane at infinity, a special construct of projective space that is
    invariant to translations of the camera, thus putting constraints on the geometry of the
    scene. Vanishing points are especially useful in urban, human-made scenes, in which
    pairs of parallel lines are ubiquitous, and this means that simple methods can often be
    used to recover 3D information. A detailed explanation of vanishing points and the plane
    at infinity can be found in chapters 3 and 6 of [62].
    A powerful demonstration of how vanishing points and vanishing lines can be used is
    by Criminisi et al. [26], who describe how the vanishing line of a plane and a vanishing
    point in a single image can be used to make measurements of relative lengths and areas.
    If just one absolute distance measurement is known, this can be used to calculate other
    distances, which is useful in measuring the height of people for forensic investigations,
    for example. As the authors state, such geometric configurations are readily available in
    structured scenes, though they do not discuss how such information could be extracted
    automatically.
    Another example of using vanishing points, where they are automatically detected, is the
    work of Koˇseck ´a and Zhang [73], in which the dominant planar surfaces of a scene are
    recovered, with the aim of using them to match between widely separated images. The
    method is based on the assumption that there are three primary mutually orthogonal
    2.1 Vanishing Points
    21
    directions present in the scene, i.e. this is a ‘Manhattan-like’ environment, having a
    ground plane and mutually perpendicular walls. Assuming that there are indeed three
    dominant orientations, the task is thus to find the three vanishing points of the image.
    This is done by using the intersections of detected line segments to vote for vanishing
    points in a Hough accumulator. However, due to noise and clutter (lines which do not
    correspond to any of the orientations), this is not straightforward. The solution is to use
    expectation maximisation (EM) to simultaneously estimate the location of the vanishing
    points in the image, and the probability of each line belonging to each vanishing point;
    the assignment of lines to vanishing points updates the vanishing points’ positions, which
    in turn alters the assignment, and this iterates until convergence. An example result,
    where viable line segments are coloured according to their assigned vanishing direction,
    is shown in Figure 2.1a.
    To find actual rectangular surfaces from this, two pairs of lines, corresponding to two
    different vanishing points, are used to hypothesise a rectangle (a quadrilateral in the
    image). These are verified using the distribution of gradient orientations within the
    image region, which should contain two separate dominant orientations. An example
    of a set of 3D rectangles detected in one image is shown in Figure 2.1b. These can
    subsequently be used for wide-baseline matching to another image.
    (a)
    (b)
    Figure 2.1: Illustration of the method of Koˇseck ´a and Zhang [73], showing the
    lines coloured by their assignment to the three orthogonal vanishing directions
    (a), and a resulting set of planar rectangles found on an image (b). Images are
    adapted from [73].
    This method has shown promising results, in both indoor and outdoor scenes, and the
    authors mention that it would be useful for robot navigation applications. Its main
    downside, however, is that it is limited to scenes with this kind of dominant rectangular
    structure. As such, planes which are perpendicular to the ground, but are oriented differ-
    ently from the rest of the planes, would not be easily detected, and may cause problems
    for the EM algorithm. Furthermore, the method relies on there being sufficiently many
    2.2 Shape from Texture
    22
    good lines which can be extracted and survive the clustering, which may not be the case
    when there is background clutter or spurious edges produced by textured scenes.
    A related method is presented by Miˇcuˇs´ık et al. [93], where line segments are used to
    directly infer rectangular shapes, which in turn inform the structure of the scene. It
    differs from [73] in that it avoids the high computational complexity of exhaustively
    considering all line pairs to hypothesise rectangles. Rectangle detection is treated as a
    labelling problem on the set of suitable lines, using a Markov random field. By using two
    orthogonal directions at a time, labels are assigned to each line segment to indicate the
    vanishing direction to which it points, and which part of a rectangle it is (for example
    the left or right side); from this, individual rectangles can be effortlessly extracted.
    These methods highlight an important issue when using vanishing points and other
    monocular cues. While they are very useful for interpreting a single image, they can also
    provide useful information when multiple images are present, by using single image cues
    to help in wide-baseline matching for example. Similar conclusions are drawn in work
    we review below [111], where a primarily single-image method is shown to be beneficial
    for stereo vision by exploiting complementary information.
    An alternative way of using vanishing points is to consider the effect that sets of con-
    verging parallel lines will have on the spectral properties of the image. For example,
    rather than detecting lines directly in the spatial domain, Ribeiro and Hancock [107]
    use the power spectrum of the image, where spectra which are strongly peaked describe
    a linear structure. From this, properties of the texture and the underlying surface can
    be recovered, such as its orientation. In their later work [108], spectral moments of the
    image’s Fourier transform are used to find vanishing lines via a Hough-like accumula-
    tor. Such methods are dependent on the texture being isotropic and homogeneous, such
    that any observed distortions are due to the effects of projection rather than the pat-
    terns themselves. These restrictions are quite limiting, and so only a subset of scenes –
    namely those with regular grid-like structures, such as brick walls and tiled roofs – are
    appropriate for use with this method.
    2.2 Shape from Texture
    An alternative to using vanishing points, which has received much attention in the liter-
    ature, is known as shape from texture (this is part of a loose collection of methods known
    2.2 Shape from Texture
    23
    collectively as ‘shape-from-X’, which includes recovery of 3D information from features
    such as shading, defocus and zoom [35, 75, 116]). Such an approach is appealing, since
    it makes use of quite different types of information from the rectangle-based methods
    above. In those, texture is usually an inconvenience, whereas many real environments
    have multiple textured surfaces, especially outdoors.
    Early work on shape from texture was motivated by the insights of Gibson into human
    vision [48], specifically that humans extract information about surface orientation from
    apparent gradients. However, this model was not shown to be reliable for real textures
    [133]; and due to making necessary assumptions about the homogeneity and isotropy
    of texture (conditions that, while unrealistic, allow surface orientation to be computed
    directly) methods developed based on these ideas fall short of being able to recover
    structure from real images.
    Work by Witkin [133] allows some of these assumptions to be relaxed in order to better
    portray real-world scenes. Rather than requiring the textures to be homogeneous or
    uniform, the only constraint is that the way in which textures appear non-uniform does
    not mimic perspective projection — which is in general true, though of course there
    will be exceptional cases. Witkin’s approach is based on the understanding that for one
    image there will be a potentially infinite number of plausible reconstructions due to the
    rules of projective geometry, and so the task is to find the ‘best’, or most likely, from
    amongst all possible alternatives. This goal can be expressed in a maximum-likelihood
    framework.
    The method works by representing texture as a distribution of short line segments,
    and assumes that for a general surface texture their orientations will be distributed
    uniformly. When projected, however, the orientations will change, aligning with the
    axis of tilt, so there is a direct relationship between the angular distribution of line
    segments (which can be measured), and the slant and tilt of the image. An iterative
    method is used to find the most likely surface orientation given the observed angles, in a
    maximum likelihood approach. This is developed by G˚arding [44], who proposes a fast
    and efficient one-step method to estimate the orientation directly from the observations,
    with comparable results. An example of applying the latter method is shown in Figure
    2.2, which illustrates the estimated orientation at the four image quadrants.
    However, shape from texture techniques such as these again face the problem that tex-
    tural cues can be ambiguous or misleading, and an incorrect slant and tilt would be
    estimated for textures with persistent elongated features, for example rock fissures or
    2.3 Learning from Images
    24
    Figure 2.2: An example result from the shape from texture algorithm of
    G˚arding [44], applied to an image of an outdoor scene. Orientation is approxi-
    mately recovered for the four quadrants of the image (there is no notion of plane
    detection here, only slant/tilt estimation). The image was adapted from [44].
    fence-posts. Moreover, while the accuracy of [44] was shown to be good for simulated
    textures of known orientation, no ground truth was available for evaluation on real im-
    ages. While the presented results are qualitatively good, it was not possible to accurately
    assess its applicability to real images.
    A further issue with shape from texture methods is that generally they deal only with
    estimating the orientation of given regions (as in the image above), or even of the whole
    image; the detection of appropriate planar surfaces, and their segmentation from non-
    planar regions, is not addressed, to our knowledge.
    2.3 Learning from Images
    The approaches discussed above use a well defined geometric model, relating features of
    appearance (vanishing points, texture, and so on) to 3D structure. They have proven
    to be successful in cases where the model is valid – where such structure is visible and
    the key assumptions hold – but are not applicable more generally. This is because it
    is very difficult to explicitly specify a general model to interpret 3D scenes, and so a
    natural alternative is to learn the model from known data. This leads us on to the next
    important class of methods, to which ours belongs: those which use techniques from
    machine learning to solve the problem of single-image perception.
    An interesting approach is presented in a set of papers by Torralba and Oliva, where
    general properties are extracted from images in order to describe various aspects of
    2.3 Learning from Images
    25
    the scene [99, 100, 127, 128]. They introduce the concept of the ‘scene envelope’ [99],
    which combines various descriptions of images, such as whether they are natural or
    artificial locations, depict close or far structures, or are of indoor or outdoor scenes. By
    estimating where along each of these axes an image falls a qualitative description can
    be generated automatically (for example ‘Flat view of a man-made urban environment,
    vertically structured’ [99]). This is achieved by representing each image by its 2D Fourier
    transform, motivated by the tendency for the coarse spectral properties of the image to
    depend upon the type and configuration of the scene. An indoor scene, for example,
    will produce sharp vertical and horizontal features, which is evident when viewed in the
    frequency domain.
    This work is further developed for scene depth estimation [127], where the overall depth
    of an image is predicted. This is based on an important observation, that while it is
    possible for an object to be at any size and at any depth, there are certain characteristic
    sizes for objects, which can be taken advantage of (a concept illustrated by Figure 2.3).
    For example, buildings are much more likely to be large and distant than small and close
    (as in a toy model); conversely something that looks like a desktop is very unlikely to be
    scenery several kilometres away. Again, this relates to the key point that we desire the
    most likely interpretation. Like the methods mentioned above, this method is based on
    the frequency characteristics of the image. Image properties are represented using local
    and global wavelet energy, with which depth is estimated by learning a parametric model
    using expectation maximisation. While this is a very interesting approach to recovering
    depth, the main downside is that it gives only the overall depth of the image. There is
    no distinction between near or far objects within one image, so it cannot be used to infer
    any detailed scene structure.
    Figure 2.3: An illustration from Torralba and Oliva [127], making the point
    that some objects tend to only appear at characteristic scales, which means they
    can be used to recover approximate scene depth (image taken from [127]).
    2.3 Learning from Images
    26
    2.3.1 Learning Depth Maps
    A more sophisticated approach by Saxena et al. [113] estimates whole-image depth maps
    for single images of general outdoor scenes. This is based on learning the relationship be-
    tween image features and depth, using training sets consisting of images with associated
    ground truth depth maps acquired using a custom built laser scanner unit. The cen-
    tral premise is that by encoding image elements using a range of descriptors at multiple
    scales, estimates of depth can be obtained.
    In more detail, they first segment the image into superpixels (see Figure 2.4b), being local
    homogeneous clusters of pixels, for which a battery of features are computed (includ-
    ing local texture features and shape features). As well as using features from individual
    superpixels, they combine features from neighbouring superpixels in order to encode con-
    textual information, because the orientation or depth of individual locations is strongly
    influenced by that of their surroundings. In addition to these features, they extract
    features to represent edge information to estimate the location of occlusion boundaries.
    These features allow them to estimate both relative and absolute depth, as well as local
    orientation. The latter is useful in evaluating whether there should be a boundary
    between adjacent superpixels. This is interesting, since it implies they are exploiting
    planarity — indeed, they make the assumption that each superpixel can be considered
    locally planar. This is a reasonable assumption when they are small. They justify this
    by analogy with the meshes used in computer graphics, where complex shapes are built
    from a tessellation of triangles. They explicitly build such meshes using the depth and
    orientation estimates, an example of which is shown in Figure 2.4c.
    (a)
    (b)
    (c)
    Figure 2.4: The depth map estimation algorithm of Saxena et al. [113] takes
    a single image (a) as input, and begins by segmenting to superpixels (b). By
    estimating the orientation and depth of each locally planar facet, they produce
    a 3D mesh representation (c). Images are taken from [113].
    2.3 Learning from Images
    27
    Figure 2.5: This shows some typical outputs of depth maps (bottom row)
    estimated by Saxena et al. [113] from single images (top row), where yellow is
    the closest and cyan the farthest. These examples showcase the ability of the
    algorithm to recover the overall structure of the scene with good accuracy (images
    taken from Saxena et al. [113]).
    Following the success of human vision, which easily integrates information from multiple
    cues over the whole image, they attempt to relate information at one location to all other
    locations. This is done by formulating the depth estimation problem in a probabilistic
    model, using a Markov random field, with the aim of capturing connected, coplanar and
    collinear structures, and to combine all the information to get a consistent overall depth
    map. Example results of the algorithm are shown in Figure 2.5.
    This has a number of interesting applications, such as using the resulting depth map
    to create simple, partially realistic single-image reconstructions. These are sufficient to
    create virtual fly-throughs, by using the connected mesh of locally planar regions, and
    these can semi-realistically simulate a scene from a photograph from new angles. Such
    reconstructions were shown to rival even those of Hoiem et al. [64] discussed below.
    Interestingly, while estimating the orientation of the planar facets is a necessary step of
    the algorithm, to relate the depth of adjacent segments, they do not actually produce a
    set of planes. Instead they focus on polygonal mesh representations, and while it might
    be possible to attempt to extract planar surfaces from such models, their results hint
    that it would not be trivial to do so.
    They also show that their depth estimates can be combined for creating multi-view re-
    constructions from images which are normally too widely separated for this to be possible
    (with too little overlap to allow reliable reconstruction using traditional structure-from-
    motion, for example). In [111] this algorithm is also shown to be beneficial for stereo
    2.3 Learning from Images
    28
    vision, as another means of estimating depth. This is interesting since even when two
    images are available, from which depth can be calculated directly, a single-image depth
    estimation still provides valuable additional information, by exploiting complementary
    cues. A simplified version of the algorithm was even used to guide an autonomous vehicle
    over unknown terrain [92]. These last two illustrate the value of using perception from
    single images in multi-view scenarios, an idea we come back to in Chapter 7.
    2.3.2 Geometric Classification
    The work of Hoiem et al. [66] has the goal of interpreting the overall scene layout from a
    single image. This is motivated by the observation that in a great many images, almost
    all pixels correspond to either the ground, the sky, or some kind of upright surface, and
    so if these three classes can be distinguished a large portion of images can be described
    in terms of their rough geometry. This notion of ‘geometric classification’ is the core of
    their method, in which regions of the image are assigned to one of three main classes
    — namely support (ground), sky, and vertical, of which the latter is further subdivided
    into left, right, and forward facing planes, or otherwise porous or solid.
    While the method is not explicitly aimed at plane detection, it is an implicit part of
    understanding the general structure of scenes, as the image is being effectively partitioned
    into planar (ground, left, right, forward) and non-planar (sky, solid, porous) regions. It
    is important to note, however, that plane orientation is limited by quantisation into four
    discrete classes. No finer resolution on surface orientation can be obtained than ‘left’
    and so on.
    Classification is achieved using a large variety of features, including colour (summary
    statistics and histograms), filter bank responses to represent texture, and image location.
    Larger-scale features are also used, including line intersections, shape information (such
    as area of segments), and even vanishing point information. These cues are used in the
    various stages of classification, using boosted decision trees, where the logistic regression
    version of Adaboost chosen for the weak learners is able to select the most informative
    features from those available. This automatic selection out of all possible features is
    one of the interesting properties of the method, in that it is not even necessary to tell
    it what to learn from — although this is at the expense of extracting many potentially
    redundant feature descriptors. The classifiers are trained on manually segmented training
    data, where segments have been labelled with their true class.
    2.3 Learning from Images
    29
    The central problem is that in an unknown image, it is not known how to group the
    pixels, and so more complex features than simple local statistics cannot be extracted.
    Thus features such as vanishing points cannot be included until initial segments have
    been hypothesised, whose fidelity in turn depends upon such features. Their solution is
    to gradually build up support, from the level of pixels to superpixels (over-segmented
    image regions) to segments (groups of superpixels), which are combined in order to cre-
    ate a putative segmentation of the image. An initial estimate of structure (a particular
    grouping of superpixels) is used to extract features, which are then used in order to up-
    date the classifications, to create a better representation of the structure and a coherent
    labelling of all the superpixels.
    The superpixels are found by using Felzenszwalb and Huttenlocher’s graph-cut segmen-
    tation method [36] to separate the image into a number (usually around 500) of small,
    approximately homogeneous regions, from which local features can be extracted. These
    are the atomic elements, and are much more convenient to work with than pixels them-
    selves. Segments are formed by grouping these superpixels together, combining infor-
    mation probabilistically from multiple segmentations of different granularity (see Figure
    2.6). Since it is not feasible to try all possible segmentations of superpixels, they sample
    a smaller representative set.
    (a)
    (b)
    (c)
    (d)
    Figure 2.6: For some image (a), this shows the superpixels extracted (b),
    and two segmentations at different granularities (c,d, with 15 and 80 segments
    respectively). Images adapted from [66].
    There are two main steps in evaluating whether a segment is good and should be retained.
    First, they use a pairwise-likelihood classification on pairs of adjacent superpixels, which
    after being trained on ground truth data is able to estimate the probability of them
    having the same label. This provides evidence as to whether the two should belong in
    the same eventual segment, or straddle a boundary. Next there is an estimate of segment
    homogeneity, which is computed using all the superpixels assigned to a segment, and is in
    turn used to estimate the likelihood of its label. To get the label likelihoods for individual
    superpixels, they marginalise over all sampled segments in which the superpixel lies.
    2.3 Learning from Images
    30
    They show how these likelihoods can be combined in various ways, from a single max-
    margin estimate of labels, to more complex models involving a Markov random field or
    simulated annealing. The final result is a segmentation of the image into homogeneously
    labelled sets of superpixels, each with a classification to one of the three main labels,
    and a sub-classification into the vertical sub-classes where appropriate. Some examples
    from their results are shown in Figure 2.7.
    Figure 2.7: Examples outputs of Hoiem et al. [66], for given input images. The
    red, green and blue regions denote respectively vertical, ground and sky classes.
    The vertical segments are labelled with symbols, showing the orientation (arrows)
    or non-planar (circle and cross) subclass to which they are assigned. Images
    adapted from [66].
    The resulting coarse scene layout can help in a number of tasks. It can enable simple 3D
    reconstruction of a scene from one image [64], termed a ‘pop-up’ representation, based on
    folding the scene along what are estimated to be the ground-vertical intersections. Its use
    in 3D reconstruction is taken further by Gupta and Efros [56], who use the scene layout
    as the first step in a blocks-world-like approach to recovering volumetric descriptions of
    outdoor scenes.
    Scene layout estimation is also used as a cue for object recognition [65], because knowing
    the general layout of a scene gives a helpful indication of where various objects are
    most likely to appear, which saves time during detection. For example, detecting the
    location and orientation of a road in a street scene helps predict the location and scale of
    pedestrians (i.e. connected to the road and around two metres in height), thus discarding
    a large range of useless search locations. This is interesting as it shows how prior scene
    knowledge can be beneficial beyond reconstructing a scene model.
    However, this method has a few shortcomings. Firstly from an algorithmic standpoint,
    its sampling approach to finding the best segmentation is not deterministic, and so quite
    different 3D structures will be obtained for arbitrarily similar images. The iterative
    nature of the segmentation, in which the features are extracted from a structure estimate
    which is not yet optimal, might also be problematic, for example falling into local minima
    where the true segmentation cannot be found because the most appropriate cues cannot
    2.4 Summary
    31
    be exploited.
    In terms of scene understanding, there are a few more drawbacks. The method rests on
    the not unreasonable assumption that the camera is level (i.e. there is no roll), which
    for most photographs is true, although cannot be guaranteed in a robot navigation
    application, for example (indeed, such images were manually removed from their dataset
    before processing). Moreover, the horizon needs to be somewhere within the image,
    which again may be rather limiting. While the assumptions it makes are much less
    restrictive than those of, say, shape from texture, it is still limited in that the scene
    needs to be well represented by the discrete set of classes available.
    In terms of plane detection, the quantisation of orientation (left-facing, right-facing and
    so on) is the biggest downside, since it limits its ability to distinguish planes from one
    another if they are similar in orientation, and would give ambiguous results for planes
    facing obliquely to the camera; as well as the obvious limit to accuracy attainable for
    any individual plane. Thus while the algorithm shows impressive performance in a range
    of images, extracting structure from arbitrarily placed cameras could be a problem.
    Nevertheless, the fact that a coarse representation gives a very good sense of scene
    structure, and is useful for the variety of tasks mentioned above, is reassuring, and paves
    the way for other learning-based single-image perception techniques.
    2.4 Summary
    This chapter has reviewed a range of methods for extracting the structure from single
    images, giving examples of methods using either direct calculation from image features or
    inference via machine learning. While these have been successful to an extent, a number
    of shortcomings remain. As we have discussed, those which rely on texture gradients
    or vanishing points are not applicable to general scenes where such structures are not
    present. Indeed, as Gregory [54] suggested, their shortcomings arise because they fail
    to take into account the contribution of learned prior knowledge to vision, and thus are
    unable to deal with novel or unexpected situations.
    It is encouraging therefore that several methods using machine learning have been devel-
    oped; however, these also have a few problems. Torralba and Oliva et al.’s work focuses
    on global scene properties, which is a level of understanding too coarse for most inter-
    esting applications. Saxena et al. successfully show that depth maps can be extracted
    2.4 Summary
    32
    from single images (so long as comprehensive and accurate ground truth is available for
    training), then used to build rough 3D models. While this does reflect the underlying
    scene structure, it does not explicitly find higher-level structures with which the scene
    can be represented, nor determine what kind of structure the superpixels represent. We
    emphasise that this work has made significant progress in terms of interpreting single
    images, but must conclude that it does not wholly address the central issue with which
    this thesis is concerned — namely, the interpretation and understanding of scenes from
    one image. That is, they can estimate depth reliably, but do not distinguish different
    types of scene element, nor divide structures as to their identity (the mesh represents
    the whole scene, with no knowledge of planes or other objects). By contrast, the plane
    detection algorithm we present distinguishes planar and non-planar regions, and though
    it does not classify planes according to what they actually are, is a first step toward un-
    derstanding the different structures that make up the scene, potentially enabling more
    well-informed augmentation or interaction.
    Hoiem et al. on the other hand focus almost exclusively on classifying parts of the
    image into geometric classes, bridging the gap between semantic understanding and
    3D reconstruction, and combining techniques from both machine learning and single
    view metrology. However, because orientations are coarsely quantised it means that
    the recovered 3D models lack specificity, being unable to distinguish similarly oriented
    planes which fall into the same category; and any reconstruction is ultimately limited
    to the fidelity of the initial superpixel extraction. This limitation is not only a practical
    inconvenience, but suggests that the available prior knowledge contained in the training
    set could be exploited more thoroughly. Moreover, their requirements that the camera
    be roughly aligned with the ground plane, and the use of vanishing point information as
    a cue, suggest they are not making use of fully general information. In particular, they
    tend to need a fairly stable kind of environment, with a visible and horizontal ground
    plane at the base of the image. While this is a common type of scene, what we are
    aiming for is a more general treatment of prior information.
    On the other hand, its ability to cope with cartoon images and even paintings show it is
    much more flexible than typical single-image methods, being able to extract very general
    cues which are much more than merely the presence of lines or shapes. This method,
    more than any others, shows the potential of machine learning methods to make sense
    of single images. In this thesis we attempt to further develop these ideas, to show how
    planar structures can be extracted from single images.
    CHAPTER 3
    Plane Recognition
    This chapter introduces our plane recognition algorithm, whose purpose is to determine
    whether an image region is planar or not, and estimate its 3D orientation. As it stands,
    this requires an appropriately delineated region of interest in the image to be given as
    input, and does not deal with the issue of finding or segmenting such regions from the
    whole image; however we stress that this technique, while limited, will form an essential
    component of what is to come.
    As we discussed in Chapter 2, many approaches to single image plane detection have
    used geometric or textural cues, for example vanishing points [73] or the characteristic
    deformation of textures [44]. We aim to go beyond these somewhat restrictive paradigms,
    and develop an approach which aims to be more general, and is applicable to a wider
    range of scenes. Our approach is inspired by the way humans appear to be able to
    easily comprehend new scenes, by exploiting prior visual experiences [50, 51, 54, 101].
    Therefore we take a machine learning approach, to learn the relationship between image
    appearance and 3D scene structure, using a collection of manually labelled examples.
    To be able to learn, our algorithm requires the collection and annotation of a large set
    of training data; representation of training and test data in an appropriate manner; and
    the training of classification and regression algorithms to predict class and orientation
    33
    3.1 Overview
    34
    respectively for new regions. Over the course of this chapter we develop these concepts
    in detail.
    3.1 Overview
    The objective of our plane recognition algorithm is as follows: for a given, pre-segmented
    area of an image (generally referred to as the ‘image region’), to classify it as being planar
    or non-planar, and if it is deemed to be planar, to estimate its orientation with respect
    to the camera coordinate system. For now, we are assuming that an appropriate region
    of the image is given as input.
    The basic principle of our method is to learn the relationship between appearance and
    structure in a single image. Thus, the assumption upon which the whole method is
    founded is that there is a consistent relationship between how an image region looks
    and its 3D geometry. While this statement may appear trivial, it is not necessarily true:
    appearances can deceive, and the properties of surfaces may not be all they seem. Indeed,
    exploiting the simplest of relationships (again, like vanishing points) between appearance
    and structure in a direct manner is what can lead to the failure of existing methods, when
    such assumptions are violated. However, we believe that in general image appearance is
    a very useful cue to the identity and orientation of image regions, and we show that this
    is sufficient and generally reliable as a means of predicting planar structure.
    To do this, we gather a large set of training examples, manually annotated with their
    class (it is to be understood that whenever we refer to ‘class’ and ‘classification’, it
    is to the distinction of plane and non-plane, rather than material or object identity)
    and their 3D orientation (expressed relative to the camera, since we do not have any
    global coordinate frame, and represented as a normalised vector in R 3 , pointing toward
    the viewer), where appropriate. These examples are represented using general features,
    rather than task-specific entities such as vanishing points. Image regions are represented
    using a collection of gradient orientation and colour descriptors, calculated about salient
    points. These are combined into a bag of words representation, which is projected to a
    low dimensional space by using a variant of latent semantic analysis, before enhancement
    with spatial distribution information using ‘spatiograms’ [10]. With these data and the
    regions’ given target labels, we train a classifier and regressor (often referred to as simply
    ‘the classifiers’). Using the same representation for new, previously unseen, test regions,
    the classifiers are used to predict their class and orientation.
    3.2 Training Data
    35
    In the following sections, we describe in detail each of these steps, with a discussion of
    the methods involved. A full evaluation of the algorithm, with an investigation of the
    various parameters and design choices, is presented in the next chapter.
    3.2 Training Data
    We gather a large set of example data, with which to train the classifiers. From the raw
    image data we manually choose the most relevant image regions, and mark them up with
    ground truth labels for class and orientation, then use these to synthetically generate
    more data with geometric transformations.
    3.2.1 Data Collection
    Using a standard webcam (Unibrain Fire-i) 1 connected to a laptop, we gathered video
    sequences from outdoor urban scenes. These are of size 320 × 240 pixels, and are corrected
    for radial distortion introduced by a wide-angle lens 2 . For development and validation
    of the plane recognition algorithm, we collected two datasets, one taken in an area
    surrounding the University of Bristol, for training and validating the algorithm; and
    a second retained as an independent test set, taken in a similar but separate urban
    location.
    To create our training set, we select a subset of video frames, which show typical or
    interesting planar and non-planar structures. In each, we manually mark up the region
    of interest, by specifying points that form its convex hull. This means that we are using
    training data corresponding to either purely planar or non-planar regions (there are no
    mixed regions). This is the case for both training and testing data.
    To train the classifiers (see Section 3.4), we need ground truth labels. Plane class is easy
    to assign, by labelling each region as to whether it is planar or not. Specifying the true
    orientation of planar regions is a little more complicated, since the actual orientation is of
    course not calculable from the image itself, so instead we use an interactive method based
    on vanishing points as illustrated in Figure 3.1 [26, 62]. Four points corresponding to the
    1 www.unibrain.com/products/fire-i-digital-camera/
    2 Using the Caltech calibration toolkit for M atlab , from www.vision.caltech.edu/bouguetj/calib doc
    3.2 Training Data
    36
    vanishing points
    v 2
    vanishing line
    l = v 1 x v 2
    v 1
    n = K T l
    plane normal
    Figure 3.1: Illustration of how the ground truth orientation is obtained, using
    manually selected corners of a planar rectangle.
    corners of a rectangle lying on the plane in 3D are marked up by hand, and the pairs of
    opposing edges are extended until they meet to give vanishing points in two orthogonal
    directions. These are denoted v 1 and v 2 , and are expressed in homogeneous coordinates
    (i.e. extended to 3D vectors by appending a 1). Joining these two points in the image
    by a line gives the vanishing line, which is represented by a 3-vector l = v 1 × v 2 . The
    plane which passes through the vanishing line and the camera centre is parallel to the
    scene plane described by the rectangle, and its normal can be obtained from n = K T l ,
    where K is the 3 × 3 intrinsic camera calibration matrix [62] (and a superscripted ‘T’
    denotes the matrix transpose) . Examples of training data, both planar, showing the
    ‘true’ orientation, and non-planar, are shown in Figure 3.2.
    3.2.2 Reflection and Warping
    In order to increase the size of our training set, we synthetically generate new training
    examples. First, we reflect all the images about the vertical axis, since a reflection of an
    image can be considered equally physically valid. This immediately doubles the size of
    our training set, and also removes any bias for left or right facing regions.
    We also generate examples of planes with different orientations, by simulating the view
    as seen by a camera in different poses. This is done by considering the original image to
    be viewed by a camera at location [ I | 0 ], where I is the 3 × 3 identity matrix (no rotation)
    and 0 is the 3D zero vector (the origin); and then approximating the image seen from
    a camera at a new viewpoint [ R | t ] (rotation matrix R and translation vector t with
    respect to the original view) by deriving a planar homography relating the image of the
    3.2 Training Data
    37
    (a)
    (b)
    (c)
    (d)
    (e)
    Figure 3.2: Examples of our manually outlined and annotated training data,
    showing examples of both planes (orange boundary, with orientation vectors)
    and non-planes (cyan boundary).
    plane in both views. The homography linking the two views is calculated by
    tn T
    H = R +
    (3.1)
    d
    where n is the normal of the plane and d is the perpendicular distance to the plane
    (all defined up to scale, which means without loss of generality we set d = 1). We use
    this homography to warp the original image, to approximate the view from the new
    viewpoint. To generate the pose [ R | t ] of the hypothetical camera, we use rotations of
    angle θ γ about the three coordinate axes γ { x,y,z } , each represented by a rotation
    matrix R γ , the product of which gives us the final rotation matrix
     
     
    1
    0
    0
    cos θ y
    0 sin θ y
    cos θ z sin θ z 0
    R = R x R y R z =
    0 cos θ x sin θ x   0
     
    1
    0 sin θ z
     
    cos θ z
    0
    0 sin θ x cos θ x
    sin θ y 0 cos θ y
    0
    0
    1
    (3.2)
    We calculate the translation by t = RD + D , where D is a unit vector in the direction
    of the point on the plane around which to rotate (calculated as D = K m where m is
    1
    a 2D point on the plane, usually the centroid, expressed in homogeneous coordinates).
    After warping the image, the normal vector for this warped plane is Rn .
    To generate new [ R | t ] pairs we step through angles in x and y in increments of 15 , up
    to ± 30 in both directions. While we could easily step in finer increments, or also apply
    rotations about the z -axis (which amounts to rotating the image), this quickly makes
    3.3 Image Representation
    38
    (a)
    (b)
    (c)
    (d)
    (e)
    Figure 3.3: Training data after warping using a homography to approximate
    new viewpoints.
    training sets too large to deal with. We also omit examples which are very distorted, by
    checking whether the ratio of the eigenvalues of the resulting regions are within a certain
    range, to ensure that the regions are not stretched too much. Warping is also applied to
    the non-planar examples, but since these have no orientation specified, we use a normal
    pointing toward the camera, to generate the homography, under the assumption that the
    scene is approximately front-on. Warping these is not strictly necessary, but we do so to
    increase the quantity of non-planar examples available, and to ensure that the number
    of planar and non-planar examples remain comparable. Examples of warped images are
    shown in Figure 3.3.
    3.3 Image Representation
    We describe the visual information in the image regions of interest using local image
    descriptors. These represent the local gradient and colour properties, and are calculated
    for patches about a set of salient points. This means each region is described by a large
    and variable number of descriptors, using each individual point. In order to create more
    concise descriptions of whole regions, we employ the visual bag of words model, which
    represents the region according to a vocabulary; further reduction is achieved by discov-
    ering a set of underlying latent topics, compressing the bag of words information to a
    lower dimensional space. Finally we enhance the topic-based representation with spatial
    distribution information, an important step since the spatial configuration of various
    image features is relevant to identifying planes. This is done using spatial histograms,
    or ‘spatiograms’. Each step is explained in more detail in the following sections.
    3.3 Image Representation
    39
    3.3.1 Salient Points
    Only salient points are used, in order to reduce the amount of information we must deal
    with, and to focus only on those areas of the image which contain features relevant to our
    task. For example, it would be wasteful to represent image areas devoid of any texture
    as these contribute very little to the interpretation of the scene.
    Many alternatives exist for evaluating the saliency of points in an image. One popular
    choice is the FAST detector [110], which detects corner-like features. We experimented
    with this, and found it to produce not unreasonable results (see Section 4.1.2); however
    this detects features at only a single scale, ignoring the fact that image features tend
    to occur over a range of different scales [80]. Since generally there is no way to know a
    priori at which scale features will occur, it is necessary to analyse features at all possible
    scales [79].
    To achieve this we use the difference of Gaussians detector to detect salient points, which
    is well known as the first stage in computing the SIFT feature descriptor [81]. As well
    as a 2D location for interest points, this gives the scale at which the feature is detected.
    This is to be interpreted as the approximate size, in pixels, of the image content which
    leads the point to be considered salient.
    3.3.2 Features
    Feature descriptors are created about each salient point in the image. The patch sizes
    used for building the descriptors come from the points’ scales, since the scale values
    represent the regions of the image around the points which are considered salient. The
    basic features we use to describe the image regions capture both texture and colour
    information, these being amongst the most important basic visual characteristics visible
    in images.
    Texture information is captured using histograms of local gradient orientation; such
    descriptors have been successfully used in a number of applications, including object
    recognition and pedestrian detection [28, 82]. However one important difference between
    our descriptors and something like SIFT is that we do not aim to be invariant. While
    robustness to various image transformations and deformations is beneficial for reliably
    detecting objects, this would actually be a disadvantage to us. We are aiming to actually
    3.3 Image Representation
    40
    recover orientation, rather than be invariant to its effects, and so removing its effects
    from the descriptor would be counter-productive.
    To build the histograms, the local orientation at each pixel is obtained by convolving
    the image with the mask [1 , 0 , 1] in the x and y directions separately, to approximate
    the first derivatives of the image. This gives the gradient values G x and G y , for the
    horizontal and vertical directions respectively, which can be used to obtain the angle θ
    and magnitude m of the local gradient orientation:
     
    G y
    θ
    = tan
    1
    G x
    (3.3)
    m = p G x2 + G y 2
    The gradient orientation and magnitude are calculated for each pixel within the image
    patch in question, and gradient histograms are built by summing the magnitudes over
    orientations, quantised into 12 bins covering the range [0 ). Our descriptors are actually
    formed from four such histograms, one for each quadrant of the image patch, then
    concatenated into one 48D descriptor vector. This is done in order to incorporate some
    larger scale information, and we illustrate it in Figure 3.4. This construction is motivated
    by histograms of oriented gradient (HOG) features [28], but omits the normalisation over
    multiple block sizes, and has only four non-overlapping cells.
    The importance of colour information for geometric classification was demonstrated by
    Hoiem et al. [66], and is used here as it may disambiguate otherwise difficult examples,
    such as rough walls and foliage. To encode colour, we use RGB histograms, which are
    formed by concatenating intensity histograms built from each of the three colour channels
    Feature descriptor:
    Figure 3.4: An illustration showing how we create the quadrant-based gradient
    histogram descriptor. An image patch is divided into quadrants, and a separate
    orientation histogram created for each, which are concatenated.
    3.3 Image Representation
    41
    of the image, each with 20 bins, to form a 60D descriptor.
    We use both types of feature together for plane classification; however, it is unlikely
    that colour information would be beneficial for estimating orientation of planar surfaces.
    Therefore, we use only gradient features for the orientation estimation step; the means
    by which we use separate combinations of features for the two tasks is described below.
    3.3.3 Bag of Words
    Each image region has a pair of descriptors for every salient point, which may number
    in the tens or hundreds, making it a rather rich but inefficient means of description.
    Moreover, each region will have a different number of points, making comparison prob-
    lematic. This is addressed by accumulating the information for whole regions in an
    efficient manner using the bag of words model.
    The bag of words model was originally developed in the text retrieval literature, where
    documents are represented simply by relative counts of words occurring in them; this
    somewhat naıve approach has achieved much success in tasks such as document classifi-
    cation or retrieval [77, 115]. When applied to computer vision tasks, the main difference
    that must be accounted for is that there is no immediately obvious analogue to words
    in images, and so a set of ‘visual words’ are created. This is done by clustering a large
    set of example feature vectors, in order to find a small number of well distributed points
    in feature space (the cluster centres). Feature vectors are then described according to
    their relationship to these points, by replacing the vectors by the ID of the word (clus-
    ter centre) to which they are closest. Thus, rather than a large collection of descriptor
    vectors, an image region can be represented simply as a histogram of word occurrences
    over this vocabulary. This has been shown to be an effective way of representing large
    amounts of visual information [134]. In what follows, we use ‘word’ (or ‘term’) to refer
    to such a cluster centre, and ‘document’ is synonymous with image.
    We create two separate codebooks (sets of clustered features), for gradient and colour.
    These are formed by clustering a large set of features harvested from training images
    using K-means, with K clusters. Clustering takes up to several minutes per codebook,
    but must only be done once, prior to training (codebooks can be re-used for different
    training sets, assuming there is not too much difference in the features used).
    Word histograms are represented as K -dimensional vectors w , with elements denoted
    3.3 Image Representation
    42
    w k , for k = 1 ,...,K . Each histogram bin is w = | Λ k | where Λ k is the set of points which
    k
    quantised to word k (i.e. w k is the count of occurrences of word k in the document). To
    reduce the impact of commonly occurring words (words that appear in all documents
    are not very informative), we apply term frequency–inverse document frequency (tf-idf)
    weighting as described in [84]. We denote word vectors after weighting by w 0 , and when
    we refer to word histograms hereafter we mean those weighted by tf-idf, unless otherwise
    stated.
    As discussed above, we use two different types of feature vector, of different dimension-
    ality. As such we need to create separate codebooks for each, for which the clustering
    and creation of histograms is independent. The result is that each training region has
    two histograms of word counts, for its gradient and colour features.
    3.3.4 Topics
    The bag of words model, as it stands, has a few problems. Firstly, there is no association
    between different words, so two words representing similar appearance would be deemed
    as dissimilar as any other pair of words. Secondly, as the vocabularies become large, the
    word histograms will become increasingly sparse, making comparison between regions
    unreliable as the words they contain are less likely to coincide.
    The answer again comes from the text retrieval literature, where similar problems (word
    synonymy and high dimensionality) are prevalent. This idea is to make use of an under-
    lying latent space amongst the words in a corpus, which can be thought of as representing
    ‘topics’, which should roughly correspond to a single semantic concept.
    It is not realistic to expect each document to correspond to precisely one latent topic, and
    so a document is represented by a distribution over topics. Since the number of topics
    is generally much less than the number of words, this means topic analysis achieves
    dimensionality reduction. Documents are represented as a weighted sum over topics,
    instead of over words, and ideally synonyms are implicitly taken care of by being mapped
    to the same semantic concept.
    A variety of methods exist for discovering latent topics, which all take as input a ‘term-
    0 0
    document’ matrix. This is a matrix W = [ w 00 , w ,..., w ], where each column is the
    1
    M
    weighted word vector for each of M documents, so each row corresponds to a word k .
    3.3 Image Representation
    43
    3.3.4.1 Latent Semantic Analysis
    The simplest of these methods is known as latent semantic analysis (LSA) [30], which
    discovers latent topics by factorising the term-document matrix W , using the singular
    value decomposition (SVD), into W = UDV . Here U is a K × K matrix, V is M × M ,
    T
    and D is the diagonal matrix of singular values. The reduced dimensionality form is
    obtained by truncating the SVD, retaining only the top T singular values (where T is
    the desired number of topics), so that W U t D t V t T , where U t is K × T and V t is
    M × T . The rows of matrix V t are the reduced dimensionality description the topic
    vectors t m – for each document m .
    An advantage of LSA is that it is very simple to use, partly because topic vectors for
    the image regions in the training set are extracted directly from V t . Topic vectors
    for new test image regions are calculated simply by projecting their word histograms
    1
    into the topic space, using t j = D t U Tt w j . However, LSA suffers from an important
    0
    problem, due to its use of SVD: the topic weights (i.e. the elements of the vectors t j )
    may be negative. This makes interpretation of the topic representation difficult (what
    does it mean for a document to have a negative amount of a certain topic?), but more
    importantly the presence of negative weights will become a problem when we come to
    take weighted means of points’ topics (see Section 3.3.5), where all topic weights would
    need to be non-negative.
    A later development called probabilistic latent semantic analysis (pLSA) [63] does pre-
    serve the non-negativity of all components (because they are expressed as probabilities),
    but its use of expectation maximisation to both find the factorisation, and get the topic
    vectors for new data, is infeasibly slow for our purposes. Fortunately, a class of methods
    known as non-negative matrix factorisation has been shown, under some conditions, to
    be equivalent to pLSA [31, 45].
    3.3.4.2 Non-negative Matrix Factorisation
    As the name implies, non-negative matrix factorisation (NMF) [76] is a method for
    factorising a matrix (in this case, the term-document matrix W ) into two factor matrices,
    with reduced dimensionality, where all the terms are positive or zero. The aim is to find
    the best low-rank approximation W BT to the original matrix with non-negative
    terms. If W is known to have only non-negative entries (as is the case here, since its
    3.3 Image Representation
    44
    entries are word counts) so will B and T . Here, T is T × M and will contain the topic
    vectors corresponding to the columns of W ; and B can be interpreted as the basis of
    the topic space (of size K × T , the number of words and topics respectively). There are
    no closed form solutions to this problem, but Lee and Seung [76] describe an iterative
    algorithm, which can be started from a random initialisation of the factors.
    As with LSA, topic vectors for training regions are simply the columns of T . Unfor-
    tunately, although we can re-arrange the above equation to obtain t j = B w j (where
    † 0
    B is the Moore-Penrose pseudoinverse) to get a low-dimensional approximation for test
    vectors w j0 , there is no non-negativity constraint on B . This means that the resulting
    topic vectors t j too will contain negative elements, and so we have the same problem as
    with LSA.
    The problem arises from using the pseudoinverse. It is generally calculated using SVD,
    which as we saw above does not maintain non-negativity. One way to ensure that the
    inverse of B is non-negative is to make B orthogonal, since for a semi-orthogonal matrix
    (non-square) its pseudoinverse is also its transpose. This leads us to the methods of
    orthogonal non-negative matrix factorisation.
    3.3.4.3 Orthogonal Non-negative Matrix Factorisation
    The objective of orthogonal non-negative matrix factorisation (ONMF) [18] is to factorise
    as above, with the added constraint that B T B = I . Left-multiplying W = BT by B
    T
    T
    0
    we get B T W = B BT = T , and so it is now valid to project a word vector w j to
    the topic space by t j = B T w j , such that the topic vector contains only non-negative
    0
    elements. In practice, these factors are found by a slight modification of the method of
    NMF [135], and so the algorithms used for ONMF are:
    ( WT ) nt
    T
    ( B T W ) tm
    B nt ←− B nt
    T tm ←− T tm
    (3.4)
    ( BTW B ) nt
    T
    ( B T BT ) tm
    The result is a factorisation algorithm which directly gives the topic vectors for each
    training example used in creating the term-document matrix, and a fast linear method
    for finding the topic vector of any new test word histogram, simply by multiplying with
    the transpose of the topic basis matrix B .
    3.3 Image Representation
    45
    (a) The words in the
    (b) Points contribut-
    (c) Points contribut-
    image, each with a
    ing to Topic 16
    ing to Topic 5
    unique colour
    Figure 3.5: For a given planar region (a), on which we have drawn coloured
    points to indicate the different words, we can also show the weighting of these
    points to different topics, namely Topics 16 (b) and 5 (c), corresponding to those
    visualised below.
    3.3.4.4 Topic Visualisation
    So far, in considering the use of topics analysis, we have treated it simply as a means of
    dimensionality reduction. However, just as topics should represent underlying semantic
    topics in text documents, they should also represent similarities across our visual words.
    Indeed, it is partly in order capture otherwise overlooked similarities between different
    words that we have used topic analysis, and so it would interesting to check whether this
    is actually the case.
    First, consider the example plane region in Figure 3.5a, taken from our training set. In
    the left image, we have shown all the words, each with a unique colour – showing no
    particular structure. The two images to its right show only those points corresponding
    to certain topics (Topic 16 and Topic 5, from a gradient-only space of 20 topics), via the
    words that the features quantise to, where the opacity of the points represent the extent
    to which the word contributes to that topic. It appears that words contributing to Topic
    16 tend to occur on the tops and bottoms of windows, while those for Topic 5 lie within
    the windows, suggesting that they may be picking out particular types of structure.
    We expand upon this, by more directly demonstrating what the topics represent. Vi-
    sualising words and topics can be rather difficult, since both exist in high dimensional
    spaces, and because topics represent a distribution over all the words, not simply a re-
    duced selection of important words. Nevertheless, it should be the case that the regions
    which quantise to the words which contribute most strongly to a given topic will look
    more similar to each other than other pairs of words.
    3.3 Image Representation
    46
    (a) Word 336
    (b) Word 84
    (c) Word 4
    Figure 3.6: A selection of patches quantising to the top three words for Topic
    16, demonstrating different kinds of horizontal edge.
    (a) Word 376
    (b) Word 141
    (c) Word 51
    Figure 3.7: Patches representing the top three words for Topic 5, which appear
    to correspond to different kinds of grid pattern (or, with a combination of ver-
    tical and horizontal edges; a few apparent outliers are shown toward the right,
    although these retain a similar pattern of edges).
    For the image region above, we looked at the histogram of topics, and found that the
    two highest weighted were Topics 16 and 5 (those shown above). Rather than showing
    these topic directly, we can find which words contribute most strongly to those topics.
    For Topic 16, the highest weighted words (using the word-topic weights from B ) were
    336, 84 and 4; and for Topic 5, the words with the highest weights were 376, 141 and
    51. Again, these numbers have little meaning on their own, being simply indices within
    the unordered set of 400 gradient words we used.
    We show example image patches for each of these words, for those two topics, in Figures
    3.6 and 3.7. These patches were extracted from the images originally used to create
    the codebooks (the patches are of different sizes due to the use of multi-scale salient
    point detection, but have been resized for display). If the clustering has been performed
    correctly there should be similarities between different examples of the same word; and
    according to the idea behind latent topic analysis, we should also find that the words
    assigned to the same topic are similar to each other.
    This does indeed appear to be the case. The three words shown for Topic 16 represent
    various types of horizontal features. Because we use gradient orientation histograms as
    features, the position of these horizontal edges need not remain constant, nor the direc-
    tion of intensity change. Similarly, Topic 5 appears to correspond to grid-like features,
    such as window panes and tiles. The fact that these groups of words have been placed
    together within topics suggests topic analysis is performing as desired, in that it is able
    to link words with similar conceptual features which would usually – in a bag of words
    model – be considered as different as any other pair of words. Not only does this grouping
    suggest the correct words have been associated together, but it agrees with the location
    3.3 Image Representation
    47
    of the topics as visualised in Figure 3.5, which are concentrated on horizontal window
    edges and grid-like window panes respectively.
    We acknowledge that these few examples do not conclusively show that this happens
    consistently, but emphasise that our method does not depend on it: it is sufficient that
    ONMF performs dimensionality reduction in the usual sense of the word. Nevertheless,
    it is reassuring to see such structure spontaneously emerge from ONMF, and for this to
    be reflected in the topics found in a typical image from our training set.
    3.3.4.5 Combining Features
    The above discussion assumes there is only one set of words and documents, and does not
    deal with our need to use different vocabularies for different tasks. Fortunately, ONMF
    makes it easy to combine the information from gradient and colour words. This is done
    by concatenating the two term-document matrices for the corpus, so that effectively
    each document has a word vector of length 2 K (assuming the two vocabularies have the
    same number of words). This concatenated matrix is the input to ONMF, which means
    the resulting topic space is over the joint distribution of gradient and colour words, so
    should encode correlations between the two types of visual word. Generally we double
    the number of topics, to ensure that using both together retains the details from either
    used individually.
    When regressing the orientation of planes, colour information is not needed. We run
    ONMF again to create a second topic space, using a term-document matrix built only
    from gradient words, and only using planar image regions. This means that there are two
    topic spaces, a larger one containing gradient and colour information from all regions,
    and a smaller one, of lower dimensionality, using only the gradient information from the
    planar regions.
    3.3.5 Spatiograms
    The topic histograms defined above represent image regions compactly; however, we
    found classification and regression accuracy to be somewhat disappointing using these
    alone (see our experiments in Section 4.1.1). A feature of the underlying bag of words
    model is that all spatial information is discarded. This also applies to the latent topic
    3.3 Image Representation
    48
    representation. Although this has not hampered performance in tasks such as object
    recognition, for our application the relative spatial position of features are likely to be
    important. For example, some basic visual primitives may imply a different orientation
    depending on their relative position.
    It is possible to include spatial information by representing pairwise co-occurrence of
    words [9], by tiling overlapping windows [131], or by using the constellation or star models
    [37, 38]. While the latter are effective, they are very computationally expensive, and scale
    poorly in terms of the number of points and categories. Instead, we accomplish this by
    using spatiograms. These are generalisations of histograms, able to include information
    about higher-order moments. These were introduced by Birchfield and Rangarajan [10]
    in order to improve the performance of histogram-based tracking, where regions with
    different appearance but similar colours are too easily confused.
    We use second-order spatiograms, which as well as the occurrence count of each bin, also
    encode the mean and covariance of points contributing to that bin. These represent the
    spatial distribution of topics (or words), and they replace the topic histograms above
    (not the gradient or colour histogram descriptors). While spatiograms have been useful
    for representing intensity, colour and terrain distribution information [10, 52, 83], to our
    knowledge they have not previously been used with a bag of words model.
    We first describe spatiograms over words, in order to introduce the idea. A word spa-
    word
    tiogram s word
    , over K words, is defined as a set of K triplets s k word
    = ( h k word
    , µ k word
    , Σ k
    ),
    = w are the elements of the word histogram as above,
    0 k
    where the histogram values h k word
    and the mean and (unbiased) covariance are defined as
    word
    1
    X
    word
    1
    X k k T
    µ k
    =
    v i
    =
    v v
    i i
    (3.5)
    | Λ k |
    Σ k
    i Λ k
    | Λ k | − 1
    i Λ k
    where v i is the 2D coordinate of point i and v ki = v i µ k word , and as above Λ k is the
    subset of points whose feature vectors quantise to word k .
    To represent 2D point positions within spatiograms, we normalise them with respect
    to the image region, leaving us with a set of points with zero mean, rather than being
    coordinates in the image. This gives us a translation invariant region descriptor. Without
    this normalisation, a similar patch appearing in different image locations would have
    3.3 Image Representation
    49
    a different spatiogram, and so would have a low similarity score, which may adversely
    affect classification. The shift is achieved by replacing the v i in the equations above with
    v i = v i 1 P N v i . Note that by shifting the means, the covariances (and histogram
    N
    i =1
    values) are unaffected.
    The definition of word spatiograms is fairly straightforward, since every 2D point quan-
    tises to exactly one word. Topics are not so simple, since each word has a distribution
    over topics, and so each 2D point contributes some different amount to each topic. There-
    fore, rather than simply a sum over a subset of points, the mean and covariance will be
    a weighted mean and a weighted (unbiased) covariance over all the points, according to
    their contribution to each topic. It is because of this calculation of weighted means that
    the topic weights cannot be negative, as ensured by our use of ONMF, discussed above.
    Rather than a sum over individual points, we can instead express topic spatiograms as
    a sum over the words, since all points corresponding to the same word contribute the
    same amount to their respective topics. The resulting topic spatiograms s topic
    , defined
    topic
    topic
    topic
    topic
    over T topics, consist of T triplets s t
    = ( h t
    , µ t
    , Σ t
    ). Here, the scalar elements
    topic
    h t
    are from the region’s topic vector as defined above, while the mean and covariance
    are calculated using
    X
    K
    X
    K
    topic
    1
    topic
    α t
    η tk X t t T
    η tk µ k
    word
    µ t
    =
    Σ t
    =
    v v
    | Λ k |
    i i
    (3.6)
    α t
    k =1
    α t 2 β t
    k =1
    i Λ k
    topic
    , α t = P η tk , and β t = P
    K
    K
    η
    tk . The weights η tk are given by
    2
    where v i t = v µ
    i
    t
    k =1
    k =1 | Λ k |
    η tk = B tk w 0 k and reflect both the importance of word k through w k and its contribution to
    0
    topic t via B tk (elements from the topic basis matrix). As in Section 3.3.4.5, we maintain
    two spatiograms per (planar) image region, one for gradient and colour features, and one
    for just gradient features. We illustrate the data represented by spatiograms in Figure
    3.8, where we have shown an image region overlaid with some of the topics to which the
    image features contribute (see Figure 3.5 above), as well as the mean and covariance for
    the topics, which is encoded within the spatiogram.
    To use the spatiograms for classification and regression, we use a similarity measure
    proposed by O Conaire et al. [97]. This uses the Battacharyya coefficient to compare
    `
    spatiogram bins, and a measure of the overlap of Gaussian distributions to compare
    3.4 Classification
    50
    (a)
    (b)
    (c)
    (d)
    Figure 3.8: An illustration of what is represented by topic spatiograms. The
    points show the contribution of individual points to each topic (a,c), and the
    spatiograms represent the distribution of these contributions (b,d), displayed
    here as an ellipse showing the covariance, centred on the mean, for individual
    topics.
    their spatial distribution. For two spatiograms s A and s of dimension D , this similarity
    B
    function is defined as
    X
    D
    q
    1
    ρ ( s , s ) =
    A
    B
    h A h B 8 π | Σ d A Σ B | 4 N µ A ; µ B , 2( Σ A + Σ B ) 
    d d
    d
    d
    d
    d
    d
    (3.7)
    d =1
    where N ( x ; µ , Σ ) is a Gaussian with mean µ and covariance matrix Σ evaluated at x .
    Following [97], we use a diagonal version of the covariance matrices since it simplifies the
    calculation.
    3.4 Classification
    After compactly representing the image regions with spatiograms, they are used to train
    a classifier. We use the relevance vector machine (RVM) [125], which is a sparse kernel
    method, conceptually similar to the more well-known support vector machine (SVM). We
    choose this classifier because of its sparse use of training data: once it has been trained,
    only a small subset of the data need to be retained (fewer even than the SVM), making
    classification very fast even for very large training sets. The RVM gives probabilistic
    outputs, representing the posterior probability of belonging to either class.
    3.4.1 Relevance Vector Machines
    The basic model of the RVM is very similar to standard linear regression, in which the
    output label y for some input vector x is modelled as a weighted linear combination of
    3.4 Classification
    51
    M fixed (potentially non-linear) basis functions [11]:
    X
    M
    ω i φ i ( x ) = ω φ ( x )
    T
    y ( x ) =
    (3.8)
    i =1
    where ω = ( ω i ,...,ω M ) T is the vector of weights, for some number M of basis functions
    φ ( x ) = ( φ i ( x ) ,...,φ M ( x )) T . If we choose the basis functions such that they are given by
    kernel functions, where there is one kernel function for each of the N training examples
    (a similar structure to the SVM), (3.8) can be re-written:
    X
    N
    y ( x ) =
    ω m k ( x , x n ) + b
    (3.9)
    n =1
    where x n are the training data, for n = 1 ,...,N , and b is a bias parameter. The kernel
    functions take two vectors as input and return some real number. This can be considered
    as a dot product in some higher dimensional space — indeed, under certain conditions, a
    kernel can be guaranteed to be equivalent to a dot product after some transformation of
    the data. This mapping to a higher dimensional space is what gives kernel methods their
    power, since the data may be more easily separable in another realm. Since the mapping
    need never be calculated explicitly (it is only ever used within the kernel function), this
    means the benefits of increased separability can be attained without the computational
    effort of working in higher dimensions (this is known as the ‘kernel trick’ [11]).
    Training the RVM is done via the kernel matrix K , where each element is the kernel
    function between two vectors in the training set, K ij = k ( x i , x j ). The matrix is sym-
    metric, which saves some time during computation, but calculating it is still quadratic
    in the number of data. This becomes a problem when using such kernel methods on
    very large datasets (the RVM training procedure is cubic, due to matrix inversions [11],
    although iteratively adding relevance vectors can speed it up somewhat [126]).
    While equation 3.9 is similar to standard linear regression, the important difference here
    is that rather than having a single shared hyperparameter over all the weights, the RVM
    introduces a separate Gaussian prior for each ω i controlled by a hyperparameter α i . Dur-
    ing training, many of the α i tend towards infinity, meaning the posterior probability of
    the associated weight is sharply peaked at zero — thus the corresponding basis functions
    3.4 Classification
    52
    are effectively removed from the model, leading to a significantly sparsified form.
    The remaining set of training examples – the ‘relevance vectors’ – are sufficient for
    prediction for new test data. These are analogous to the support vectors in the SVM,
    though generally far fewer in number; indeed, in our experiments we observed an over
    95% reduction in training data used, for both classification and regression. Only these
    data (and their associated weights) need to be stored in order to use the classifier to
    predict the target value of a new test datum x 0 . Classification is done through a new
    0
    kernel matrix K 0 (here being a single column), whose elements are calculated as K r =
    k ( x r , x 0 ) for R relevance vectors indexed r = 1 ,...,R . The prediction is then calculated
    simply by matrix multiplication:
    y ( x ) = ω K
    0
    T
    0
    (3.10)
    where ω is again the vector of weights. For regression problems, the target value is
    simply y ( x 0 ); whereas for classification, this is transformed by a logistic sigmoid,
    p ( x 0 ) = σ ( y ( x )) = σ ( ω K )
    0
    T
    0
    1
    (3.11)
    σ ( x ) =
    1+ exp ( x )
    This maps the outputs to p ( x 0 ) (0 , 1), which is interpreted as the probability of the
    test datum belonging to the positive class. Thresholding this at 0.5 (equal probability
    of either class) gives a binary classification. In our work we used the fast [126] ‘Sparse
    Bayes’ implementation of the RVM made available by the authors 3 .
    The above describes binary classification or single-variable regression in the standard
    RVM. In order to regress multi-dimensional data – in our case the three components of the
    normal vectors – we use the multi-variable RVM (MVRVM) developed by Thayananthan
    et al. [123]; the training procedure is rather different from the above, but we omit details
    here (see [11]). This regresses over each of the D output dimensions simultaneously,
    using the same set of relevance vectors for each. Regression is achieved as in (3.10),
    3 www.vectoranomaly.com/downloads/downloads.htm
    3.5 Summary
    53
    but now the weights are a R × D matrix , with a column for each dimension of the
    T
    target variable; and so y ( x 0 ) = K . For this we adapted code available online . For
    0
    4
    both classification and regression, we can predict values for many data simultaneously if
    necessary, simply by adding more columns to K 0 .
    All that remains is to specify the form of kernel function we use. When using word
    or topic histograms, we experimented with a variety of standard histogram similarity
    measures, such as the Bhattacharyya coefficient and cosine similarity; for spatiograms,
    we use various functions of equation 3.7. More details of the different functions can be
    found in our experiments in Section 4.1.4.
    3.5 Summary
    This concludes our description of the plane recognition algorithm, for distinguishing
    planes from non-planes in given image regions and estimating their orientation. To
    summarise: we represent data for regions by detecting salient points, which are described
    using gradient and colour features; these are gathered into a bag of words, reduced with
    topic analysis, and enhanced with spatial information using spatiograms. These data
    are used to train an RVM classifier and regressor. For a test region, once the spatiogram
    has been calculated, the RVMs are used to classify it and, if deemed planar, to regress
    its orientation.
    However, it is important to realise that, as successful as this algorithm may be, it is
    only able to estimate the planarity and orientation for a given region of the image. It
    is not able to detect planes in a whole image, since there is no way of knowing where
    the boundaries between regions are, and this is something we shall return to in Chapter
    5. Before that, in the next chapter we present a thorough evaluation of this algorithm,
    both in cross-validation, to investigate the effects of the details discussed above; and
    an evaluation on an independent test set, showing that it can generalise well to novel
    environments.
    4 mi.eng.cam.ac.uk/˜at315/MVRVM.htm
    CHAPTER 4
    Plane Recognition Experiments
    In this chapter, we present experimental results for our plane recognition algorithm. We
    show how the techniques described in the previous chapter affect recognition, for example
    how the inclusion of spatial distribution information makes classification and regression
    much more accurate, and the benefits of using larger amounts of training data. We
    describe experiments on an independent set of data, demonstrating that our algorithm
    works not only on our initial training and validation set, but also in a wider context;
    and compare to an alternative classifier.
    Our experiments on the plane recognition algorithm were conducted as follows: we began
    with a set of training data (individual, manually segmented regions, not whole images),
    which we represented using the steps discussed in Section 3.3. The resulting descriptions
    of the training regions were used to train the RVM classifiers. Only at this point were
    the test data introduced, so the creation of the latent space and the classifiers was
    independent of the test data.
    54
    4.1 Investigation of Parameters and Settings
    55
    4.1 Investigation of Parameters and Settings
    For the first set of experiments, we used only our training set, collected from urban
    locations at the University of Bristol and the surrounding area. This consisted of 556
    image regions, each of which was labelled as described in Section 3.2; these were reflected
    to create our basic dataset of 1112 regions. We also warped these regions, as described
    in Section 3.2.2, to obtain a total of 7752 regions. The effect of using these extra regions
    is discussed later.
    All of these experiments used five-fold cross-validation on this training set — the train
    and test folds were kept independent, the only association between them being that the
    data came from the same physical locations, and that the features used to build the bag
    of words codebooks could have overlapped both train and test sets. We also ensured
    that warped versions of a region never appeared in the training set when the original
    region was in the test set (since this would be a potentially unfairly easy test). All
    of the results quoted below came from ten independent runs of cross-validation, from
    which we calculated the mean and standard deviation, which are used to draw error bars.
    The error bars on the graphs show one standard deviation either side of the mean, over
    all the runs of cross-validation (i.e. it is the standard deviation of the means from the
    multiple cross-validation runs, and not derived from the standard deviations of individual
    cross-validations).
    4.1.1 Vocabulary
    The first experiment investigated the performance for different vocabulary sizes, for
    different basic representations of the regions. The vocabulary size is the number of
    clusters K used in the K-means algorithm, and by testing using different values for K
    we could directly see what effect this had on plane recognition (instead of choosing the
    best K according to the distortion measure [11], for example).
    This experiment also compared four different representations of the data. First, we used
    weighted word histograms as described in Section 3.3.3 — this representation does not
    use the latent topics, nor use spatial information. We compared this to the basic topic
    histogram representation, where the word vectors have been projected into the topic
    space to reduce their dimensionality (refer to Section 3.3.4); in these experiments we
    always used a latent space of 20 topics, but found performance to be robust to different
    4.1 Investigation of Parameters and Settings
    56
    values. Next we created spatiograms for both word and topic representations, using
    equations 3.5 and 3.6.
    The experiment was conducted as follows: for each different vocabulary size, we created
    a new codebook by clustering gradient features harvested from a set of around 100
    exemplary images. The same set of detected salient points and feature descriptors was
    used each time (since these are not affected by the vocabulary size); and for each repeat
    run of cross-validation the same vocabulary was used, to avoid the considerable time it
    takes to re-run the clustering. This was done for each of the four region descriptions,
    then the process was repeated for each of the vocabulary sizes. We used only gradient
    features for this to simplify the experiment — so the discussion of combining vocabularies
    (Section 3.3.4.5) is not relevant at this point. In this experiment (and those that follow),
    we used only the original marked-up regions, and their reflections – totalling 1112 regions
    – since warping was not yet confirmed to be useful, and to keep the experiment reasonably
    fast. The kernels used for the RVMs were polynomial sums (see Section 4.1.4) of the
    underlying similarity measure (Bhattacharyya for histograms and spatiogram similarity
    [97] for spatiograms).
    100%
    30
    28
    90%
    26
    24
    80%
    22
    Word Histograms
    20
    70%
    Topic Histograms
    Word Histograms
    Word Spatiograms
    18
    Topic Histograms
    TopicSpatiograms
    Word Spatiograms
    60%
    16
    TopicSpatiograms
    14
    50%
    12
    0
    500
    1000
    1500
    2000
    0
    500
    1000
    1500
    2000
    Vocabulary size
    Vocabulary size
    (a)
    (b)
    Figure 4.1: Performance of plane classification and orientation estimation as
    the size of the vocabulary was changed.
    The results are shown in Figure 4.1, comparing performance for plane classification (a)
    and orientation regression (b). It is clear that spatiograms outperformed histograms,
    over all vocabulary sizes, for both error measures. Furthermore, using topics also tended
    to increase performance compared to using words directly, especially as the vocabulary
    size increased. As one would expect, there was little benefit in using topics when the
    number of topics was approximately the same as the number of words, and when us-
    ing small vocabularies, performance of word spatiograms was as good as using topic
    spatiograms. However, since this relies on using a small vocabulary it would overly con-
    4.1 Investigation of Parameters and Settings
    57
    strain the method (and generally larger vocabularies, up to an extent, give improved
    results [68]); and best performance was only seen when using around 400 words, the
    difference being much more pronounced for orientation estimation. This experiment
    confirms the hypothesis advanced in Section 3.3 that using topic analysis plus spatial
    representation is the best way, of those studied, for representing region information.
    4.1.2 Saliency and Scale
    In Section 3.3.1 we discussed the choice of saliency detector, and our decision to use a
    multi-scale detector to make the best use of the information at multiple image scales.
    We tested this with an experiment, comparing the performance of the plane recognition
    algorithm when detecting points using either FAST [110] or the difference of Gaussians
    (DoG) detector [81]. FAST gives points’ locations only, and so we tried a selection of
    possible scales to create the descriptors. We did the same using DoG (i.e. ignoring the
    scale value and using fixed patch sizes), to directly compare the type of saliency used.
    Finally, we used the DoG scale information to choose the patch size, in order to create
    descriptors to cover the whole area deemed to be salient.
    18.5
    FAST
    18
    DoG position
    DoG with scale
    17.5
    17
    16.5
    16
    15.5
    15
    14.5
    14
    13.5
    0
    10
    20
    30
    40
    50
    60
    Patch width (pixels)
    Figure 4.2: The effect of patch size on orientation accuracy, for different means
    of saliency detection.
    Our results are shown in Figure 4.2, for evaluation using the mean angular error of
    orientation regression (we found that different saliency detectors made no significant
    difference to classification accuracy). It can be seen that in general, DoG with fixed patch
    sizes out-performed FAST, although at some scales the difference was not significant.
    4.1 Investigation of Parameters and Settings
    58
    This suggests that the blob-like features detected by DoG might be more appropriate
    for our plane recognition task than the corner-like features found with FAST. The green
    line shows the performance when using the scale chosen by DoG (thus the patch size
    axis has no meaning, hence the straight line), which in most cases was clearly superior
    to both FAST and DoG with fixed patch sizes.
    We note that at a size of 15 pixels, a fixed patch size appeared to out-perform scale
    selection, and it may be worth investigating further since this would save computational
    effort. However, we do not feel this one result is enough yet to change our method, as
    it may be an artefact of these data. We conclude that this experiment broadly supports
    our reasons for using scale selection, rather than relying on any particular patch size;
    especially given that we would not know the best scale to use with a particular dataset.
    4.1.3 Feature Representation
    The next experiment compared performance when using different underlying feature
    representations, namely the gradient and colour features as described in Section 3.3.2.
    In these experiments, we used a separate vocabulary for gradient and colour descriptors,
    using K-means as before (having fixed the number of words, on the basis of the earlier
    experiment, at 400 words for the gradient vocabulary, and choosing 300 for colour).
    For either feature descriptor type used in isolation, the testing method was effectively
    the same; when using both together, we combined the two feature representations by
    concatenating their word histograms before running ONMF (words were re-numbered as
    appropriate, so that word i in the colour space became word K g + i in the concatenated
    space, where K g is the number of words in the gradient vocabulary). Since this used
    around twice as much feature information, we doubled the number of topics to 40 for the
    concatenated vocabularies, which we found to improve performance somewhat (simply
    doubling the number of topics for either feature type in isolation showed little difference,
    so any improvement will be due to the extra features).
    These experiments were again conduced using ten runs of five-fold cross-validation, using
    topic spatiograms, after showing them to be superior in the experiment above. We used
    the non-warped set of 1112 regions, with the polynomial sum kernel (see below), and the
    vocabularies were fixed throughout.
    Table 4.1 shows the results, which rather interestingly indicate that using colour on its
    4.1 Investigation of Parameters and Settings
    59
    own gave superior performance to gradient information. This is somewhat surprising
    given the importance of lines and textural patterns in identifying planar structure [44],
    although as Hoiem et al. [66] discovered, colour is important for geometric classification.
    It is also important to remember that we use spatiograms to represent the distribution
    of colour words, not simply using the mean or histogram of regions’ colours directly.
    Even so, our hypothesis that using both types of feature together is superior is verified,
    since the concatenated gradient-and-colour descriptors performed better than either in
    isolation, suggesting that the two are representing complementary information.
    Gradient
    Colour
    Gradient&Colour
    Classification Accuracy (%) 86.5 (1.8) 92.5 (0.5)
    93.9 (2.8)
    Orientation Error (deg)
    13.1 (0.2) 28.4 (0.3)
    17.9 (0.7)
    Table 4.1: Comparison of average classification accuracy and orientation er-
    ror when using gradient and colour features. Standard deviations are shown in
    parentheses.
    On the other hand, as we expected, colour descriptors fared much worse when estimating
    orientation, and combining the two feature types offered no improvement. This stands
    to reason, since the colour of a region – beyond some weak shading information, or the
    identity of the surface – should not give any indication to its 3D orientation, whereas
    texture will [107]. Adding colour information would only serve to confuse matters, and
    so the best approach is simply to use only gradient information for orientation regression.
    To summarise, the image representations we use, having verified this by experiment,
    are computed as follows: gradient and colour features are created for all salient points
    in all regions, which are used to create term-document matrices for the two vocabu-
    laries. These are used to create a combined 40D topic space, encapsulating gradient
    and colour information, and this forms the classification topic space. Then, a separate
    term-document matrix is built using only planar regions, using only gradient features, to
    create a second 20D gradient only topic space, to be used for regression. This means that
    each region will have two spatiograms, one of 40 dimensions and one of 20 dimensions,
    for classification and regression respectively.
    4.1 Investigation of Parameters and Settings
    60
    Data
    Name
    Function k ( x i , x j ) =
    Linear
    x i T x j
    Euclidean
    k x i x j k
    Histogram
    P x id x jd
    Bhattacharyya
    d
    Q
    q
    Bhattacharyya Polynomial P P d x id x jd 
    q =1
    Spatiogram
    ρ ( s i , s j ) (see (3.7) )
    ρ ( s i , s j ) p
    Gaussian
    exp(
    2 σ 2
    )
    Spatiogram
    P Q ρ ( s i , s j ) q
    Polynomial
    q =1
    Logistic
    1
    1+exp( ρ ( s i , s j ))
    P Q w d ρ ( s i , s j ) q
    Weighted Polynomial
    q =1
    Table 4.2: Description of the kernel functions used by the RVM, for histograms
    and spatiograms.
    4.1.4 Kernels
    Next, we compared the performance of various RVM kernels, for both classification
    and orientation estimation (this test continued to use only gradient features for ease of
    interpreting the results). In this experiment, we compared kernels on both histograms
    and spatiograms.
    For histograms, we used various standard comparison functions, including Euclidean,
    cosine, and Bhattacharyya distances, as well as simply the dot product (linear kernel).
    For spatiograms, since they do not lie in a vector space, we could only use the spatiogram
    similarity measure, denoted ρ , from equation 3.7 [97], and functions of it. Variations we
    used include the original measure, the version with diagonalised covariance, a Gaussian
    radial basis function, and polynomial functions of the spatiogram similarity. The latter
    were chosen in order to increase the complexity of the higher dimensionality space and
    strive for better separability. These functions are described in Table 4.2.
    Figure 4.3 shows the results. It is clear that as above the spatiograms out-performed his-
    tograms in all cases, though the difference is less pronounced for classification compared
    to regression. Of the spatiogram kernels, the polynomial function showed superior per-
    formance (altering the weights for each degree seemed to make no difference). Therefore
    we chose the unweighted polynomial sum kernel for subsequent testing (and we set the
    maximum power to Q = 4).
    It is also interesting to compare this polynomial sum kernel to the equivalent polynomial
    sum of the Bhattacharyya coefficient on histograms, which leads to substantially lower
    4.1 Investigation of Parameters and Settings
    61
    90%
    35
    80%
    30
    70%
    25
    60%
    50%
    20
    40%
    15
    30%
    10
    20%
    10%
    5
    0%
    0
    (a)
    (b)
    Figure 4.3: Comparison of different kernel functions, for histogram (first five)
    and spatiogram (the others) region descriptions. Spatiograms always out-perform
    histograms, with the polynomial sum kernels proving to be the best.
    performance. This confirms that the superior performance is due to using spatiograms,
    rather than the polynomial function or Bhattacharyya comparison of histogram bins;
    while the polynomial function leads to increased performance compared to the regular
    spatiogram similarity kernel.
    4.1.5 Synthetic Data
    In Section 3.2.2 we described how, from the initial marked-up training examples, we can
    synthetically generate many more by reflecting and warping these, to approximate views
    from different locations. We conducted an experiment to verify that this is actually ben-
    eficial to the recognition algorithm (note that all the above experiments were using the
    marked-up and reflected regions, but not the warped). This was done by again running
    cross-validation, where the test set was kept fixed (comprising marked-up and reflected
    regions as before), but for the training set we added progressively more synthetically
    generated regions. The experiment started with a minimal set consisting of only the
    marked-up regions, then added the reflected regions, followed by increasing quantities
    of warped regions, and finally included warped and reflected regions. These were added
    such that the number of data was increased in almost equal amounts each time.
    4.1 Investigation of Parameters and Settings
    62
    95%
    16
    90%
    15
    85%
    14
    80%
    13
    75%
    12
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (a)
    (b)
    Figure 4.4: The effect of adding progressively more synthetically generated
    training data is to generally increase performance, although the gains diminish
    as more is added. The best improvement is seen when adding reflected regions
    (second data point).
    The results of this experiment, on both classification and orientation performance, are
    shown in Figure 4.4. As expected, adding more data was beneficial for both tasks.
    However, while performance tended to increase as more training data were used, the
    gains diminished until adding the final set of data made little difference. This could be
    because we were only adding permutations of the already existing data, from which no
    new information could be derived. It is also apparent that the biggest single increase was
    achieved after adding the first set of synthetic data (second data point) which consisted
    of the reflected regions. This seems reasonable, since of all the warped data this was the
    most realistic (minimal geometric distortion) and similar to the test regions.
    It seems that synthetically warping data was generally beneficial, and increased perfor-
    mance on the validation set — but as most of the benefit came from reflected images this
    calls into question how useful or relevant the synthetic warping actually was (especially
    given there were a very large number of warped regions). This experiment confirmed
    that using a larger amount of training data was an advantage, and that reflecting our
    initial set was very helpful, but it may be that including more marked-up data, rather
    than generating new synthetic regions, would be a better approach.
    4.1.6 Spatiogram Analysis
    It may be thought that part of the success of spatiograms comes from the way the test
    regions have been manually segmented, with their shape often being indicative of their
    actual orientation – for example the height of an upright planar region in the image
    4.1 Investigation of Parameters and Settings
    63
    Figure 4.5: Region shape, even without any visual features, is suggestive of the
    orientation of manually segmented regions.
    will often diminish as it recedes into the distance. Even without considering the actual
    features in a region, the shape alone can give an indication of its likely orientation, and
    such cues are sure to exist when boundaries are outlined by a human. We illustrate
    this effect in Figure 4.5. This is significant, as unlike histograms, spatiograms use the
    location of points, which implicitly encodes the region shape. This is a problem, since
    in ‘real’ images, such boundary information would not be available; worse, it could bias
    the classifier into an erroneous orientation simply due to the shape of the region.
    To investigate how much of an effect this has on our results, an experiment was carried
    out where all regions were reduced to being circular in shape (by finding the largest
    circle which can fit inside the region). We would expect this to reduce performance
    generally, since there was less information available to the classifier; however, we found
    that spatiograms still significantly outperformed histograms for orientation regression,
    as Table 4.3 shows. While circular regions gave worse performance, this was by a simi-
    lar amount for both representations, and adding spatial information to circular regions
    continued to boost accuracy. A similar pattern was seen for classification too, where the
    region shape should not be such a significant cue. To summarise it seems that the region
    shapes are not a particularly important consideration, confirming our earlier conclusions
    that spatiograms contribute significantly to the performance of the plane recognition
    algorithm.
    Histograms Spatiograms Cut Hist. Cut Spat.
    Class. Acc. (%)
    77.3 (1.0)
    87.9 (1.0)
    75.3 (1.0) 84.4 (0.9)
    Orient. Err. (deg)
    24.9 (0.2)
    13.3 (0.1)
    26.6 (0.2) 17.0 (0.3)
    Table 4.3: Comparison of performance for histograms and spatiograms on re-
    gions cut to be uniformly circular, compared to the original shaped regions. Spa-
    tiograms are still beneficial, showing this is not only due to the regions’ shapes
    (standard deviation in parentheses).
    4.2 Overall Evaluation
    64
    4.2 Overall Evaluation
    Finally, using the experiments above, we decided upon the following as the best settings
    to use for testing our algorithm:
  • Difference of Gaussians saliency detection, to detect location and scale of salient
  • points.
  • Gradient and colour features combined for classification, but gradient only for
  • regression.
  • A vocabulary of size 400 and 300 for gradient and colour vocabularies respectively.
  • Latent topic analysis, to reduce the dimensionality of words to 20-40 topics.
  • Spatiograms as opposed to histograms, to encode spatial distribution information.
  • Polynomial sum kernel of the spatiogram similarity measure within the RVM.
  • Augmented training set with reflected and warped regions.
  • 35 %
    30 %
    25 %
    20 %
    15 %
    10 %
    5 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    160
    Orientation error (degrees)
    Figure 4.6: Distribution of orientation errors for cross-validation, showing that
    the majority of errors were below 15 .
    Using the above settings, we ran a final set of cross-validation runs on the full dataset,
    with reflected and warped regions, comprising 7752 regions. We used the full set for
    training but only the marked-up and reflected regions for testing. We observed a mean
    classification accuracy of 95% (standard deviation σ = 0 . 49%) and a mean orientation
    error of 12.3 ( σ = 0 . 16 ), over the ten runs. To illustrate how the angular errors were
    distributed, we show a histogram of orientation errors in Figure 4.6. Although some are
    large, a significant number are under 15 (72%) and under 20 (84%). This is a very
    4.3 Independent Results and Examples
    65
    encouraging result, as even with a mean as low as 12.3 , the errors are not normally
    distributed, with an obvious tendency towards lower orientation errors.
    35 %
    30 %
    25 %
    20 %
    15 %
    10 %
    5 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    Orientation error (degrees)
    Figure 4.7: Distribution of errors for testing on independent data.
    4.3 Independent Results and Examples
    While the above experiments were useful, they were not a good test of the method’s
    ability to generalise, since the training and test images, though never actually coinciding,
    were taken from the same physical locations, and so there would inevitably be a degree
    of overlap between them.
    To test the recognition algorithm properly, we used a second dataset of images, gathered
    from an independent urban location, ensuring the training and test sets were entirely
    separate. This set consisted of 690 image regions, of an equal number of planes and
    non-planes (we did not use any reflection or warping on this set), which were marked
    up with ground truth class and orientation as before. Again, we emphasise these were
    manually chosen regions of interest, not whole images. Ideally, the intention was to keep
    this dataset entirely separate from the process of training and tuning the algorithm, and
    to use it exactly once at the end, for the final validation. Due to the time-consuming
    nature of acquiring new training data, this was not quite the case, and some of the data
    would have been seen more than once, as follows. A subset of these data (538 regions)
    were used to test an earlier version of the algorithm as described in [58] (without colour
    features and using a different classifier). The dataset was expanded to ensure equal
    balance between the classes, but the fact remains that most of the data have been used
    once before. Furthermore, we needed to use this dataset once more to verify a claim
    made about the vocabulary size: in section 4.1.1 we justified using a larger number of
    4.3 Independent Results and Examples
    66
    (a) error = 0.9
    (b) error = 1.9
    (c) error = 2.0
    (d) error = 3.1
    (e) error = 3.1
    (f) error = 3.4 *
    (g) error = 10.4
    (h) error = 10.6
    (i) error = 10.7
    (j) error = 11.0
    (k) error = 11.7
    (l) error = 11.7
    (m) error = 27.9
    (n) error = 30.2
    (o) error = 30.7
    (p) error = 31.8
    (q) error = 33.2
    (r) error = 45.4
    4.3 Independent Results and Examples
    67
    Figure 4.8: Example results, selected algorithmically as described in the text,
    showing typical performance of plane recognition on the independent dataset.
    The first six (a-f) are selected from the best of the results, the next six (g-l)
    are from the middle, and the final six (m-r) are from the worst, according to the
    orientation error. *Note that (f) was picked manually, because it is an important
    illustrative example.
    words partly by the fact that this should allow the algorithm to generalise better to
    new data. A test (not described here) was done to see if this was true for our test set
    (the result showed that using a small vocabulary without topic discovery was indeed
    detrimental to performance, more so than implied by the proximity of the two curves
    toward the left of Figure 4.1a). Other than these lapses, the independent dataset was
    unseen with respect to the process of developing and honing our method.
    The results we obtained for plane recognition were a mean classification accuracy of
    91.6% and a mean orientation error of 14.5 . We also show in Figure 4.7 a plot of the
    orientation errors; this is to be compared to the results in Figure 4.6, and shows that
    here too the spread of orientation errors is good, with the majority of regions being given
    an accurate normal estimate. This suggests that the algorithm is capable of generalising
    well to new environments, and supports our principal hypothesis that by learning from a
    set of training images, it is possible to learn how appearance relates to 3D structure; and
    that this can be applied to new images with good accuracy. We have not compared this
    to other methods, due to the lack of appropriately similar work with which to compare
    (though a comparative evaluation of the full plane detector is presented in Chapter 6)
    but we believe that these results represent a good level of accuracy, given the difficulty of
    the task, and the fact that no geometric information about the orientation is available.
    Figures 4.8 to 4.10 show typical example results of the algorithm, on this independent
    training set. To avoid bias in choosing the results to show, the images were chosen as
    follows. The correct classifications of planar regions (true positives) were sorted by their
    orientation error. We took the best ten percent, the ten percent surrounding the median
    value, and the worst ten percent, then chose randomly from those (six from each set)
    – these are shown in Figure 4.8. The only exception to this is Figure 4.8f which we
    picked manually from the best ten percent, because it is a useful illustrative example;
    all the others were chosen algorithmically. This method was chosen to illustrate good,
    typical, and failure cases of the algorithm, but does not reflect the actual distribution of
    errors (c.f. Figure 4.7) which of course has more good than bad results. We then chose
    randomly from the set of true negatives (i.e. correct identification of nonplanes), false
    4.3 Independent Results and Examples
    68
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    (g)
    (h)
    (i)
    Figure 4.9: Correct classification of non-planes, in various situations (selected
    randomly from the results).
    negatives, and false positives, as shown in the subsequent images, again in order to avoid
    biasing the results we display.
    These results show the algorithm is able to work in a variety of different environments.
    This includes those with typical Manhattan-like structure, for example 4.8a — but cru-
    cially, also those with more irregular textures like Figure 4.8f. While the former may
    well be assigned good orientation estimates by typical vanishing-point algorithms [73],
    such techniques will not cope well with the more complicated images.
    We also show examples of successful non-plane detection in Figure 4.9. These are mostly
    composed of foliage and vehicles, but we also observed the algorithm correctly classifiying
    people and water features.
    4.4 Comparison to Nearest Neighbour Classification
    69
    (a) False negative
    (b) False negative
    (c) False negative
    (d) False positive
    (e) False positive
    (f) False positive
    Figure 4.10: Some cases where the algorithm fails, showing false negatives (a-c)
    and false positives (d-f) (these were chosen randomly from the results).
    It is also interesting to consider cases where the algorithm performs poorly, examples of
    which are shown in the lower third of Figure 4.8 and in Figure 4.10. The former shows
    regions correctly classified as planes, but with a large error in the orientation estimate
    (many of these are of ground planes dissimilar to the data used for training), while the
    rectangular window on a plain wall in Figure 4.8p may not have sufficienly informative
    visual information. The first row of Figure 4.10 shows missed planes; we speculate that
    Figure 4.10c is misclassified due to the overlap of foliage into the region. The second
    row shows false detections of planes, where Figure 4.10d may be confused by the strong
    vertical lines, and Figure 4.10e has too little visual information to be much use. These
    examples are interesting in that they hint at some shortcomings of the algorithm; but we
    emphasise that such errors are not common, with the majority of regions being classified
    correctly.
    4.4 Comparison to Nearest Neighbour Classification
    Relevance vector machines for classification and regression perform well, however it is
    not straightforward to interpret the reasons why the RVMs behave as they do. That is,
    we cannot easily learn from them which aspects of the data they are exploiting, or indeed
    4.4 Comparison to Nearest Neighbour Classification
    70
    if they are functioning as we believe they are. We investigated this by using a K-nearest
    neighbour (KNN) classifier instead. This assigns a class using the modal class of the K
    nearest neighbours, and the orientation as the mean of its neighbours’ orientations. By
    looking at the regions the KNN deemed similar to each other, we can see why the given
    classifications and orientations were assigned. Ultimately this should give some insight
    into the means by which the RVM assigns labels.
    The first step was to verify that the KNN and RVM gave similar results, otherwise it
    would be perverse to claim one gives insight into the other. This was done by running a
    cross-validation experiment with the KNN on varying amounts of training data. We used
    the recognition algorithm in the same way as above, except for the final classification
    step.
    92%
    20
    KNN
    19
    RVM
    90%
    18
    88%
    KNN
    RVM
    17
    86%
    16
    84%
    15
    82%
    14
    80%
    13
    78%
    12
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (a)
    (b)
    0.1
    7000
    0.09
    6000
    0.08
    0.07
    5000
    KNN
    0.06
    RVM
    4000
    KNN
    0.05
    RVM
    3000
    0.04
    0.03
    2000
    0.02
    1000
    0.01
    0
    0
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    0
    1000
    2000
    3000
    4000
    5000
    6000
    7000
    8000
    9000
    Training set size
    Training set size
    (c)
    (d)
    Figure 4.11: Comparison of RVM and KNN, showing improved accuracy and
    better scalability to larger training sets (at the expense of a slow training phase
    for the RVM).
    4.4 Comparison to Nearest Neighbour Classification
    71
    The results, from ten runs of cross-validation, were a mean classification accuracy of
    95.6% ( σ = 0 . 49%) and an orientation error of 13.9 ( σ = 0 . 16 ), neither of which
    were substantially different from results with the RVM. As Figures 4.11a and 4.11b
    show, performance for both algorithms improved with more training data, and was fairly
    similar (although the RVM is generally better). Figure 4.11c compares classification
    time, showing that the KNN was much slower and scaled poorly to larger training sets,
    justifying our choice of the RVM. The main drawback of the RVM, on the other hand,
    is its training time (the KNN requires no training). Figure 4.11d compares the setup
    time (creation of training descriptors and training the classifiers if necessary) for both
    algorithms, where the time taken for the RVM increased dramatically with training set
    size.
    We also tested the KNN version of the algorithm on our independent test set, and again
    found similar performance: classification accuracy was 87.8%, while orientation error
    increased to 18.3 .
    4.4.1 Examples
    In this section we show example results from the independent test set, with the ground
    truth and classification overlaid as before, accompanied by the nearest neighbours for
    each (using K = 5 neighbours). Obviously, this set is no longer completely independent
    or unseen, since it is the same as used above to test the method using the RVM; but no
    changes were made based on those results before using the KNN.
    As above, the examples we show here were not selected manually, but chosen randomly in
    order to give a fair representation of the algorithm. This was done as above, i.e. taking
    the best, middle, and worst ten percent of the results for true positive cases (Figure
    4.12), and selecting randomly from each set. Examples of true negatives (Figure 4.13),
    and false positives and negatives (Figure 4.14), are selected randomly.
    These images illustrate that classification was achieved by finding other image regions
    with similar structure. It is interesting to note that the neighbours found were not always
    perceptually similar, for example Figures 4.12b and 4.12d. This is important, since while
    an algorithm which matched only to visually similar regions would work to an extent, it
    would fail when presented with different environments. Figure 4.13c, for example, shows
    how non-planes can be correctly matched to similar non-planar regions in the training
    4.4 Comparison to Nearest Neighbour Classification
    72
    (a)
    (b)
    (c)
    (d)
    (e)
    (f)
    (g)
    (h)
    (i)
    Figure 4.12: Examples of plane classification and orientation when using a
    K-nearest neighbour classifier, showing the input image overlaid with the classi-
    fication and orientation (left), and the five nearest neighbours from the training
    set. These show triplets of images selected randomly from the best (a-c), middle
    (d-f), and worst (g-i) examples, for correct plane classification.
    4.4 Comparison to Nearest Neighbour Classification
    73
    (a)
    (b)
    (c)
    Figure 4.13: Examples of correct identification of non-planar regions, using a
    K-nearest neighbour classifier; these examples were chosen randomly from the
    results.
    set, but Figure 4.13b is also classified correctly, despite being visually different from its
    neighbours.
    It is interesting to note the role that the reflected and warped data play in classification
    – in many situations several versions of the same original image are found as neighbours
    (for example Figure 4.12e in particular). This stands to reason as they will be quite
    close in feature space. On the other hand, the tendency to match to multiple versions of
    the same image with different orientations can cause large errors, as in Figure 4.12g.
    It is also instructive to look at examples where the KNN classifier performed poorly, since
    now we can attempt to discover why. Figure 4.12i, for example, has a large orientation
    error. By looking at the matched images, we can see that this is because it has matched
    to vertical walls which share a similar pattern of lines tending toward a frontal vanishing
    point, but whose orientation is not actually very similar. Misclassification of a wall occurs
    in Figure 4.14a where a roughly textured wall was predominantly matched to foliage,
    resulting in an incorrect non-planar classification (interestingly, there is a wall behind
    the trees in the neighbouring training regions). Figure 4.14b was also wrongly classified,
    perhaps due to the similarity between the ventilation grille and a fence. These two
    examples are interesting since they highlight the fact that what is planar can sometimes
    be rather ambiguous. Indeed, Figure 4.14c shows the side of a car being classified as
    planar, which one could argue is actually correct.
    4.4 Comparison to Nearest Neighbour Classification
    74
    (a)
    (b)
    (c)
    (d)
    Figure 4.14: Examples of incorrect classification of regions, showing randomly
    chosen examples of false negatives (a,b) and false positives (c,d).
    4.4.2 Random Comparison
    A further useful property of the KNN was that we could confirm that the low aver-
    age orientation error we obtained was a true result, not an artefact of the data or test
    procedure. It is conceivable that the recognition algorithm was simply exploiting some
    property of the dataset, rather than actually using the features we extracted. For exam-
    ple, if all the orientations were actually very similar, any form of regression would return
    a low error. We refute this in Figure 4.15b, which shows the spread of orientation er-
    rors obtained (in cross-validation) when using randomly chosen neighbours in the KNN,
    instead of the spatiogram similarity. This means there was no image information being
    used at all. Compared to results obtained using the KNN classifier (shown in Figure
    4.15a), performance was clearly much worse. The histogram of results for the KNN also
    shows similar performance to the RVM (Figure 4.6 above).
    These experiments with the KNN were quite informative, since even an algorithm as
    simple as the KNN classifier can effectively make use of the visual information available
    in test data, in an intuitive and comprehensible way, to find structurally similar training
    examples. The RVM and KNN exhibit broadly similar performance, though they work
    by different mechanisms. It is reasonable, therefore, to consider the RVM as being a more
    efficient way of approximating the same goal, that of choosing the regions in feature space
    most appropriate for a given test datum [11]. The superior performance of the RVM at
    4.5 Summary of Findings
    75
    35 %
    35 %
    30 %
    30 %
    25 %
    25 %
    20 %
    20 %
    15 %
    15 %
    10 %
    10 %
    5 %
    5 %
    0 %
    0 %
    0
    20
    40
    60
    80
    100
    120
    140
    160
    0
    20
    40
    60
    80
    100
    120
    140
    160
    Orientation error (degrees)
    Orientation error (degrees)
    (a)
    (b)
    Figure 4.15: Comparison of the distribution of orientation errors (in cross-
    validation), for K-nearest neighbour regression (a) and randomly chosen ‘neigh-
    bours’ (b).
    a lower computational cost (during testing), and more efficient handling of large training
    sets, suggests it is a suitable choice of classifier.
    4.5 Summary of Findings
    In this chapter, we have shown experimental results for the plane recognition algorithm
    introduced in Chapter 3. We investigated the effects of various parameters and imple-
    mentation choices, and showed that the methods we chose to represent the image regions
    were effective for doing plane classification, and out-performed the simpler alternatives.
    We also showed that the algorithm generalises well to environments outside the training
    set, being able to recognise and orient planar regions in a variety of scenes.
    4.5.1 Future Work
    There are a number of further developments of this algorithm which would be interesting
    to consider, but fall outside the scope of this thesis. First, while we are using a saliency
    detector which identifies scale, and using it to select the patch size for descriptor creation,
    this is not fully exploiting scale information (rather, we are striving to be invariant to
    it). However, the change in size of similar elements across a surface is an important
    depth cue [43], and is something we might consider using — perhaps by augmenting
    the spatiograms to represent this as well as 2D position. On the other hand, whether
    the notion of image scale detected by the DoG detector has any relation to real-world
    scale or depth is uncertain. Alternatively, investigation of other types of saliency may
    4.5 Summary of Findings
    76
    be fruitful.
    We could also consider using different or additional feature descriptors. We have already
    shown how multiple types of descriptor, each with their own vocabulary, can be combined
    together using topics, and that these descriptors are suited to different tasks. We could
    further expand this by using entirely different types of feature for the two tasks, for
    example using sophisticated rotation and scale invariant descriptors for classification,
    following approaches to object recognition [33, 70], and shape or line based [6] features
    for orientation estimation.
    4.5.2 Limitations
    The system as described above has a number of limitations, which we briefly address
    here. First of all, because it is based on a finite test set, it has a limited ability to deal
    with new and exceptional images. We have endeavoured to learn generic features from
    these, to be applicable in unfamiliar scenes, though this may break down for totally
    new types of environment. Also, while we have avoided relying on any particular visual
    characteristics, such as isotropic texture, the choices we have made are tailored to outdoor
    locations, and we doubt whether it would perform well indoors where textured planar
    surfaces are less prevalent.
    The most important and obvious limitation is that it requires a pre-segmented region of
    interest, both for training and testing, which means it requires human intervention to
    specify the location. This was sufficient to answer an important question, namely, given
    a region, can we identify it as a plane or not and estimate its orientation? However, the
    algorithm would be of limited use in most real-world tasks, and we cannot simply apply
    it to a whole image, unless we already know a plane is likely to fill the whole image.
    The next step, therefore, is to place the plane recognition into a broader framework in
    which it is useful for both finding and orienting planes. This is the focus of the following
    chapter, in which we show how it is possible, with a few modifications, to use it as part
    of a plane detection algorithm, which is able to find planes from anywhere within the
    image, making it possible to use directly on previously unseen images.
    CHAPTER 5
    Plane Detection
    This chapter introduces our novel plane detection method, which for a single image can
    detect multiple planes, and predict their individual orientations. It is important to note
    the difference between this and the plane recognition algorithm presented in Chapter 3,
    which required the location of potential planar regions to be known already, limiting its
    applicability in real-world scenarios.
    5.1 Introduction
    The recognition algorithm forms a core part of our plane detection method, where it is
    applied at multiple locations, in order to find the most likely locations of planes. This
    is sufficient to be able to segment planes from non-planes, and to separate planes from
    each other according to their orientation. This continues with our aim of developing
    a method to perceive structure in a manner inspired by human vision, since the plane
    detection method extends the machine learning approach introduced in the recognition
    method. Again this means we need a labelled training set of examples, though with
    some key differences.
    77
    5.1 Introduction
    78
    5.1.1 Objective
    First we more rigorously define our objective: in short, what we intend to achieve is the
    detection of planes in a single image. More precisely, we intend to group the salient points
    detected in an image into planar and non-planar regions, corresponding to locations of
    actual planar and non-planar structures in the scene. The planar regions should then
    be segmented into groups of points, having the same orientation, and corresponding
    correctly to the planar surfaces in the scene. Each group will have an accurate estimate
    of the 3D orientation with respect to the camera. This is to be done from a single image
    of a general outdoor urban scene, without knowledge such as camera pose, physical
    location or depth information, nor any other specific prior knowledge about the image.
    This cannot rely upon specific features such as texture distortion or vanishing points.
    While such methods have been successful in some situations (e.g. [13, 44, 73, 93]), they
    are not applicable to more general real-world scenes. Instead, by learning the relationship
    between image appearance and 3D structure, the intention is to roughly emulate how
    humans perceive the world in terms of learned prior experience (although the means by
    which we do this is not at all claimed to be biologically plausible). This task as we have
    described it has not, to the best of our knowledge, been attempted before.
    5.1.2 Discussion of Alternatives
    Given that we have developed an algorithm capable of recognising planes and estimating
    their orientation in a given region of an image (Chapter 3), a few possible methods
    present themselves for using this to detect planes. Briefly, the alternatives are to sub-
    divide the image, perhaps using standard image segmentation algorithms, to extract
    candidate regions; to find planar regions in agglomerations of super-pixels; and to search
    for the optimal grouping by growing, splitting and merging segmentations over the salient
    points.
    Amongst the simplest potential approaches is to provide pre-segmented regions on which
    the plane recogniser can work, using a standard image segmentation algorithm. However,
    image segmentation is in general a difficult and unsolved problem [36, 130], especially
    when dealing with more complicated distinctions than merely colour or texture, and it
    is unlikely that general algorithms would give a segmentation suitable for our purposes.
    To illustrate the problem, Figure 5.1 shows typical results of applying Felzenszwalb and
    Huttenlocher’s segmentation algorithm [36], with varying settings (controlling roughly
    5.1 Introduction
    79
    Figure 5.1: This illustrates the problems with using appearance-based segmen-
    tation to find regions to which plane recognition may be applied. Here we have
    used Felzenszwalb and Huttenlocher’s algorithm [36], which uses colour informa-
    tion in a graph-cut framework. While the image is broadly broken down into
    regions corresponding to real structure, it is very difficult to find a granularity
    (decreasing left to right) which does not merge spatially separate surfaces, while
    not over-segmenting fine details.
    the number of segments) to some of our test images. The resulting segments are either
    too small to be used, or will be a merger of multiple planar regions, whose boundary is
    effectively invisible (for example the merging of walls and sky caused by an over-exposed
    image).
    An even more basic approach would be simply to tile the image with rectangular blocks,
    and run plane recognition on each, then join adjacent blocks with the same classifi-
    cation/orientation. This does not depend on any segmentation algorithm to find the
    boundaries, but will result in a very coarse, blocky segmentation. Choosing the right
    block size would be problematic, since with blocks too large we gain little information
    about the location or shape of surfaces, but too small and the recognition will perform
    too poorly to be of any use (c.f. Chapter 3 and experiment 6.2.1). However, one could
    allow the blocks to overlap, and while the overlapped sections are potentially ambiguous,
    this could allow us to use sufficiently large regions while avoiding the blockiness — this
    is something we come back to later.
    Rather than use fixed size blocks, we could deliberately over-segment the image into so-
    called ‘superpixels’, where reasonably small homogeneous regions are grouped together.
    This allows images to be dealt with much more efficiently than using the pixels them-
    selves. Individual superpixels would likely not be large enough to be classifiable on their
    own, but ideally they would be merged into larger regions which conform to planar or
    5.1 Introduction
    80
    non-planar surfaces. Indeed, this is rather similar to Hoiem et al. [66], who use seg-
    ments formed of superpixels to segment the image into geometric classes. They use local
    features such as filter responses and mean colour to represent individual superpixels,
    which are very appropriate for grouping small regions (unlike our larger-scale classifica-
    tion). Even so, finding the optimal grouping is prohibitively expensive. In our case we
    would also have to ensure that there are always enough superpixels in a collection being
    classified at any time, constraining the algorithm in the alterations it can make to the
    segmentation.
    Another alternative is to initialise a number of non-overlapping regions, formed from
    adjacent sets of salient points rather than superpixels (an initial segmentation). Plane
    recognition would be applied to each, followed by iterative update of the regions’ bound-
    aries in order to search for the best possible segmentation. The regions could be merged
    and split as necessary and the optimal configuration found using simulated annealing
    [34], for example. Alternatively this could be implemented as region growing where a
    smaller number of regions are initialised centred at salient points, and are grown by
    adding nearby points if they increase the ability of the region to describe a plane; then
    split apart if they become too large or fail to be classified confidently. The problem is
    that while region growing has been successful for image segmentation [129], our situa-
    tion differs in that the decision as to whether a point should belong to a plane is not
    local — i.e. there is nothing about the point itself which indicates that it belongs to a
    nearby region. Rather, it is only after including the point within a region to be classi-
    fied that anything about its planar or non-plane status can be known. This is different
    from segmenting according to colour, for instance, which can be measured at individual
    locations.
    These methods would need some rule for evaluating the segmentation, or deciding
    whether regions should be merged or split, and unfortunately it is not clear how to
    define which cost function we should be minimising (be it the energy in simulated an-
    nealing or the region growing criterion). One candidate is the probability estimate given
    by the RVM classifier, where at each step, the aim would be to maximise the overall
    probability of all the classifications, treating the most confident configuration as the
    best. The problem is that we cannot rely on the probability given by the RVM, partly
    due to the interesting property of the RVM that the certainty of classification, while
    generally sensible for validation data, actually tends to increase as the test data move
    further away from the training data. This is an unfortunate consequence of enforcing
    sparsity in the model [106], and cannot be easily resolved without harming the efficiency
    5.2 Overview of the Method
    81
    of the algorithm. Indeed, in some early experiments using region growing, we found that
    classification certainty tended to increase with region size, so that the final result always
    comprised a single large region no matter what the underlying geometry. This particu-
    lar problem is specific to the RVM, although similar observations hold for classification
    algorithms more generally — any machine learning method would be constrained by its
    training data, and liable to make erroneous predictions outside this scope.
    Bearing the above discussion in mind, what we desire is a method that does not rely
    upon being able to classify or regress small regions (since our plane recogniser cannot
    do this); avoids the need for any prior segmentation or region extraction (which cannot
    be done reliably); does not rely on accurate probabilities from a classifier in its error
    measure (which are hard to guarantee); and will not require exploration or optimisation
    over a combinatorial search space. In the following, we present such a method, the
    crucial factor being a step we call region sweeping, and use this to drive a segmentation
    algorithm at the level of salient points.
    5.2 Overview of the Method
    This section gives a brief overview of the plane detection method, the details of which
    are elaborated upon in the following sections, and images representing each of the steps
    are shown in Figure 5.2. We begin by detecting salient points in the image as before,
    and assigning each a pair of words based on gradient and colour features. Next we use a
    process we call region sweeping, in which a region is centred at each salient point in turn,
    to which we apply the plane recognition algorithm. This gives plane classification and
    orientation estimation at a number of overlapping locations covering the whole image.
    We use these to derive the ‘local plane estimate’ which is an estimate at each individual
    salient point of its probability of belonging to a plane, and its orientation, as shown in
    Figure 5.2b. These estimates of planarity and orientation at each point are derived from
    all of the sweep regions in which the point lies.
    While this appears to be a good approximation of the underlying planar structure, it is
    not yet a plane detection, since we know nothing of individual planes’ locations. The
    next step, therefore, is to segment these points into individual regions, using the local
    plane estimates to judge which groups of points should belong together, as illustrated
    in Figure 5.2c. The output of the segmentation is a set of individual planar and non-
    planar segments, and the final step is to verify these, by applying the plane recognition
    5.3 Image Representation
    82
    (a)
    (b)
    (c)
    (d)
    Figure 5.2: Example steps from plane detection: from the input image (a), we
    sweep the plane recogniser over the image to obtain a pointwise estimate of plane
    probability and orientation (b). This is segmented into distinct regions (c), from
    which the final plane detections are derived (d).
    algorithm once more, on these planar segments. The result, as shown in Figure 5.2d, is a
    detection of the planar surfaces, comprised of groups of salient points, with an estimate
    of their orientation, derived from their spatial adjacency and compatibility in terms of
    planar characteristics.
    5.3 Image Representation
    The representation of images will closely follow the description in Section 3.3, except
    that now we need to consider the entire image, as well as multiple overlapping regions.
    Because regions overlap, feature vectors are shared between multiple regions.
    We begin by detecting a set of salient points V = { v 1 ,..., v n } over the whole image,
    using the difference of Gaussians (DoG) detector. For each point, we create a gradient
    and colour descriptor. Assuming that we have already built bag of words vocabularies,
    we quantise these to words, so that the image is represented by pairs of words (gradient
    and colour) assigned to salient points (for more details, refer back to Section 3.3). The
    further stages of representation – word/topic histograms and spatiograms – depend on
    having regions, not individual points, and so are not applied yet. Conveniently, this set
    of salient points and words can be used for any regions occurring in the image.
    5.4 Region Sweeping
    83
    5.4 Region Sweeping
    In order to find the most likely locations of planar structure, we apply a ‘region sweeping’
    stage, using the set V of salient points. Region sweeping creates a set of approximately
    circular overlapping regions R , by using each salient point v i in turn to create a ‘sweep
    region’ R i R , using the point as the centroid and including all other points within a
    fixed radius κ . We define R i as the set of all salient points within the radius from the
    centroid: R i = { v j |k v j v i k < κ,j = 1 ,...,n } . To speed up the process, we generally
    use every fourth salient point only, rather than every point, as a centroid. Points are
    processed in order from the top left corner, to ensure the subset we use is approximately
    evenly distributed.
    Topic spatiograms are created for each sweep region (see Section 3.3.5). Using these, the
    plane recognition algorithm can be applied, resulting in an estimate of the probability
    p ( R i ) (0 , 1) of belonging to the plane class, and an estimate of the orientation n ( R i )
    R 3 (normal vector), for each sweep region R i in isolation. The result before any further
    processing – can be see in in Figure 5.3, showing multiple overlapping regions R coloured
    according to their probability of being planar, with the orientation estimate shown for
    each planar region. These regions are classified and regressed using RVMs, the training
    data for which is described in the next section.
    (a)
    (b)
    Figure 5.3: Input image (left) and the result of region sweeping (right) this
    shows the hull of each region, coloured according to its estimated probability of
    being a plane (red is planar, blue is non-planar, and shades of purple in between),
    and the regressed normal for plane regions (only a subset of the regions are shown
    for clarity).
    Note that the choice of region size is dictated by two competing factors. On one hand,
    larger regions will give better recognition performance, but at the expense of obtaining
    coarser-scale information from the image, blurring plane boundaries. Small regions would
    be able to give precise and localised information, except that accuracy falls as region size
    5.5 Ground Truth
    84
    (a)
    (b)
    (c)
    (d)
    Figure 5.4: Examples of manually segmented ground truth data, used for train-
    ing the classifiers, showing planes (red) with their orientation and non-planes
    (blue).
    decreases. Fortunately, due to the segmentation method we will introduce soon, this does
    not mean our algorithm is incapable of resolving details smaller than the region size. We
    investigate the implications of region size in our experiments presented in Section 6.2.1.
    5.5 Ground Truth
    Before discussing the next step in the detection algorithm, it is necessary to explain the
    ground truth data. This is because it is crucial for training the recognition algorithm,
    as well as validation of the detection algorithm (see Chapter 6).
    Unlike in the previous chapters, these ground truth data will contain the location and
    orientation of all planes in the image, not just a region of interest. We begin with a set of
    images, selected from the training video sequences, and hand segment them into planar
    and non-planar regions. We mark up the entire image, so that no areas are resigned to
    being ambiguous. Plane orientations are specified using the interactive vanishing line
    method as before (refer to Figure 3.1). Examples of such ground truth regions are shown
    in Figure 5.4.
    5.6 Training Data
    In Section 3.2, we described how training data were collected by manually selecting and
    annotating regions from frames from a video sequence. These regions are no longer
    suitable, because the manually selected region boundaries are not at all like the shapes
    obtained from region sweeping. As a consequence, classification performance suffers.
    5.6 Training Data
    85
    Furthermore, we can no longer guarantee that all regions used for recognition will be
    purely planar or non-planar, since they are not hand-picked but are extracted from
    images about all salient points. Thus training regions which are cleanly segmented and
    correspond entirely to one class (or orientation) are not representative of the test data.
    5.6.1 Region Sweeping for Training Data
    To obtain training data more appropriate to the plane detection task, we gather it using
    the same method as we extract the sweep regions themselves (see above), but applied
    instead to ground truth images described in the previous section. When creating regions
    by grouping all salient points within a radius of the central salient point, we use only a
    small subset of salient points in each image as centre points, so that we do not get an
    unmanageably large quantity of data. Since these are ground truth labelled images, we
    use the ground truth to assign the class and orientation labels to these extracted regions.
    Inevitably some regions will lie over multiple ground truth segments, as would test data.
    This is dealt with by assigning to each salient point the class of the ground truth region
    in which it lies, then the training regions are labelled based on the modal class of their
    salient points. The same is done for assigning orientation, except for using the geometric
    median.
    We also investigated making use of the estimated probability of each region being a
    plane, which was calculated as the proportion of planar points in the region. The aim
    would be to regress such a probability estimate for test regions, but experiments showed
    no benefit in terms of the resulting accuracy, and so we use the method outlined above.
    The downside to this approach is that regions whose true class is ambiguous (having for
    example almost equal numbers of plane and non-plane points) will be forced into one
    class or the other. However, this will be the same during testing, and so we consider it
    sensible to leave these in the training set.
    This method has the advantage that it allows a very large amount of training data to be
    extracted with little effort, other than the initial marking up of ground truth. Indeed, this
    is far more than we can deal with if we do not sparsify the sweeping process considerably.
    As such it is not necessary to apply warping as well (Section 3.2.2), although we still
    consider it beneficial to reflect all the ground truth images before extraction begins.
    5.7 Local Plane Estimate
    86
    5.6.1.1 Evaluation of Training Data
    To verify that gathering an entirely new set of training regions is necessary, we col-
    lected a test set of planar and non-planar regions by applying region sweeping to an
    independent set of ground truth images (those which are later used to evaluate the full
    algorithm). This produced a new set of 3841 approximately circular regions. When using
    the plane recognition algorithm on these, using the original hand-segmented training set
    from Chapter 4, the results were a classification accuracy of only 65.8%, and a mean
    orientation error of 22 , which would not be good enough for reliable plane segmentation
    and detection. However, running the same test using the new sweeping-derived training
    set described here increases classification accuracy to 84.6%, indicating that having an
    appropriate training set is indeed important. We note with some concern that the mean
    orientation error decreased only marginally to 21 . Possibly this is because both training
    and testing data now include regions containing a mixture of different planes, making it
    more difficult to obtain good accuracy with respect to the ground truth.
    5.7 Local Plane Estimate
    After running region sweeping, as described in Section 5.4, and classifying these regions
    with classifiers trained on the data just described, we have a set of overlapping regions
    R covering the image. This gives us local estimates of what might be planar, but says
    nothing about boundaries. Points in the image lying inside multiple regions have am-
    biguous classification and orientations. We address this by considering the estimate given
    to each region R i containing that point as a vote for a particular class and orientation.
    Intuitively, a point where all the regions in which it lies are planar is very likely to be
    on a plane. Conversely a point where there is no consensus about its class is uncertain,
    and may well be on the boundary between regions. This observation is a crucial factor
    in finding the boundaries between planes and non-planes, and between different planes.
    More formally, we use the result of region sweeping to estimate the probability of a
    salient point belonging to a planar region, by sampling from all possible local regions it
    could potentially be a part of, before any segmentation has been performed. For points
    which are likely to be in planar regions, we also estimate their likely orientation, using
    the normals of all the planar surfaces they could lie on. Each salient point v i lies within
    multiple regions R i R , where R i is defined as R i = { R k | v i R k } ; that is, the subset
    5.7 Local Plane Estimate
    87
    (a)
    (b)
    Figure 5.5: Using the sweep regions (left), we obtain the local plane estimate,
    which for each point, located at v i , assigns a probability q i of belonging to a
    plane, and an orientation estimate m i (right). Probability is coloured from red
    to blue (for degree of plane to non-plane), and orientation is shown with a normal
    vector (only a subset of points are shown for clarity).
    of regions R k in which v i appears. Each point v i is given an estimate of its probability
    of being on a plane, denoted q i , and of its normal vector m i , calculated as follows:
    q i
    = ζ ( { p ( R k ) | R k R i } )
    (5.1)
    m i = ζ G ( { n ( R k ) | R k R i ,p ( R k ) > 0 . 5 } )
    where ζ and ζ G are functions calculating the median and geometric median in R re-
    3
    spectively. Note that m i is calculated using only regions whose probability of being a
    plane is higher than for a non-plane. To clarify, equation 5.1 describes how the plane
    probability and normal estimate for the point i come from the median of the regions
    in R i . We use the median rather than the mean since it is a more robust measure of
    central tendency, to reduce the effect of outliers, which will inevitably occur when using
    a non-perfect classifier. Figure 5.5 illustrates how the sweep regions lead to a pointwise
    local plane estimate.
    In order to improve the accuracy of the local plane estimate, we discard regions whose
    classification certainty is below a threshold. The classification certainty is defined as the
    probability of belonging to the class it has been assigned, and thus is p ( R i ) and 1 p ( R i ),
    for planar and non-planar regions respectively. This discarding of classified regions is
    justified by a cross-validation experiment (similar to those in Section 4.1). As shown
    in Figure 5.6, as the threshold is increased, to omit the less certain classifications, the
    mean classification accuracy increases. This is at the expense of decreasing the number
    of regions which can be used. It is this thresholding for which having a confidence
    5.7 Local Plane Estimate
    88
    100%
    4500
    4000
    95%
    3500
    90%
    3000
    2500
    85%
    2000
    80%
    1500
    1000
    75%
    500
    70%
    0
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    0.5
    0.6
    0.7
    0.8
    0.9
    1
    Certainty threshold
    Certainty threshold
    (a)
    (b)
    Figure 5.6: As the threshold on classifier certainty increases, less confident
    regions are discarded, and so accuracy improves (a); however, the number of
    regions remaining drops (b).
    value for the classification, provided by the RVM, is very useful, and is an advantage
    over having a hard classification. The effect of this is to remove the mid-range from
    the spread of probabilities, after which the median is used to choose which of the sides
    (closer to 0 or 1) is larger. This is equivalent to a voting scheme based on the most
    confident classifications. However, our formulation can allow more flexibility if necessary,
    for example using a different robust estimator.
    Since there are usually a large number of overlapping regions available for each point,
    the removal of a few is generally not a problem. In some cases, of course, points are
    left without any regions that they lie inside ( R i = ), and so these points are left out
    of subsequent calculations. Such points are in regions where the classifier cannot be
    confident about classification, and so should play no part in segmentation.
    The result is an estimate of planarity for each point, which we will refer to as the local
    plane estimate — examples are shown in Figure 5.7. We have now obtained a represen-
    tation of the image structure which did not require the imposition of any boundaries,
    and circumvented the problem of being unable to classify arbitrarily small regions. It is
    encouraging to note that at this stage, the underlying planar structure of the image is
    visible, although as expected it is less well defined at plane boundaries, where orientation
    appears to transition smoothly between surfaces.
    5.8 Segmentation
    89
    (a)
    (b)
    (c)
    Figure 5.7: Examples of local plane estimates, for a variety of images; colour,
    from red to blue, shows the estimated probability of belonging to a plane, and
    the normal vector is the median of all planar regions in which the point lies.
    5.8 Segmentation
    Although the local plane estimate produced above is a convincing representation of the
    underlying structure of the scene – showing where planes are likely to be, and their ap-
    proximate orientation – this is not yet an actual plane detection. Firstly, plane bound-
    aries remain unknown, since although we know the estimates at each point, it is not
    known which points are connected to which other points on a common surface. Sec-
    ondly, each pointwise local plane estimate is calculated from all possible surfaces on
    which the point might lie. As such, this is not accurate, which is especially clear in Fig-
    ure 5.7c where points near plane-plane boundaries take the mean of the two adjoining
    faces.
    This section describes how the local plane estimate is used to segment points from each
    other to discover the underling planar structure of the scene, in terms of sets of connected,
    coplanar points. This consists of segmenting planes from non-planes, deciding how many
    planes there are and how they are oriented, then segmenting planes from each other.
    5.8.1 Segmentation Overview
    Our segmentation is performed in two separate steps. This is because we found it was
    not straightforward to select a single criterion (edge weight, energy function, set of clique
    potentials etc.) to satisfactorily express the desire to separate planes and non-planes,
    while at the same time separating points according to their orientation. Therefore, we
    first segment planar from non-planar regions, according to the probability of belonging
    5.8 Segmentation
    90
    to a plane given by the local plane estimate. Next, we consider only the resulting planar
    regions, and segment them into distinct planes, according to their orientation estimate,
    for which we need to determine how many planes there are. We do this by finding
    the modes in the distribution of normals observed in the local plane estimate, which
    represent the likely underlying planes. This is based on the quite reasonable assumption
    that there are a finite number of planar surfaces which we need to find.
    5.8.2 Graph Segmentation
    We formulate the segmentation as finding the best partition of a graph, using a Markov
    random field (MRF) framework. A MRF is chosen since it can well represent our task,
    which is for each salient point, given its observation, to find its best ‘label’ (for plane
    class and orientation), while taking into account the values of its neighbours. The values
    of neighbouring points are important, since as with many physical systems we can make
    the assumption of smoothness. This means we are assuming that points near each other
    will generally (discontinuities aside) have similar values [78]. Without the smoothness
    constraint, segmentation would amount simply to assigning each point to its nearest
    class label, which would be trivial, but would not respect the continuity of surfaces.
    Fortunately, optimisation of MRFs is a well-studied problem, and a number of efficient
    algorithms exist.
    First we build the graph to represent the 2D configuration of points in the image and their
    neighbourhoods. We do this with a Delaunay triangulation of the salient points to form
    the edges of the graph, using the efficient S-Hull implementation 1 [118]. We modify the
    standard triangulation, first by removing edges whose endpoints never appear in a sweep
    region together. This is because for two points which are never actually used together
    during the plane sweep recognition stage, there is no meaningful spatial link between
    them, so there should be no edge. To obtain a graph with only local connectivity, we
    impose a threshold on edge length (typically 50 pixels), to remove any undesired long-
    range effects.
    1 The code is available at www.s-hull.org/
    5.8 Segmentation
    91
    5.8.3 Markov Random Field Overview
    A MRF expresses a joint probability distribution on an undirected graph, in which
    every node is conditionally independent of all other nodes, given its neighbours [78] (the
    Markov property). This is a useful property, since it means that the global properties
    of the graph can be specified by using only local interactions, if certain assumptions
    are made. In general the specification of the joint probability of the field would be
    intractable, were it not for a theorem due to Hammersley and Clifford [60] which states
    that a MRF is equivalent to a Gibbs random field, which is a random field whose global
    properties are described by a Gibbs distribution. This duality enables us to work with
    the MRF in an efficient manner and to find optimal values, in terms of maximising the
    probability.
    If we define a MRF over a family of variables F = { F 1 ,...,F N } , each taking the value
    F i = f i , then f = { f 1 ,...,f N } denotes a particular realisation of F , where some f
    describes the labels assigned to each node, and is called a configuration of the field. In
    this context, the Gibbs distribution is expressed as
    × e
    U ( f )
    P ( f ) = Z
    1
    (5.2)
    where U ( f ) is the energy function, and for brevity we have omitted the temperature
    parameter from the exponent. Z is the partition function:
    X U ( f )
    Z =
    e
    (5.3)
    f F
    which is required to ensure the distribution normalises to 1. Since this must be evaluated
    over all possible configurations f (the space F ) evaluating (5.2) is generally intractable.
    However, since Z is the same for all f , it is not needed in order to find the optimal
    configuration of the field — i.e. it is sufficient that P ( f ) e U ( f ) . The optimal config-
    uration is the f which is most likely (given the observations and any priors), so the aim
    is to find the f = f which maximises the probability P .
    The energy function U ( f ) sums contributions from each of the neighbourhood sets of the
    graph. Since the joint probability is inversely related to the energy function, minimising
    the energy is equivalent to finding the MAP (maximum a-posteriori) configuration f
    of the field. Thus, finding the solution to a MAP-MRF is reduced to an optimisation
    problem on U ( f ). This is generally written as a sum over clique potentials, U ( f ) =
    5.8 Segmentation
    92
    P V ( f ), where a clique is a set of nodes which are all connected to each other, and
    c ∈C c
    V c is the clique potential for clique c (in the set of all possible cliques C ). In our case,
    deal with up to second order cliques – that is, single nodes and pairs of nodes – so the
    energy function can be expressed as:
    X
    X
    X
    X X
    U ( f ) =
    V 1 ( f i ) +
    V 2 ( f i ,f i 0 )
    =
    V 1 ( f i ) +
    V 2 ( f i ,f i 0 ) (5.4)
    { i }∈C 1
    { i,i 0 }∈C 2
    i ∈S
    i ∈S i 0 ∈N i
    where V 1 and V 2 are the first and second order clique potentials respectively. The left
    hand equation expresses the energy as a sum over the two clique potentials, over the set of
    all first order C 1 (single nodes) and second order C 2 (pairs connected by an edge) cliques;
    the right hand side expresses this more naturally as a sum of potentials over nodes, and
    a sum over all the neighbourhoods for all the nodes; S is the set of all nodes in the graph
    and N i S denotes all the neighbours of node i . The two clique potentials take into
    account respectively the dependence of the label on the observed value, at a single node,
    and the interaction between pairs of labels at adjacent nodes (this is how smoothness
    is controlled). In summary, it is this energy U ( f ), calculated using the variables f i of
    the configuration f and the functions V 1 , 2 , which is to be minimised, in order to find the
    best configuration f (and thus the optimal labels f i for all the nodes).
    We use iterative conditional modes (ICM) to optimise the MRF, which is a simple but
    effective iterative algorithm developed by Besag [8]. The basic principle of ICM is to
    set every node in turn to its optimal value, given the current value of its neighbours,
    monotonically decreasing the total energy of the MRF. The updates will in turn alter
    the optimal value for already visited nodes, so the process is repeated until convergence
    (convergence to a (local) minimum can be proven, assuming sequential updates). ICM
    is generally quite fast to converge, and although it may not be the most efficient opti-
    misation algorithm, it is very simple to implement, and was found to be suitable for our
    task.
    5.8.4 Plane/Non-plane Segmentation
    The first step is to segment planes from non-planes. We do this using a MRF as follows:
    let p represent a configuration of the field, where each node p i { 0 , 1 } represents the class
    5.8 Segmentation
    93
    of point i (1 and 0 being plane and non-plane, respectively). We then seek the optimal
    configuration p , defined as p = argmin p U ( p ), where U ( p ) represents the posterior
    energy of the MRF:
    X
    X X
    U ( p ) =
    V 1 ( p i ,q i ) +
    V 2 ( p i ,p j )
    (5.5)
    i ∈S
    i ∈S j ∈N i
    Here, q i is the observation at point i , which is the estimated probability of this point
    belonging to a plane, obtained as in equation 5.1. The set S contains all salient points
    in the image (assuming they have been assigned a probability). The functions V 1 and V 2
    are the single site and pair site clique potentials respectively, defined as
    V 1 ( p i ,q i ) = ( p i q i )
    2
    V 2 ( p i ,p j ) = δ p i 6 = p j
    (5.6)
    where δ p i 6 = p j has value 0 iff p i and p j are equal, 1 otherwise. Here we express the function
    V 1 with two arguments, since it depends not only on the current value of the node but
    also its oberved value. This function penalise deviation of the assigned value p i at a point
    from its observed value q i , using a squared error, since we want the final configuration to
    correspond as closely as possible to the local plane estimates. We desire pairs of adjacent
    nodes to have the same class, to enforce smoothness where possible, so δ in function V 2
    returns a higher value (1) to penalise a difference in its arguments.
    Each p i is initialised to the value in { 0 , 1 } which is closest to q i (i.e. we threshold the
    observations to obtain a plane/non-plane class) and optimise using ICM. We generally
    find that this converges within a few iterations, since the initialisation is usually quite
    good. It tends to smooth the edges of irregular regions and removes small isolated
    segments. After this optimisation, each node is set to its most likely value (plane or not),
    given the local plane estimate and a smoothness constraint imposed by its neighbours, so
    that large neighbourhoods in the graph will correspond to the same class. The process
    is illustrated with examples in Figure 5.8. Finally, segments are extracted by finding the
    connected components corresponding to planes and non-planes, which now form distinct
    regions (Figure 5.8d).
    5.8 Segmentation
    94
    (a)
    (b)
    (c)
    (d)
    Figure 5.8: The process of segmenting planes from non-planes. Using the
    probabilities estimated at each point from region sweeping (a), we initialise a
    Markov random field (b); this is optimised using iterative conditional modes
    (c), resulting in a clean segmentation into smooth disjoint regions of planes and
    non-planes (red and blue respectively) (d).
    5.8.5 Orientation Segmentation
    Once planar regions have been separated from non-planar regions using the above MRF,
    we are left with regions supposedly consisting only of planes. However, except in very
    simple images, these will be made up of multiple planes, with different orientations. The
    next step is to separate these planes from each other, using the estimated orientation m i
    at each salient point.
    This is done by optimising a second MRF, with the goal of finding the points belonging
    to the same planar surfaces. This is defined on the subgraph of the above which con-
    tains only points in planar segments. This graph may consist of multiple independent
    connected components, but this makes no difference to the formulation of the MRF.
    In contrast to the simplicity of the first MRF, which was a two-class problem, the values
    we have for plane normals are effectively continuously valued variables in R 3 , where we
    do not know the values of the true planes. Some brief experimentation suggested that
    attempting to find an energy function which is able to not only segment the points,
    but also find the correct number of planes, was far from straightforward. Generally, a
    piecewise constant segmentation approach did enforce regions of constant orientation,
    but typically far too many, forming banding effects (essentially discretising the normals
    into too-fine graduations).
    We take an alternative approach, by using mean shift to find the modes of the kernel
    density estimate of the normals in the image. This is justified as follows: we assume
    that in the image, there are a finite number of planar surfaces, each with a possibly
    different orientation, to which all of the planar salient points belong. This implies that
    5.8 Segmentation
    95
    close to the observed normals, there are a certain number, as yet unknown, of actual
    orientations, from which all these observations are derived. The task is to find these
    underlying orientations, from the observed normals. Once we know the value of these,
    the problem reduces to a discrete, multi-class MRF optimisation, which is easily solved
    using ICM. The following sections introduce the necessary theory and explain how we
    use this to find the plane orientations.
    5.8.5.1 Orientation Distribution and the Kernel Density Estimate
    Kernel density estimation is a method to recover the density of the distribution of a
    collection of multivariate data. Conceptually this is similar to creating a histogram, in
    order to obtain a non-parametric approximation of a distribution. However, histograms
    are always limited by the quantisation into bins, introducing artefacts. Instead of as-
    signing each datum to a bin, the kernel density estimate (KDE) places a kernel at every
    datum, and sums the results to obtain an estimate of the density at any point in the
    space, as we illustrate in Figure 5.9 with the example of a 1D Gaussian kernel.
    4.5
    4
    3.5
    3
    2.5
    2
    1.5
    1
    0.5
    0
    −0.5
    −2
    0
    2
    4
    6
    8
    10
    12
    Data values
    Figure 5.9: This illustrates the central principle behind the kernel density
    estimate. For 1D data (the black diamonds), we estimate the probability density
    by placing a kernel over each point (here it is a Gaussian with σ = 0 . 5 ), drawn
    with dashed lines. The sum of all these kernels, shown by the thick red line, is
    the KDE. Clusters of nearby points’ kernels sum together to give large peaks in
    the density; outlying points give low probability maxima.
    5.8 Segmentation
    96
    In general, the KDE function f ˆ for multivariate data X = { x 1 ,..., x N } , evaluated at a
    point y within the domain of X is defined as [23]
    X
    N
     y x i 
    1
    f ˆ ( y ) =
    K
    (5.7)
    Nh d
    i =1
    h
    where K is the kernel function, h is the bandwidth parameter, and d is the number of
    dimensions.
    Following the derivation of Comaniciu and Meer [23], the kernel function is expressed in
    terms of a profile function k , so that K ( x ) = ck ( k x k 2 ), where c is some constant. For a
    Gaussian kernel, the profile function is
    1
    k ( x ) = exp( x )
    (5.8)
    2
    which leads to an isotropic multivariate Gaussian kernel:
     1
    
    1
    2
    K ( x ) =
    d
    exp k x k
    (5.9)
    (2 π ) 2
    2
    d
    where the constant c became (2 π ) 2 . Substituting the Gaussian kernel function (5.9)
    into equation 5.7 yields the function for directly calculating the KDE at any point: