System Calibration
Given the restriction of pseudo-orthographic projection, one can
verify (Proesmans et al. 1996) that the shape can be recovered up to
a scale, if the angle between the directions of projection and
viewing is known. In order to find this angle, the system is
initialized or 'calibrated'. For calibration it suffices to show two
planar patches and to give their subtended angle. In practice,
showing a box or corner of a room and specifying the angle is all it
takes to perform this calibration. The system will extract the
grid as explained below, looks for the two largest planar regions
within that grid and using the knowledge about their orientation
(typically 90 degrees) gives the angle between directions of viewing
and projection.
Extraction of the grid
Once the system is calibrated, the object to be modeled in 3D
can be shown. The system first extracts the grid lines in the image,
through a set of horizontally and vertically aligned line detectors.
The different edge segments are then linked to two sets of crossing
contours. The crossings are used to generate a complete grid
representation of the pattern. Each grid point or 'node' has at most
4 neighbors which are labeled N, E, S and W. Note that nodes
that are vertices of quadrangles are of special interest, since they
indicate how the square pattern is deformed under projection.
In order to determine the lines with sub-pixel precision, a
snake-like process is introduced. The problem is stated as the
minimization of the energy functional that relates both the
intensity along the grid connections as well as its
smoothness. This makes sure that the grid will move towards
the darker parts of the image and let it more closely follow the
pattern lines.
Correction of the extracted
grid
Ideally, we would now have a complete set
of all the line intersections or nodes, together with their
connections. Of course, the extracted grid will not be perfect
yet. Nodes or their connections may have gone undetected, and
similarly spurious nodes and connections may have been found.
Furthermore, in order to arrive at a 3D reconstruction the
connectivity of the lines has to be known. The important
difference with classical techniques lies in the fact that no
absolute correspondences between the lines in the grid and the lines
in the image are (nor can be) derived. Finding the correct
connectivity between lines is the crux of the matter (see Proesmans
et al. 1996). To that end, iterative procedures mend the initially
extracted grid. Two principles underlie this mending. First,
the grid should look square when viewed from the direction of the
projector. Secondly, the mending should lead to a consistent
numbering for the two sets of grid: whatever path is taken between
two nodes, the grid based city block distance should be the same.
Based on these principles, connections are broken and created.
These procedures allow the system to work its way around local
discontinuities in the grid, e.g. due to sudden changes in
depth. A limitation of the system as described is that the
grid patches should be connected somewhere in the image to be
reconstructed as a single surface.
Extraction of 3D shape
As soon as the grid has been extracted and the connectivity has
been established, the extraction of the 3D shape is quite
straightforward. The grid in the image plane can be modeled as
the cross-section of two sets of planes with the object's
surface. To each node a depth value can be assigned (see
Proesmans et al. 1996). Note that the reconstruction is up to
a scale factor. The knowledge of a single distance between two
points would fix the scale.
As mentioned, the ability to deal with discontinuities is
restricted to connected sets of data. If two or more separated
objects are present in the scene, they will be reconstructed
independently in the sense that their relative positions are not
extracted.
For many applications in visualization, this is
not necessarily a problem, because objects are either shown
separately or the scene is synthesized starting from individual
object models. Discontinuities by self-occlusion need not be a
problem, however. The system recovers from self-occlusions. Of
course, there are gaps in the reconstructions for those parts not
visible to the camera.
It is noteworthy that the system
yields a genuine surface description, not a mere cloud of points.
The connections found between the nodes are used in the description,
that consists of small bilinear patches (the little squares of the
pattern). Thus, problems with retrieving the correct topology are
largely eliminated.
Extraction of surface
texture
With the given setup, shape and texture
extraction could in principle be very simple by just alternating
between images with and without the grid. The former would yield
shape and the latter texture. Alignment would be perfect as the
same, fixed camera takes both images. However, the need for
switching the pattern on and off would complicate system operation.
For one thing, it would require additional provisions to block the
pattern. Moreover, if the alternations should follow each other
quickly and automatically, this would require the synchronization of
this blocking with the image acquisition by the camera. All this
much to the detriment of the simplicity gained by using nothing but
a simple camera and slide projector.
Therefore, a method
was developed (and patented as the rest of the system) that extracts
surface texture with the pattern on. It does away with the need to
alternate between shape and texture frames altogether. The
underlying principle literally is to read between the lines.
This might seem a weird approach at first, but actually the human
visual system solves a similar problem. The blood vessels on our
retina are positioned in front of the photoreceptors. As with
the blind spot, we are hardly aware of these obstacles because they
are filtered out.
NEXT
3D reconstruction from
uncalibrated image sequences
The second technique (Pollefeys et al. 1998; Koch et al. 1998)
starts from an uncalibrated image sequence of a scene. Using
nothing else than the images a textured 3D reconstruction of the
recorded scene is obtained in an automated way. The sequence
can be taken with a simple hand-held video- or photo camera.
The camera need not be calibrated and zoom and focus can be used
freely. The motion is unconstrained and we do not make use of
any reference points. In addition the method is just as easy
to use for small objects as for complete sites. Our method
therefore offers a lot of flexibility and is easy to use. We
have for example obtained a global reconstruction of the whole site
of Sagalassos (Turkey), but also of separate monuments.
These are shown on this site.
This technique could
therefore be an alternative for a wide variety of measurement
techniques used nowadays by archeologists, which range from drawings
and measurements done manually, over traditional photogrammetry to
the computation of a DTM (digital terrain map) from aerial
photographs.
Overview of the
method
In this paragraph a brief overview of 3D
acquisition from multiple views is given. Starting from the
images the knowledge about the scene and the cameras is gradually
increased. At first a partial camera calibration is obtained
together with a sparse reconstruction of the scene. This is
all done using robust algorithms to insure that the algorithms can
work on real images. The next step consists of upgrading the
calibration of the cameras. Imposing some constraints that are
in general valid for standard cameras does this. The next step
consists of determining the complete surface (i.e. not only a few
feature points). This is done through dense correspondence
algorithms. Finally a 3D-model is build which contains the
geometric (i.e. the shape) and photometric (i.e. the color and the
texture) information of the scene. These models can then be
used for visualization or measurement purposes.
In Figure
2 an overview of the method is
shown
. 
The acquisition of the image
sequence
The first step is to take the images
from which the 3D model of the scene will be generated. It is
important to take some restrictions of our method into consideration
at this point. Since the images are to be matched
automatically consecutive images of the sequence shouldn't be too
different. Consecutive viewpoints should be targeted at more
or less the same point and shouldn't be separated by more than 5 to
10 degrees. Illumination conditions are also important.
If shadows are present at the time of the image acquisition, these
will be found back in the final model. Additionally the
reconstruction of parts that are in the shadow will be more
complicate due to the restricted range of the intensity values in
these parts of the images. Therefore the ideal circumstance
for image acquisition is an overcast sky. Quite good results
can also be obtained in other circumstances.
The images
can be taken with different types of cameras. We have already
used video cameras, digital still cameras and standard photo
cameras. In this last case the images can either be scanned in
or put on photo-CD. Typically images of 700x500 were
used. When a higher resolution was available it was used to
get a better texture map for the final model.
Relating the images
A
first step in the reconstruction procedure is to relate the
different images to each other. Therefore it is necessary to
put points from different images into correspondence (i.e. finding
the points which originate from the same 3D point). When the
images are not too different the neighborhood of these points will
have the same appearance and they can therefore be matched through
normalized intensity correlation.
To avoid a combinatorial
explosion of the problem only a restricted amount of so-called
interest points will be considered at this stage. These are
extracted through a special filter which selects corner-like
features from the images (Harris, 1988). In fact these
features are often not real corners, but this filter tends to
extract the same points in consecutive images (i.e. points
corresponding to the same 3D points).
An important concept
to simplify the matching process between two images is the epipolar
geometry. This expresses the fact that a point which is
situated on a specific plane through the two camera centers is bound
to be projected on the intersection lines between that plane and the
respective image planes. These lines are therefore in
so-called epipolar correspondence (i.e. the corresponding point of
every point on an epipolar line in one image should be found back on
the corresponding epipolar line in the other image). All these
epipolar lines pass through the epipole. This point is the
intersection of the image plane with the line through the two camera
positions.
If this geometry is known the matching
complexity is reduced from a 2D (whole image) to a 1D search
(epipolar line). In uncalibrated circumstances the only way to
obtain this epipolar geometry is by computing it from a number of
matches. We are therefore confronted with a chicken and egg
problem. The approach that is followed first obtains a set of
initial matches, then computes some candidate epipolar geometry from
a minimal subset of matches. If this epipolar geometry
receives enough support from the rest of the matches, it is
accepted, else another subset is selected and the procedure is
repeated. Once some satisfactory epipolar geometry has been
obtained it can be used to generate more matches who in turn allow
refining the epipolar geometry (see Torr (1995)).
A first reconstruction
Once the epipolar geometry and a set of corresponding points are
known it is possible to generate a first reconstruction. This
reconstruction is only determined up to a projective
ambiguity. This means that besides not knowing the absolute
position, orientation and scale of the scene some additional
parameters are missing. In fact parallelism and orthogonality
can not be verified at this stage and metric properties like
relative distances and angles can not yet be measured.
This initial reconstruction is only based on the two first
images of a sequence and is therefore far from complete. For
additional images the epipolar geometry (with the previous image) is
determined as described earlier. Then the relative position
and orientation of the camera in the frame determined from the first
two views is determined in a similar way through the already
reconstructed points. At this point the reconstruction is
refined, extended and corrected. A technical description
of this approach can be found in Beardsley et al. (1996).
Obtaining metric
information
The projective reconstruction
obtained in the previous paragraph is clearly not satisfactory for
our purposes since both visualization and measurement applications
require at least a metric reconstruction. Luckily algorithms
have been developed which can upgrade projective reconstruction to
metric reconstructions. The approach presented here is based
on the work described in Pollefeys et al. (1998).
It is
always possible to factorize the cameras in intrinsic (e.g. focal
length) and extrinsic parameters (i.e. position and orientation of
the camera). In the case of a projective reconstruction the
obtained parameters do not have a physical interpretation. In
the metric case however they correspond to actual physical entities
that are subject to some constraints. Self-calibration
therefore consists of finding the projective transformation that
transforms the initial reconstruction into a reconstruction for
which all these constraints are satisfied.
The most
successful methods to achieve this are based on some abstract
geometric concepts. When a camera is moved through space, its
relative position to the plane at infinity and the absolute conic
does not change. This property allows identifying both
entities in projective space. These entities encode the metric
properties of space and are therefore equivalent to a metric
calibration.
Once the metric calibration is retrieved, the
initial reconstruction can be transformed to a metric
reconstruction. This reconstruction is now a true scale model
of the initial scene that was recorded. The absolute scale can
be determined through a single measurement in the scene.
Dense correspondences
At first, only the correspondences for specific feature points
were extracted. In this paragraph it will be discussed how the
correspondences can be determined for almost all the points of the
images. In this way a full surface model of the recorded scene
can be reconstructed. This is not only important for
visualization, but also for measurements since not all the
interesting points are always retrieved as feature points in our
sparse reconstruction.
The epipolar geometry greatly
simplifies the matching problem. As described previously
corresponding points have to be found on each other's epipolar
lines. The matching problem can therefore be solved line by
line.
To obtain efficient algorithms the images are first
warped so that corresponding image rows are also corresponding
epipolar lines. This procedure is called rectification.
The matching proceeds row by row over both images. Not
only the correlation score is taken into account, but a cost is also
included for discontinuities in the surface. Some other
constraints are also taken into account (e.g. uniqueness,
ordering). An optimal solution can then be obtained through
dynamic programming.
The result is a disparity map.
This is an image were the value associated with each pixel indicates
the position of the corresponding pixel in the other. Since at
this point the cameras are calibrated this can immediately be
translated to a dept map. This is an image that contains the
distance from the camera center to the scene along direction
corresponding every pixel. This is often called a 2.5D
representation.
To obtain a more accurate and more
complete depth-information more than two images should be
used. We have implemented an approach, which is described in
detail in Koch et al. (1998). At first the depth maps are
computed pair-wise for the whole sequence. Then, to compute
accurate depths for a specific image, the disparities are used to
create a correspondence chain that goes over multiple images.
The initial depth is refined by taking into account the other
correspondences of the chain using an extended Kalman filter.
The chain is interrupted once the back-projected ray falls outside
the 95% probability bounds.
An important advantage of this
approach is that one can go both backward and forward in the
sequence. This allows avoiding most of the occlusion problems
one typically encounters with stereo. In addition the
inaccuracy of small baseline and the difficulty of matching for wide
baseline stereo are both solved through this technique.
Additionally this multi-viewpoint stereo also allows improving
the quality of the texture. One 3D point is observed in many
images and the texture can therefor robustly be estimated.
Highlights and other artifacts can easily be removed.
Building the model
The 3D-model surface is constructed of triangular surface
patches with the vertices storing the surface geometry and the faces
holding the projected image color in texture maps. The texture
maps add very much to the visual appearance of the models and
augment missing surface detail.
The model building process
is at present restricted to partial models computed from single
viewpoint and work remains to be done to fuse different
viewpoints. Since all the views are registered into one metric
framework it is possible to fuse the depth estimate into one
consistent model surface.
NEXT
Applications
These techniques will not only greatly facilitate the acquisition
of 3D data for existing applications, but will also allow the
appearance of new applications. In this paragraph a few
possible applications of the presented techniques are
described.
3D Measurements
Once
a 3D surface model of an object is obtained, the model can be used
to obtain 3D measurements. Our models are up to scale and can
therefore immediately be used to obtain relative length
measurements. Carrying out a single length measurement in the
real scene allows getting absolute lengths from the model.
Measuring lengths on the 3D model is much simpler than in the real
scene. It is sufficient to click on two points on the model
surface and the computer immediately computes the distance between
the two points. When the model coordinates are aligned with
the world coordinates a simple click on the model surface results in
the absolute coordinates of the indicated point. It is clear
that these procedures result in an important gain of efficiency for
generating plans of the archeological sites.
Virtual Exhibitions
The models delivered by the different techniques presented in
this paper are well suited for virtual exhibitions. The models
have a high visual quality since they are immediately generated from
real images. Complex shapes like statues, masks or pottery can
accurately be modeled. An important advantage of virtual
exhibitions is that it allows placing the object back in their
natural environment. In this respect the method based on
multiple views offers the advantage that the environment can also be
'virtualized' without too much effort.
Virtual Reconstruction
The presented techniques do not only allow acquiring the
3D-model of the remains of a building, but also of the broken down
parts which are retrieved on the site. It becomes therefore
possible to carry out a 'virtual reconstruction'. The
different parts of a monument can be reassembled on the screen of a
computer.
This offers a flexible method to test
reconstruction hypothesis. Additional tools could even be
developed to simplify reconstruction even more. Broken parts
of pillars could for example automatically be matched using standard
registration software (see e.g. Chen and Medioni, 1991).
Annotations
Since
much evidence is destroyed during diggings it is very important for
archaeologists to annotate all the findings. Traditionally a
lot of pictures are taken, many plans and notes are made and the
stratigraphic structure of slices of the terrain is drawn.
An alternative would be to do a 3D recording of a location after
the removal of every stratigraphic layer. In this 3D
reconstruction specific models of pieces of interest could be
inserted at the exact location were they were discovered. The whole
could then be assembled to a complete annotation model were a slider
would allow to travel through time by adding or removing
stratigraphic layers at will.
NEXT
Conclusion
In this paper two techniques for 3D acquisition were
presented. They both heavily rely on state-of-the-art computer
vision and image processing algorithms. Compared to
traditional acquisition techniques the complexity of the acquisition
has been moved from hardware to software. Additionally,
geometric insights have been used to avoid complicate calibration
procedures and provide a maximal flexibility in acquisition.
These different factors make the presented techniques especially
suited for use in archaeology where often non-technical people have
to carry out 3D acquisition in difficult circumstances.
Additionally the presented techniques offer the possibility to
develop new applications or to improve existing approaches. A
few examples of these applications were given in this paper.
NEXT
Acknowledgement +
References
We would like to thank Prof. Dr. Marc Waelkens and his team for
making it possible for us to do experiments on the archeological
site of Sagalassos (Turkey). We would also like to thank Dr.
Andrew Zisserman and his team in Oxford for providing us with robust
projective reconstruction algorithms. The financial support of
the European ACTS project VANGUARD and of the Belgian IUAP project
IMechS are gratefully acknowledged. Marc Pollefeys and Marc
Proesmans both thank the IWT for financial support.
1. Beardsley, P., Torr., P. and Zisserman, A. 1996, ? from
Extended Image Sequences, Proceedings ECCV'96 (European
Conference on Computer Vision).
2. Chen, Y. and Medioni,
G. 1991. Object Modeling by Registration of Multiple Range Images,
Proceedings International Conference on Robotics and
Automation.
3. Harris, C. and Stephens, M. 1988. A
combined corner and edge detector, proceedings Fourth Alvey Vision
Conference, pp.147-151.
4. Koch, R., Pollefeys, M. and Van
Gool, L. 1998. Multi Viewpoint Stereo from Uncalibrated Video
Sequences. Proceedings European Conference on computer Vision,
Lecture Notes in Computer Science, Springer-Verlag.
5.
Koch, R., Pollefeys, M. and Van Gool, L. 1998. Automatic 3D
Model Acquisition from Uncalibrated Image Sequences, proceedings
CGI'98 (Computer Graphics International)
6. Pollefeys, M.,
Koch, R., Vergauwen, M. and Van Gool, L. 1998, Metric 3D Surface
Reconstruction from Uncalibrated Image Sequences, In proceedings
SMILE Workshop, Freiburg, Germany, Lecture Notes in Computer
Science, Springer-Verlag.
7. Pollefeys, M., Koch, R. and
Van Gool, L. 1998. Self-Calibration and Metric Reconstruction
in Spite of Varying and Unknown Internal Camera Parameters,
Proceedings ICCV'98 (International Conference on Computer Vision),
Bombay, 1998.
8. Proesmans, M., Van Gool, L. and
Oosterlinck, A. 1996. One-shot active range acquisition, Proceedings
ICPR'96 (International Conference on Pattern Recognition), Vienna,
Vol. C, pp.336-340.
9. Proesmans, M., and Van Gool, L.
1998. Reading between the lines - a method for extracting dynamic
3D with texture, Proceedings ICCV'98 (International Conference on
Computer Vision), Bombay, 1998.
10. Torr, P. 1995. Motion
Segmentation and Outlier Detection, Ph.D. Thesis, Dept. of
Engineering Science, University of Oxford.
BACK
TO THE START OF THE
ARTICLE