From VSMM98 Conference Proceedings
4th International Conference on Virtual Systems and Mulitmedia
Field Cinematography Techniques for Virtual Reality
Interval Research Corporation, 1801-C Page Mill
Road , Palo Alto CA 94304 USA
Abstract. "Virtual Reality" (VR)
is practically defined by the requirement of 3D computer models
and realtime control by the user. While such properties afford interactive
navigation and manipulation, the imagery is relatively simplistic
or cartoon-like. Cinema is the opposite, with little or no interactivity
but with "photo-realistic" imagery. Most of the efforts around VR
have originated from the computer culture. This paper describes
complimentary efforts from the cinema culture, including techniques,
case studies, challenges, and social implications.
- Virtual Reality and Cinema
1. "Be Now Here" in Creative Time's "Art in the Anchorage" exhibition,
New York, 1997.
(photo: T. Westenberger)
1.1 Aesthetics of Telepresence
Much of the historical work in representation
has concentrated on conveying a sense of place. We can trace one
strand relating to visual representation from landscape and mural
paintings, to the panoramas and cycloramas of the nineteenth century,
to the special-venue cinema formats of today such as Imax, Showscan,
and CircleVision. These formats exploit high spatial and temporal
resolution, wide-angle and surround fields of view, multitrack audio,
and 3D stereoscopy. Their goal is to convey a sense of "being there"
that is, of telepresence.
The "there" in most cinema is an actual physical
place, since the nature of cameras is to record whatever is in front
of the lens (even if that is the contrived environment of a studio).
The aesthetics of cinema are biased toward representations of actual
places; imaginary places must be created with additional work (i.e.,
special effects). Furthermore, virtually all cinema is linear and
non-interactive; the technology of film makes it difficult to give
the audience any control.
Hence, the aesthetics of telepresence from the
point of view of cinema are geared to present high sensory realism
in images of physical rather than imaginary places, with little
or no interactivity.
1.2. Cinema Culture and Computer Culture
The aesthetics surrounding the computer culture has traditionally
taken a contrasting point of view: low sensory realism in images
of imaginary rather than physical places, with interactivity as
a key element.
People often associate computer graphics with simplistic or cartoonlike
imagery. Developers must build computer-graphics models from scratch,
using drawing and painting tools as well as libraries of primitive
shapes and textures. Techniques such as photographic texture mapping
help such models to approach photorealism, but the imagery is still
far from conventional camera-based images alone. Current work in
image-based rendering shows promise but also demonstrates the complex
and problematic challenge of making 3D computer models from images.
Also, since such computer models must be built up from nothing,
it is generally easier to make imaginary and fantasy places than
to model physical environments. Hence, the aesthetics of the computer
culture has tended more toward the imaginary and fantasy, whereas
the cinema culture particularly the documentary cinema culture
has tended toward the actual world.
A more subtle difference between the computer and cinema cultures
is that a person must spend many hours in front of a workstation
to make computer models, whereas filmmakers must interact directly
with environments and people.
Many people who work with computers think that interactivity is
a critical characteristic, for both navigation and manipulation
(e.g., when the user specifies, "move that chair to the right").
Sensory and physical realism is secondary. People who work in cinema
often have the opposite priorities .
2. Cinematic Techniques for Interaction and
Realness, or sensory realness, or
photorealism is relative to representation. In theory, we could
apply a photorealism Turing test to types of imagery: We would ask
viewers whether the representation is indistinguishable from its
subject. The problem is that virtually all current forms of visual
representation would fail. Even 3D Imax images are obviously still
only a movie, compared to reality. The human eye is extremely difficult
to fool. (The ear is much more fallible; we've all mistaken a voice
on the radio for the voice of a person present in the room.)
Consider the current range of photorealism for
dynamic visual representation: web- and MPEG- level video, broadcast-quality
video, theater-quality (e.g., 35mm) film, and special-venue (e.g.,
70mm and multiple-screen formats) film. Imax is advertised as having
10 times the resolution of standard 35mm film, and several special-venue
formats have twice the resolution of standard Imax . To make
an ultimate CAVE, with four walls, ceiling, and floor all in Imax-quality
stereo, would require 12 times Imax resolution, or 120 times 35mm
film. Such a level of photorealism is orders of magnitude greater
than low-end formats.
It is also noteworthy that the degree of perceived
realness is usually correlated with quality of content. When a presentation
is compelling, it seems real. Conversely, higher resolution does
not automatically make a presentation more convincing. The relationship
between photorealism of form and quality of content is complex.
Panoramas are generally regarded as wide-field
images; often, they represent a complete 360-degree field of view
(FOV). A panorama represents a single point of view and is by definition
two dimensional. Panoramas allow a viewer to look around (angular
movement, i.e., panning and tilting), but not to move around (lateral
movement, i.e., dollying and tracking).
We can make panoramic photographs using a single
lens (such as a fisheye, in conjunction with a convex mirror, or
with a rotating slit mechanism). We can tile together multiple images,
but if they are not taken from a single point of view (i.e., the
nodal point of the camera), then distortion is inevitable.
Panoramic imagery offers limited navigational
interactivity. A viewer can pan and tilt through a panoramic scene,
but can neither move laterally nor manipulate the imagery.
2. Le Cinéorama, a 10-screen panoramic film
theater by Raoul Grimoin-Sanson, Paris, 1900.
(gravure: La Nature)
Moviemaps offer another kind of limited navigational
interactivity. They are filmed by stop-frame cameras that move along
a path and are triggered by distance from the viewed scene (typically
by a sensor attached to a wheel), rather than by time. Distance
triggering maintains constant speeds during playback at constant
frame rates, which is often not practical or possible during production
with conventional (time-triggered) movie cameras. The result is
the transfer of speed control from the producer to the viewer, who
controls the frame rate through an input device such as a joystick
In addition to speed control, limited control
of direction is possible if registered turns are filmed at intersections.
With match-cutting between a straight sequence and a turn sequence,
the user can "turn" from one route to another. The developer must
be careful to minimize visual discontinuities, such as sun position
and object (e.g., cars and people) transience. The goal is to make
the cuts appear seamless.
Moviemaps are "look-up" media, where all possible
views are pre-recorded and accessed via computer. Hence, they offer
only limited navigability: you can view only images that have been
pre-recorded (i.e., you cannot leave the paths). Like panoramas,
moviemaps limit interaction to navigation; viewers cannot manipulate
2.4 Stereoscopy and Multiple Perspectives
Stereoscopy, the sense of 3D that we get when
we perceive a scene through both eyes, requires two unique points
of view one for each eye. People make stereoscopic photographs
and cinema using two lenses (and often two cameras) typically separated
by the normal human interocular distance. Two separate images must
be recorded and kept synchronized from recording to playback to
give the viewer the sense of 3D.
In theory, stereoscopy is successful only if the
viewers head is not allowed to move, because it represents
only two points of view. If head motion is allowed, every new perspective
encountered must be displayed. Although a system that can provide
all these views has been demonstrated in a limited way with pre-recorded
imagery , creating one is problematic because every possible
point of view cannot be filmed. Unlimited navigation is possible
with 3D computer models.
2.5 Orthoscopic Displays
Many VR displays combine three important sensory elements for
a maximum sense of presence: wide-angle FOV (for immersion), stereoscopy
(for 3D), and orthoscopy (for proper scale), often called wide-angle
ortho-stereo . These displays fall into two main groups:
special-venue film formats (such as Imax Solido, 3D Imax, and Showscan
3D) and VR displays (such as head-mounted displays [HMDs] and CAVEs).
The film formats provide ultra-high resolution and group viewing
but are not interactive, whereas the HMDs and CAVEs are lower resolution
but allow the possibility of multiple perspectives through head-tracking
and 3D computer models.
3. Case Studies - "See Banff" and "Be Now
3.1 See Banff
3. Johnston Canyon trail near Banff.
(stereo pair from See Banff - view cross-eyed)
See Banff! is a unique stereoscopic moviemap that
grew out of an exploration of field recording for VR . We used
two stop-frame 16mm film cameras with wide-angle lenses, mounted
for stereoscopy on a "baby jogger" carriage, with an optical encoder
attached to one of the wheels to trigger the cameras at programmable
The imagery was recorded entirely in the field,
outdoors in the Canadian Rocky Mountain region surrounding Banff,
Alberta. As is sometimes done in documentary film, we made no attempt
to control lighting or action: The goal was to record the
environment as it is.
In addition to recording the beauty of the landscape,
documenting the proliferation of tourists was an integral part of
the intention. As we worked in the field and interacted with both
local residents and tourists, it became apparent that there was
lively controversy surrounding tourism and growth; this dialog was
part of the experience of being in Banff. Aesthetically, conceptually,
and technically, having tourists appear in the foreground and the
landscape in the background added a strong sense of depth and presence.
Over 100 paths were recorded during a 6-week period.
4. "See Banff!" camera rig and kinetoscope playback
(photos: L. Psihoyos and M. Naimark)
The display system for See Banff! mimicked a 100-year-old
cinema viewing device: the kinetoscope. Thus, it mimicked the limitations
of the old stereoscopic viewing systems by using a stationary eye-hood
that prevents a viewers free head motion, as well as providing
nearly orthoscopic optics. The display also conformed to the limitations
of the one-dimensional travel along the paths by providing only
a one-dimensional user input device: a crank on the side of the
system. The crank employed a force-feedback brake that would freeze
at the beginning and end of each sequence. The user selects the
sequence to view by manipulating a lever near the eye-hood. The
prerecorded material was stored on a single laserdisc using field-sequential
stereo and LCD shutter optics.
Hence, the See Banff! kinetoscope provided a broadcast-quality
video wide-angle ortho-stereo viewing experience with one-dimensional
navigational control for a single user. It could not provide unlimited
navigation or any form of manipulation of the imagery.
3.2 Be Now Here
5. Orlando Column in Dubrovnik, covered for protection
(stereo pair from Be Now Here - view cross-eyed)
Be Now Here (Welcome to the Neighborhood) is a
unique stereoscopic panorama. We used two full-motion 35mm film
cameras mounted for stereoscopy on a motorized tripod that rotated
at 1 revolution per minute (rpm). To enhance the sense of telepresence,
we ran the cameras at 60 frames per second (fps), rather than the
standard 24 fps, and employed wide-angle (60-degree horizontal FOV)
The imagery was recorded in public plazas in the
four cities designated "In Danger" by the UNESCO World Heritage
Centre: Jerusalem, Dubrovnik (Croatia), Timbuktu (Mali), and Angkor
(Cambodia). The intention was to record these beautiful and
The production concept was simple: Find in one
public plaza in each city a single spot that best represents each
place, then film several panoramas from that spot during the course
of the day without moving the camera system. The multiple times
of day would be perfectly registered and allow seamless intercutting,
with only the lighting and transient objects changing.
Due to the potentially hazardous and controversial
nature of the project, production was as inexpensive, fast, and
quiet as possible, relying on prearranged local staff at each site
and help from UNESCO to cross borders with the 500 pounds of film
gear. Through a great deal of planning and collaboration, all four
sites were filmed in 1 month. ("Highlights" included a bomb scare
in Jerusalem resulting in evacuation of the entire plaza, a drive
through a strip of Bosnia in the middle of the night during wartime,
a negotiation with Taurig camel drivers in Timbuktu about issues
of appropriation, and a bribe to get the hired driver out of jail
after he had gotten lost after dark and had been found by the Cambodian
military .) Miraculously, all the footage survived.
Working with local collaborators was a critical
element in ensuring the quality of imagery. The selection of the
sites was heavily informed by local knowledge. More important, filming
in the middle of public plazas is a conspicuous activity, and the
fact that local collaborators knew many of the people in the plaza
helped to make everyone feel comfortable. Local people, particularly
children, didn't appear self-conscious they simply did what
they would normally do in such places.
6. "Be Now Here" camera rig and installation.
(photos: G. Tassé and C. Dohrmann)
The display system for Be Now Here employed a
large (12- by 16-foot) front-projection screen capable of maintaining
polarity, two video projectors driven by laserdisc players, four-channel
surround audio, and a simple input pedestal that allowed a user
to choose the location as well as the time of day. The input pedestal
was positioned at the orthoscopically correct point for a 60-degree
FOV of the screen.
We recreated the sense of camera rotation by rotating
the entire floor in sync with the imagery. A 16-foot diameter rotating
floor was used as the viewing platform, with the input pedestal
in the center. This space was totally dark except for the screen,
resulting in a strong visceral illusion: Viewers believed that the
screen was rotating around them, rather than that they themselves
were rotating. The effect is similar to the feeling of motion that
you get when you are sitting in a stationary train in the station
and an adjacent train begins to move.
After several public and private screenings, it
was apparent that the 1-rpm rotation of the floor was too fast for
some people to ignore. Tests suggested that, at 0.5 rpm, almost
nobody would feel dizzy, but the illusion of a rotating screen would
remain intact. The floor was slowed and the laserdiscs were remastered
at one-half the original speed (30 fps). An unintentional result
was that all motion in the images of people, animals, and
vehicles was now in "slow motion," making the representation
less real and more abstract, an effect which could be construed
as more arty and less techy . Nevertheless, most viewers reported
a more compelling immersive experience.
Be Now Here provided a twice-broadcast-quality
video wide-angle ortho-stereo viewing experience with limited navigational
control (discrete choice of place and time) for group viewing. Like
See Banff!, it could not provide unlimited navigation or any form
of manipulation of the imagery.
4. The Challenge of Converging Cinema and Computing
4.1 Dimensionalization: Making 2D into 3D
Dimensionalization is making a 3D model
from one or more 2D images. Image-based rendering may fulfill
a common dream in many VR circles: to wave a camera around an actual
place and to end up with a 3D computer model. But, as we are learning,
many obstacles prevent us from realizing this fantasy.
Perhaps the most difficult problem is how to resolve
occlusions, the "holes" that are left after we after aggregate
all 2D images into a 3D model. Simply put: How do we fill in a blank
when we have no information? In many classes of imagery, occlusions
are inevitable, even with many 2D views, such as of forests, crowds,
street scenes, or almost any complex and unstructured environment.
Another question is for what purposes 3D models
are necessary. Several different 2D panoramic formats currently
exist on the web, including QuickTime VR, PhotoBubbles, and IPIX.
Since panoramas represent a view from only one point in space, they
are relatively easy to record. As 2D databases, they require much
less storage than a comparable 3D database. But they allow only
angular, rather than lateral, navigation, and they afford no manipulation.
For some applications, panoramas alone may provide sufficient virtual
4.2 Segmentation - Making Non-Semantic "Models"
into Semantic Models
Segmentation adding higher-level
or semantic knowledge to an image can be done by hand or
by computer. One particular class of 3D visual databases consists
of only points in space, and includes no semantic knowledge by the
system. This class includes light-fields and 3D images made with
depth-maps. Standard 3D computer models, built from primitives,
are semantic models: The system "knows" the contents of the database.
Semantic models are required for any kind of interactive manipulation
of the imagery. Light-field and depth-map 3D databases are nothing
but "clouds" of pixels (whether they are even "models" is debatable)
As such, they allow unlimited navigation but no manipulation.
Like full 3D models, semantic models may not be
necessary for all applications. Real-world precedences exist for
viewers to enjoy navigation without manipulation (through nature
trails, ancient ruins, religious temples, and so on).
4.3 Automation or Human Intervention
Making 2D into 3D (dimensionalization) and making
non-semantic into semantic models (segmentation) are both possible
when there are humans in the loop. Some of the processes have been
automated and some will be automated. But it's not clear that all
the decisions required should be automated.
Much of the human labor today associated with
dimensionalization and segmentation is by default: The work is neither
desirable nor enjoyable, but automation doesn't exist. The Hollywood
special-effects community relies on such human intervention. Clearly,
such processes would be best automated.
There is, however, another class of decisions
for which human intervention is desirable, particularly regarding
segmentation and semantic modeling. In some cases, determining what
are the "most important" elements in a scene is a matter of human
expression and of art.
4.4 Immersive Virtual Environments
The two most prominent immersive virtual environments
today are created with HMDs or with CAVEs. Both techniques are problematic.
HMDs are relatively low resolution and encumbering; CAVEs require
several projectors and space. Because of head-tracking, both HMDs
and CAVEs are optimized for only one user, even though CAVEs can
comfortably accommodate several other viewers.
Before we can have high-quality, ubiquitous immersive
virtual environments, we need to overcome several technological
hurdles. For example, we need high display resolution and brightness,
no viewer encumbrance, and accurate head tracking.
4.5 Cameras of the Future
Although we may be able to do a limited amount
of VR conversion from pre-existing images, cameras used for most
VR applications today fall into two categories: (1) standard, mass-produced
cameras that have been modified, specially mounted, or instrumented;
or (2) extremely heavy, expensive contraptions, such as CyberScans,
motion-control rigs, or time-of-flight lasers.
No one has designed an inexpensive camera specifically
for VR applications. A golden opportunity exists here.
4. Work in the Real World
The state of the world is precariously uneven in terms of resources.
Although many may believe that computers will save the world, North
American scientists have access to almost 40% of the world's R&D
investment, while the entire continent of Africa only has 0.5% ,
and less than 10% of the children of the world will have access
to computers and the Internet by the year 2000 .
The state of the world is also unimaginably rich in terms of culture.
VR can be an important communications medium for world culture,
but only if those of us lucky enough to have access to the tools
are sensitive enough to work with and learn from local expertise.
If not, the loss will ultimately be ours.
 M. Naimark, Realness and Interactivity. In: B. Laurel
(ed.), The Art of Human Computer Interface Design. ISBN 0-201-51797-3.
Addison Wesley, Reading, MA, 1990, pp. 455-459.
 M. Naimark, Expo '92 Seville, Presence vol. 1
no. 3 (1992) 364-369.
 S. S. Fisher, Viewpoint Dependent Imaging: An Interactive
Stereoscopic Display, SPIE vol. 367 (1982) 41-45.
 E. M. Howlett, Wide-Angle Orthostereo, SPIE vol.
1256 (1990) 210-223.
 M. Naimark, A 3D Moviemap and a 3D Panorama, SPIE
vol. 3012 (1997) 297-305. (online at www.interval.com)
 M. Naimark, Field Recording Studies. In: M. A. Moser
and D. MacLeod (eds.), Immersed in Technology. ISBN 0-262-13314-8.
MIT Press, Cambridge, MA, 1996, pp. 299-302. (online at www.interval.com)
 M. Naimark, Trip Reports from the Be Now Here production.
 M. Naimark, Whats Wrong with this Picture? Presence and
Abstraction in the Age of Cyberspace. In: Roy Ascot (ed.), Consciousness
Reframed: Art and Consciousness in the Post-biological Era. ISBN:
1 899274 03 0. University of Wales College, Newport, 1997. (online
 F. Mayor, Science and Power: A New Commitment for the 21st
Century, UNESCO Director General's address to the Association
for the Advancement of Science, Washington, D.C., 25 June 998
 N. Negroponte, 2b1 Foundation Mission Statement (see