Xem mẫu
c
MIT Media Lab Perceptual Computing Learning and Common Sense Technical Rep ort dec
The Inve rse Hollywo o d Problem
Fr o m video to scripts and storyb oards via causal analysis
Matthew Brand
Th e Med ia L ab M IT
Ames Street Cambridge MA USA
brandmediamitedu w wwmediamitedu brand
Abstract thought of as t h e inver s e Hollywo o d problem
b egin with a movie end with a script and storyb oard
We address the problem of visually detecting
causal e vents and tting them together i nto
Related v i si on work
a coherent story o f the action witnessed by
t h e camera We show th a t this can b e d one
Early approaches t o action understanding emphasized
by reasoning ab out the motions and collisions
reconstruction followed by analysis lately attention
of surfaces using highlevel causal constraints
is turning to applying causal constraints directly to
derived from psychological studies of infant v isual
motion traces Kuniyoshi Inoue and Ikeuch i
b eh avior These constraints are naive forms
Suehiro describ ed systems that recognize
of basic physical laws g overning substantiality
actions in assemb ly tasks wit h simple geometric
contiguity momentum and acceleration We
ob jects eg blo cks These systems were intended
describ e two implementations One system
parses instructional videos extracting plans of
action and key frames suitable for storyb oarding
as front en d s for rob otic p ickandplace mimicry and
emphasized scene geometry taking somewhat ad ho c
Since learning will play a role in making
approaches t o causality and action
s u ch systems robust we intro d uce a new
Presently there is a growin g literature in g esture
framewo r k f o r coupling hidden Markov mo dels
recognition from motions Essa with an
and demonstrate its use in a second system
emphasis on classication rather than interpretation
that segments stereo video into actions in near
of structured activity Siskind Morris blurs
realtime Rather than attempt accurate l ow
this distinction somewhat by using Markov mo dels
l e vel vision b oth systems use highlevel causal
to classify s h or t sequences of individual motions as
analysis to integrate fast but sloppy pixelbased
throwing dropping lifting and pushing gestures
representations over time The output is suitable
given relative velo city proles b etween an arm and
for s ummary indexing and automated editing
an ob ject Mann Jepson Siskind present
c
AAAI All rights reserved
a system that analyzes kinematic and dynamic
relations b etween ob ject s on a framebyframe basis
The program nds minimal systems o f Newtonian
I ntro duction
equations t h at are consistent with each frame but
A useful result from a vision system would b e
these are not necessarily consistent over time no r
an answer to the question What i s happ ening
do t h ey mar k causal events All of these systems
This i s a question ab out causality W ha t are the
require b oth a priori knowledge of the scene eg
e vents and how do earlier ones cause o r enable later
handsegment at ion of event b oundaries or ob jects and
ones We are exploring the hyp othesis that causal
limited scenes eg w hite black backgrounds sp ecic
p erception rests o n inference ab out t he motions and
camera views and constraints on the s hap es and colors
collisions of surfaces and pro ceeds indep endent ly of
of ob jects In contrast the metho ds describ ed in this
pro cesses s u ch as recognition reconstruction and
pap er emphasize continuous a ction parsing integration
static segmentation In this pap er we present two
of information over time constraint s derived from
computational mo dels of t his pro cess one heuristic
psychological exp eriment meaningful ou tput and
one probabilistic and trainable that incorp orate
general vision eg the background may b e cluttered
psychological m o de l s o f causal event p erception in
and ob jects may b e t extured irregular and exible
infants These systems use causal landmarks to
segment video into actions and higherlevel causal
Psychology of motion causality
constraint s t o ensure that actions are consistent
over time Each system takes a video sequence of
Vision sciences traditionally take highlevel vision
manipulative action a s input and outputs a planof
action and selected frames showing key events the
to b e concerned with static prop erties of ob jects
typically their identities categories and shap es The
gist o f the video useful for summary indexing
relationships b etween these prop erties and visual
reasoning and automated editing Gisting may b e
features are correlational leading t o many prop osals
App e ars in Pro ceedings of AAAI Providence RI
B r a n d
f o r how brains and computers may compute optimal
discriminators for various sets of images
could extract key event s from howto videos o f
the s or t t h at demonstrate pro cedures for a ssembling
furniture installing CD ROMs etc The input is a
Arguably causal dynamic prop erties of ob jects
video of an ob ject b eing assembled or disassembled
a n d scenes are more informative more universal
The output is a script describing the actions of the
a n d more easily computed These prop erties
repairman p lu s key frames t h at highlight imp o rta nt
substantiality solidity contiguity inertia and
causal event s
conservation of momentum are governed by simple
p hysical laws at human scales and are thus consistent
From visual events to causal events
across most o f visual exp erience The fact that
t h e s e prop erties are causal suggests t h at a small
The g ister reasons ab out ch an ges in the integrity an d
numb er of qualitative rules may provide satisfactory
motions of a s ingle foreground blob a connected map
p s y chological and computational accounts of much of
of image pixels that change due primarily to motion
visual understanding
The blob is obtained fr om a realtime vision system
Indeed there is a growing b o dy of psychological
develop ed by Wren et al Discontinuities in the
evidence showing that infants are uent p e rceivers of
blobs visual b ehavior signal ch an ges o f ca usa lity Fo r
l awful causality and violations thereof Sp elke and Van
example if the b lob has a b oundary discontinuity such
d e Valle found that infants aged to months
as sudden swellin g at on e p oint there is an apparent
will detect a w ide range of apparent violations of the
violation of the cohesion constraint explicable via t he
causality of motion Sp elke Van de Valle
contact constraint An agent has attached a n o b ject
They prop ose that three b asic principles are active in
and set it in motion causing its pixels to join the blob
motion understanding by late infancy
Cohesion is violated b ecause the agent fuses with
the ob ject Many visual discontinuity events have
The principle of contact equates physical connect
causal signicance including
edness with causal connectedness No action at a
distance no contact without action
The principle of cohesion equates ob ject integrity
with individuality N o splitting no fusing This
guarantees that individuality b oundaries remain
stable over time unless a series o f causal events
combines two o b jects into one eg via attachment
or splits one i nto two eg via detach ment
visual
event
app earance
disapp earance
ination
deation
ash
acceleration
discontinuity
disrupted
causality
contac t
contac t
cohesion
cohesion
cohesion
contac t
c ontinuity
explanatory
causality
animacy
animacy
contact
contact
contact
animacy
...
- tailieumienphi.vn
nguon tai.lieu . vn