CS colloquium talk: Joint Language-Vision Inference in Machines and Humans

Date: 
Tuesday, December 12, 2017 - 3:30pm
Location: 
HFH 1132
Title: 
Joint Language-Vision Inference in Machines and Humans
Speaker: 
Jeffrey Mark Siskind
Host: 
William Wang

              Jeffrey Mark Siskind
                School of Electrical and Computer Engineering
                              Purdue University

I will present several frameworks for performing joint inference across
vision, language, and motor control.  The first is a unified cost function
relating video, sentences, and a lexicon.  Multidirectional inference supports
video captioning, producing sentences from video and a lexicon, video
retrieval, searching for video given a sentential query and a lexicon, and
language acquisition, learning a lexicon from sententially annotated video.
The second is a unified cost function relating mobile robot navigation,
sentences, and a lexicon.  Multidirectional inference supports language
acquisition, learning a lexicon from sententially annotated navigational
paths, generation, producing sentential descriptions of mobile robot paths
driven under teleoperation, and comprehension, automatically driving a mobile
robot given sentential description of route plans.  The third uses sentential
annotation to assist video object codiscovery.  Joint inference between video
and language can be used to discover objects without any pretrained object
detector models from a small number of example videos that have been annotated
with sentential description but no object bounding boxes.  Finally, I will
present investigation of how the human brain performs joint inference between
language and vision.  FMRI studies allow training computer models to recover
semantic content from brain scans.  We can train models solely on subjects
watching video and use the models to recover semantic content from brain scans
of different subjects reading sentences.  We can similarly train models solely
on subjects reading sentences and use the models to recover semantic content
from brain scans of different subjects watching video.  The ability to perform
cross modal and cross subject decoding, as well as the significant overlap in
brain regions used by the models, points to a common semantic representation
employed by the human brain across modality and subject.

Jeffrey M. Siskind received the B.A. degree in computer science from the
Technion, Israel Institute of Technology, Haifa, in 1979, the S.M. degree in
computer science from the Massachusetts Institute of Technology (M.I.T.),
Cambridge, in 1989, and the Ph.D. degree in computer science from M.I.T. in
1992.  He did a postdoctoral fellowship at the University of Pennsylvania
Institute for Research in Cognitive Science from 1992 to 1993.  He was an
assistant professor at the University of Toronto Department of Computer
Science from 1993 to 1995, a senior lecturer at the Technion Department of
Electrical Engineering in 1996, a visiting assistant professor at the
University of Vermont Department of Computer Science and Electrical
Engineering from 1996 to 1997, and a research scientist at NEC Research
Institute, Inc. from 1997 to 2001.  He joined the Purdue University School of
Electrical and Computer Engineering in 2002 where he is currently an associate
professor.  His research interests include computer vision, robotics, artificial
intelligence, neuroscience, cognitive science, computational linguistics,
child language acquisition, automatic differentiation, and programming
languages and compilers.

Everyone welcome!