CSE Colloquium: Designing Video Models for Human Behavior Understanding
Zoom Information: Join from PC, Mac, Linux, iOS or Android: https://psu.zoom.us/j/94128390729?pwd=UWU0Sk9DeFNyQWJMMHplamFnQkdCQT09 Password: 435568
Or iPhone one-tap (US Toll): +13017158592,94128390729# or +13126266799,94128390729#
Or Telephone: Dial: +1 301 715 8592 (US Toll) +1 312 626 6799 (US Toll) +1 646 876 9923 (US Toll) +1 253 215 8782 (US Toll) +1 346 248 7799 (US Toll) +1 669 900 6833 (US Toll) Meeting ID: 941 2839 0729 Password: 435568 International numbers available: https://psu.zoom.us/u/aep2iTbjU
ABSTRACT: Many modern computer vision applications require extracting core attributes of human behavior such as attention, action, or intention. Extracting such behavioral attributes requires powerful video models that can reason about human behavior directly from raw video data. To design such models, we need to answer the following three questions: how do we (1) model videos (2) learn from videos, and lastly, (3) use videos to predict human behavior?
In this talk I will present a series of methods to answer each of these questions. First, I will introduce TimeSformer, the first convolution-free architecture for video modeling built exclusively with self-attention. It achieves the best reported numbers on major action recognition benchmarks at 1/10th of the cost of state-of-the-art 3D CNNs. Afterwards, I will present COBE, a new large-scale framework for learning contextualized object representations in settings involving human-object interactions. Our approach exploits automatically-transcribed speech narrations from instructional YouTube videos, and it does not require manual annotations. Lastly, I will introduce a self-supervised learning approach for predicting a basketball player's future motion trajectory from an unlabeled collection of first-person basketball videos.
BIOGRAPHY: Gedas Bertasius is a postdoctoral researcher at Facebook AI working on computer vision and machine learning problems. His current research focuses on topics of video understanding, first-person vision, and multi-modal deep learning. He received his Bachelor’s Degree in Computer Science from Dartmouth College, and a Ph.D. in Computer Science from the University of Pennsylvania. His recent work was nominated for the CPVR 2020 best paper award.
Event Contact: Robert Collins