CSE Colloquia: Multimodal (Generative) LLMs: Unification, Efficiency, Interpretability
Abstract:
In this talk, I will present our journey of large-scale multimodal pretrained (generative) models across various modalities (text, images, videos, audio, layouts, etc.) and enhancing important aspects such as unification, efficiency, and interpretability. We will start by discussing early cross-modal vision-and-language pretraining models (LXMERT) and visually-grounded text models with image/video knowledge distillation (Vokenization, VidLanKD). We will then present early unified models (VL-T5) to combine several multimodal tasks (such as visual QA, referring expression comprehension, visual entailment, visual commonsense reasoning, captioning, and multimodal translation) by treating all these tasks as text generation. We will also look at recent unified models (with joint objectives and architecture) such as textless video-audio transformers (TVLT), vision-text-layout transformers for universal document processing (UDOP), composable any-to-any multimodal generation (CoDi), as well as consistent multi-scene video generation (VideoDirectorGPT). Second, we will look at further parameter/memory efficiency via adapter (VL-Adapter), ladder-sidetuning (LST), sparse sampling (ClipBERT), and audio replacement (ECLIPSE) methods. I will conclude with interpretability and evaluation aspects of image generation models, based on fine-grained skill and bias evaluation (DALL-Eval) and based on interpretable and controllable visual programs (VPGen+VPEval).
Bio:
Dr. Mohit Bansal is the John R. & Louise S. Parker Professor and the Director of the MURGe-Lab (UNC-NLP Group) in the Computer Science department at UNC Chapel Hill. He received his PhD from UC Berkeley in 2013 and his BTech from IIT Kanpur in 2008. His research expertise is in natural language processing and multimodal machine learning, with a particular focus on multimodal generative models, grounded and embodied semantics, language generation and Q&A/dialogue, and interpretable and generalizable deep learning. He is a recipient of IIT Kanpur Young Alumnus Award, DARPA Director's Fellowship, NSF CAREER Award, Google Focused Research Award, Microsoft Investigator Fellowship, Army Young Investigator Award (YIP), DARPA Young Faculty Award (YFA), and outstanding paper awards at ACL, CVPR, EACL, COLING, and CoNLL. He has been a keynote speaker for the AACL 2023 and INLG 2022 conferences. His service includes ACL Executive Committee, ACM Doctoral Dissertation Award Committee, CoNLL Program Co-Chair, ACL Americas Sponsorship Co-Chair, and Associate/Action Editor for TACL, CL, IEEE/ACM TASLP, and CSL journals. Webpage: https://www.cs.unc.edu/~mbansal/
Event Contact: Timothy Zhu