Stanford MLSys Seminar Episode 63: Arjun Akula [Today, 1.35-2.30pm PT]

22 views

Skip to first unread message

Karan Goel

unread,

Apr 28, 2022, 11:01:52 AM4/28/22

to stanford-ml...@googlegroups.com

Hi everyone,

We're back with the sixty-third episode of the MLSys Seminar today from 1.35-2.30pm PT.

We'll be joined by Arjun Akula, who will talk about improving robustness in vision and language grounding. The format is a 30 minute talk followed by a 30 minute podcast-style discussion, where the live audience can ask questions.

Livestream Link: https://www.youtube.com/watch?v=zKKSXQvdCyE

Guests: Arjun Akula

Title: Improving Robustness and Interpretability in Vision and Language Grounding Models

Abstract: Deep neural networks have enabled significant progress on many multi-modal grounding problems such as visual question answering (VQA), referring expression recognition (REF) which has several important applications such as in navigation, medical imaging, robotics and accessibility. In the last few years we have seen a huge improvement in how these models perform, some of them reaching human-level performance on several datasets. However, we find that these models could be exploiting strong biases in these datasets casting doubts on the actual progress. For example, as a human, do you focus on the same visual object when you hear the sentences “the bus in the middle of the crowd” and “the crowd that the bus is in the middle of”? Neural networks do so. The exciting progress on understanding language in the context of an image is not due to the cleverness of the neural networks, but rather because of the shortcuts present in the evaluation datasets. In this talk, we show that state-of-the-art neural network approaches are easily fooled due to their failure in overcoming biases in the training datasets. We also show that the recent self-supervised BERT based multi-modal architectures (e.g. ViLBERT) are relatively more robust compared to other neural architectures. We propose methods to improve robustness (and generalization) of the current models. We show that while data augmentation is one way to increase robustness, multi-task learning is probably a less tedious route. Finally, we describe a mechanism for producing scalable and nonstationary benchmarks (and out-of-distribution hard splits) for testing the generalization capabilities of existing grounding models.

Bio: Arjun Akula is a Research Scientist at Google AI in Mountain View. He got his PhD from UCLA, jointly advised by Prof. Song-Chun Zhu (UCLA) and Prof. Joyce Chai (UMich). His research interests are in computer vision, natural language processing, and deep learning, with the focuses on multi-modal grounding. Specifically, he works on identifying biases in state-of-the-art datasets and models, improving robustness of vision and language grounding models to out-of-distribution and adversarial inputs. He also works on making the underlying reasoning process of deep learning models more transparent and interpretable to human users. During his PhD, he interned at Amazon Alexa AI (Sunnyvale, CA), Google Research (Los Angeles, CA), Amazon AI (Palo Alto, CA) and Mila (Montreal). Prior to his PhD, he worked as a research software engineer at IBM Research AI (India) for 2.5 years. He did his Bachelors and Masters in Computer Science and Engineering from IIIT Hyderabad, India. He is an active member of the academic community serving as a reviewer/program committee member of ACL, CVPR, ARR, EMNLP, ICCV, AAAI, ECCV, NeurIPS and NAACL. Outside of work, he enjoys hiking, traveling, and playing Table Tennis. Here is a link to his personal website: https://www.arjunakula.com/

See you all there!

Best,

Karan

Karan Goel

unread,

Apr 28, 2022, 4:27:29 PM4/28/22

to stanford-ml...@googlegroups.com

Reminder: this is in 10 minutes!

Reply all

Reply to author

Forward

0 new messages