Multi-Modal Self-Supervised Learning from Videos

Рет қаралды 1,501

3 жыл бұрын

Abstract:
In this talk, I will show how good visual representations can be learned without manual  annotations by simply leveraging the multimodal nature of videos. I will illustrate this by going through two of our recent results. First, we demonstrate that a text-video embedding trained on HowTo100M, a large uncurated dataset of narrated videos, leads to state-of-the-art results for text-to-video retrieval and action localization tasks [1]. Second, I will introduce our recent MultiModal Versatile (MMV) Networks [2] that learn state-of-the-art self-supervised representations by leveraging three modalities naturally present in videos: vision, audio, and language. 
[1] Antoine Miech, Jean-Baptiste Alayrac, et al. ''End-to-End Learning of Visual Representations from Uncurated Instructional Videos'', CVPR 2020.
[2] Jean-Baptiste Alayrac et al, ''Self-Supervised MultiModal Versatile Networks'', NeurIPS 2020.
Short bio:
Jean-Baptiste Alayrac is a senior research scientist at DeepMind working in the Vision group led by Andrew Zisserman. He obtained a Ph.D. from Ecole Normale Superieure in Paris in 2018, an MSc degree in Mathematics, Machine Learning, and Computer Vision from Ecole Normale Superieure in Cachan in 2014, and graduated from the Ecole Polytechnique in France in 2013. His research interests span video understanding, natural language processing, and machine learning. Most recently, he has been focusing on self-supervised learning from multiple modalities present in large collections of videos.