Today’s computer vision technology is quite good at identifying animals, people, or natural and man-made objects in cluttered images and videos. But it relies on a humongous amount of manual annotation to learn the corresponding visual models. The vision systems of tomorrow will have to continuously learn from data with a much weaker level of human supervision, to adapt to new users for digital assistants or new routes and driving conditions for autonomous cars, and truly leverage the billions of images available on the Internet. This change of paradigm is necessary for the successful large-scale deployment of computer vision technology, and it is a central scientific challenge for our field.

