* Below is the final schedule. *
1:30 - 1:40 pm Introduction
1:40 - 2:00 pm Keynote
Learning to Segment Moving Objects, by Cordelia Schmid (INRIA)
2:00 - 2:30 pm Oral Session 1
Gaze Embeddings for Zero-Shot Image Classification
by Nour Karessli (Max Planck Institute for Informatics)
Towards Better Instance-level Recognition
by Georgia Gkioxari (Facebook AI Research)
2:30 - 2:50 pm Keynote
Interferences in Match Kernels, by Naila Murray (Naver Labs/NLE)
2:50 - 4:15 pm Poster Session and Coffee Break
4:15 - 4:35 pm Keynote
Computer Vision for the Blind, by Chieko Asakawa (IBM Research/CMU)
4:35 - 5:05 pm Oral Session 2
Dynamic Deep Neural Networks: Optimizing Accuracy-Efficiency Trade-offs by Selective Execution
by Lanlan Liu (University of Michigan)
Semi and Weakly Supervised Semantic Segmentation Using Generative Adversarial Network
by Nasim Souly (University of Central Florida)
5:05 - 5:35 pm Panel session
Chieko Asakawa (IBM Research/CMU)
Andrea Frome (Clarifai)
Raia Hadsell (DeepMind)
Naila Murray (Naver Labs/NLE)
Cordelia Schmid (INRIA)
Helge Seetzen (TandemLaunch)
5:35 - 5:45 pm Closing Remarks
Keynote speakers will give technical talks about their research in computer vision.
Title: Computer Vision for the Blind
Abstract: Blind people have been dreaming of a machine which can recognize objects, people and environment, such as goods in a shop, people around them, or obstacles in a corridor. For many years, such machines were only available in science fiction, but now thanks to the advancement of deep learning and computer vision technologies, it is becoming closer to a reality, by supplementing and augmenting missing or weakened abilities of people with visual impairments. In this talk, I will outline a set of necessary technologies, demonstrate our efforts and cast a vision for near future deployments to change people's lives.
Bio: Chieko Asakawa is a blind Japanese computer scientist, known for her work at IBM Research – Tokyo in accessibility. A Netscape browser plug-in which she developed, the IBM Home Page Reader, became the most widely used web-to-speech system available. She is the recipient of numerous industry and government awards. Asakawa was born with normal sight, but after she injured her optic nerve in a swimming accident at age 11, she began losing her sight, and by age 14 she was fully blind. She earned a bachelor's degree in English literature at Otemon Gakuin University in Osaka in 1982 and then began a two-year computer programming course for blind people using an Optacon to translate print to tactile sensation. Chieko joined IBM in 1985 after completing the computer science courses for the blind at Nippon Lighthouse. She received a B.A. degree in English literature from Otemon Gakuin University in 1982, and a Ph.D in Engineering from the University of Tokyo in 2004. She is a member of the Association for Computing Machinery (ACM), the Information Processing Society of Japan, and IBM Academy of Technology. She was inducted into the Women in Technology International (WITI) Hall of Fame in 2003, and both within and outside of IBM, she has been actively working to help women engineers pursue technical careers. Chieko was appointed to IBM Fellow in 2009, IBM's most prestigious technical honor. In 2013, the government of Japan awarded the Medal of Honor with Purple Ribbon to Chieko for her outstanding contributions to accessibility research, including the development of the voice browser for the visually impaired.
Title: Interferences in Match Kernels
Abstract: We consider the design of an image representation that embeds and aggregates a set of local descriptors into a single vector. Popular representations of this kind include the bag-of-visual-words, the Fisher vector and the VLAD. When two such image representations are compared with the dot-product, the image-to-image similarity can be interpreted as a match kernel. In match kernels, one has to deal with interference, i.e. with the fact that even if two descriptors are unrelated, their matching score may contribute to the overall similarity. We formalise this problem and propose two related solutions, both aimed at equalising the individual contributions of the local descriptors in the final representation. These methods modify the aggregation stage by including a set of per-descriptor weights. They differ by the objective function that is optimised to compute those weights. The first is a “democratisation” strategy that aims at equalising the relative importance of each descriptor in the set comparison metric. The second one involves equalising the match of a single descriptor to the aggregated vector. These concurrent methods give a substantial performance boost over standard aggregation methods, as demonstrated by our experiments on standard public image retrieval benchmarks.
Bio: Naila Murray graduated with a PhD in computer science from the Unversitat Autònoma de Barcelona. She also holds a master’s degree in computer vision and artificial intelligence from the Unversitat Autònoma de Barcelona, and a bachelor’s degree in electrical engineering from Princeton University. Her work in computer vision has involved research into biologically-inspired deep models of visual attention; fine-grained visual recognition, including participation in the winning team of the FGComp 2013 competition; and computational models for visual aesthetic analysis. Naila is a senior scientist and manager of the computer vision group at Xerox Research Centre Europe. Currently, her research focuses on visual search in large databases, and human behaviour understanding, particularly for video action recognition.
Title: Learning to Segment Moving Objects
Abstract: This talk addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a “visual memory” in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given video frames as input, our approach first assigns each pixel an object or background label obtained with an encoder-decoder network that takes as input optical flow and is trained on synthetic data. Next, a “visual memory” specific to the video is acquired automatically without any manually-annotated frames. The visual memory is implemented with convolutional gated recurrent units, which allows to propagate spatial information over time. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results.
Bio: Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis received the best thesis award from INPG in 1996. Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at INRIA Grenoble Rhone-Alpes, where she is a research director and directs an INRIA team. Dr. Schmid is the author of over a hundred technical publications. She has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), editor-in-chief for IJCV (2013---), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015 and ECCV 2020. In 2006, 2014 and 2016, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE. She was awarded an ERC advanced grant in 2013, the Humbolt research award in 2015 and the Inria & French Academy of Science Grand Prix in 2016. She was elected to the German National Academy of Sciences, Leopoldina, in 2017.
Panelists will answer questions and discuss about increasing diversity in computer vision.
Feel free to ask your anonymous questions here.
Chieko Asakawa is a blind Japanese computer scientist, known for her work at IBM Research – Tokyo in accessibility. A Netscape browser plug-in which she developed, the IBM Home Page Reader, became the most widely used web-to-speech system available. She is the recipient of numerous industry and government awards. Asakawa was born with normal sight, but after she injured her optic nerve in a swimming accident at age 11, she began losing her sight, and by age 14 she was fully blind. She earned a bachelor's degree in English literature at Otemon Gakuin University in Osaka in 1982 and then began a two-year computer programming course for blind people using an Optacon to translate print to tactile sensation. Chieko joined IBM in 1985 after completing the computer science courses for the blind at Nippon Lighthouse. She received a B.A. degree in English literature from Otemon Gakuin University in 1982, and a Ph.D in Engineering from the University of Tokyo in 2004. She is a member of the Association for Computing Machinery (ACM), the Information Processing Society of Japan, and IBM Academy of Technology. She was inducted into the Women in Technology International (WITI) Hall of Fame in 2003, and both within and outside of IBM, she has been actively working to help women engineers pursue technical careers. Chieko was appointed to IBM Fellow in 2009, IBM's most prestigious technical honor. In 2013, the government of Japan awarded the Medal of Honor with Purple Ribbon to Chieko for her outstanding contributions to accessibility research, including the development of the voice browser for the visually impaired.
Dr. Andrea Frome earned her Ph.D. in Computer Science and Machine Learning in Jitendra Malik’s lab at UC Berkeley in 2007. Since then, her work in computer vision and machine learning has included: leading the visual recognition team within Street View which is especially known for its work blurring faces and license plates; as a member of the Google Brain team, developing DeViSE for combining visual recognition with word embeddings and applying an attention RNN to fine-grained classification; and work on systems for Hillary for America at campaign headquarters for identity resolution and automatically reading canvassing surveys to reduce data entry. In January 2017, she joined Clarifai as Director of Research. Her non-work pursuits include doing volunteer work for flippable.org, studying flying trapeze, and learning Argentine Tango.
Raia Hadsell, a senior research scientist at DeepMind, has worked on deep learning and robotics problems for over 10 years. Her early research developed the notion of manifold learning using Siamese networks, which has been used extensively for invariant feature learning. After completing a PhD with Yann LeCun, which featured a self-supervised deep learning vision system for a mobile robot, her research continued at Carnegie Mellon’s Robotics Institute and SRI International, and in early 2014 she joined DeepMind in London to study artificial general intelligence. Her current research focuses on the challenge of continual learning for AI agents and robots. While deep RL algorithms are capable of attaining superhuman performance on single tasks, they often cannot transfer that performance to additional tasks, especially if experienced sequentially. She has proposed neural approaches such as policy distillation, progressive nets, and elastic weight consolidation to solve the problem of catastrophic forgetting for agents and robots.
Naila Murray graduated with a PhD in computer science from the Unversitat Autònoma de Barcelona. She also holds a master’s degree in computer vision and artificial intelligence from the Unversitat Autònoma de Barcelona, and a bachelor’s degree in electrical engineering from Princeton University. Her work in computer vision has involved research into biologically-inspired deep models of visual attention; fine-grained visual recognition, including participation in the winning team of the FGComp 2013 competition; and computational models for visual aesthetic analysis. Naila is a senior scientist and manager of the computer vision group at Xerox Research Centre Europe. Currently, her research focuses on visual search in large databases, and human behaviour understanding, particularly for video action recognition.
Cordelia Schmid holds a M.S. degree in Computer Science from the University of Karlsruhe and a Doctorate, also in Computer Science, from the Institut National Polytechnique de Grenoble (INPG). Her doctoral thesis received the best thesis award from INPG in 1996. Dr. Schmid was a post-doctoral research assistant in the Robotics Research Group of Oxford University in 1996--1997. Since 1997 she has held a permanent research position at INRIA Grenoble Rhone-Alpes, where she is a research director and directs an INRIA team. Dr. Schmid is the author of over a hundred technical publications. She has been an Associate Editor for IEEE PAMI (2001--2005) and for IJCV (2004--2012), editor-in-chief for IJCV (2013---), a program chair of IEEE CVPR 2005 and ECCV 2012 as well as a general chair of IEEE CVPR 2015 and ECCV 2020. In 2006, 2014 and 2016, she was awarded the Longuet-Higgins prize for fundamental contributions in computer vision that have withstood the test of time. She is a fellow of IEEE. She was awarded an ERC advanced grant in 2013, the Humbolt research award in 2015 and the Inria & French Academy of Science Grand Prix in 2016. She was elected to the German National Academy of Sciences, Leopoldina, in 2017.
Helge is an award-winning technologist, entrepreneur, and a recognized global authority on technology transfer and display technologies. As General Partner of TandemLaunch, he works with inventors and entrepreneurs to build high growth technology companies. His past successes include the transformation of raw university IP into fully commercialized LED TV technology, including selling his last company - Brightside Technologies - to Dolby Laboratories after sealing partnerships with several of the largest consumer electronics companies in the world. Helge holds over 80 patents in the fields of display, camera and video technology.
A few accepted abstracts are invited to give oral presentations.
Presenter instructions: The presentations should be 12 minute talk and 3 minutes Q/A.
Authors of all accepted abstracts (with or without travel grant) will present their work in a poster session.
Presenter instructions: All posters should be installed in at most 10 minutes at the start of the poster session in the afternoon. The poster boards are located at the room Kamahameha II (where the main conference posters were). The physical dimensions of the poster stands are 8 feet wide by 4 feet high. Poster presenters can optionally use the CVPR17 poster template for more details on how to prepare their posters. Please note your poster number below to find your board.
by invitation only
6:00 - 8:00 pm Dinner sponsored by NVIDIA
The dinner event is an opportunity to meet other female computer vision researchers. Poster presenters will be matched with senior computer vision researchers to share experience and career advice. Invitees will receive an e-mail and be asked to confirm attendance.
*Note that the dinner takes place the evening before the main workshop day.*
Dr. Andrea Frome earned her Ph.D. in Computer Science and Machine Learning in Jitendra Malik’s lab at UC Berkeley in 2007. Since then, her work in computer vision and machine learning has included: leading the visual recognition team within Street View which is especially known for its work blurring faces and license plates; as a member of the Google Brain team, developing DeViSE for combining visual recognition with word embeddings and applying an attention RNN to fine-grained classification; and work on systems for Hillary for America at campaign headquarters for identity resolution and automatically reading canvassing surveys to reduce data entry. In January 2017, she joined Clarifai as Director of Research. Her non-work pursuits include doing volunteer work for flippable.org, studying flying trapeze, and learning Argentine Tango.
Shalini De Mello is a Senior Research Scientist at NVIDIA Research since March 2013. Her research interests are in computer vision and machine learning for human-computer interaction and smart interfaces. Her work includes NVIDIA’s shipping products for hand gesture recognition, face detection, video stabilization and GPU-optimized libraries for the development for computer vision applications on mobile platforms. She received doctoral and master’s degrees in Electrical and Computer Engineering from the University of Texas at Austin in 2008 and 2004, respectively.
Olga Russakovsky is an Assistant Professor of Computer Science at Princeton University. She completed her PhD in Computer Science at Stanford University in August 2015 and her postdoc at the Robotics Institute of Carnegie Mellon University in June 2017. Her research is in computer vision, closely integrated with machine learning and human-computer interaction. Her work was featured in the New York Times and MIT Technology Review. She served as a Senior Program Committee member for WACV’16 (and CVPR’18 soon), led the ImageNet Large Scale Visual Recognition Challenge effort for two years, was the Publicity and Press chair at CVPR’16, and organized multiple workshops and tutorials on large-scale recognition at premier computer vision conferences ICCV’13, ECCV’14, CVPR’15, ICCV’15, CVPR’16, ECCV’16 and CVPR’17. In addition, she was the co-founder and director of the Stanford AI Laboratory’s outreach camp SAILORS (featured in Wired and published in SIGCSE’16) which educates high school girls about AI, and is the co-founder and board member of the AI4ALL foundation dedicated to cultivating a diverse group of future AI leaders.