Projects & Code»DBNet for natural language visual detection

This is the project page for the following paper:

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Spotlight presentation
[] [] [paper 5M (high-res 24M)] [arXiv] [data & development toolbox] [project (TensorFlow & Caffe Code)] [slides (spotlight video)] [poster]

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.
  author={Yuting Zhang and Luyao Yuan and Yijie Guo and Zhiyuan He and I-{An} Huang and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries},


We provide code and model release in both MATLAB+Caffe (used for getting results in our paper) and Python+TensorFlow. The code can be obtained on our GitHub pages: 

We also provide an independent development and evaluation toolbox for visual localization and detection with natural language queries:

Spotlight video