Yuting Zhang

name    Ph.D.

Zhang, Yuting


Postdoctoral fellow, CSE, EECS, University of Michigan, Ann Arbor

I got my PhD at Zhejiang University in December 2015.



E-mail:  email-umich


I am a postdoctoral fellow in Honglak Lee's group in the EECS department, University of Michigan. I received my Ph.D. degree from Zhejiang University in December, 2015. I was a Ph.D. student working on pattern recognition and computer vision under the supervision of Gang Pan and the co-supervision of Yueming Wang in the College of Computer Science & TechnologyZhejiang University. I received the B.E. degree in the same department and an honor degree at CKC Honor College in 2009. After that, I became an M.S. student with the entrance exam free, and directly transferred to be a Ph.D. candidate without receiving the M.S. degree in the fall of 2011. In addition, I was also a junior research assistant in the Advanced Digital Sciences Center (Singapore), University of Illinois at Urbana-Champaign during 2012, under the supervision of Kui (Chris) Jia. I had been visiting Honglak Lee's group, which I am currently affiliated with, since the fall of 2013. 

  • Deep learning
  • Language and vision
  • Object detection
  • Large-scale computer vision
  • Face alignment & recognition
  • Efficient feature extraction (M.Sc.)
  • Acceleration-based biometrics (Bachelor - M.Sc.)

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Spotlight presentation
[] [] [paper 5M (high-res 24M)] [arXiv] [code, model, data (COMING SOON)]

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.
  author={Yuting Zhang and Luyao Yuan and Yijie Guo and Zhiyuan He and I-{An} Huang and Honglak Lee},
  booktitle={{IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries},

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-Scale Image Classification
Yuting Zhang, Kibok Lee, Honglak Lee
International Conference on Machine Learning (ICML), June 2016.
[] [] [paper (main, supp.)] [arXiv] [code & model] [slides] [poster] [more image reconstruction examples]

Unsupervised learning and supervised learning are key research topics in deep learning. However, as high-capacity supervised neural networks trained with a large amount of labels have achieved remarkable success in many computer vision tasks, the availability of large-scale labeled images reduced the significance of unsupervised learning. Inspired by the recent trend toward revisiting the importance of unsupervised learning, we investigate joint supervised and unsupervised learning in a large-scale setting by augmenting existing neural networks with decoding pathways for reconstruction. First, we demonstrate that the intermediate activations of pretrained large-scale classification networks preserve almost all the information of input images except a portion of local spatial details. Then, by end-to-end training of the entire augmented architecture with the reconstructive objective, we show improvement of the network performance for supervised tasks. We evaluate several variants of autoencoders, including the recently proposed “what-where” autoencoder that uses the encoder pooling switches, to study the importance of the architecture design. Taking the 16-layer VGGNet trained under the ImageNet ILSVRC 2012 protocol as a strong baseline for image classification, our methods improve the validation-set accuracy by a noticeable margin.
  author={Yuting Zhang and Kibok Lee and Honglak Lee},
  booktitle={International Conference on Machine Learning ({ICML})},
  title={Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-Scale Image Classification},

Deep Visual Analogy-Making
Scott Reed, Yi Zhang, Yuting Zhang, Honglak Lee
Advances in Neural Information Processing Systems (NIPS), December 2015.
Oral presentation
[] [] [paper] [code] [data]

In addition to identifying the content within a single image, relating images and generating related images are critical tasks for image understanding. Recently, deep convolutional networks have yielded breakthroughs in predicting image labels, annotations and captions, but have only just begun to be used for generating high-quality images. In this paper we develop a novel deep network trained end-to-end to perform visual analogy making, which is the task of transforming a query image according to an example pair of related images. Solving this problem requires both accurately recognizing a visual relationship and generating a transformed query image accordingly. Inspired by recent advances in language modeling, we propose to solve visual analogies by learning to map images to a neural embedding in which analogical reasoning is simple, such as by vector subtraction and addition. In experiments, our model effectively models visual analogies on several datasets: 2D shapes, animated video game sprites, and 3D car models.
  author={Scott Reed and Yi Zhang and Yuting Zhang and Honglak Lee},
  booktitle={Advances in Neural Information Processing Systems ({NIPS})},
  title={Deep Visual Analogy-Making},

Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction
Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak Lee
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. doi: 10.1109/CVPR.2015.7298621
Oral presentation & 1st Winner of CV Community Top Paper Award: CVPR 2015 (OpenCV’s People’s Vote Winning Papers) [link]
[] [] [paper (main, supp.)] [arXiv] [project (code & model)] [slides 7M (high-res 45M)] [poster]

Object detection systems based on the deep convolutional neural network (CNN) have recently made groundbreaking advances on several object detection benchmarks. While the features learned by these high-capacity neural networks are discriminative for categorization, inaccurate localization is still a major source of error for detection. Building upon high-capacity CNN architectures, we address the localization problem by 1) using a search algorithm based on Bayesian optimization that sequentially proposes candidate regions for an object bounding box, and 2) training the CNN with a structured loss that explicitly penalizes the localization inaccuracy. In experiments, we demonstrate that each of the proposed methods improves the detection performance over the baseline method on PASCAL VOC 2007 and 2012 datasets. Furthermore, two methods are complementary and significantly outperform the previous state-of-the-art when combined.
  author={Yuting Zhang and Kihyuk Sohn and Ruben Villegas and Gang Pan and Honglak Lee},
  booktitle={{IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Improving Object Detection with Deep Convolutional Networks via {Bayesian} Optimization and Structured Prediction},

Single Sample Face Recognition via Learning Deep Supervised Autoencoders
Shenghua Gao, Yuting Zhang, Kui Jia, Jiwen Lu, Yingying Zhang
IEEE Transactions on Information Forensics and Security, vol. 10, no. 10, pp. 2108-2118, October 2015. doi: 10.1109/TIFS.2015.2446438
[] [] [paper]

This paper targets learning robust image representation for single training sample per person face recognition. Motivated by the success of deep learning in image representation, we propose a supervised auto-encoder, which is a new type of building block for deep architectures. There are two features distinct our supervised auto-encoder from standard auto-encoder. First, we enforce the faces with variants to be mapped with the canonical face of the person, for example, frontal face with neutral expression and normal illumination; Second, we enforce features corresponding to the same person to be similar. As a result, our supervised auto-encoder extracts the features which are robust to variances in illumination, expression, occlusion, and pose, and facilitates the face recognition. We stack such supervised auto-encoders to get the deep architecture and use it for extracting features in image representation. Experimental results on the AR, Extended Yale B, CMU-PIE, and Multi-PIE datasets demonstrate that by coupling with the commonly used sparse representation based classification, our stacked supervised auto-encoders based face representation significantly outperforms the commonly used image representations in single sample per person face recognition, and it achieves higher recognition accuracy compared with other deep learning models, including the deep Lambertian network, in spite of much less training data and without any domain information. Moreover, supervised auto-encoder can also be used for face verification, which further demonstrates its effectiveness for face representation.
  author={Shenghua Gao and Yuting Zhang and Kui Jia and Jiwen Lu and Yingying Zhang},
  title={Single Sample Face Recognition via Learning Deep Supervised Autoencoders},
  journal={{IEEE} Transactions on Information Forensics and Security},

Robust Face Recognition by Constrained Part-based Alignment
Yuting Zhang, Kui Jia, Yueming Wang, Gang Pan, Tsung-Han Chan, Yi Ma
ArXiv preprint, 2015.
[] [] [paper] [arXiv]

Developing a reliable and practical face recognition system is a long-standing goal in computer vision research. Existing literature suggests that pixel-wise face alignment is the key to achieve high-accuracy face recognition. By assuming a human face as piece-wise planar surfaces, where each surface corresponds to a facial part, we develop in this paper a Constrained Part-based Alignment (CPA) algorithm for face recognition across pose and/or expression. Our proposed algorithm is based on a trainable CPA model, which learns appearance evidence of individual parts and a tree-structured shape configuration among different parts. Given a probe face, CPA simultaneously aligns all its parts by fitting them to the appearance evidence with consideration of the constraint from the tree-structured shape configuration. This objective is formulated as a norm minimization problem regularized by graph likelihoods. CPA can be easily integrated with many existing classifiers to perform partbased face recognition. Extensive experiments on benchmark face datasets show that CPA outperforms or is on par with existing methods for robust face recognition across pose, expression, and/or illumination changes.
  author={Yuting Zhang and Kui Jia and Yueming Wang and Gang Pan and Tsung-Han Chan and Yi Ma},
  title={Robust Face Recognition by Constrained Part-based Alignment},
  journal={ArXiv preprint}

Accelerometer-based Gait Recognition by Sparse Representation of Signature Points with Clusters
Yuting Zhang, Gang Pan, Kui Jia, Minlong Lu, Yueming Wang, Zhaohui Wu
IEEE Transactions on Cybernetics, vol. 45, no. 9, pp. 1864-1875, September 2015. doi: 10.1109/TCYB.2014.2361287
[] [] [paper] [dataset] [code]

Gait, as a promising biometric for recognizing human identities, can be non-intrusively captured as series of acceleration signals using wearable or portable smart devices. It can be used for access control. Most existing methods on accelerometer-based gait recognition require explicit step-cycle detection, suffering from cycle detection failures and inter-cycle phase misalignment. We propose a novel algorithm that avoids both the above two problems. It makes use of a type of salient points termed Signature Points (SPs), and has three components: (1) a multi-scale SP extraction method, including the localization and SP descriptors; (2) a sparse representation scheme for encoding newly emerged SPs with known ones in terms of their descriptors, where the phase propinquity of the SPs in a cluster is leveraged to ensure the physical meaningfulness of the codes; and, (3) a classifier for the sparse-code collections associated with the SPs of a series. Experimental results on our publicly available dataset of 175 subjects showed that our algorithm outperformed existing methods, even if the step cycles were perfectly detected for them. When the accelerometers at 5 different body locations were used together, it achieved the rank-1 accuracy of 95.8% for identification, and the equal error rate of 2.2% for verification.
  author={Yuting Zhang and Gang Pan and Kui Jia and Minlong Lu and Yueming Wang and Zhaohui Wu},
  title={Accelerometer-based Gait Recognition by Sparse Representation of Signature Points with Clusters},
  journal={IEEE Transactions on Cybernetics},

Learning to Disentangle Factors of Variation with Manifold Interaction
Scott Reed, Kihyuk Sohn, Yuting Zhang, Honglak Lee
International Conference on Machine Learning (ICML), 2014.
[] [] [paper] [code]

Many latent factors of variation interact to generate sensory data; for example, pose, morphology and expression in face images. In this work, we propose to learn manifold coordinates for the relevant factors of variation and to model their joint interaction. Many existing feature learning algorithms focus on a single task and extract features that are sensitive to the task-relevant factors and invariant to all others. However, models that just extract a single set of invariant features do not exploit the relationships among the latent factors. To address this, we propose a higher-order Boltzmann machine that incorporates multiplicative interactions among groups of hidden units that each learn to encode a distinct factor of variation. Furthermore, we propose correspondencebased training strategies that allow effective disentangling. Our model achieves state-of-the-art emotion recognition and face verification performance on the Toronto Face Database. We also demonstrate disentangled features learned on the CMU Multi-PIE dataset.
  author={Scott Reed and Kihyuk Sohn and Yuting Zhang and Honglak Lee},
  booktitle={International Conference on Machine Learning ({ICML})},
  title={Learning to Disentangle Factors of Variation with Manifold Interaction},

L1-Norm Latent SVM for Compact Features in Object Detection
Min Tan, Gang Pan, Yueming Wang, Yuting Zhang, Zhaohui Wu
Neurocomputing, vol. 139, pp. 56-64, 2014. doi: 10.1016/j.neucom.2013.09.054
[] []

The deformable part model is one of the most effective methods for object detection. However, it simultaneously computes the scores for a holistic filter and several part filters in a relatively highdimensional feature space, which causes the problem of low computational efficiency. This paper proposes an approach to select compact and effective features by learning a sparse deformable part model using L1-norm latent SVM. A stochastic truncated sub-gradient descent method is presented to solve the L1-norm latent SVM problem. Convergence of the algorithm is proved. Extensive experiments are conducted on the INRIA and PASCAL VOC 2007 datasets. A highly compact feature in our method can reach the state-of-the-art performance. The feature dimensionality is reduced to 12% of the original one in the INRIA dataset and less than 30% in most categories of PASCAL VOC 2007 dataset. Compared with the features used in L2-norm latent SVM, the average precisions (AP) have almost no drop using the reduced feature. With our method, the speed of the detection score computation is faster than that of the L2-norm latent SVM method by 3 times. When the cascade strategy is applied, it can be further speeded up by about an order of magnitude.
  author={Min Tan and Gang Pan and Yueming Wang and Yuting Zhang and Zhaohui Wu},
  title={L1-Norm Latent {SVM} for Compact Features in Object Detection},

Efficient Computation of Histograms on Densely Overlapped Polygonal Regions
Yuting Zhang, Yueming Wang, Gang Pan, Zhaohui Wu
Neurocomputing, vol. 118, pp. 141-149, 2013. doi: 10.1016/j.neucom.2013.02.027
[] [] [paper] [code]

This paper proposes a novel algorithm to efficiently compute the histograms in densely overlapped polygonal regions. An incremental scheme is used to reduce the computational complexity. By this scheme, only a few entries in an existing histogram need to be updated to obtain a new histogram. The updating procedure makes use of a few histograms attached to the polygon’s edges, which can be efficiently pre-computed in a similar incremental manner. Thus, the overall process can achieve higher computational efficiency. Further, we extend our method to efficiently evaluate objective functions on the histograms in polygonal regions. The experiments on natural images demonstrate the high efficiency of our method.
  author={Yuting Zhang and Yueming Wang and Gang Pan and Zhaohui Wu},
  title={Efficient Computation of Histograms on Densely Overlapped Polygonal Regions},

Removal of 3D Facial Expressions: a Learning-based Approach
Gang Pan, Song Han, Zhaohui Wu, Yuting Zhang
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2010.
[] [] [paper]

This paper focuses on the task of recovering the neutral 3D face of a person when given his/her 3D face model with facial expression. We propose a learning-based expression removal framework to tackle this task. Our basic idea is to model expression residue from samples, and then use the inferred expression residue from the input expressional face model to recover the neutral one. A two-step non-rigid alignment method is introduced to make all the face models topologically share a common structure. Then we construct two spaces, normal space and expression residue space, for modeling expression. Therefore, the expression removal problem can be formalized as the inference of expression residue from normal spaces. The neutral face model can be generated in a Poisson-based framework by the inferred expression residue. The experimental results on BU-3DFED database demonstrate the effectiveness of our approach.
  author={Gang Pan and Song Han and Zhaohui Wu and Yuting Zhang},
  booktitle={{IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Removal of {3D} Facial Expressions: a Learning-based Approach},

Accelerometer-based Gait Recognition via Voting by Signature Points
Gang Pan, Yuting Zhang, Zhaohui Wu
Electronics Letters, vol. 45, no. 22, pp. 1116-1118, October 2009. doi: 10.1049/el.2009.2301
PRC Patent: 200910153244.2
[] [] [paper] [related slides]

This letter presents a novel algorithm to recognize human identities via gait by bodyworn accelerometers. It uses acceleration information to measure human gait dynamics. Acceleration-based gait recognition is a non-intrusive biometric measurement, which is insensitive to changes of lighting conditions and viewpoint. The proposed algorithm firstly extracts signature points from gait acceleration signals, and then identifies the gait pattern using a signature point-based voting scheme. Experiments with a data set of 30 subjects shows that the proposed algorithm significantly outperforms other existing methods and achieves a high recognition rate of 96.7% in case of five accelerometers.
  author={Gang Pan and Yuting Zhang and Zhaohui Wu},
  title={Accelerometer-based Gait Recognition via Voting by Signature Points},
  journal={Electronics Letters},

GPU-Accelerated Parallel Realistic 3D Facial Expression Synthesis
Song Han, Gang Pan, Junkang Fu, Yuting Zhang
Journal of Computer-Aided Design and Computer Graphics (Chinese), vol. 23, no. 5, pp. 747-755, May 2011.

  author={Song Han and Gang Pan and Junkang Fu and Yuting Zhang},
  title={{GPU}-Accelerated Parallel Realistic {3D} Facial Expression Synthesis},
  journal={Journal of Computer-Aided Design and Computer Graphics (Chinese)}
  • Object Detection Using Deep Neural Networks
    Presentated at a2-dlearn2016 (official website)
    Ann Arbor, MI, USA, Nov 2016. 
  • Accelerometer-based gait recognition [slides]
    Presented at IWCST'11 (BUAA-Tsukuba-ZJU workshop, English website@BUAA, Japanese website@Tsukuba)
    Beijing, China, Oct 2011.
  • Conference Reviewer / PC member:
    • CVPR 2017
    • ICCV 2017
    • NIPS 2016, 2017
    • ICML 2016, 2017
    • IJCAI 2016, 2017
    • ICLR 2016, 2017
    • AISTATS 2017