Yuting Zhang - Rekognition & Video Analysis , Amazon Web Services

Yuting Zhang's Homepage

Yuting Zhang

Zhang, Yuting Ph.D.

name

Senior Applied Scientist

Computer Vision Science @ Amazon Web Services (AWS)

[Google Scholar] [LinkedIn]

––––––––––––––––––––––––––––––––––

E-mail:

Since August 2018, I have been an applied scientist in the Rekognition and Textract team at Amazon Web Services, where I have been developing machine-learning solutions to joint vision and language problems. Before that, I was a postdoctoral fellow (January 2016 – July 2018) and a visiting Ph.D. student (September 2013 – December 2015) working with Honglak Lee at the University of Michigan, Ann Arbor. I received my Ph.D. from Zhejiang University in December 2015, advised by Gang Pan. I was also working with Kui (Chris) Jia and Yi Ma in the Advanced Digital Sciences Center (Singapore), UIUC, in 2012.

DocTr: Document Transformer for Structured Information Extraction in Documents
Haofu Liao, Aruni Roychowdhury, Weijian Li, Ankan Bansal, Yuting Zhang, Zhuowen Tu, Ravi Kumar Satzoda, R. Manmatha, Vijay Mahadevan
International Conference on Computer Vision (ICCV), October 2023.
[] [] [paper] [arXiv]

We present a new formulation for structured information extraction (SIE) from visually rich documents. It aims to address the limitations of existing IOB tagging or graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a DOCument TRansformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions.

@inproceedings{2023-iccv-doctr,
  author={Haofu Liao and Aruni Roychowdhury and Weijian Li and Ankan Bansal and Yuting Zhang and Zhuowen Tu and Ravi Kumar Satzoda and R. Manmatha and Vijay Mahadevan},
  booktitle={International Conference on Computer Vision ({ICCV})},
  title={{DocTr}: Document Transformer for Structured Information Extraction in Documents},
  year={2023},
  month={October},
  url={http://www.ytzhang.net/files/publications/2023-iccv-doctr.pdf},
  arxiv={2307.07929}
}

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Vijay Mahadevan, Ravi Kumar Satzoda, R. Manmatha
Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
[] [] [paper] [arXiv]

In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset.

@inproceedings{2023-cvpr-polyformer,
  author={Jiang Liu and Hui Ding and Zhaowei Cai and Yuting Zhang and Vijay Mahadevan and Ravi Kumar Satzoda and R. Manmatha},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={{PolyFormer}: Referring Image Segmentation as Sequential Polygon Generation},
  year={2023},
  month={June},
  url={http://www.ytzhang.net/files/publications/2023-cvpr-polyformer.pdf},
  arxiv={2302.07387}
}

Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries
Qi Dong, Zhuowen Tu, Haofu Liao, Yuting Zhang, Vijay Mahadevan, Stefano Soatto
International Conference on Computer Vision (ICCV), October 2021.
[] [] [paper (supp)] [arXiv]

Computer vision applications such as visual relationship detection and human object interaction can be formulated as a composite (structured) set detection problem in which both the parts (subject, object, and predicate) and the sum (triplet as a whole) are to be detected in a hierarchical fashion. In this paper, we present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end visual composite set detection. Different from existing Transformers in which queries are at a single level, we simultaneously model the joint part and sum hypotheses/interactions with composite queries and attention modules. We explicitly incorporate sum queries to enable better modeling of the part-and-sum relations that are absent in the standard Transformers. Our approach also uses novel tensor-based part queries and vector-based sum queries, and models their joint interaction. We report experiments on two vision tasks, visual relationship detection and human object interaction and demonstrate that PST achieves state of the art results among single-stage models, while nearly matching the results of custom designed two-stage models.

@inproceedings{2021-iccv-PST,
  author={Qi Dong and Zhuowen Tu and Haofu Liao and Yuting Zhang and Vijay Mahadevan and Stefano Soatto},
  booktitle={International Conference on Computer Vision ({ICCV})},
  title={Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries},
  year={2021},
  month={October},
  url={http://www.ytzhang.net/files/publications/2021-iccv-PST.pdf},
  arxiv={2105.02170}
}

Humble Teachers Teach Better Students for Semi-Supervised Object Detection
Yihe Tang, Weifeng Chen, Yijun Luo, Yuting Zhang
Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
[] [] [paper] [arXiv]

We propose a semi-supervised approach for contemporary object detectors following the teacher-student dual model framework. Our method is featured with 1) the exponential moving averaging strategy to update the teacher from the student online, 2) using plenty of region proposals and soft pseudo-labels as the student’s training targets, and 3) a light-weighted detection-speciﬁc data ensemble for the teacher to generate more reliable pseudo-labels. Compared to the recent state-of-the-art – STAC, which uses hard labels on sparsely selected hard pseudo samples, the teacher in our model exposes richer information to the student with soft-labels on many proposals. Our model achieves COCO-style AP of 53.04% on VOC07 val set, 8.4% better than STAC, when using VOC12 as unlabeled data. On MSCOCO, it outperforms prior work when only a small percentage of data is taken as labeled. It also reaches 53.8% AP on MS-COCO test-dev with 3.1% gain over the fully supervised ResNet-152 Cascaded R-CNN, by tapping into unlabeled data of a similar size to the labeled data.

@inproceedings{2021-cvpr-humble-teacher,
  author={Yihe Tang and Weifeng Chen and Yijun Luo and Yuting Zhang},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Humble Teachers Teach Better Students for Semi-Supervised Object Detection},
  year={2021},
  month={June},
  url={http://www.ytzhang.net/files/publications/2021-cvpr-humble-teacher.pdf},
  arxiv={2106.10456}
}

Dynamic Grown Generative Adversarial Networks
Lanlan Liu, Yuting Zhang, Jia Deng, Stefano Soatto
AAAI Conference on Artificial Intelligence (AAAI), February 2021.
[] [] [paper] [arXiv]

Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generator-discriminator balance and convolutional layer choices.

@inproceedings{2021-aaai-dggan,
  author={Lanlan Liu and Yuting Zhang and Jia Deng and Stefano Soatto},
  booktitle={AAAI Conference on Artificial Intelligence ({AAAI})},
  title={Dynamic Grown Generative Adversarial Networks},
  year={2021},
  month={February},
  url={http://www.ytzhang.net/files/publications/2021-aaai-dggan.pdf},
  arxiv={2106.08505}
}

Visual Question Answering on Image Sets
Ankan Bansal, Yuting Zhang, Rama Chellappa
European Conference on Computer Vision (ECCV), August 2020.
[] [] [paper (main, appendices)] [arXiv] [project (ISVQA dataset and baseline code)]

We introduce the task of Image-Set Visual Question Answer-ing (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationshipsin one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets – indoor and outdoor scenes. They simulate the real-world scenarios ofindoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human-annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases indataset, and question-image dependencies. We also build new baselinemodels to investigate new research challenges in ISVQA.

@inproceedings{2020-eccv-isvqa,
  author={Ankan Bansal and Yuting Zhang and Rama Chellappa},
  booktitle={European Conference on Computer Vision ({ECCV})},
  title={Visual Question Answering on Image Sets},
  year={2020},
  month={August},
  url={http://www.ytzhang.net/files/publications/2020-eccv-isvqa.pdf},
  arxiv={2008.11976}
}

Unsupervised Discovery of Object Landmarks as Structural Representations
Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
Oral presentation
[] [] [paper (main, appendices, supp-videos .tar.gz)] [arXiv] [project (code & results)] [poster] [slides] [oral presentation .mp4]

Deep neural networks can model images with rich latent representations, but they cannot naturally conceptualize structures of object categories in a human-perceptible way. This paper addresses the problem of learning object structures in an image modeling process without supervision. We propose an autoencoding formulation to discover landmarks as explicit structural representations. The encoding module outputs landmark coordinates, whose validity is ensured by constraints that reflect the necessary properties for landmarks. The decoding module takes the landmarks as a part of the learnable input representations in an end-to-end differentiable framework. Our discovered landmarks are semantically meaningful and more predictive of manually annotated landmarks than those discovered by previous methods. The coordinates of our landmarks are also complementary features to pretrained deep-neural-network representations in recognizing visual attributes. In addition, the proposed method naturally creates an unsupervised, perceptible interface to manipulate object shapes and decode images with controllable structures.

@inproceedings{2018-cvpr-lmdis-rep,
  author={Yuting Zhang and Yijie Guo and Yixin Jin and Yijun Luo and Zhiyuan He and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Unsupervised Discovery of Object Landmarks as Structural Representations},
  year={2018},
  month={June},
  url={http://www.ytzhang.net/files/publications/2018-cvpr-lmdis-rep.pdf},
  arxiv={1804.04412}
}

Hierarchical Novelty Detection for Visual Object Recognition
Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[] [] [paper (supp)] [arXiv] [code]

Deep neural networks have achieved impressive success in large-scale visual object recognition tasks with a predefined set of classes. However, recognizing objects of novel classes unseen during training still remains challenging. The problem of detecting such novel classes has been addressed in the literature, but most prior works have focused on providing simple binary or regressive decisions, e.g., the output would be “known,” “novel,” or corresponding confidence intervals. In this paper, we study more informative novelty detection schemes based on a hierarchical classification framework. For an object of a novel class, we aim for finding its closest super class in the hierarchical taxonomy of known classes. To this end, we propose two different approaches termed top-down and flatten methods, and their combination as well. The essential ingredients of our methods are confidence-calibrated classifiers, data relabeling, and the leave-one-out strategy for modeling novel classes under the hierarchical taxonomy. Furthermore, our method can generate a hierarchical embedding that leads to improved generalized zero-shot learning performance in combination with other commonly-used semantic embeddings.

@inproceedings{2018-cvpr-hierachical-novelty,
  author={Kibok Lee and Kimin Lee and Kyle Min and Yuting Zhang and Jinwoo Shin and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Hierarchical Novelty Detection for Visual Object Recognition},
  year={2018},
  month={June},
  url={http://www.ytzhang.net/files/publications/2018-cvpr-hierachical-novelty.pdf},
  arxiv={1804.00722}
}

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries
Yuting Zhang, Luyao Yuan, Yijie Guo, Zhiyuan He, I-An Huang, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Spotlight presentation
[] [] [paper 5M (high-res 24M)] [arXiv] [data & development toolbox] [project (TensorFlow & Caffe Code)] [slides (spotlight video)] [poster]

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

@inproceedings{2017-cvpr-dbnet,
  author={Yuting Zhang and Luyao Yuan and Yijie Guo and Zhiyuan He and I-{An} Huang and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries},
  year={2017},
  month={July},
  url={http://www.ytzhang.net/files/publications/2017-cvpr-dbnet.pdf},
  arxiv={1704.03944}
}

Towards Understanding the Invertibility of Convolutional Neural Networks
Anna C. Gilbert, Yi Zhang, Kibok Lee, Yuting Zhang, Honglak Lee
International Joint Conference on Artificial Intelligence (IJCAI), August 2017.
[] [] [paper] [arXiv]

Several recent works have empirically observed that Convolutional Neural Nets (CNNs) are (approximately) invertible. To understand this approximate invertibility phenomenon and how to leverage it more effectively, we focus on a theoretical explanation and develop a mathematical model of sparse signal recovery that is consistent with CNNs with random weights. We give an exact connection to a particular model of model-based compressive sensing (and its recovery algorithms) and random-weight CNNs. We show empirically that several learned networks are consistent with our mathematical analysis and then demonstrate that with such a simple theoretical framework, we can obtain reasonable reconstruction results on real images. We also discuss gaps between our model assumptions and the CNN trained for classification in practical scenarios.

@inproceedings{2017-ijcai-CNNmrip,
  author={Anna C. Gilbert and Yi Zhang and Kibok Lee and Yuting Zhang and Honglak Lee},
  booktitle={International Joint Conference on Artificial Intelligence ({IJCAI})},
  title={Towards Understanding the Invertibility of Convolutional Neural Networks},
  year={2017},
  month={August},
  url={http://www.ytzhang.net/files/publications/2017-ijcai-CNNmrip.pdf},
  arxiv={1705.08664}
}

Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-Scale Image Classification
Yuting Zhang, Kibok Lee, Honglak Lee
International Conference on Machine Learning (ICML), June 2016.
[] [] [paper (main, supp.)] [arXiv] [code & model] [slides] [poster] [more image reconstruction examples]

Unsupervised learning and supervised learning are key research topics in deep learning. However, as high-capacity supervised neural networks trained with a large amount of labels have achieved remarkable success in many computer vision tasks, the availability of large-scale labeled images reduced the significance of unsupervised learning. Inspired by the recent trend toward revisiting the importance of unsupervised learning, we investigate joint supervised and unsupervised learning in a large-scale setting by augmenting existing neural networks with decoding pathways for reconstruction. First, we demonstrate that the intermediate activations of pretrained large-scale classification networks preserve almost all the information of input images except a portion of local spatial details. Then, by end-to-end training of the entire augmented architecture with the reconstructive objective, we show improvement of the network performance for supervised tasks. We evaluate several variants of autoencoders, including the recently proposed “what-where” autoencoder that uses the encoder pooling switches, to study the importance of the architecture design. Taking the 16-layer VGGNet trained under the ImageNet ILSVRC 2012 protocol as a strong baseline for image classification, our methods improve the validation-set accuracy by a noticeable margin.

@inproceedings{2016-icml-recon-dec,
  author={Yuting Zhang and Kibok Lee and Honglak Lee},
  booktitle={International Conference on Machine Learning ({ICML})},
  title={Augmenting Supervised Neural Networks with Unsupervised Objectives for Large-Scale Image Classification},
  year={2016},
  month={June},
  url={http://www.ytzhang.net/files/publications/2016-icml-recon-dec.pdf},
  arxiv={1606.06582},
  pages={612-621}
}

Deep Visual Analogy-Making
Scott Reed, Yi Zhang, Yuting Zhang, Honglak Lee
Advances in Neural Information Processing Systems (NIPS), December 2015.
Oral presentation
[] [] [paper] [code] [data]

In addition to identifying the content within a single image, relating images and generating related images are critical tasks for image understanding. Recently, deep convolutional networks have yielded breakthroughs in predicting image labels, annotations and captions, but have only just begun to be used for generating high-quality images. In this paper we develop a novel deep network trained end-to-end to perform visual analogy making, which is the task of transforming a query image according to an example pair of related images. Solving this problem requires both accurately recognizing a visual relationship and generating a transformed query image accordingly. Inspired by recent advances in language modeling, we propose to solve visual analogies by learning to map images to a neural embedding in which analogical reasoning is simple, such as by vector subtraction and addition. In experiments, our model effectively models visual analogies on several datasets: 2D shapes, animated video game sprites, and 3D car models.

@inproceedings{2015-nips-analogy,
  author={Scott Reed and Yi Zhang and Yuting Zhang and Honglak Lee},
  booktitle={Advances in Neural Information Processing Systems ({NIPS})},
  title={Deep Visual Analogy-Making},
  year={2015},
  month={December},
  url={http://www.ytzhang.net/files/publications/2015-nips-analogy.pdf}
}

Improving Object Detection with Deep Convolutional Networks via Bayesian Optimization and Structured Prediction
Yuting Zhang, Kihyuk Sohn, Ruben Villegas, Gang Pan, Honglak Lee
Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. doi: 10.1109/CVPR.2015.7298621
Oral presentation & 1st Winner of CV Community Top Paper Award: CVPR 2015 (OpenCV’s People’s Vote Winning Papers) [link]
[] [] [paper (main, supp.)] [arXiv] [project (code & model)] [slides 7M (high-res 45M)] [poster]

Object detection systems based on the deep convolutional neural network (CNN) have recently made groundbreaking advances on several object detection benchmarks. While the features learned by these high-capacity neural networks are discriminative for categorization, inaccurate localization is still a major source of error for detection. Building upon high-capacity CNN architectures, we address the localization problem by 1) using a search algorithm based on Bayesian optimization that sequentially proposes candidate regions for an object bounding box, and 2) training the CNN with a structured loss that explicitly penalizes the localization inaccuracy. In experiments, we demonstrate that each of the proposed methods improves the detection performance over the baseline method on PASCAL VOC 2007 and 2012 datasets. Furthermore, two methods are complementary and significantly outperform the previous state-of-the-art when combined.

@inproceedings{2015-cvpr-det,
  author={Yuting Zhang and Kihyuk Sohn and Ruben Villegas and Gang Pan and Honglak Lee},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Improving Object Detection with Deep Convolutional Networks via {Bayesian} Optimization and Structured Prediction},
  year={2015},
  month={June},
  url={http://www.ytzhang.net/files/publications/2015-cvpr-det.pdf},
  arxiv={1504.03293},
  pages={249-258},
  doi={10.1109/CVPR.2015.7298621}
}

Single Sample Face Recognition via Learning Deep Supervised Autoencoders
Shenghua Gao, Yuting Zhang, Kui Jia, Jiwen Lu, Yingying Zhang
IEEE Transactions on Information Forensics and Security, vol. 10, no. 10, pp. 2108-2118, October 2015. doi: 10.1109/TIFS.2015.2446438
[] [] [paper]

This paper targets learning robust image representation for single training sample per person face recognition. Motivated by the success of deep learning in image representation, we propose a supervised auto-encoder, which is a new type of building block for deep architectures. There are two features distinct our supervised auto-encoder from standard auto-encoder. First, we enforce the faces with variants to be mapped with the canonical face of the person, for example, frontal face with neutral expression and normal illumination; Second, we enforce features corresponding to the same person to be similar. As a result, our supervised auto-encoder extracts the features which are robust to variances in illumination, expression, occlusion, and pose, and facilitates the face recognition. We stack such supervised auto-encoders to get the deep architecture and use it for extracting features in image representation. Experimental results on the AR, Extended Yale B, CMU-PIE, and Multi-PIE datasets demonstrate that by coupling with the commonly used sparse representation based classification, our stacked supervised auto-encoders based face representation significantly outperforms the commonly used image representations in single sample per person face recognition, and it achieves higher recognition accuracy compared with other deep learning models, including the deep Lambertian network, in spite of much less training data and without any domain information. Moreover, supervised auto-encoder can also be used for face verification, which further demonstrates its effectiveness for face representation.

@article{2015-tifs-sup-ae,
  author={Shenghua Gao and Yuting Zhang and Kui Jia and Jiwen Lu and Yingying Zhang},
  title={Single Sample Face Recognition via Learning Deep Supervised Autoencoders},
  year={2015},
  month={October},
  url={http://www.ytzhang.net/files/publications/2015-tifs-sup-ae.pdf},
  pages={2108-2118},
  volume={10},
  number={10},
  doi={10.1109/TIFS.2015.2446438},
  journal={{IEEE} Transactions on Information Forensics and Security},
  issn={1556-6013}
}

Robust Face Recognition by Constrained Part-based Alignment
Yuting Zhang, Kui Jia, Yueming Wang, Gang Pan, Tsung-Han Chan, Yi Ma
ArXiv preprint, 2015.
[] [] [paper] [arXiv]

Developing a reliable and practical face recognition system is a long-standing goal in computer vision research. Existing literature suggests that pixel-wise face alignment is the key to achieve high-accuracy face recognition. By assuming a human face as piece-wise planar surfaces, where each surface corresponds to a facial part, we develop in this paper a Constrained Part-based Alignment (CPA) algorithm for face recognition across pose and/or expression. Our proposed algorithm is based on a trainable CPA model, which learns appearance evidence of individual parts and a tree-structured shape configuration among different parts. Given a probe face, CPA simultaneously aligns all its parts by fitting them to the appearance evidence with consideration of the constraint from the tree-structured shape configuration. This objective is formulated as a norm minimization problem regularized by graph likelihoods. CPA can be easily integrated with many existing classifiers to perform partbased face recognition. Extensive experiments on benchmark face datasets show that CPA outperforms or is on par with existing methods for robust face recognition across pose, expression, and/or illumination changes.

@article{2015-preprint-cpa,
  author={Yuting Zhang and Kui Jia and Yueming Wang and Gang Pan and Tsung-Han Chan and Yi Ma},
  title={Robust Face Recognition by Constrained Part-based Alignment},
  year={2015},
  url={http://www.ytzhang.net/files/publications/2015-preprint-cpa.pdf},
  arxiv={1501.04717},
  journal={ArXiv preprint}
}

Accelerometer-based Gait Recognition by Sparse Representation of Signature Points with Clusters
Yuting Zhang, Gang Pan, Kui Jia, Minlong Lu, Yueming Wang, Zhaohui Wu
IEEE Transactions on Cybernetics, vol. 45, no. 9, pp. 1864-1875, September 2015. doi: 10.1109/TCYB.2014.2361287
[] [] [paper] [dataset] [code]

Gait, as a promising biometric for recognizing human identities, can be non-intrusively captured as series of acceleration signals using wearable or portable smart devices. It can be used for access control. Most existing methods on accelerometer-based gait recognition require explicit step-cycle detection, suffering from cycle detection failures and inter-cycle phase misalignment. We propose a novel algorithm that avoids both the above two problems. It makes use of a type of salient points termed Signature Points (SPs), and has three components: (1) a multi-scale SP extraction method, including the localization and SP descriptors; (2) a sparse representation scheme for encoding newly emerged SPs with known ones in terms of their descriptors, where the phase propinquity of the SPs in a cluster is leveraged to ensure the physical meaningfulness of the codes; and, (3) a classifier for the sparse-code collections associated with the SPs of a series. Experimental results on our publicly available dataset of 175 subjects showed that our algorithm outperformed existing methods, even if the step cycles were perfectly detected for them. When the accelerometers at 5 different body locations were used together, it achieved the rank-1 accuracy of 95.8% for identification, and the equal error rate of 2.2% for verification.

@article{2015-tcyb-gait,
  author={Yuting Zhang and Gang Pan and Kui Jia and Minlong Lu and Yueming Wang and Zhaohui Wu},
  title={Accelerometer-based Gait Recognition by Sparse Representation of Signature Points with Clusters},
  year={2015},
  month={September},
  url={http://www.ytzhang.net/files/publications/2015-tcyb-gait.pdf},
  pages={1864-1875},
  volume={45},
  number={9},
  doi={10.1109/TCYB.2014.2361287},
  journal={IEEE Transactions on Cybernetics},
  issn={2168-2267}
}

Learning to Disentangle Factors of Variation with Manifold Interaction
Scott Reed, Kihyuk Sohn, Yuting Zhang, Honglak Lee
International Conference on Machine Learning (ICML), 2014.
[] [] [paper] [code]

Many latent factors of variation interact to generate sensory data; for example, pose, morphology and expression in face images. In this work, we propose to learn manifold coordinates for the relevant factors of variation and to model their joint interaction. Many existing feature learning algorithms focus on a single task and extract features that are sensitive to the task-relevant factors and invariant to all others. However, models that just extract a single set of invariant features do not exploit the relationships among the latent factors. To address this, we propose a higher-order Boltzmann machine that incorporates multiplicative interactions among groups of hidden units that each learn to encode a distinct factor of variation. Furthermore, we propose correspondencebased training strategies that allow effective disentangling. Our model achieves state-of-the-art emotion recognition and face verification performance on the Toronto Face Database. We also demonstrate disentangled features learned on the CMU Multi-PIE dataset.

@inproceedings{2014-icml-disentangling,
  author={Scott Reed and Kihyuk Sohn and Yuting Zhang and Honglak Lee},
  booktitle={International Conference on Machine Learning ({ICML})},
  title={Learning to Disentangle Factors of Variation with Manifold Interaction},
  year={2014},
  url={http://www.ytzhang.net/files/publications/2014-icml-disentangling.pdf}
}

L1-Norm Latent SVM for Compact Features in Object Detection
Min Tan, Gang Pan, Yueming Wang, Yuting Zhang, Zhaohui Wu
Neurocomputing, vol. 139, pp. 56-64, 2014. doi: 10.1016/j.neucom.2013.09.054
[] []

The deformable part model is one of the most effective methods for object detection. However, it simultaneously computes the scores for a holistic filter and several part filters in a relatively highdimensional feature space, which causes the problem of low computational efficiency. This paper proposes an approach to select compact and effective features by learning a sparse deformable part model using L1-norm latent SVM. A stochastic truncated sub-gradient descent method is presented to solve the L1-norm latent SVM problem. Convergence of the algorithm is proved. Extensive experiments are conducted on the INRIA and PASCAL VOC 2007 datasets. A highly compact feature in our method can reach the state-of-the-art performance. The feature dimensionality is reduced to 12% of the original one in the INRIA dataset and less than 30% in most categories of PASCAL VOC 2007 dataset. Compared with the features used in L2-norm latent SVM, the average precisions (AP) have almost no drop using the reduced feature. With our method, the speed of the detection score computation is faster than that of the L2-norm latent SVM method by 3 times. When the cascade strategy is applied, it can be further speeded up by about an order of magnitude.

@article{2014-neurocomp-l1lsvm,
  author={Min Tan and Gang Pan and Yueming Wang and Yuting Zhang and Zhaohui Wu},
  title={L1-Norm Latent {SVM} for Compact Features in Object Detection},
  year={2014},
  pages={56-64},
  volume={139},
  doi={10.1016/j.neucom.2013.09.054},
  journal={Neurocomputing},
  issn={0925-2312}
}

Efficient Computation of Histograms on Densely Overlapped Polygonal Regions
Yuting Zhang, Yueming Wang, Gang Pan, Zhaohui Wu
Neurocomputing, vol. 118, pp. 141-149, 2013. doi: 10.1016/j.neucom.2013.02.027
[] [] [paper] [code]

This paper proposes a novel algorithm to efficiently compute the histograms in densely overlapped polygonal regions. An incremental scheme is used to reduce the computational complexity. By this scheme, only a few entries in an existing histogram need to be updated to obtain a new histogram. The updating procedure makes use of a few histograms attached to the polygon’s edges, which can be efficiently pre-computed in a similar incremental manner. Thus, the overall process can achieve higher computational efficiency. Further, we extend our method to efficiently evaluate objective functions on the histograms in polygonal regions. The experiments on natural images demonstrate the high efficiency of our method.

@article{2013-neurocomp-polyhist,
  author={Yuting Zhang and Yueming Wang and Gang Pan and Zhaohui Wu},
  title={Efficient Computation of Histograms on Densely Overlapped Polygonal Regions},
  year={2013},
  url={http://www.ytzhang.net/files/publications/2013-neurocomp-polyhist.pdf},
  pages={141-149},
  volume={118},
  doi={10.1016/j.neucom.2013.02.027},
  journal={Neurocomputing},
  issn={0925-2312}
}

GPU-Accelerated Parallel Realistic 3D Facial Expression Synthesis
Song Han, Gang Pan, Junkang Fu, Yuting Zhang
Journal of Computer-Aided Design and Computer Graphics (Chinese), vol. 23, no. 5, pp. 747-755, May 2011.
[]

@article{2011-jcadcg-gpu-face,
  author={Song Han and Gang Pan and Junkang Fu and Yuting Zhang},
  title={{GPU}-Accelerated Parallel Realistic {3D} Facial Expression Synthesis},
  year={2011},
  month={May},
  pages={747-755},
  volume={23},
  number={5},
  journal={Journal of Computer-Aided Design and Computer Graphics (Chinese)}
}

Removal of 3D Facial Expressions: a Learning-based Approach
Gang Pan, Song Han, Zhaohui Wu, Yuting Zhang
Conference on Computer Vision and Pattern Recognition (CVPR), June 2010.
[] [] [paper]

This paper focuses on the task of recovering the neutral 3D face of a person when given his/her 3D face model with facial expression. We propose a learning-based expression removal framework to tackle this task. Our basic idea is to model expression residue from samples, and then use the inferred expression residue from the input expressional face model to recover the neutral one. A two-step non-rigid alignment method is introduced to make all the face models topologically share a common structure. Then we construct two spaces, normal space and expression residue space, for modeling expression. Therefore, the expression removal problem can be formalized as the inference of expression residue from normal spaces. The neutral face model can be generated in a Poisson-based framework by the inferred expression residue. The experimental results on BU-3DFED database demonstrate the effectiveness of our approach.

@inproceedings{2010-cvpr-3dface,
  author={Gang Pan and Song Han and Zhaohui Wu and Yuting Zhang},
  booktitle={Conference on Computer Vision and Pattern Recognition ({CVPR})},
  title={Removal of {3D} Facial Expressions: a Learning-based Approach},
  year={2010},
  month={June},
  url={http://www.ytzhang.net/files/publications/2010-cvpr-3dface.pdf},
  pages={2614-2621}
}

Accelerometer-based Gait Recognition via Voting by Signature Points
Gang Pan, Yuting Zhang, Zhaohui Wu
Electronics Letters, vol. 45, no. 22, pp. 1116-1118, October 2009. doi: 10.1049/el.2009.2301
PRC Patent: 200910153244.2
[] [] [paper] [related slides]

This letter presents a novel algorithm to recognize human identities via gait by bodyworn accelerometers. It uses acceleration information to measure human gait dynamics. Acceleration-based gait recognition is a non-intrusive biometric measurement, which is insensitive to changes of lighting conditions and viewpoint. The proposed algorithm firstly extracts signature points from gait acceleration signals, and then identifies the gait pattern using a signature point-based voting scheme. Experiments with a data set of 30 subjects shows that the proposed algorithm significantly outperforms other existing methods and achieves a high recognition rate of 96.7% in case of five accelerometers.

@article{2009-el-gait,
  author={Gang Pan and Yuting Zhang and Zhaohui Wu},
  title={Accelerometer-based Gait Recognition via Voting by Signature Points},
  year={2009},
  month={October},
  url={http://www.ytzhang.net/files/publications/2009-el-gait.pdf},
  pages={1116-1118},
  volume={45},
  number={22},
  doi={10.1049/el.2009.2301},
  journal={Electronics Letters},
  issn={0013-5194}
}

Organizer of NeurIPS 2020 EXPO Demonstration – “AWS Computer Vision Science”
Organizer of CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
Conference Reviewer / PC member:
- CVPR 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024
- ICCV 2017, 2019, 2021, 2023
- ECCV 2018, 2020, 2022
- NeurIPS 2016, 2017, 2018, 2019, 2020, 2021, 2022
- ICML 2016, 2017, 2018, 2019, 2021, 2022, 2023
- ICLR 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023
- IJCAI 2016, 2017, 2018 (distinguished PC), 2019, 2020
- AAAI 2018, 2020
- AISTATS 2017, 2018, 2019, 2020, 2021, 2022
Journal Reviewer:
- IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
- IEEE Transactions on Image Processing (TIP)
- Neural Computation

Guest Lectures: Artifical Neural Networks and Deep Learning
In EECS 592 (AI Foundations), University of Michigan, Nov 27 & Nov 29, Fall 2017.
Guest Lectures: Artifical Neural Networks and Deep Learning
In EECS 492 (Introduction to Artificial Intelligence), University of Michigan, Nov 30 & Dec 5, Fall 2017.
Invited Talk: Object Detection Using Deep Neural Networks
In a²-dlearn²⁰¹⁶(official website), Ann Arbor, MI, USA, Nov 2016.
Invited Talk: Accelerometer-based gait recognition [slides]
In IWCST'11 (BUAA-Tsukuba-ZJU workshop), Beijing, China, Oct 2011.

Short Bio

Publications & Preprints

Professional Activities

Talks