top of page

PROJECTS

Human Activity Recognition

Trajectory features are a powerful new way to describe video data. By leveraging the spatio-temporal range and structure of trajectories, they improve activity recognition performance compared to systems based on fixed local spatio-temporal volumes. This chapter places them in context, compares a sparse, generative model of extended trajectories to a dense, discriminative model of local trajectories, and explores ways to extend the sparse system with new kinds of information.

 

Book chapter in Advanced Topics in Computer Vision (PDF)

Registration of human silhouettes in pairs of stereo thermal-visible videos 
 

In this article, we investigate the accuracy of similarity measures for thermal–visible image registration of human silhouettes, including MI, Sum of Squared Differences (SSD), Normalized Cross-Correlation (NCC), Histograms of Oriented Gradients (HOG), Local Self-Similarity (LSS), Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), Census, Fast Retina Keypoint (FREAK), and Binary Robust Independent Elementary Feature (BRIEF). We tested the various similarity measures in dense stereo matching tasks over 25,000 windows to have statistically significant results. To do so, we created a new dataset in which one to five humans are walking in a scene in various depth planes. Results show that even if MI is a very strong performer, particularly for large regions of interest (ROI), LSS gives better accuracies when ROI are small or segmented into small fragments because of its ability to capture shape. The other tested similarity measures did not give consistently accurate results.

Infrared Physics & Technology, 2014 

(PDF, Abstract, Dataset)

CVIU, 2013 (PDF, Abstract)

PR, 2013 (PDF, Abstract)

CVPR, 2011 (PDF, Abstract)

 

thermal–visible image registration, sensor fusion, and people tracking

In this work, we propose a new integrated framework that addresses the problems of thermal–visible video registration, sensor fusion, and people tracking for far-range videos. The video registration is based on a RANSAC trajectory-to-trajectory matching, which estimates an affine transformation matrix that maximizes the overlapping of thermal and visible foreground pixels. Sensor fusion uses the aligned images to compute sum-rule silhouettes, and then constructs thermal–visible object models. Finally, multiple object tracking uses blobs constructed in sensor fusion to output the trajectories. Results demonstrate the advantage of our proposed framework in obtaining better results for both image registration and tracking than separate image registration and tracking methods.

CVIU, 2012 (PDF, Abstract, Dataset)

CVPR, 2010 (PDF, Abstract

IMAVIS, 2010 (PDF, Abstract)

Semantic segmentation of scene

 

This paper proposes a semantic segmentation method for outdoor scenes captured by a surveillance camera. Our algorithm classifies each perceptually homogenous region as one of the predefined classes learned from a collection of manually labelled images. The proposed approach combines two different types of information. First, color segmentation is performed to divide the scene into perceptually similar regions. Then, the second step is based on SIFT keypoints and uses the bag of words representation of the regions for the classification. The prediction is done using a Na¨ıve Bayesian Network as a generative classifier. Compared to existing techniques, our method provides more compact representations of scene contents and the segmentation result is more consistent with human perception due to the combination of the color information with the image keypoints. The experiments conducted on a publicly available data set demonstrate the validity of the proposed method

Industry project with iWactLife Co.

ISIVC, 2016 (PDF)

Rat's body tempreture measurement and pose tracking
 

Some studies on epilepsy have shown that seizures might change the body temperature of a patient. Furthermore, other works have shown that kainic acid, a drug used to study seizures, modify body temperature of a laboratory rat. Thus, thermographic cameras may have an important role in investigating seizures. In this paper, we present the methods we have developed to measure the temperature of a moving rat subject to seizure using a thermographic camera and image processing. To accurately measure the body temperature, a particle filter tracker has been developed and tested along with an experimental methodology. The obtained measures are compared with a ground truth. The methods are tested on a 2-hour video and it is shown that our method achieves the promising results.

Research collaboration with St. Justine Children hospital

ISVC, 2008 (PDF, Abstract)

IPCV, 2010 (PDF)

MVAP, 2010 (PDF, Abstract)

 

Multiple human tracking

 

In this paper, we present a new multiple hypotheses tracking (MHT) approach. Our tracking method is suitable for online applications, because it labels objects at every frame and estimates the best computed trajectories up to the current frame. In this work we address the problems of object merging and splitting (occlusions) and object fragmentations. Object fragmentation resulting from imperfect background subtraction can easily be confused with splitting objects in a scene, especially in close range surveillance applications. This subject is not addressed in most MHT methods. In this work, we propose a framework for MHT which distinguishes fragmentation and splitting using their spatial and temporal characteristics and by generating hypotheses only for splitting cases using observation in later frames. This approach results in a more accurate data association and a reduced size of the hypothesis graph. Our tracking method is evaluated with various indoor videos.

CRV, 2009 (PDF, Abstract)

CRV, 2008 (PDF, Abstract)

Large Scale Movie Description Dataset For Video Annotation research

In this project we built a  dataset for movie description and/or movie annotation and search using high quality natural language phrases describing the important visual elements for visualy impaired.

 

Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research (PDF)

TO GET ACCESS TO THIS DATASET CLICK HERE! 

 

 

Fine-Grained Action Classification and Highlighting Using LSTM With Attention

Inspired by recent advances in neural machine translation, that jointly align and translate using encoder-decoder networks equipped with attention, we propose an attention-based LSTM model for human activity recognition.
Our model jointly learns to classify actions and highlight frames associated with the action, by attending to salient visual information through a jointly learned soft-attention mechanism.

 

US Patent application, “Systems and Methods for Identifying Activities and/or Events in Media Contents Based on Object Data and Scene Data”

Action Classification and Highlighting in Videos (PDF)

 

 

 

Movie Annotation & Search With Natural Language

Learning a joint language-visual space has a number of very appealing applications, including i) sentence-based annotation of images or video and ii) retrieval of relevant visual content as specified by a natural language phrase. In this project, we train a language-visual space that, in addition, allows alignment between statistically encoded part(s) of the phrase/sentence and the frames of the video. This approach has the added benefit of language helping the model to attend to certain relevant parts of the video, producing more accurate retrieval results as an outcome.

 

US Patent application,  “Joint Heterogeneous Language-Vision Embeddings for Video Tagging and Search”

ECCV 2016 Workshop on Storytelling with Images and Videos, "Learning language-visual embedding for movie understanding with natural-language" (PDF)

Learning Knot Placement for Character animation

Animators often craft motion by specifying a sparse set of values called "knots" for the animation variables.  These knots are interpolated using a smooth spline curve to produce values for the animation variables at every frame of the animation.  The placement of these knots is often semantically meaningful to the animators and allows for easily editing the motion.  Densely sampled animation variables, such as those resulting from motion capture (Mocap) are not easily manipulated by artists.  We presnt a neural network model that converts densely sampled motion data into semantically meaningful, sparse knot placement.

We use a deep learning model for learning knot placement from densely sampled motion values. Our model has three stacked components: an "Encoder" which is a feed-forward neural network that maps the skeleton body data to the space of sequential modeling networks, followed by a "Sequential modeling" component using a Recurrent Neural Network (RNN) called Long Short-Term Memory (LSTM) which model the sequence, and finally a "Decoder" which is another feed-forward neural network that decodes the RNN outputs to the final knot placement output.

 

Pixar Technical Memo #17-01 

Describing Videos by Exploiting Temporal Structure

 We propose an approach that successfully takes into account both the local and global temporal structure of videos to produce descriptions. First, our approach incorporates a spatial temporal 3-D convolutional neural network (3-D CNN) representation of the short temporal dynamics. The 3-D CNN representation is trained on video action recognition tasks, so as to produce a representation that is tuned to human motion and behavior. Second we propose a temporal attention mechanism that allows to go beyond local temporal modeling and learns to automatically select the most relevant temporal segments given the text-generating RNN. Our approach exceeds the current state-of-art for both BLEU and METEOR metrics on the Youtube2Text dataset. We also present results on a new, larger and more challenging dataset of paired video and natural language descriptions.

Describing Videos by Exploiting Temporal Structure (PDF)

bottom of page