Learning-based approach for vision problems
Learning-based techniques have seen more and more successful application in computer vision. "Learning for vision" is viewed as the next challenging frontier for computer vision. Technical challenges in applying learning-based methods in vision include picking the appropriate representation, model generalization and complexity. This dissertation investigated different vision problems together with the proposed learning algorithms for them. In particular, three vision problems are studied from low-level to high level: stereo, super-resolution and human detection. In the first part, we present a learning-based approach [73, 74]to address the visual correspondence problems when the stereo images have different intensity level. The algorithm first learns the matching behaviors of multiple local-window methods (called experts) using a simple histogram-based method. The learned behaviors are then integrated into a MAP-MRF depth estimation framework and the Metropolis-Hastings algorithm is used to find the MAP solution. Segmentation is also used to accelerate the computation and improve the performance. Qualitative and quantitative experimental results are presented, which demonstrate that, for stereo image pair having different intensity level, the proposed algorithm significantly outperforms the state-of-the-art methods. Using prior knowledge can significantly improve the performance of low-level image processing and vision problems. In the second part, we propose a learning-based approach [72, 71] for video super-resolution. The approach extends previous primal sketch image hallucination method via learning a scene-specific priors using examples. This is achieved by constructing training examples using the high resolution images captured by still camera and use that to increase the low resolution videos. As a result, information from cameras with different spatio-temporal resolutions is combined in our framework. In addition, we use conditional random field (CRF) to enforce smoothness constraint between adjacent super-resolved frames and the video super-resolution is posed as finding the high resolution video that maximize the conditional probability. Extensive experimental results demonstrate that our approach can produce high quality super-resolution videos. In the third part, we explore the problem of human detection and counting using supervised learning [70, 69]. We first propose a solution based on background subtraction and edge detection. A three layer neural network is trained with novel feature representation and used for online human counting. Since the neural network approach works by first segmenting the foreground region, it can not count the static people. To solve this problem, we further propose a detection based approach using Convolutional Neural Network (CNN). This approach applies the detector to every scale and position of the image and collect the total positive responses. Experimental results show that CNN works extremely well for videos where the resolution of human is very low.