Skip to main content
To KTH's start page To KTH's start page

Computer Vision and Machine Learning

We use hands to tie shoes, write, wash dishes, play piano: these tasks involve close and complex interaction with objects and we can execute them even if one of our hand/arm/fingers is injured or not functioning properly. Our dexterity and sensory capabilities have developed through a long process of evolution - however, for the above tasks, coupling the sensing and dexterity with efficient representations, planning and execution is necessary.

Research in Computer Vision is directed towards technologies for human visual perception and memory support. Visual perception is central to our ability to interact with the surrounding world. Vision is also very central in forming memories of our daily life and interactions with other people. The degradation of our abilities to perceive and memorize visual stimuli is therefore of general disadvantage. The possibilities of compensating for this degradation have historically been very small.  

Developments in technology of camera design, computers and the understanding of methods for building artificial visual systems that interpret and organize visual information is however changing this situation to the better. In the future there is a definite possibility that artificial systems will be built that capture and process visual information with the same performance as the human visual system and are able to organize the visual input into memories that can be used to support a degrading human perception and memory. These systems will, when fully developed serve as "cognitive prostheses", in analogue to the way physical prostheses replace human body parts. Even today however it can be noted that systems can be built that aid in e.g reading and in interpreting essential visual information in the environment.

RPL advances these systems towards fully autonomous visual information aids by:

  1. developing the state of the art of automatic artificial visual processing with special emphasis on visual data from systems that can be worn in an unobtrusive manner, and
  2. investigating methods of automatic visual memory selection from these systems in order to enhance failing human visual memory.

Our research in Computer Vision and Machine Learning concerns representation learning from different kinds of visual data.

One example is on the utility of generic Deep Convolutional Networks (ConvNets) visual representations. Given a ConvNet which has been trained with a large-scale labeled data set, the feed-forward units activation at a certain layer can be used as a generic representation of a new input image for a target task, i.e. a global image descriptor. We have been investigating different aspects of this common scenario in the context of transfer learning. That is, there are several factors affecting the transferability, including those for learning such as network design and distribution of training data as well as post-learning factors such as layer choice of the trained ConvNet. By optimising these factors, we saw that significant improvements can be achieved on various standard visual recognition tasks. We also explore what information resides in such representations; interestingly we found strong spatial information implicit, which was unexpected in a network trained for classification problems. For the results, see for example (Azizpour, Razavian, Sullivan, Maki and Carlsson, PAMI 2016).

In another line of work, (Zhang, Kjellström and Ek, ECCV 2016) we develop a a factorized probabilistic latent variable representation which is an extension of an LDA topic model. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and ability to handle missing data. The factorization is provided by representing data in terms of aligned pairs of observations as different views. This provides means for selecting a representation that separately models topics that exist in both views 015 from the topics that are unique to a single view. This structured consolidation allows for efficient and robust inference and provides a compact and efficient representation. Learning is performed in a Bayesian fashion by maximizing a rigorous bound on the log-likelihood.

Contact

For more information, please contact involved faculty members: