Covariant and invariant deep networks

We study and develop deep networks that handle scaling transformations and other image transformations in a theoretically well-founded manner, preferably in terms of provable covariance and invariance properties.

Specifically, we study the ability of deep networks to generalize to previously unseen scales, that are not spanned by the training data. For this purpose, we have developed two classes of scale-covariant and scale-invariant networks, based on either (i) using multiply rescaled continuous filters applied to the original input image or (ii) applying the same primitive discrete deep network to multiple rescalings of the input image. In this way, the resulting provably scale-covariant and scale-invariant deep networks can handle objects of different size in the world and at different distances to the camera.

Gaussian derivative networks

According to the first approach, based on rescaling continuous models of image filters, we have developed scale-channel networks that are constructed by coupling parameterized linear combinations of Gaussian derivatives in cascade, complemented by non-linear ReLU stages in between, and a final stage of max pooling over the different scale channels. Given that the learned parameters in the linear combinations of Gaussian derivatives are shared between the scale channels, the raw scale channels are provably scale covariant. The final stage after max pooling over the scale channels is, in addition, provably scale invariant. Experimentally, we have demonstrated that the approach allows for scale generalization, with good ability to classify image patterns at scales not present in the training data.

Lindeberg (2022) "Scale-covariant and scale-invariant Gaussian derivative networks", Journal of Mathematical Imaging and Vision, 64(3): 223-242.
Lindeberg (2021) “Scale-covariant and scale-invariant Gaussian derivative networks”, Proc. SSVM 2021: Scale Space and Variational Methods in Computer Vision, Springer LNCS 12679: 3–14.
Perzanowski and Lindeberg (2025) "Scale generalisation properties of extended scale-covariant and scale-invariant Gaussian derivative networks on image datasets with spatial scaling variations", Journal of Mathematical Imaging and Vision, 67(3):29: 1-39.

Regarding the use of Gaussian derivative networks over a single scale channel, we have also shown that the receptive fields learned from the ConvNeXt architecture can be well modelled by idealized discrete scale-space filters. A notable result is that, for a reduced version of the ConvNeXt architecture, using a set of only 8 discrete scale-space filters leads to almost as good accuracy as for the receptive fields trained from scratch on the ImageNet dataset:

Lindeberg, Babaiee and Kiasari (2025) "Modelling and analysis of the 8 filters from the 'master key filters hypothesis' for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory", arXiv preprint arXiv:2509.12746.

Scale-invariant scale-channel networks

According to the second approach, we have developed a class of scale-channel networks based on applying the same discrete deep network to multiple rescaled copies of the input image, followed by max pooling or average pooling over the scale channels. For such networks, it is also possible to construct formal proofs of scale covariance and scale invariance properties. These foveated network architectures are well able to handle scaling transformations between the training data and the test data over the range of scale factors for which there are supportive scale channels. In our experiments, we handle scaling factors up to 8:

Jansson and Lindeberg (2022) "Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales", Journal of Mathematical Imaging and Vision, 64(5): 506-536.
Jansson and Lindeberg (2021) "Exploring the ability of CNNs to generalise to previously unseen scales over wide scale ranges", Proc. International Conference on Pattern Recognition (ICPR 2020), pages 1181-1188, extended version in arXiv:2004.01536.

Characterization of invariance properties of spatial transformer networks

We have also performed an in-depth study of the ability of spatial transformer networks to support true invariance properties. First, we have shown that spatial transformers that transform the CNN feature maps do not support true invariance properties for purely spatial transformations of CNN feature maps. Only spatial transformer networks that transform the input allow for true invariance properties. Then, we have performed a systematic study of how these properties affect the classification performance. Specifically, we have investigated different architectures for spatial transformer networks that make use of more complex features for computing the image transformations that transform the input data to a reference frame, and demonstrated that these new spatial transformer architectures lead to better experimental performance:

Finnveden, Jansson and Lindeberg (2021) "Understanding when spatial transformer networks do not support invariance, and what to do about it", Proc. International Conference on Pattern Recognition (ICPR 2020), pages 3427-3434, extended version in arXiv:2004.11678.
Jansson, Maydanskiy, Finnveden and Lindeberg (2020) "Inability of spatial transformations of CNN feature maps to support invariant recognition", arXiv preprint arXiv:2004.14716.

Scale-covariant biologically inspired hierarchical networks

In earlier studies, we have shown how it is generally possible to define provably scale-covariant hand-crafted networks by coupling scale-space operations in cascade. Specifically, we have studied a sub-class of such networks in more detail, also motivated from biological inspiration, by coupling models of complex cells in terms of quasi quadrature measures in cascade. Experimentally, we have evaluated these networks on the task of texture classification, in our experiments with scaling transformations for scaling factors up to 4:

Lindeberg (2020) "Provably scale-covariant continuous hierarchical networks by coupling scale-normalized differential entities in cascade", Journal of Mathematical Imaging and Vision, 62(1): 120–148.
Lindeberg (2019) "Provably scale-covariant hierarchical networks by coupling quasi quadrature measures in cascade", Proc. SSVM 2019: Scale Space and Variational Methods in Computer Vision, Springer LNCS 11603: 328-340.

Covariant spatio-temporal receptive fields for spiking neural networks

As an extension of the abovementioned provably scale-covariant deep networks for spatial image data, we have also proposed the use of covariant spatio-temporal receptive field models as a prior for spiking networks applied to video data. Specifically, the proposed network architectures extend the covariance properties under spatial scaling transformations to also comprise covariance properties under temporal scaling transformations, to handle spatio-temporal events that may occur either faster or slower relative to a reference view. Additionally, we have demonstrated that the use of idealized receptive field models as obtained from scale-space theory can be used as a prior to improve the training of spiking neural networks, which is otherwise known to be problematic for event-based vision.

Pedersen, Conradt and Lindeberg (2025) "Covariant spatio-temporal receptive fields for spiking neural networks", Nature Communications, 16:8231: 1-14.

Tony Lindeberg,
Professor
tony@kth.se
+46 8 790 62 05

Studies

Research

Collaboration

About KTH

Library