Andrey Kuznetsov

2024

Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.

pdf abs
The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models
Anton Razzhigaev | Matvey Mikhalchuk | Elizaveta Goncharova | Ivan Oseledets | Denis Dimitrov | Andrey Kuznetsov
Findings of the Association for Computational Linguistics: EACL 2024

In this study, we present an investigation into the anisotropy dynamics and intrinsic dimension of embeddings in transformer architectures, focusing on the dichotomy between encoders and decoders. Our findings reveal that the anisotropy profile in transformer decoders exhibits a distinct bell-shaped curve, with the highest anisotropy concentrations in the middle layers. This pattern diverges from the more uniformly distributed anisotropy observed in encoders. In addition, we found that the intrinsic dimension of embeddings increases in the initial phases of training, indicating an expansion into higher-dimensional space. This fact is then followed by a compression phase towards the end of training with dimensionality decrease, suggesting a refinement into more compact representations. Our results provide fresh insights to the understanding of encoders and decoders embedding properties.

We introduce OmniDialog — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent shortcut learning by incorporating various formats and misleading or irrelevant multimodal cues. We also evaluate both multimodal and unimodal models to gain insights into how they process modality inputs introduced in the conversation.

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models like GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering an almost perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed, due to a consistently low transformer layer output norm. Our experiments show that pruning or linearly approximating some of the layers does not impact loss or model performance significantly. Moreover, we introduce a cosine-similarity-based regularization in our pretraining experiments on smaller models, aimed at reducing layer linearity. This regularization not only improves performance metrics on benchmarks like Tiny Stories and SuperGLUE but as well successfully decreases the linearity of the models. This study challenges the existing understanding of transformer architectures, suggesting that their operation may be more linear than previously assumed.

2023

Text-to-image generation is a significant domain in modern computer vision and achieved substantial improvements through the evolution of generative architectures. Among these, diffusion-based models demonstrated essential quality enhancements. These models generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky – a novel exploration of latent diffusion architecture, combining the principles of image prior models with latent diffusion techniques. The image prior model, is trained separately to map CLIP text and image embeddings. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation and text-guided inpainting/outpainting. Additionally we released the source code and checkpoints for Kandinsky models. Experimental evaluations demonstrate FID score of 8.03 on the COCO-30K dataset, marking our model as the top open source performer in terms of measurable image generation quality.

2022

pdf abs
Pixel-Level BPE for Auto-Regressive Image Generation
Anton Razzhigaev | Anton Voronov | Andrey Kaznacheev | Andrey Kuznetsov | Denis Dimitrov | Alexander Panchenko
Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models

Pixel-level autoregression with Transformer models (Image GPT or iGPT) is one of the recent approaches to image generation that has not received massive attention and elaboration due to quadratic complexity of attention as it imposes huge memory requirements and thus restricts the resolution of the generated images. In this paper, we propose to tackle this problem by adopting Byte-Pair-Encoding (BPE) originally proposed for text processing to the image domain to drastically reduce the length of the modeled sequence. The obtained results demonstrate that it is possible to decrease the amount of computation required to generate images pixel-by-pixel while preserving their quality and the expressiveness of the features extracted from the model. Our results show that there is room for improvement for iGPT-like models with more thorough research on the way to the optimal sequence encoding techniques for images.