JanusFlow: Harmonizing Autoregression and Rectified Flow

JanusFlow: Harmonizing Autoregression and Rectified Flow

@Genaillmnews
@Genaillmnews
3 Followers
1 day ago 9

JanusFlow introduces an innovative framework that integrates autoregressive language models with rectified flow, streamlining multimodal understanding and image generation. This minimalist architecture enhances performance while maintaining simplicity, allowing for superior results on standard benchmarks in both fields. Key strategies focus on decoupling encoders and aligning representations to improve training efficiency.

JanusFlow: Harmonizing Autoregression and Rectified Flow

@Genaillmnews1 day ago

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Yiyang Ma 1,2 Xingchao Liu 1, † Xiaokang Chen 1, † Wen Liu 1, † Chenyue Wu 1,3 Zhiyu Wu 1,2 Zizheng Pan 1 Zhenda Xie 1 Haowei Zhang 1 Xingkai Yu 1 Liang Zhao 1 Yisong Wang 1,4 Jiaying Liu 2 Chong Ruan 1, ‡

1 DeepSeek-AI 2 Peking University 3 The University of Hong Kong 4 † Equal contribution, ‡ Corresponding author Project Page: https://github.com/deepseek-ai/Janus

Tsinghua University

Abstract

We present JanusFlow , a powerful framework that unifies image understanding and generation in a single model. JanusFlow introduces a minimalist architecture that integrates autoregressive language models with rectified flow, a state-of-the-art method in generative modeling. Our key finding demonstrates that rectified flow can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications. To further improve the performance of our unified model, we adopt two key strategies: (i) decoupling the understanding and generation encoders, and (ii) aligning their representations during unified training. Extensive experiments show that JanusFlow achieves comparable or superior performance to specialized models in their respective domains, while significantly outperforming existing unified approaches across standard benchmarks. This work represents a step toward more efficient and versatile vision-language models.

1. Introduction

Large language models (LLMs) have demonstrated remarkable capabilities in learning diverse knowledge and generalizing to new scenarios [1, 7, 8, 68, 88]. Leveraging these capabilities, researchers have developed sophisticated models specialized in image comprehension [2, 15, 47, 49, 56, 57] and text-to-image generation [23, 71, 74, 77].

The field has recently shifted toward creating unified systems capable of handling both tasks simultaneously. One prominent direction involves utilizing pre-trained text-to-image models for high-quality generation while training LLMs to generate conditions for these models [19, 25-27, 84]. However, this approach introduces architectural complexity and potentially constrains the model's capabilities through maintaining separate LLM and generative components. Alternative approaches [85, 93, 95, 96, 103] propose training a single LLM for both tasks, typically incorporating either diffusion models [32, 80] or vector-quantized autoregressive models [22, 83].

Our approach builds upon recent breakthroughs in rectified flow models [3, 23, 55, 60, 61], which provide a simple framework for generative modeling while delivering exceptional empir-

            (a) Benchmark Performances.

            (b) Visual Generation Results.

            ical performance [23, 36, 45]. Building on these advances, we propose JanusFlow , a powerful unified multimodal model that seamlessly integrates rectified flow with LLM architecture. Following a minimalist design principle, our architecture requires only a lightweight encoder and decoder to adapt the LLM for rectified flow operations. To optimize JanusFlow's performance, we implement two key strategies: First, we maintain separate vision encoders for understanding and generation tasks, preventing task interference and thus enhancing comprehension capabilities. Second, we align the intermediate representations between generation and understanding modules during training, strengthening semantic coherence in the generation process.

            JanusFlow shows state-of-the-art performances in both multimodal comprehension and text-to-image generation compared to existing unified approaches, and even outperforms several specialized methods. Specifically, on text-to-image generation benchmarks, MJHQ FID-30k [48], GenEval [28] and DPG-Bench [34], JanusFlow achieves scores of 9.51, 0.63 and 80.09%, surpassing established text-to-image models including SDv1.5 [75] and SDXL [71]. In multimodal comprehension benchmarks, JanusFlow attains scores of 74.9, 70.5 and 60.3 on MMBench [62], SeedBench [46], and GQA [35], respectively, exceeding specialized models such as LLaVA-v1.5 [56] and Qwen-VL-Chat [4]. Notably, these results are achieved with a compact LLM architecture with only 1.3B parameters.

            2. Related Work

            Visual Generation with Flow-based Generative Models. Recent years have witnessed remarkable progress in visual generation through diffusion models [32, 80], leading to impressive models like [66, 71, 74-77]. Building on these advances, flow-based generative models [3, 55, 60] emerged as a simplified alternative framework. These approaches have recently enabled advanced visual generation models [23, 36] that achieve superior empirical performance with faster sampling. Our work demonstrates that rectified flow [59-61] can be effectively integrated into LLMs, creating unified models that excel in both understanding and generation tasks.

            Figure 1 | Multimodal understanding and image generation with JanusFlow . JanusFlow surpasses the state-of-the-art unified multimodal models and several task-specific understanding models on visual understanding benchmarks. It is also capable of generating high-quality images. The resolution of the images is 384 × 384.

                      Unified Models For Understanding and Generation. The development of multimodal large language models (MLLMs) has enabled effective integration of text and visual information. Building upon powerful LLMs [7, 88, 89], recent MLLMs [2, 15, 49, 56, 57, 63] have demonstrated exceptional multimodal understanding capabilities. Current research increasingly focuses on architectures that can simultaneously handle visual understanding and generation tasks. One approach extends MLLMs with pre-trained diffusion models [19, 25-27, 84, 97]. However, these systems essentially utilize diffusion models as external tools, where the MLLM generates conditions for image generation without possessing direct generative capabilities. This separation often results in suboptimal performance compared to standalone diffusion models [25, 84]. Another line of work [85, 93, 95, 96, 103] aim to train a single LLM for both tasks. Many of these methods employ vector-quantization [22, 83] to convert images into discrete tokens, enabling unified autoregressive processing [85, 93]. While straightforward to implement, these approaches are inherently limited by their image tokenization quality.

                      Our work focuses on developing unified models that combine autoregressive capabilities with flow/diffusion models, leveraging their proven effectiveness in visual generation. Compared to similar approaches [96, 103], JanusFlow offers three key advantages: (i) a simple yet effective generation process using rectified flow, (ii) enhanced performance through decoupled vision encoders that resolve inter-task conflicts, and (iii) improved generation quality through representation alignment regularization, enabled by our decoupled encoder design.

                      3. JanusFlow

                      In this section, we introduce the architecture of JanusFlow and our training strategies.

                      3.1. Background

                      Multimodal LLMs. Given a dataset D containing discrete token sequences, each of which can be formulated as 𝑥 = ( 𝑥 1, · · · , 𝑥 ℓ ) , large language models (LLMs) are trained to model the sequence distribution in an autoregressive manner,

                      log P 𝜃 𝐿𝐿𝑀 ( 𝑥 ) = ℓ -1 ∑︁ 𝑖 = 0 log P 𝜃 𝐿𝐿𝑀 ( 𝑥 𝑖 + 1 | 𝑥 1, . . . , 𝑥 𝑖 ) , (1)

                      where 𝜃 𝐿𝐿𝑀 denotes the parameters of the LLM and ℓ is the sequence length. After being trained on large-scale datasets, LLMs exhibit the ability to generalize across various tasks and follow diverse instructions [1, 8, 68]. To extend these models to handle visual inputs, LLMs are augmented with vision encoders [2, 56, 57]. For instance, LLaVA [57] integrates an LLM with a pre-trained CLIP [73] image encoder via a projection layer, transforming the extracted image features into a joint embedding space that the LLM can process as word embeddings. By leveraging large-scale multimodal datasets and increasingly powerful LLMs, this architecture has facilitated the development of advanced multimodal models capable of addressing a wide range of vision-language tasks [4, 47, 56, 63].

                      Rectified Flow. For a dataset D consisting of continuous 𝑑 -dimensional data points 𝑥 = ( 𝑥 1, · · · , 𝑥 𝑑 ) drawn from an unknown data distribution 𝜋 1, rectified flow [55, 60] models the data distribution by learning an ordinary differential equation (ODE) defined over time 𝑡 ∈ [ 0, 1 ] :

                      d 𝑧 𝑡 d 𝑡 = 𝑣 𝜃 𝑁𝑁 ( 𝑧 𝑡 , 𝑡 ) , 𝑧 0 ∼ 𝜋 0, (2)

                                Figure 2 | Architecture of the proposed JanusFlow. For visual understanding, the LLM performs autoregressive next-token prediction to generate responses. For image generation, the LLM employs images with rectified flow. Starting from Gaussian noise at 𝑡 = 0, the LLM iteratively updates 𝑧 𝑡 by predicting velocity vectors until reaching 𝑡 = 1. We omit the VAE encoder, the skip connection leveraged in generation and the linear layer after 𝑓 𝑒𝑛𝑐 for simplicity.

                                where 𝜃 𝑁𝑁 represents the parameters of the velocity neural network and 𝜋 0 is a simple distribution, typically standard Gaussian noise N( 0, 𝐼 ) . The network is trained by minimizing the Euclidean distance between the neural velocity and the directions of linear paths connecting random points from 𝜋 0 and 𝜋 1,

                                min 𝜃 E 𝑡 ∼ P ( 𝑡 ) , 𝑧 0 ∼ 𝜋 0, 𝑥 ∼ 𝜋 1 h GLYPH<12> GLYPH<12> GLYPH<12> GLYPH<12> 𝑣 𝜃 𝑁𝑁 ( 𝑧 𝑡 , 𝑡 ) - ( 𝑥 -𝑧 0 ) GLYPH<12> GLYPH<12> GLYPH<12> GLYPH<12> 2 i , where 𝑧 𝑡 = 𝑡𝑥 + ( 1 -𝑡 ) 𝑧 0. (3)

                                Here, P ( 𝑡 ) is a distribution over time 𝑡 ∈ [ 0, 1 ] . When the network has sufficient capacity and the objective is perfectly minimized, the optimal velocity field 𝑣 𝜃 ∗ 𝑁𝑁 maps the elementary distribution 𝜋 0 to the true data distribution 𝜋 1. More precisely, the distribution of 𝑧 1 = ∫ 1 0 𝑣 𝜃 ∗ 𝑁𝑁 ( 𝑧 𝑡 , 𝑡 ) d 𝑡 , with 𝑧 0 ∼ 𝜋 0, follows 𝜋 1. Despite its conceptual simplicity, rectified flow has shown superior performance in various generative modeling tasks, including text-to-image generation [23], audio generation [40] and biological structure generation [38].

                                3.2. A Unified Framework for Multimodal Understanding and Generation

                                JanusFlow presents a unified framework designed to address both vision understanding and image generation tasks. Next we outline how JanusFlow handles these two tasks within a single LLM architecture.

                                Multimodal Understanding. In multimodal understanding tasks, the LLM processes an input sequence consisting of interleaved text and image data. The text is tokenized into discrete tokens, each of which is transformed into an embedding of dimension 𝐷 𝑒𝑚𝑏 . For the images, an image encoder 𝑓 𝑒𝑛𝑐 encodes each image 𝑥 𝑖𝑚 into a feature map of shape 𝐻 𝑖𝑚 × 𝑊 𝑖𝑚 × 𝐷 𝑒𝑛𝑐 . This feature map is flattened and projected through a linear transformation layer into a sequence of embeddings with shape 𝐻 𝑖𝑚 𝑊 𝑖𝑚 × 𝐷 𝑒𝑚𝑏 . 𝐻 𝑖𝑚 and 𝑊 𝑖𝑚 are determined by the image encoder. The text and image embeddings are concatenated to form the input sequence to the LLM, which then autoregressively predicts the next tokens based on the input sequence of embeddings. According to common practice [85, 93, 96], we add special token |BOI| before the image and |EOI| after the image to help the model locate the image embeddings in the sequence.

                                          Image Generation. For image generation, our LLM takes a text sequence 𝑥 𝑐𝑜𝑛 as condition and generates a corresponding image using rectified flow. To improve computational efficiency, generation occurs in the latent space using a pre-trained SDXL-VAE [71].

                                          The generation process begins by sampling Gaussian noise 𝑧 0 of shape 𝐻 𝑙𝑎𝑡𝑒𝑛𝑡 × 𝑊 𝑙𝑎𝑡𝑒𝑛𝑡 × 𝐷 𝑙𝑎𝑡𝑒𝑛𝑡 in the latent space, which is then processed by a generation encoder 𝑔 𝑒𝑛𝑐 into a sequence of embeddings 𝐻 𝑔𝑒𝑛 𝑊 𝑔𝑒𝑛 × 𝐷 𝑒𝑚𝑏 . This sequence is concatenated with a time embedding representing the current time step 𝑡 ( 𝑡 = 0 at the beginning), resulting in a sequence of length 𝐻 𝑔𝑒𝑛 𝑊 𝑔𝑒𝑛 + 1. Unlike previous approaches that employ various attention masking strategies [96, 103], we found that causal attention suffices, as our preliminary experiments showed no performance benefits from alternative masking schemes. The LLM's output corresponding to 𝑧 0 is transformed back into the latent space by a generation decoder 𝑔 𝑑𝑒𝑐 , producing a velocity vector of shape 𝐻 𝑙𝑎𝑡𝑒𝑛𝑡 × 𝑊 𝑙𝑎𝑡𝑒𝑛𝑡 × 𝐷 𝑙𝑎𝑡𝑒𝑛𝑡 . The state is updated by a standard Euler solver,

                                          𝑧 𝑡 + d 𝑡 = 𝑧 𝑡 + 𝑣 ( 𝑧 𝑡 , 𝑡 ) d 𝑡 , (4)

                                          where d 𝑡 is a user-defined step size. We replace 𝑧 0 with 𝑧 d 𝑡 on the input and iterate the process until we get 𝑧 1, which is then decoded into the final image by the VAE decoder. To enhance generation quality, we employ classifier-free guidance (CFG) when computing the velocity:

                                          𝑣 ( 𝑧 𝑡 , 𝑡 ) = 𝑤𝑣 ( 𝑧 𝑡 , 𝑡 | 𝑥 𝑐𝑜𝑛 ) + ( 1 -𝑤 ) 𝑣 ( 𝑧 𝑡 , 𝑡 | ∅) , (5)

                                          where 𝑣 ( 𝑧 𝑡 , 𝑡 | ∅) denotes the velocity inferred without text conditioning and 𝑤 ⩾ 1 controls the magnitute of CFG. Empirically, increasing 𝑤 yields higher semantic alignment [23, 61, 71, 75]. Analogous to multimodal understanding, we prepend the special token |BOI| to indicate the start of image generation in the sequence.

                                          Decoupling Encoders for the Two Tasks. Previous approaches that unify autoregressive generation and diffusion models within a joint LLM training framework [96, 103] employ identical encoders ( 𝑓 𝑒𝑛𝑐 and 𝑔 𝑒𝑛𝑐 ) for both understanding and generation tasks. For instance, Zhou et al. [103] performs both tasks in the same VAE latent space using a shared U-Net or linear encoder, while Xie et al. [96] leverages MAGVIT-v2 [98] to encode image patches into discrete tokens for both tasks.

                                          However, recent work on unified autoregressive models has shown this shared encoder design to be suboptimal [93], particularly in models that generate images through autoregression on vector-quantized tokens. Drawing from these insights, JanusFlow adopts a decoupled encoder design. Specifically, we employ a pre-trained SigLIP-Large-Patch/16 [102] model as 𝑓 𝑒𝑛𝑐 to extract semantic continuous features for multimodal understanding, while using separate ConvNeXt blocks [92] initialized from scratch as 𝑔 𝑒𝑛𝑐 and 𝑔 𝑑𝑒𝑐 for generation, chosen for its effectiveness. Following established practices [5, 14, 90], we incorporate a long skip connection between 𝑔 𝑒𝑛𝑐 and 𝑔 𝑑𝑒𝑐 . Our controlled experiments in Sec. 4.5 demonstrate that this decoupled encoder design significantly improves the performance of our unified model. The complete architecture of JanusFlow is illustrated in Fig. 2.

                                          3.3. Training Schemes

                                          As illustrated in Fig. 3, we train our model in three sequential stages, detailed below.

                                          Stage 1: Adaptation of Randomly Initialized Components. In the first stage, we focus on training only the randomly initialized components: the linear layers, generation encoder, and

                                                    Figure 3 | Three training stages of JanusFlow. The trainable modules are marked with flame and the frozen modules are marked with snowflakes.

                                                    generation decoder. This stage serves to adapt these new modules to work effectively with the pre-trained LLM and SigLIP encoder, essentially functioning as an initialization phase for the newly introduced components.

                                                    Stage 2: Unified Pre-Training. Following the adaptation stage, we train the entire model except for the visual encoder, consistent with previous approaches [57, 63]. The training incorporates three data types: multimodal understanding, image generation, and text-only data. We initially allocate a higher proportion of multimodal understanding data to establish the model's understanding capabilities. Subsequently, we increase the ratio of image generation data to accommodate the convergence requirements of diffusion-based models [18, 70].

                                                    Stage 3: Supervised Fine-Tuning (SFT). In the final stage, we fine-tune the pre-trained model using instruction tuning data, which comprises dialogues, task-specific conversations, and highquality text-conditioned image generation examples. During this stage, we also unfreeze the SigLIP encoder parameters [63, 87, 93]. This fine-tuning process enables the model to effectively respond to user instructions for both multimodal understanding and image generation tasks.

                                                    3.4. Training Objective

                                                    Training JanusFlow involves two types of data, multimodal understanding data and image generation data. Both types of data contain two parts: 'condition' and 'response'. 'Condition' refers to the prompting of the tasks ( e . g ., text prompts in the task of generation and images in the task of understanding) while 'response' refers to the corresponding responses of the two tasks. The data can be formatted as 𝑥 = ( 𝑥 𝑐𝑜𝑛 , 𝑥 𝑟𝑒𝑠 ) , where the superscript 𝑐𝑜𝑛 denotes 'condition' and 𝑟𝑒𝑠 denotes 'response'. We denote the length of the whole sequence 𝑥 as ℓ , the length of 𝑥 𝑐𝑜𝑛 as ℓ 𝑐𝑜𝑛 and the length of 𝑥 𝑟𝑒𝑠 as ℓ 𝑟𝑒𝑠 . We use 𝜃 to represent the collection of all the trainable parameters in JanusFlow, including the LLM, 𝑓 𝑒𝑛𝑐 , 𝑔 𝑒𝑛𝑐 , 𝑔 𝑑𝑒𝑐 and the linear transformation layers.

                                                    Autoregression Objective. For mutimodal understanding tasks, 𝑥 𝑟𝑒𝑠 contains only text tokens. JanusFlow is trained using the maximum likelihood principle,

                                                    L 𝐴𝑅 ( 𝜃 ) = -E 𝑥 ∼D 𝑢𝑛𝑑 " ℓ -1 ∑︁ 𝑖 = ℓ 𝑐𝑜𝑛 log P 𝜃 ( 𝑥 𝑖 + 1 | 𝑥 1, . . . , 𝑥 𝑖 ) # , (6)

                                                              where the expectation is taken over all ( 𝑥 𝑐𝑜𝑛 , 𝑥 𝑟𝑒𝑠 ) pairs in our multimodal understanding dataset D 𝑢𝑛𝑑 , computing loss only over tokens in 𝑥 𝑟𝑒𝑠 .

                                                              Rectified Flow Objective. For image generation tasks, 𝑥 𝑐𝑜𝑛 consists of text tokens and 𝑥 𝑟𝑒𝑠 is the corresponding image. JanusFlow is trained with the rectified flow objective,

                                                              L 𝑅𝐹 ( 𝜃 ) = E 𝑥 ∼D 𝑔𝑒𝑛 , 𝑡 ∼ P ( 𝑡 ) , 𝑧 0 ∼N( 0, 𝐼 ) h | | 𝑣 𝜃 ( 𝑧 𝑡 , 𝑡 | 𝑥 𝑐𝑜𝑛 ) - ( 𝑥 𝑟𝑒𝑠 -𝑧 0 )|| 2 i , (7)

                                                              where 𝑧 𝑡 = 𝑡𝑥 𝑟𝑒𝑠 + ( 1 -𝑡 ) 𝑧 0. Following Stable Diffusion 3 [23], we set the time distribution P ( 𝑡 ) to the logit-normal distribution. To enable CFG inference, we randomly drop 10% of the text prompts in training.

                                                              Representation Alignment Regularization. Recent work [99] has shown that aligning intermediate representations between diffusion transformers and semantic vision encoders enhances diffusion model generalization. Our decoupled vision encoder design enables efficient implementation of this alignment as a regularization term. Specifically, for generation tasks, we align features from the understanding encoder 𝑓 𝑒𝑛𝑐 with the LLM's intermediate features,

                                                              L 𝑅𝐸𝑃𝐴 ( 𝜃 , 𝜑 ) = -E 𝑥 ∼D 𝑔𝑒𝑛 GLYPH<2> sim GLYPH<0> stop_grad ( 𝑓 𝑒𝑛𝑐 ( 𝑥 𝑟𝑒𝑠 )) , ℎ 𝜑 ( 𝑞 𝜃 ( 𝑧 𝑡 )) GLYPH<1> GLYPH<3> , (8)

                                                              where 𝑞 𝜃 ( 𝑧 𝑡 ) denotes an intermediate LLM representation given input 𝑧 𝑡 , and ℎ 𝜑 is a small trainable MLP that projects 𝑞 𝜃 ( 𝑧 𝑡 ) to dimension 𝐷 𝑒𝑛𝑐 . The function sim (· , ·) computes the mean of element-wise cosine similarity between embeddings. Before computing the loss, we reshape ℎ 𝜑 ( 𝑞 𝜃 ( 𝑧 𝑡 )) to 𝐻 𝑔𝑒𝑛 × 𝑊 𝑔𝑒𝑛 × 𝐷 𝑒𝑛𝑐 . To simplify the implementation, we intentionally adjust the configuration of 𝑔 𝑒𝑛𝑐 and 𝑔 𝑑𝑒𝑐 to ensure 𝐻 𝑔𝑒𝑛 = 𝐻 𝑖𝑚 and 𝑊 𝑔𝑒𝑛 = 𝑊 𝑖𝑚 . The gradient of L 𝑅𝐸𝑃𝐴 is not back-propagated through the understanding encoder. This alignment loss helps the LLM's internal feature space (given noisy input 𝑧 𝑡 ) align with the understanding encoder's semantic feature space, thereby improving generation quality when producing images from new random noise and text conditions during inference.

                                                              Summary. All three objectives are applied across all training stages. Multimodal understanding tasks use L 𝐴𝑅 , while image generation tasks employ the combined loss L 𝑅𝐹 +L 𝑅𝐸𝑃𝐴 . Detailed experimental settings are provided in Sec. 4.1.

                                                                        4. Experiments

                                                                        We conduct extensive experiments to evaluate the capabilities of JanusFlow in both multimodal understanding and generation tasks. First, we describe our experimental setup and implementation details. Then, we present results on standard benchmarks for multimodal understanding and image generation. Finally, we perform ablation studies to validate our key design choices.

                                                                        4.1. Experiment Setup and Implementation Details

                                                                        Our framework builds upon an enhanced version 1 of DeepSeek-LLM (1.3B) [7, 63]. The LLM consists of 24 transformer blocks and supports a sequence length of 4, 096. In our model, both understanding and generation exploits images of resolution 384.

                                                                        For multimodal understanding, we leverage SigLIP-Large-Patch/16 [102] as 𝑓 𝑒𝑛𝑐 . For image generation, we utilize the pre-trained SDXL-VAE [71] for its latent space. The generation encoder 𝑔 𝑒𝑛𝑐 comprises a 2 × 2 patchify layer followed by two ConvNeXt [92] blocks and a linear layer. The generation decoder 𝑔 𝑑𝑒𝑐 combines two ConvNeXt blocks, a pixel-shuffle layer to upsample the feature map, and a linear layer. Our SigLIP encoder contains ∼ 300M parameters. 𝑔 𝑒𝑛𝑐 and 𝑔 𝑑𝑒𝑐 are light-weight modules, containing ∼ 70M parameters in total. Table 1 details the hyperparameters for each training stage. In the alignment regularization, we use the LLM features after the 6th block as 𝑞 𝜃 ( 𝑧 𝑡 ) and a three-layer MLP as ℎ 𝜑 . We employ an exponential moving average (EMA) with a ratio of 0.99 to ensure training stability.

                                                                        For data preprocessing, we deal with understanding and generation data differently. For understanding tasks, we maintain all image information by resizing the long side to the target size and padding the image to squares. For generation tasks, we resize the short side to the target size and apply random square cropping to avoid padding artifacts. During training, multiple sequences are packed to form a single sequence of length 4, 096 for training efficiency. Our implementation is based on the HAI-LLM platform [31] using PyTorch [72]. Training was conducted on NVIDIA A100 GPUs, with each model requiring ∼ 1, 600 A100 GPU days.

                                                                        4.2. Training Data Settings

                                                                        We follow Janus [93] to construct the training data. The data configuration for each training stage is listed below.

                                                                        Data for Stage 1 and Stage 2. The first two stages of our framework uses three types of data: multimodal understanding data, image generation data and text-only data.

                                                                        • 1. Multimodal Understanding Data. This type of data contains several sub-categories: (a) Image caption data. We incorporate caption datasets from [20, 41, 50, 51, 53, 79] and generate additional captions for images from [16, 43] using open-source multimodal understanding models. The data follows template formats, e . g ., ' Generate the caption of this picture. '. (b) Charts and tables. We directly adopt the chart and table data from the training data of DeepSeek-VL [63]. (c) Task data. ShareGPT4V [11] data is utilized to facilitate basic question-answering capabilities during pre-training,

                                                                                  structured as ' '. (d) Interleaved text-image data. This sub-category is sourced from [42, 81].

                                                                                  • 2. Image Generation Data. Our image generation dataset combines high-quality images from [16, 21, 41, 43, 67, 69, 79, 82] and 2 million in-house data. We enhance them with machinegenerated captions using multimodal understanding models. We filter the images in [16, 79] with aspect ratios and aesthetic scores, retaining approximately 20% of the original datasets. 25% of the data contains single-sentence captions. These kind of data assist the model to be able to process short prompts. All the data points are formatted as ' '.
                                                                                  • 3. Text-Only Data. We directly use the text corpus of DeepSeek-LLM [7].

                                                                                  Data for Stage 3. The SFT stage also uses three types of data:

                                                                                  • 1. Multimodal Instruction Data. We leverage the instruction tuning datasets from [29, 33, 35, 47, 64, 78].
                                                                                  • 2. Image Generation Data. We reformat the high-quality text-image pairs from [16, 79, 82] into an instruction format: ' User: Assistant: '.
                                                                                  • 3. Text-Only Data. We directly incorporate the text-only data from [47].

                                                                                  4.3. Evaluation Settings

                                                                                  Image Generation. We evaluate the generated images using both visual quality and semantic accuracy metrics. For visual quality assessment, we employ the Fréchet Inception Distance [30] (FID) metric and compute FID between 30,000 generated images and their corresponding reference images from the MJHQ dataset [48]. The FID computation follows the implementation from GigaGAN [39]. To evaluate semantic accuracy, we utilize two specialized frameworks: GenEval [28] and DPG-Bench [34]. These frameworks are designed to assess whether the generated images accurately contain the objects and relationships specified in the input prompts, providing a broad evaluation of the generation capabilities.

                                                                                            Multimodal Understanding. We evaluate JanusFlow's multimodal understanding abilities across a diverse set of vision-language benchmarks for general understanding capabilities, including POPE [52], MME [24], MMBench [62], SEEDBench [46], VQAv2 [29], GQA [35], MMVet [100], and MMMU [101].

                                                                                            4.4. Quantitative Results

                                                                                            Image Generation Performances. We report the performances on GenEval, DPG-Bench and MJHQ FID-30k. In Tab. 2, we give comparisons on GenEval including the scores of all the sub-tasks and the overall score. JanusFlow achieves an overall score of 0.63, surpassing the previous unified framework and several generation specific models including SDXL [71] and DALL-E 2 [74]. In Tab. 3, We show results on DPG-Bench and the corresponding comparisons. It is noted that all the methods in Tab. 3 are generation-specific models except our model. The results on GenEval and DPGBench demonstrate the ability of instruction following of our model. We give the comparisons on MJHQ FID-30k in Tab. 4. The images which are sampled to calculate FID are generated with a CFG factor 𝑤 = 2 and a number of sampling steps 30. We sweep the CFG factor and the sampling steps

                                                                                            and provide the results in the appendix. Our method achieves the best performance among all the models with 1.3B LLM. The results prove that the rectified flow is able to improve the quality of generated images over autoregressive models such as Janus [93].

                                                                                            Multimodal Understanding Performances. We show comparisons of our method and other methods including understanding-specific models and unified understanding and generation models in Tab. 5. Our model reaches the best performances among all the models with similar number of parameters and even surpasses multiple understanding-specific methods with larger scales. Our results demonstrate that our method harmonizes autoregressive LLM and rectified flow, achieving satisfying performance in both understanding and generation.

                                                                                                      4.5. Ablation Studies

                                                                                                      We conduct comprehensive ablation studies to validate the effectiveness of our key design choices. For computational efficiency, all ablation experiments are performed on 256 × 256 resolution images 2 . All models are trained on our unified pre-training dataset for 50, 000 iterations, except for the understanding-only and generation-only variants, which are trained for proportionally fewer iterations based on their respective data ratios in the pre-training phase. The quantitative results of these ablation studies are presented in Tab. 6.

                                                                                                      Impact of Representation Alignment. The comparison between Exp. A and F demonstrates the significant benefits of incorporating representation alignment regularization [99] during training. Specifically, models trained with representation alignment show notably lower FID scores on MJHQ dataset and higher CLIP scores, indicating simultaneous improvements in both image quality and semantic alignment. Importantly, our architecture differs from previous studies [65, 70] examined in [99] due to our incorporation of LLM and an additional skip connection between 𝑔 𝑒𝑛𝑐 and 𝑔 𝑑𝑒𝑐 . The effectiveness of representation alignment in our modified architecture suggests its broad applicability and generalization capability across different network structures.

                                                                                                      Impact of Decoupling Visual Encoders. e efficacy of using powerful pre-trained visual encoders

                                                                                                                Figure 4 | Image generation results of JanusFlow. Our model can generate high-quality images that are semantically consistent with text prompts.
                                                                                                                A corgi's head depicted as an explosion of a nebula, with vibrant cosmic colors like deep purples, blues, and pinks swirling around. The corgi's fur blends seamlessly into the nebula, with stars and galaxies forming the texture of its fur. Bright bursts of light emanate from its eyes, and faint constellations can be seen in the background, giving the image a surreal, otherworldly feel.

                                                                                                                Beautiful surreal symbolism the mesmerizing vision of a Cleopatra Queen of Egypt, mesmerizing brown eyes, black hair and ethereal features, radiating celestial aura, super high definition, true lifelike color, perfect exposure, razor sharp focus, golden ratio, soft reflections, bokeh effect, fine art photography, cinematic compositing, authentic, professional.

                                                                                                                A lone figure in dark robes ascends worn stone steps toward a glowing light in an ancient temple entrance. Ornate arches, lush greenery, and intricate carvings adorn the scene, evoking a mystical, high-fantasy atmosphere reminiscent of works by artists like Randy Vargas, with cinematic lighting and epic storytelling.

                                                                                                                in multimodal understanding. The comparison among Exp. B, C, and F demonstrates the advantages of using separate visual encoders for understanding and generation tasks. In Exp. B, following a design similar to Transfusion [103], we implement shared ConvNeXt blocks in the SDXL-VAE latent space for both understanding and generation encoders. Exp. C employs separate encoders with identical architectures and initialization parameters, but trained independently. The performance differences between these configurations validate the necessity of decoupled visual encoders in improving our unified model's capabilities. Moreover, the superior results in Exp. C and F highlight the benefits of leveraging pre-trained semantic visual encoders for multimodal understanding tasks.

                                                                                                                Fair Comparison with Understanding / Generation-Only Models. To establish meaningful benchmarks, we evaluate task-specific models trained under identical conditions - using the same pre-training dataset, infrastructure, and hyperparameters. Exp. D and E represent these specialized models, trained with data volumes matching the unified models in Tab. 6. The minimal performance gap between Exp. F and these task-specific baselines demonstrates that our unified framework successfully integrates understanding and generation capabilities

                                                                                                                          Figure 5 | Visual Understanding with JanusFlow. Our model effectively handles various visual understanding tasks, such as question answering, plot interpretation and object counting.
                                                                                                                          User: What are the kinds of fruits in this picture?
                                                                                                                          JaunsFlow (Ours): The fruits in the picture are banana, strawberry, mango, persimmon, blueberry, and lime.User: Describe this image.

                                                                                                                          JaunsFlow (Ours): Kid's Favourite Fruits is a bar graph. The x-axis shows the fruits. The y-axis shows the number of kids who like the fruit. The orange is the most popular fruit. The banana is the least popular fruit.

                                                                                                                          User: How many penguins in the picture?

                                                                                                                          JaunsFlow (Ours): There are 11 penguins in the picture.

                                                                                                                          without significant compromise in either task's performance.

                                                                                                                          4.6. Qualitative Results

                                                                                                                          We present qualitative evaluations of our method for both image generation and understanding tasks. Fig. 1(b) and Fig. 4 showcases the image generation capabilities of JanusFlow. These results demonstrate both the high visual quality of our generated images and our framework's ability to faithfully follow diverse instructions. For multimodal understanding, Fig. 5 presents example conversations that show our model's understanding capabilities across various scenarios. These interactions demonstrate the model's ability to understand and reason about visual content in natural language dialogues. Additional qualitative examples showcasing the versatility and effectiveness of JanusFlow are provided in the appendix.

                                                                                                                          5. Conclusion

                                                                                                                          We present JanusFlow, a unified framework that successfully harmonizes autoregressive and rectified flow models for multimodal understanding and generation tasks. Our extensive experiments demonstrate that this unification achieves comparable performance to task-specific models. The successful integration of these fundamentally different model architectures not only addresses current challenges in multimodal learning but also opens new possibilities for future research in training unified models.

                                                                                                                                    References

                                                                                                                                                                                                          Appendix

                                                                                                                                                                                                          A. Performance Analysis of 256 Resolution Model

                                                                                                                                                                                                          We trained our model at two resolutions: 256 × 256 and 384 × 384. The main paper presents results from the 384 × 384 model as our primary results. Here, we provide a comprehensive evaluation of the 256 × 256 model's performance. The visual understanding performances are presented in Tab. 1. The generation capabilities are evaluated using GenEval [28], DPGBenchmark [34], and MJHQ FID-30k [48], with results shown in Tab. 2 and 3.

                                                                                                                                                                                                          Asexpected, the 256 × 256 model shows slightly lower performance compared to the 384 × 384 model on visual understanding metrics due to its reduced resolution. Interestingly, however, the 256 × 256 model outperforms its higher-resolution counterpart on GenEval and DPG-Bench benchmarks specifically designed to evaluate instruction following capabilities and semantic accuracy. This superior performance on semantic tasks can be attributed to the model's better control over lower-resolution images, where reduced visual complexity allows for more precise semantic manipulation.

                                                                                                                                                                                                          B. Analysis of CFG Factor and Sampling Steps

                                                                                                                                                                                                          We investigate the impact of two key generation parameters: the Classifier-Free Guidance (CFG) factor and the number of sampling steps. While our main results use 𝑤 = 2 for CFG and 30 sampling steps to calculate FID, here we present a comprehensive analysis of these hyperparameters. Fig. 1(a) shows the effect of varying CFG factors while maintaining 30 sampling steps. The results reveal an optimal CFG value for FID scores, while CLIP [73] similarity continues to improve with increasing CFG values, consistent with findings from previous work [71]. Fig. 1(b) demonstrates the impact of different sampling steps while maintaining a CFG factor of 2. The number of sampling steps shows relatively minor influence on performance. Our choice of 30 steps in the main paper represents a balance between generation quality and computational efficiency.

                                                                                                                                                                                                                    Figure 1 | Results of varying CFG factors and numbers of sampling steps . In Fig. (a), the number of sampling steps is set to 30. In Fig. (b), the CFG factor is set to 2.
                                                                                                                                                                                                                    • (a) Results of varying CFG Factors
                                                                                                                                                                                                                    • (b) Results of Varying Numbers of Sampling Steps

                                                                                                                                                                                                                    C. Additional Qualitative Results

                                                                                                                                                                                                                    User: Who is this in the figure?

                                                                                                                                                                                                                    JanusFlow: George W. Bush.

                                                                                                                                                                                                                    User: Explain why this image is funny.

                                                                                                                                                                                                                    JanusFlow: This image is likely to be a humorous or artistic rendition of the famous painting "The Mona Lisa" with a dog instead of a human. The dog's expression and pose are similar to the iconic pose of the original painting, which adds a playful and unexpected twist to the artwork.

                                                                                                                                                                                                                    User: What are the texts in this passage?

                                                                                                                                                                                                                    JanusFlow: To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.

                                                                                                                                                                                                                    Figure 2 | More multimodal understanding cases.

                                                                                                                                                                                                                    Additional qualitative examples for both understanding and generation tasks are presented in Fig. 2 and Fig. 3, respectively. The understanding examples demonstrate JanusFlow's diverse capabilities, including code generation, person identification, character recognition, and visual reasoning. For image generation, our model exhibits strong performance in both visual quality and semantic alignment with input prompts.

                                                                                                                                                                                                                              Figure 3 | More text-to-image generation results.
                                                                                                                                                                                                                              Massive cathedral church, battle between Heaven and hell, church on fire, 8k hyper real ultra sharp renaissance by Francisco Goya.

                                                                                                                                                                                                                              A handsome 24-year-old boy in the middle with sky color background wearing eye glasses, it's super detailed with anime style.

                                                                                                                                                                                                                              Happy dreamy owl monster sitting on a tree branch, colorful glittering particles, forest background, detailed feathers.
                                                                                                                                                                                                                              A man wearing Fedora hat with mafia style, realistic photography, intricate details, magical lighting, vibrant background, complex textures, rich colors, realistic style, front-facing view.

                                                                                                                                                                                                                              A vivid depiction of the Northern Lights dancing above the snow-covered mountains in Iceland, casting a mesmerizing glow across the sky.

                                                                                                                                                                                                                              A dark, high-contrast render of a psychedelic Tree of Life glowing brilliantly, illuminating swirling dust particles in a mystical, cavernous setting.The image features a mushroom growing on grassy ground amidst fallen leaves. Their caps are light brownish-white with visible gills underneath; the stems appear dark and sturdy. In the background, there's an out-of-focus scene that includes greenery and possibly some structures or trees shrouded by mist or fog, giving it a serene yet slightly eerie atmosphere. This photograph employs shallow depth of field to emphasize the mushrooms while blurring the surroundings for artistic effect.

                                                                                                                                                                                                                              The image captures a vast ocean view at either sunrise or sunset, with soft pink hues near the horizon blending into darker clouds above. Waves crash against rugged black rocks on the right, where water flows down onto smaller stones below. In the foreground, dry grass contrasts with the smooth sea surface. The scene feels tranquil but also reveals the raw power of nature through the interaction between the dynamic waves and the solid land.

                                                                                                                                                                                                                              A serene Chinese ink painting depicts a tranquil mountain village. Simple homes nestle at the foot of misty peaks, while a gentle river winds through the village. Bamboo and pine trees dot the landscape. The minimalist brushstrokes reflect a harmonious relationship between nature and human life, capturing the peaceful essence of the scene with elegant simplicity.

                                                                                                                                                                                                                                        Figure 4 | The FID and CLIP similarity during the first 50,000 iterations.

                                                                                                                                                                                                                                        D. Details of REPA Ablation

                                                                                                                                                                                                                                        We provide the FID and CLIP similarity of the first 50,000 training iterations of the pre-train stage in Fig. 4 with and without representation alignment regularization. The gap between the two models demonstrates the benefits of using representation alignment regularization.

JanusFlow: Harmonizing Autoregression and Rectified Flow
for Unified Multimodal Understanding and …
1/24
POPE
GQA
MMBench
SEEDB
GenEval MM-Vet
-MJHQ FID
MME-Perception
VQAv2
61.52
73.03
84.54
3…
2/24
Unified Models For Understanding and Generation. The development of multimodal large
language mode…
3/24
Large Language Model
Und. Encoder 𝑓𝑒𝑛𝑐 Text Tokenizer
Text De-Tokenizer
(a) Understanding: Autor…
4/24
Image Generation. For image generation, our LLM takes a text sequence 𝑥
𝑐𝑜𝑛 as condition
and gene…
5/24
LLM
Und. Enc. 𝑓𝑒𝑛𝑐
Linear Gen. Enc. 𝑔𝑒𝑛𝑐
VAE Enc.
Text De Gen. Dec. 𝑔𝑑𝑒𝑐 -Token.
Stage 1
Adap…
6/24
Table 1 | Hyper-parameters of the proposed JanusFlow. Data ratio denotes the proportion of
multimo…
7/24
4. Experiments
We conduct extensive experiments to evaluate the capabilities of JanusFlow in both …
8/24
Table 2 | Performances on GenEval benchmark. “Gen.” denotes “generation” and “Unified”
denotes uni…
9/24
Table 3 | Performances on DPG-Bench. The methods in this table are all generation-specific
models …
10/24
Table 5 | Comparison with other methods on multimodal understanding benchmarks. “Und.”
denotes “un…
11/24
Table 6 | Ablation studies. The weights of the modules with † are frozen during training. “Exp.”
d…
12/24
User: What are the kinds of fruits in this picture?
JaunsFlow (Ours): The fruits in the picture ar…
13/24
References
[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida,
J.…
14/24
[15] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBL…
15/24
[31] High-flyer. HAI-LLM: Efficient and lightweight training tool for large models, 2023. URL
http…
16/24
[45] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi,
J. …
17/24
[61] X. Liu, X. Zhang, J. Ma, J. Peng, et al. InstaFlow: One step is enough for high-quality
diffu…
18/24
[77] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lop…
19/24
[91] X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yu, et al.
E…
20/24
Appendix
A. Performance Analysis of 256 Resolution Model
We trained our model at two resolutions:…
21/24
1 2 3 4 5 6 7 8 9
CFG Factor
10
11
12
13
MJHQ FID-30k
24.0
24.5
25.0
25.5
26.0
26.5
27…
22/24
Massive cathedral church, battle between 
Heaven and hell, church on fire, 8k hyper real 
ultra s…
23/24
0 10 20 30 40 50
Training Iterations / K Iteration
20
40
60
80
100
MJHQ FID-30k
18
20
22
…
24/24


  • Previous
  • Next
  • f Fullscreen
  • esc Exit Fullscreen
@Genaillmnews

Share

JanusFlow: Harmonizing Autoregression and Rectified Flow

Embed code


Swipe LEFT
to view Related

Scroll DOWN
to read doc

We, and our third-party partners, use cookies, pixels, and other technologies (“cookies”) to collect, record, and share information you provide, as well as information about your interactions with, our site for ad targeting, analytics, personalization, and site functionality purposes. By clicking Allow All, you agree to the use of tracking technologies and acknowledge our privacy practices as described in our Privacy Notice.

Cookies to automatically collect, record, and share information about your interactions with our site for analytics purposes.
Cookies used to enable advertising on our site.

Login

OR

Forgot password?

Don't have an account? Sign Up