Janus-Pro: Unified Multimodal Understanding and Generation

Janus-Pro: Unified Multimodal Understanding and Generation

@Genaillmnews
@Genaillmnews
3 Followers
1 day ago 10

Introducing Janus-Pro, an advanced model improving its predecessor Janus through optimized training, expanded datasets, and larger model architectures. With significant enhancements in multimodal understanding and text-to-image generation, Janus-Pro sets new benchmarks in performance and stability. This work aims to stimulate further advancements in multimodal AI technologies.

Janus-Pro: Unified Multimodal Understanding and Generation

@Genaillmnews1 day ago

Figure 1 | Multimodal understanding and visual generation results from our Janus-Pro . For multimodal understand, we average the accuracy of POPE, MME-Perception, GQA, and MMMU. The scores of MME-Perception are divided by 20 to scale to [ 0, 100 ] . For visual generation, we evaluate the performance on two instruction-following benchamrks, GenEval and DPG-Bench. Overall, Janus-Pro outperforms the previous state-of-the-art unified multimodal models as well as some task-specific models. Best viewed on screen.

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan DeepSeek-AI

Project Page: https://github.com/deepseek-ai/Janus

Abstract

In this work, we introduce Janus-Pro , an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

1. Introduction

(a) Average performance on four multimodal understanding benchmarks. (b) Performance on instruction-following benchmarks for text-to-image generation.
      Figure 2 | Comparison of text-to-image generation between Janus-Pro and its predecessor, Janus. Janus-Pro delivers more stable outputs for short prompts, with improved visual quality, richer details, and the ability to generate simple text. The image resolution is 384 × 384. Best viewed on screen.

      短

      prompt

      ,美感,细节

      Janus-Pro-7B
      The face of a beautiful girl.
      A minimalist photo of an orange tangerine with a green stem and leaves, symbolizing prosperity, sitting on a red silk cloth during Chinese New Year.
      Janus-Pro-7B
      A steaming cup of coffee on a wooden table.
      A glass of red wine on a reflective surface.
      A clear image of a blackboard with a clean, dark green surface and the word 'Hello' written precisely and legibly in the center with bold, white chalk letters.

      Capture a close-up shot of a vibrant sunflower in full bloom, with a honeybee perched on its petals, its delicate wings catching the sunlight.

      Recent advancements in unified multimodal understanding and generation models have demonstrated significant progress [30, 40, 45, 46, 48, 50, 54, 55]. These approaches have been proven to enhance the instruction-following capabilities in visual generation tasks while reducing model redundancy. Most of these methods utilize the same visual encoder to process inputs for both multimodal understanding and generation tasks. Since the representations required for these two tasks differ, this often results in suboptimal performance in multimodal understanding. To address this issue, Janus [46] proposes decoupling visual encoding, which alleviates the conflict between multimodal understanding and generation tasks, achieving excellent performance in both tasks.

      As a pioneering model, Janus is validated at the 1B parameter scale. However, due to the limited amount of training data and the relatively small model capacity, it exhibites certain shortcomings, such as suboptimal performance on short prompts image generation and unstable text-to-image generation quality. In this paper, we introduce Janus-Pro, an enhanced version of Janus that incorporates improvements across three dimensions: training strategies, data, and model size. The Janus-Pro series includes two model sizes: 1B and 7B, demonstrating scalability of the visual encoding decoding method.

      We evaluate Janus-Pro on multiple benchmarks, and the results reveal its superior multimodal understanding capabilities and significantly improved text-to-image instruction-following performance. Specifically, Janus-Pro-7B achieved a score of 79.2 on the multimodal understanding benchmark MMBench [29], surpassing state-of-the-art unified multimodal models such as Janus [46] (69.4), TokenFlow [34] (68.9) and MetaMorph [42] (75.2). Additionally, in the text-toimage instruction-following leaderboard GenEval [14], Janus-Pro-7B scores 0.80, outperforming Janus [46] (0.61), DALL-E 3 (0.67), and Stable Diffusion 3 Medium [11] (0.74).

      Janus

      Janus

          Figure 3 | Architecture of our Janus-Pro. We decouple visual encoding for multimodal understanding and visual generation. 'Und. Encoder' and 'Gen. Encoder' are abbreviations for 'Understanding Encoder' and 'Generation Encoder', respectively. Best viewed on screen.

          2. Method

          2.1. Architecture

          The architecture of Janus-Pro is shown in Figure 3, which is the same as Janus [46]. The core design principle of the overall architecture is to decouple visual encoding for multimodal understanding and generation. We apply independent encoding methods to convert the raw inputs into features, which are then processed by an unified autoregressive transformer. For multimodal understanding, we use the SigLIP [53] encoder to extract high-dimensional semantic features from images. These features are flattened from a 2-D grid into a 1-D sequence, and an understanding adaptor is used to map these image features into the input space of the LLM. For visual generation tasks, we use the VQ tokenizer from [38] to convert images into discrete IDs. After the ID sequence is flattened into 1-D, we use a generation adaptor to map the codebook embeddings corresponding to each ID into the input space of the LLM. We then concatenate these feature sequences to form a multimodal feature sequence, which is subsequently fed into the LLM for processing. Apart from the built-in prediction head in the LLM, we also utilize a randomly initialized prediction head for image predictions in the visual generation task. The entire model adheres to an autoregressive framework.

          2.2. Optimized Training Strategy

          The previous version of Janus employs a three-stage training process. Stage I focuses on training the adaptors and the image head. Stage II handles unified pretraining, during which all components except the understanding encoder and the generation encoder has their parameters updated. Stage III is supervised fine-tuning, building upon Stage II by further unlocking the parameters of the understanding encoder during training. This training strategy has certain issues. In Stage II, Janus divides the training for text-to-image capabilities into two parts following PixArt [4]. The first part trains on ImageNet [9] data, using image category names as prompts for text-to-image generation, with the goal of modeling pixel dependence. The second part trains on normal text-to-image data. During implementation, 66.67% of the textto-image training steps in Stage II are allocated to the first part. However, through further

              experimentation, we find that this strategy is suboptimal and lead to significant computational inefficiency.

              To address this issue, we make two modifications.

              • · Longer Training in Stage I : We increase the training steps in Stage I, allowing sufficient training on the ImageNet dataset. Our findings reveals that even with the LLM parameters fixed, the model could effectively model pixel dependence and generate reasonable images based on category names.
              • · Focused Training in Stage II : In Stage II, we drop ImageNet data and directly utilize normal text-to-image data to train the model to generate images based on dense descriptions. This redesigned approach enables Stage II to utilize the text-to-image data more efficiently, resulting in improved training efficiency and overall performance.

              We also adjust the data ratio in Stage III supervised fine-tuning process across different types of datasets, changing the proportion of multimodal data, pure text data, and text-to-image data from 7:3:10 to 5:1:4. By slightly reducing the proportion of text-to-image data, we observe that this adjustment allows us to maintain strong visual generation capabilities while achieving improved multimodal understanding performance.

              2.3. Data Scaling

              We scale up the training data used for Janus in both multimodal understanding and visual generation aspects.

              • · Multimodal Understanding . For the Stage II pretraining data, we refer to DeepSeekVL2 [49] and add approximately 90 million samples. These include image caption datasets (e.g., YFCC [31]), as well as data for table, chart, and document understanding (e.g., Docmatix [20]). For the Stage III supervised fine-tuning data, we also incorporate additional datasets from DeepSeek-VL2, such as MEME understanding, Chinese conversational data, and datasets aimed at enhancing dialogue experiences. These additions significantly expanded the model's capabilities, enriching its ability to handle diverse tasks while improving the overall conversational experience.
              • · Visual Generation . We observe that the real-world data used in the previous version of Janus lacks quality and contains significant noise, which often leads to instability in textto-image generation, resulting in aesthetically poor outputs. In Janus-Pro, we incorporate approximately 72 million samples of synthetic aesthetic data, bringing the ratio of real to synthetic data to 1:1 during the unified pretraining stage. The prompts for these synthetic data samples are publicly available, such as those in [43]. Experiments demonstrat that the model converges faster when trained on synthetic data, and the resulting text-to-image outputs are not only more stable but also exhibit significantly improved aesthetic quality.

              2.4. Model Scaling

              The previous version of Janus validates the effectiveness of visual encoding decoupling using a 1.5B LLM. In Janus-Pro, we scaled the model up to 7B, with the hyperparameters of both the 1.5B and 7B LLMs detailed in Table 1. We observe that when utilizing a larger-scale LLM, the convergence speed of losses for both multimodal understanding and visual generation improved significantly compared to the smaller model. This finding further validates the strong scalability of this approach.

                  3. Experiments

                  3.1. Implementation Details

                  In our experiments, we utilize DeepSeek-LLM (1.5B and 7B) [3] with a maximum supported sequence length of 4096 as the base language model. For the vision encoder used in understanding tasks, we select SigLIP-Large-Patch16-384 [53]. The generation encoder has a codebook of size 16, 384 and downsamples images by a factor of 16. Both the understanding adaptor and the generation adaptor are two-layer MLPs. The detailed hyperparameters for each stage are provided in Table 2. Please note that for Stage II, we employ an early stopping strategy, halting at 270K steps. All images are resized to 384 × 384 pixels. For multimodal understanding data, we resize the long side of the image and pad the short side with the background color (RGB: 127, 127, 127) to reach 384. For visual generation data, the short side is resized to 384, and the long side is cropped to 384. We use sequence packing during training to improve training efficiency. We mix all data types according to the specified ratios in a single training step. Our Janus-Pro is trained and evaluated using HAI-LLM [15], which is a lightweight and efficient distributed training framework built on top of PyTorch. The whole training process took about 9/14 days on a cluster of 16/32 nodes for 1.5B/7B model, each equipped with 8 Nvidia A100 (40GB) GPUs.

                  3.2. Evaluation Setup

                  Multimodal Understanding. To assess multimodal understanding capabilities, we evaluate our model on widely recognized image-based vision-language benchmarks, which include GQA

                      [17], POPE [23], MME [12], SEED [21], MMB [29], MM-Vet [51], and MMMU [52].

                      Visual Generation. For evaluating visual generation capabilities, we use GenEval [14] and DPG-Bench [16]. GenEval is a challenging benchmark for text-to-image generation, designed to reflect the comprehensive generative abilities of visual generation models by offering a detailed instance-level analysis of their compositional capabilities. DPG-Bench (Dense Prompt Graph Benchmark) is a comprehensive dataset consisting of 1065 lengthy, dense prompts, designed to assess the intricate semantic alignment capabilities of text-to-image models.

                      3.3. Comparison with State-of-the-arts

                      Multimodal Understanding Performance. We compare the proposed method with state-ofthe-art unified models and understanding-only models in Table 3. Janus-Pro achieves the overall best results. This can be attributed to decoupling the visual encoding for multimodal understanding and generation, mitigating the conflict between these two tasks. When compared to models with significantly larger sizes, Janus-Pro remains highly competitive. For instance, Janus-Pro-7B outperforms TokenFlow-XL (13B) on all benchmarks except GQA.

                          Visual Generation Performance. We report visual generation performance on GenEval and DPG-Bench. As shown in Table 4, our Janus-Pro-7B obtains 80% overall accuracy on GenEval, which outperforms all the other unified or generation-only methods, e.g., Transfusion [55] (63%) SD3-Medium (74%) and DALL-E 3 (67%). This demonstrates that our approach has better instruction-following capabilities. As shown in Table 5, Janus-Pro achieves a score of 84.19 on DPG-Bench, surpassing all other methods. This demonstrates that Janus-Pro excels in following dense instructions for text-to-image generation.

                          3.4. Qualitative Results

                          We present results on multimodal understanding in Figure 4. Janus-Pro exhibits impressive comprehension abilities when handling inputs from various contexts, showcasing its powerful capabilities. We also present some text-to-image generation results in the lower part of Figure 4. The images generated by Janus-Pro-7B are highly realistic, and despite having a resolution of only 384 × 384, they still contain a lot of details. For imaginative and creative scenes, Janus-Pro7B accurately captures the semantic information from the prompts, producing well-reasoned and coherent images.

                              Image Description

                              Describe this scene in detail.

                              Landmark Recognition

                              Figure 4 | Qualitative results of multimodal understanding and visual generation capability . The model is Janus-Pro-7B and the image output resolution of visual generation is 384 × 384. Best viewed on screen. 8

                              General Knowledge

                              Can you introduce the background story of this cake?

                              Text-To-Image Generation

                              A golden retriever lying peacefully on a wooden porch, with autumn leaves scattered around.
                              A glowing crystal ball floating above a sandstone table in the middle of a desert at sunset.

                              A tiny galaxy contained inside a glass bottle, glowing brightly against a dark velvet cloth.

                              A giant whale flying through a city skyline, surrounded by floating glowing lanterns.

                              Astronaut in a jungle, cold color palette, muted colors, detailed, 8k

                              This image depicts a stunning coastal landscape featuring a series of tall, rugged rock formations emerging from the ocean. The rocks are layered with sedimentary rock, giving them a distinct stratified appearance. The largest formation is prominently positioned in the foreground, with smaller ones receding into the distance. The ocean waves crash against the base of these formations, creating white foam and a dynamic sense of movement.

                              The coastline is sandy, with a narrow strip of beach visible on the right side of the image. The beach is bordered by steep cliffs, which rise dramatically from the shoreline. The cliffs are covered in patches of green vegetation, adding a touch of color to the otherwise earthy tones of the rocks.

                              The sky above is overcast, with thick clouds that suggest an impending storm or recent rainfall. The light is diffused, casting a soft, muted glow over the entire scene. Despite the cloudy sky, there is a sense of tranquility and natural beauty in this coastal landscape.

                              Text Recognition

                              这张图⽚展示的是中国杭州的⻄湖,特别是著名的 三潭印⽉岛。图⽚中的建筑是三潭印⽉的亭⼦,周 围是平静的湖⽔,远处是连绵的⼭脉。⻄湖是中国 著名的旅游景点之-,以其美丽的⾃然景观和丰富 的历史⽂化⽽闻名。

                              猜猜这是哪里 ?

                              This cake is themed around the classic cartoon character Tom and Jerry. The cake features the characters Tom (the cat) and Jerry (the mouse) in a playful setting, with a backdrop that includes the words "Tom & Jerry'. The cake itself is decorated with fondant shaped like cheese, which is a recurring theme in the cartoon series. The design also includes a small fondant house with a door, adding to the whimsical and playful theme of the cake.

                              A young woman with freckles wearing a straw hat, standing in a golden wheat field.
                              A single drop of water clinging to a green leaf, with sunlight creating a faint rainbow pris
                              What's written on the large black board?

                              The large black board has the following text:

                              "Serving Soul since Twenty Twelve"

                              Additionally, there are smaller sections with more detailed information, but the main heading is clearly visible.

                              An ancient stone bridge arching over a crystal-clear mountain stream, surrounded by lush greenery.

                                  4. Conclusion

                                  This paper introduces improvements to Janus from three aspects: training strategy, data, and model size. These enhancements have led to significant advancements in both multimodal understanding and text-to-image instruction-following capabilities. However, Janus-Pro still has certain limitations. In terms of multimodal understanding, the input resolution is limited to 384 × 384, which affects its performance in fine-grained tasks such as OCR. For text-toimage generation, the low resolution, combined with reconstruction losses introduced by the vision tokenizer, results in images that, while rich in semantic content, still lack fine details. For example, small facial regions occupying limited image space may appear under-detailed. Increasing the image resolution could mitigate these issues.

                                  References

                                                  [57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al. Lumina-Next: Making Lumina-T2X stronger and faster with Next-DiT. arXiv preprint arXiv:2406.18583, 2024.

Janus-Pro: Unified Multimodal Understanding and
Generation with Data and Model Scaling
Xiaokang C…
1/13
The face of a beautiful girl.
Janus Janus-Pro-7B
短 prompt ,美感,细节
A clear image of a blackboard w…
2/13
Auto-Regressive Transformer
Und. Encoder Text Tokenizer Gen. Encoder
Text De-Tokenizer
Text Toke…
3/13
experimentation, we find that this strategy is suboptimal and lead to significant computational
in…
4/13
Table 1 | Architectural configuration for Janus-Pro. We list the hyperparameters of the architectu…
5/13
Table 3 | Comparison with state-of-the-arts on multimodal understanding benchmarks. “Und.”
and “Ge…
6/13
Table 4 | Evaluation of text-to-image generation ability on GenEval benchmark. “Und.”
and “Gen.” d…
7/13
Text-To-Image Generation
Image Description
Describe this scene in detail.
This image depicts a s…
8/13
4. Conclusion
This paper introduces improvements to Janus from three aspects: training strategy, d…
9/13
[11] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sau…
10/13
[26] H. Liu, C. Li, Y. Li, and Y. J. Lee. Improved baselines with visual instruction tuning. In
Pr…
11/13
[42] S. Tong, D. Fan, J. Zhu, Y. Xiong, X. Chen, K. Sinha, M. Rabbat, Y. LeCun, S. Xie, and
Z. Liu…
12/13
[57] L. Zhuo, R. Du, H. Xiao, Y. Li, D. Liu, R. Huang, W. Liu, L. Zhao, F.-Y. Wang, Z. Ma, et al.
…
13/13


  • Previous
  • Next
  • f Fullscreen
  • esc Exit Fullscreen
@Genaillmnews

Share

Janus-Pro: Unified Multimodal Understanding and Generation

Embed code


Swipe LEFT
to view Related

Scroll DOWN
to read doc

We, and our third-party partners, use cookies, pixels, and other technologies (“cookies”) to collect, record, and share information you provide, as well as information about your interactions with, our site for ad targeting, analytics, personalization, and site functionality purposes. By clicking Allow All, you agree to the use of tracking technologies and acknowledge our privacy practices as described in our Privacy Notice.

Cookies to automatically collect, record, and share information about your interactions with our site for analytics purposes.
Cookies used to enable advertising on our site.

Login

OR

Forgot password?

Don't have an account? Sign Up