2024 "AI Image Generation" Project Collection-Ouitech丨炜烨智算

2024 "AI Image Generation" Project Collection
With just a simple prompt, a little patience, and a burst of
imagination, one can transform ideas into vivid, lifelike artwork.

Image generation, as an AIGC application driven by AI large models, is disrupting traditional content creation and art design, allowing everyone to become a "master painter."
With just a simple prompt, a little patience, and a series of wild ideas, one can transform them into vivid and lifelike artworks. As 2024 draws to a close, the field of "AI image generation" has seen numerous outstanding research achievements, greatly enriching the ecosystem of image content creation. These come from leading tech giants, university labs, and individual developers, with some research already being open-sourced.
In this summary article, we focus on sharing "research-oriented" AI image generation projects. We have selected 18 out of 100 projects to share with you (listed in chronological order of release, see the original text at the end).

1. InstantID: Second-level zero-shot high-fidelity image generation

In the field of personalized image synthesis, methods like Textual Inversion, DreamBooth, and LoRA have made significant progress. However, the practical application of these methods is limited by high storage requirements, lengthy fine-tuning processes, and the need for multiple reference images. In contrast, existing ID embedding-based methods, while requiring only a single forward inference, also face challenges: either they require extensive fine-tuning of numerous model parameters, are incompatible with community-pretrained models, or fail to maintain high facial realism. To address these limitations, the research teams from InstantX and Xiaohongshu proposed a diffusion model-based solution, InstantID. Its plug-and-play module cleverly handles image personalization across various styles using just a single facial image while ensuring high fidelity. To achieve this, the researchers designed IdentityNet, which combines strong semantic and weak spatial conditions to merge facial and landmark images with text prompts to guide image generation.

InstantID demonstrates excellent performance and efficiency, making it highly valuable for practical applications that require maintaining identity authenticity. Moreover, InstantID can serve as an adaptable plugin, seamlessly integrating with popular pre-trained text-to-image diffusion models, such as SD 1.5 and SDXL.
Paper link: https://arxiv.org/abs/2401.07519
Project link: https://instantid.github.io/

2. PhotoMaker: Efficient Personalized Custom Portrait Photography

In recent years, text-to-image technology has made significant progress in synthesizing realistic human photos based on given text prompts. However, existing personalized generation methods fail to meet the requirements of high efficiency, maintaining identity (ID) fidelity, and flexible text controllability simultaneously.
To address this, research teams from Nankai University, Tencent, and the University of Tokyo proposed an efficient personalized text-to-image generation method—
PhotoMaker. PhotoMaker is capable of encoding any number of input ID images into a stacked ID embedding to retain ID information. As a unified ID representation, this embedding not only comprehensively encapsulates the features of the same input ID but also accommodates features of different IDs for subsequent integration. This opens up possibilities for more interesting and practically valuable applications.

In order to promote the training of PhotoMaker, the researchers proposed an ID-oriented data construction pipeline to collect training data. Compared to methods based on test-time fine-tuning, under the nourishment of the dataset constructed by this method, PhotoMaker demonstrates better ID preservation capabilities, along with significant speed improvements, high-quality generated results, strong generalization ability, and a wide range of applications.
Paper link: https://arxiv.org/abs/2312.04461
Project link: https://photo-maker.github.io/

3. ConsiStory: Consistent Text - to - Image Generation without Extra Training

Text - to - image models allow users to guide the image - generation process through natural language, elevating creative flexibility to a new level. However, it remains challenging to consistently depict the same subject across different prompts when using these models. Existing methods teach the model new words for specific subjects provided by users by fine - tuning the model, or add image conditions to the model. These methods require long - term optimization for each subject or large - scale pre - training. Meanwhile, it is difficult to align the generated images with text prompts, and there are also difficulties in describing multiple subjects.
To this end, a research team from NVIDIA and Tel Aviv University, along with their collaborators, proposed a training - free method called ConsiStory. It achieves consistent subject generation by sharing the internal activations of pre - trained models.

The research team introduced topic - driven shared attention blocks and correspondence - based feature injection to promote topic consistency among images. To encourage layout diversity while maintaining topic consistency, the research team compared ConsiStory with a series of baselines. Without any optimization steps, ConsiStory achieved state - of - the - art (SOTA) performance in terms of topic consistency and text alignment. ConsiStory can naturally scale to multi - topic scenarios and even enable training - free personalization of common objects.
Paper link: https://arxiv.org/abs/2402.03286
Project link: https://research.nvidia.com/labs
/par/consistory/

4. Just one picture is needed to customize personalized photos easily and quickly.

A research team from the University of Hong Kong, Alibaba, and Ant Group has introduced a practical tool, FlashFace. Users can easily and instantly personalize their photos just by providing one or several reference face images and text prompts.

FlashFace is different from existing human photo customization methods, boasting higher identity fidelity and better instruction - following capabilities, thanks to two subtle designs. First, this technology encodes the face identity as a series of feature maps, rather than encoding it as an image token as in previous technologies. This enables the model to retain more details of the reference face, such as scars, tattoos, and facial shapes. Second, during the text - to - image generation process, FlashFace introduces a separation - integration strategy to balance text and image guidance, thus alleviating the conflict between the reference face and the text prompt (for example, personalizing an adult into a "child" or an "elderly person").
Numerous experiments have demonstrated the effectiveness of FlashFace in various applications, including portrait personalization, face swapping under language prompts, and turning virtual characters into real - life - like people.
Paper link: https://arxiv.org/abs/2403.17008
Project link: https://jshilong.github.io/flashface-page/

5. Huawei proposed PixArt - Σ: Directly generate 4K - resolution images

A research team from Huawei Noah's Ark Lab, Dalian University of Technology, the University of Hong Kong, and the Hong Kong University of Science and Technology has proposed a Diffusion Transformer (DiT) named PixArt - Σ that can directly generate 4K - resolution images. Compared with its predecessor PixArt - α, PixArt - Σ has made significant progress, with a notable improvement in image fidelity and better alignment with text prompts.

One of the main features of PixArt - Σ is its training efficiency. Leveraging the basic pre - training of PixArt - α, it evolves from a "weak" baseline to a "strong" model by incorporating higher - quality data, which we call the "weak - to - strong training" process. The progress of PixArt - Σ is reflected in two aspects: First, high - quality training data: PixArt - Σ integrates higher - quality image data, along with more accurate and detailed image descriptions. Second, efficient token compression. The research team proposed a new attention module within the DiT framework that can compress both keys and values simultaneously, thus significantly improving efficiency and facilitating the generation of ultra - high - resolution images.
Thanks to these improvements, PixArt - Σ achieves excellent image quality and user - prompt functionality. Meanwhile, its model size (0.6B parameters) is significantly smaller than existing text - to - image diffusion models such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). In addition, PixArt - Σ can generate 4K images, supporting the production of high - resolution posters and wallpapers, and effectively facilitating the production of high - quality visual content in industries such as film and games.
Paper link: https://arxiv.org/abs/2403.04692
Project link: https://pixart-alpha.github.io/PixArt-sigma-project/

6. CogView3：Achieving More Precise and Faster "Text - to - Image" through Relay Diffusion

The latest advancements in text - to - image systems have been primarily driven by diffusion models. However, single - stage text - to - image diffusion models still face challenges in computational efficiency and image detail refinement. To address this issue, a research team from Tsinghua University and AI Era proposed CogView3, an innovative cascading framework that can enhance the performance of text - to - image diffusion.
It is reported that CogView3 is the first model to implement relay diffusion in the field of text - to - image generation. It performs tasks by first creating low - resolution images and then applying relay - based super - resolution. This approach not only produces competitive text - to - image outputs but also significantly reduces training and inference costs.

Experimental results show that in human evaluations, CogView3 outperforms the current state - of - the - art open - source text - to - image diffusion model SDXL by 77.0%, while requiring only half of the inference time of the latter. The distilled variant of CogView3 has performance comparable to SDXL, but with an inference time that is only one - tenth of that of SDXL.
Paper link: https://arxiv.org/abs/2403.05121
Github link: https://github.com/THUDM/
CogView3

7. SPRIGHT: Improving Spatial Consistency in Text - to - Image Models

One of the major drawbacks of current text - to - image (T2I) models is their inability to consistently generate images that are faithful to the spatial relationships specified in the text prompts. A research team from Arizona State University, Intel Labs, and their collaborators conducted a comprehensive study of this limitation. They also developed a dataset and methods that achieve SOTA.
The inability to consistently generate images that are faithful to the spatial relationships specified in text prompts is one of the primary flaws of current text - to - image (T2I) models. A research team from Arizona State University, Intel Labs, and their collaborators conducted a comprehensive study of this limitation. Meanwhile, they developed a dataset and methods that achieve state - of - the - art (SOTA) performance.

In addition, they found that training on images containing a large number of objects can significantly improve spatial consistency. Notably, by fine - tuning on fewer than 500 images, they achieved state - of - the - art (SOTA) performance on T2I - CompBench, with a spatial score of 0.2133.
Paper link: https://arxiv.org/abs/2404.01197
Project link: https://spright-t2i.github.io/

8. RLCM: Fine - tuning Consistency Models via Reinforcement Learning

Reinforcement learning (RL) improves the guided image generation of diffusion models by directly optimizing to obtain rewards for image quality, aesthetics, and instruction - following capabilities. However, the resulting generation strategy inherits the iterative sampling process of diffusion models, leading to slow generation speed. To overcome this limitation, consistency models propose learning a new class of generative models that directly map noise to data, resulting in a model that can generate images with only one sampling iteration.
In this work, to optimize text - to - image generation models for specific task rewards and achieve fast training and inference, a research team from Cornell University proposed a framework for fine - tuning consistency models via RL.
RLCM, which formulates the iterative inference process of the consistency model as an RL process. RLCM improves the RL - fine - tuned diffusion models in terms of text - to - image generation capabilities, and trades computational cost for sample quality during the inference process.

Experiments show that RLCM can adjust text - to - image consistency models to adapt to goals that are difficult to express through prompts (such as image compressibility) and those from human feedback (such as aesthetic quality). Compared with RL - fine - tuned diffusion models, RLCM trains significantly faster, improves the generation quality measured under reward goals, and speeds up the inference process, being able to generate high - quality images with just two inference steps.
Paper link: https://arxiv.org/abs/2404.03673
Project link: hhttps://rlcm.owenoertell.com/

9. Tsinghua University and Meta Propose a New Method for Customized Text - to - Image Generation: MultiBooth

A research team from Tsinghua University and Meta has proposed a new and efficient technique for multi - concept customization in text - to - image generation, called MultiBooth. Despite significant progress in customized generation methods, especially with the rapid development of diffusion models, existing methods still struggle with multi - concept scenarios due to low concept fidelity and high inference costs.

To address these issues, MultiBooth divides the multi - concept generation process into two stages: the single - concept learning stage and the multi - concept integration stage. In the single - concept learning stage, they adopt a multi - modal image encoder and an efficient concept encoding technique to learn a concise and discriminative representation for each concept. In the multi - concept integration stage, they use bounding boxes to define the generation area of each concept in the cross - attention map. This approach enables the creation of single concepts within the specified areas, thus facilitating the formation of multi - concept images.
This strategy not only improves the concept fidelity but also reduces the additional inference costs. In both qualitative and quantitative evaluations, MultiBooth outperforms various baselines, demonstrating its superior performance and computational efficiency.
Paper link: https://arxiv.org/abs/2404.14239
Project link: https://multibooth.github.io/

10. Teams from Tsinghua University and AI Era Launch the Infinite Super - Resolution Model Inf - DiT

In recent years, diffusion models have demonstrated excellent performance in image generation. However, due to the quadratic increase in memory during the generation of ultra - high - resolution images (such as 4096×4096), the resolution of generated images is often limited to 1024×1024.
In this work, a research team from Tsinghua University and AI Era proposed a unidirectional block attention mechanism. This mechanism can adaptively adjust the memory overhead during the inference process and handle global dependencies. Based on this module, they adopted the DiT architecture for upsampling and developed an infinite super - resolution model capable of upsampling images of various shapes and resolutions.

Comprehensive experiments show that this model achieves state - of - the - art (SOTA) performance in both machine and human evaluations for generating ultra - high - resolution images. Compared with the commonly used UNet architecture, this model can save more than five times the memory when generating 4096×4096 images.
Paper link: https://arxiv.org/abs/2405.04312
ttps://arxiv.org/abs/2410.10629 https://github.com/THUDM/Inf-DiT

11. A New Work by Kai - ming He: Autoregressive Image Generation without Vector Quantization

The traditional view holds that autoregressive models for image generation are usually accompanied by vector - quantized tokens.
The team led by Kai Ming He from the MIT Computer Science and Artificial Intelligence Laboratory (MIT CSAIL), along with collaborators from Google DeepMind and Tsinghua University, have discovered that while a discrete - valued space is helpful for representing categorical distributions, it is not a necessary condition for autoregressive modeling.
In this work, they proposed using a diffusion process to model the probability distribution of each token, enabling the application of autoregressive models in a continuous - valued space. Instead of using categorical cross - entropy loss, they defined a diffusion loss function to model the probability of each token. This approach eliminates the need for discrete - valued tokenizers. They evaluated its effectiveness in various scenarios, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, the image generator they proposed not only enjoys the speed advantage of sequence modeling but also achieves excellent results.

They hope that this work will promote the application of autoregressive generation techniques in other continuous - valued fields and applications.
Paper link: https://arxiv.org/abs/2406.11838
Github link: https://github.com/LTH14/mar

12. Xiaohongshu Launches StoryMaker: Achieving Overall Feature Consistency in "Text - to - Image" Generation

Tuning - free personalized image generation methods have achieved great success in maintaining facial consistency. However, in scenes with multiple characters, the lack of overall consistency hinders the ability of these methods to create coherent narratives. In this work, the Xiaohongshu team introduced a personalized solution -
StoryMaker can maintain not only the consistency of facial features but also that of clothing, hairstyles, and body figures, thus facilitating the creation of stories through a series of images.

StoryMaker, which can maintain not only facial consistency but also consistency in clothing, hairstyles, and body, thus facilitating the creation of stories through a series of images.
Paper link: https://arxiv.org/abs/2409.12576
Github link: https://github.com/RedAIGC/
StoryMaker

13. NVIDIA's Text - to - Image Framework SANA: Efficiently Generate High - Resolution Images

The research team of NVIDIA and its collaborators have proposed a text - to - image framework, Sana, which can efficiently generate images with a resolution of up to 4096×4096. Sana can synthesize high - resolution and high - quality images at an extremely fast speed on a laptop GPU, and it has a strong ability to align text with the generated images.

The core designs include: (1) Deep Compression Auto - encoder: Unlike traditional auto - encoders that can only compress images 8 - fold, the auto - encoder they trained can compress images 32 - fold, effectively reducing the number of latent tokens. (2) Linear DiT: They replaced all vanilla attention in DiT with linear attention. Linear attention is more efficient at high resolutions without sacrificing quality. (3) Pure - decoder Text Encoder: They replaced T5 with a modern small - scale pure - decoder large - language model (LLM) as the text encoder, and designed sophisticated human - machine instructions and learning from context to enhance the alignment between images and text. (4) Efficient Training and Sampling: They proposed Flow - DPM - Solver to reduce the number of sampling steps, and accelerated convergence through efficient caption annotation and selection.
Compared with modern giant diffusion models such as Flux - 12B, Sana - 0.6B is highly competitive. It is 20 times smaller in size and over 100 times faster in measured throughput. In addition, Sana - 0.6B can be deployed on a 16GB laptop GPU and takes less than 1 second to generate an image with a resolution of 1024×1024. Sana enables content creation at a low cost.
Paper link: https://arxiv.org/abs/2410.10629
Project link: https://nvlabs.github.io/Sana/

14. Kandinsky - 3: A New Type of Text - to - Image Diffusion Model

Text - to - Image (T2I) diffusion models are commonly used models for introducing image processing methods, such as editing, image blending, inpainting, etc. Meanwhile, Image - to - Video (I2V) and Text - to - Video (T2V) models are also built on T2I models. A research team from SberAI and its collaborators have introduced a new T2I model based on latent diffusion - Kandinsky 3, which features high quality and realism.

The main features of the new architecture are simplicity and high efficiency, enabling it to adapt to various types of generation tasks. They extended the basic T2I model for a wide range of applications and created a versatile generation system, which includes text - guided inpainting/outpainting, image blending, text - image fusion, image variation generation, I2V, and T2V generation. They also proposed a refined version of the T2I model. Without sacrificing image quality, this refined model evaluates inference in 4 steps of the reverse process and is 3 times faster than the basic model. They have deployed a user - friendly demonstration system, and all functions can be tested in the public domain.
In addition, they have released the source code and checkpoints of Kandinsky 3 and the extended models. The results of human evaluation show that Kandinsky 3 is one of the open - source generation systems with the highest quality scores.
Paper link: https://arxiv.org/abs/2410.21061
Github link: https://github.com/ai-forever/Kandinsky-3

15. FlipSketch: Transforming Static Drawings into Text - guided Sketch Animations

Sketch animation provides a powerful medium for visual storytelling, encompassing everything from simple flip - book doodles to professional studio productions. Traditional animation production requires a team of skilled artists to draw keyframes and in - betweens. Existing attempts at automation still demand a significant amount of artistic work through precise motion path or keyframe specifications.
In this work, the SketchX team from the University of Surrey introduced the FlipSketch system, which allows you to rediscover the charm of flip - book animation — all you need to do is draw your ideas and describe how you want them to move.

This approach leverages the motion priors from text - to - video diffusion models and adapts them to generate sketch animations through three key innovations: (1) fine - tuning the frame generation in a sketch style, (2) a reference - frame mechanism that maintains the visual integrity of the input sketches through noise refinement, (3) dual - attention synthesis to achieve smooth motion without losing visual consistency. Unlike restricted vector animations, their raster - based framework supports dynamic sketch transformations, capturing the free - form expressiveness of traditional animations.
Paper link: https://arxiv.org/abs/2411.10818
Project link: https://hmrishavbandy.github.io/
flipsketch-web/

16. OneDiffusion: One Diffusion to Generate All

In this work, a research team from AI2 and the University of California, Irvine introduced a general - purpose large - scale diffusion model, OneDiffusion. It can seamlessly support two - way image synthesis and understanding in different tasks. It can generate images based on inputs such as text, depth, pose, layout, and semantic maps. At the same time, it can also handle reverse - processing tasks such as image de - blurring, upscaling, as well as depth estimation and segmentation.

In addition, OneDiffusion can generate multi - views, perform camera pose estimation, and utilize continuous image inputs for instant personalization. During the training process, the model treats all tasks as frame sequences with different noise scales, allowing any frame to serve as a conditional image during inference. This unified training framework does not require a dedicated architecture, supports scalable multi - task training, and can adapt to any resolution, thus enhancing generality and scalability.
Experimental results show that, despite a relatively small training dataset, this training framework is highly competitive in generation and prediction tasks such as text - to - image, multi - view generation, identity protection, depth estimation, and camera pose estimation.
Paper link: https://arxiv.org/abs/2411.16318
Github link: https://github.com/lehduong/
OneDiffusion

17. The team from Peking University proposed the "Custom Comic Generation" framework DiffSensei

Story visualization is the task of creating visual narratives from textual descriptions, and progress has been made in text - to - image generation models. However, these models often lack effective control over the appearance and interactions of characters, especially in multi - character scenarios.
To address these limitations, a research team from Peking University and their collaborators proposed a new task: custom comic generation, and introduced DiffSensei, an innovative framework specifically designed for generating dynamic multi - character - controlled comics. DiffSensei integrates a diffusion - based image generator and a multi - modal large language model (MLLM), with the latter serving as a text - compatible identity adapter. Their approach employs masked cross - attention technology to seamlessly integrate character features, enabling precise layout control without directly transferring pixels. Additionally, the MLLM - based adapter can adjust character features to align with the text cues of specific panels, thus flexibly adjusting the characters' expressions, poses, and actions.

They also introduced MangaZero, a large - scale dataset specifically tailored for this task. It contains 43,264 comic pages and 427,147 annotated panels, supporting the visualization of various character interactions and actions across consecutive frames. Extensive experiments have demonstrated that DiffSensei outperforms existing models. By enabling text - adaptable character customization, it marks a significant advancement in comic - generation technology.
Paper link: https://arxiv.org/abs/2412.07589
Project link: https://jianzongwu.github.io/
projects/diffsensei/

18. SnapGen: Tiny, Fast High - Resolution "Text - to - Image"

Existing Text - to - Image (T2I) diffusion models face several limitations, including large model size, slow operation speed, and low - quality images generated on mobile devices. The research team from Snap and its collaborators aim to address all these challenges by developing an extremely small and fast T2I model that can generate high - resolution and high - quality images on mobile platforms.

To achieve this goal, they proposed several techniques. First, they systematically examined the design choices of the network architecture to reduce model parameters and latency while ensuring high - quality generation. Second, to further improve the generation quality, they adopted cross - architecture knowledge distillation from a larger model, using a multi - level approach to guide the training of their model from scratch. Third, they achieved generation in just a few steps by combining adversarial guidance with knowledge distillation.
Their model, SnapGen, only takes 1.4 seconds to generate a 1024x1024 px image on mobile devices. On ImageNet - 1K, the model can generate 256x256 px images with only 372M parameters, achieving an FID of 2.06. On T2I benchmarks (i.e., GenEval and DPG - Bench), their model, with just 379M parameters, outperforms large models with billions of parameters at a significantly smaller scale (for example, 7 times smaller than SDXL and 14 times smaller than IF - XL).
Paper link: https://arxiv.org/abs/2412.09619
Project link: https://snap-research.github.io/snapgen/
Full version: https://oosdj1g7qa.feishu.cn/docx/
YXfLdKN6po7UPsxC3Eycoboondg