| --- |
| license: apache-2.0 |
| language: en |
| library_name: transformers |
| tags: |
| - d2f |
| - diffusion-llm |
| - text-generation |
| - dream |
| - lora |
| base_model: apple/DiffuCoder-7B-Instruct |
| model_name: D2F_DiffuCoder_Instruct_7B_Lora |
| --- |
| # D2F LoRA adapter for DiffuCoder-7B-Instruct |
|
|
| This repository contains the **LoRA adapter** for the `apple/DiffuCoder-7B-Instruct` model, trained using the **Discrete Diffusion Forcing (D2F)** method. |
|
|
| This adapter allows the `DiffuCoder-7B-Instruct` diffusion LLM (dLLM) to achieve inference speeds that are significantly faster than both its original version and leading autoregressive (AR) models like LLaMA3, while maintaining comparable output quality. |
|
|
| The D2F method and its results are detailed in the paper: **[D2F: Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing](https://arxiv.org/abs/2508.09192)**. |
|
|
| - **Official Code:** [D2F GitHub Repository](https://github.com/zhijie-group/Discrete-Diffusion-Forcing) |
| - **Demo Space:** [D2F-LLaDA-Instruct-8B](https://huggingface.co/spaces/zhijie3/D2F-LLaDA-Instruct-8B) |
| - **The model is used in** [LoPA](https://github.com/zhijie-group/LoPA) |
|
|
| ## Method: Discrete Diffusion Forcing (D2F) |
|
|
| Diffusion LLMs (dLLMs) have long promised ultra-fast parallel decoding, but this potential was historically crippled by two main bottlenecks: |
| 1. **KV Cache Incompatibility:** Their bidirectional attention mechanism prevented the use of the Key-Value Cache, a critical optimization in AR models. |
| 2. **Strict Inter-Block Dependency:** Previous attempts at block-based generation required each block to be fully generated before starting the next, preventing true parallelism. |
|
|
| **D2F** solves these issues with a novel hybrid approach: |
|
|
| 1. **Hybrid Architecture:** D2F reframes text generation as a block-autoregressive process. |
| * **Within a block:** Attention remains **bidirectional** to capture rich local context. |
| * **Between blocks:** Attention is made **causal**, allowing the model to be fully compatible with the standard **KV Cache**. |
|
|
| 2. **Pipelined Parallel Decoding:** D2F uses an efficient training and inference strategy. |
| * **Training:** It uses *Asymmetric Distillation*, where a D2F student model learns to mimic a powerful bidirectional teacher model, efficiently transferring its capabilities to the fast, cache-friendly architecture. |
| * **Inference:** It enables a dynamic **pipelined parallel decoder**. New text blocks are added to the pipeline as soon as their predecessors are only partially complete. This creates an asynchronous workflow that maximizes GPU utilization and dramatically boosts throughput. |
|
|
|
|
| ## How to Use |
|
|
| ⚠️ **Important:** This is a LoRA adapter and requires the official D2F codebase for inference. |
|
|
| For detailed instructions and code, please refer to the official GitHub repository: |
|
|
| ➡️ **https://github.com/zhijie-group/Discrete-Diffusion-Forcing** ⬅️ |