stereoplegic 's Collections Quantization
updated
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
• 2310.08659
• Published
• 27
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper
• 2309.14717
• Published
• 46
Norm Tweaking: High-performance Low-bit Quantization of Large Language
Models
Paper
• 2309.02784
• Published
• 2
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with
Modular Quantizers
Paper
• 2309.16119
• Published
• 1
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language
Models
Paper
• 2308.13137
• Published
• 19
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Paper
• 2308.15987
• Published
• 2
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
• 2310.16795
• Published
• 27
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Paper
• 2310.16836
• Published
• 14
Microscaling Data Formats for Deep Learning
Paper
• 2310.10537
• Published
• 8
DeepliteRT: Computer Vision at the Edge
Paper
• 2309.10878
• Published
• 1
Efficient Post-training Quantization with FP8 Formats
Paper
• 2309.14592
• Published
• 11
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Paper
• 2308.05600
• Published
• 1
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper
• 2310.11453
• Published
• 106
Understanding the Impact of Post-Training Quantization on Large Language
Models
Paper
• 2309.05210
• Published
• 1
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs
Paper
• 2308.09723
• Published
• 2
Softmax Bias Correction for Quantized Generative Models
Paper
• 2309.01729
• Published
• 1
Training and inference of large language models using 8-bit floating
point
Paper
• 2309.17224
• Published
• 1
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Paper
• 2310.10944
• Published
• 10
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large
Language Models
Paper
• 2310.08041
• Published
• 1
Optimize Weight Rounding via Signed Gradient Descent for the
Quantization of LLMs
Paper
• 2309.05516
• Published
• 11
PB-LLM: Partially Binarized Large Language Models
Paper
• 2310.00034
• Published
• 2
Towards End-to-end 4-Bit Inference on Generative Large Language Models
Paper
• 2310.09259
• Published
• 1
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt
Paper
• 2305.11186
• Published
• 1
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Paper
• 2308.14903
• Published
• 1
FP8-LM: Training FP8 Large Language Models
Paper
• 2310.18313
• Published
• 33
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Paper
• 2310.19102
• Published
• 11
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
• 2305.14314
• Published
• 59
A Survey on Model Compression for Large Language Models
Paper
• 2308.07633
• Published
• 3
REx: Data-Free Residual Quantization Error Expansion
Paper
• 2203.14645
• Published
• 1
Data-Free Quantization with Accurate Activation Clipping and Adaptive
Batch Normalization
Paper
• 2204.04215
• Published
• 2
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Paper
• 2305.17888
• Published
• 1
Token-Scaled Logit Distillation for Ternary Weight Generative Language
Models
Paper
• 2308.06744
• Published
• 1
Understanding and Improving Knowledge Distillation for
Quantization-Aware Training of Large Transformer Encoders
Paper
• 2211.11014
• Published
• 1
Quantized Feature Distillation for Network Quantization
Paper
• 2307.10638
• Published
• 1
Model compression via distillation and quantization
Paper
• 1802.05668
• Published
• 1
Adaptive Precision Training (AdaPT): A dynamic fixed point quantized
training approach for DNNs
Paper
• 2107.13490
• Published
• 1
Feature Affinity Assisted Knowledge Distillation and Quantization of
Deep Neural Networks on Label-Free Data
Paper
• 2302.10899
• Published
• 1
Compressing LLMs: The Truth is Rarely Pure and Never Simple
Paper
• 2310.01382
• Published
• 1
Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing
Paper
• 2306.12929
• Published
• 13
Outlier Suppression+: Accurate quantization of large language models by
equivalent and optimal shifting and scaling
Paper
• 2304.09145
• Published
• 1
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot
Compression
Paper
• 2309.14021
• Published
• 1
Prune Once for All: Sparse Pre-Trained Language Models
Paper
• 2111.05754
• Published
• 2
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large
Language Models
Paper
• 2309.00964
• Published
• 3
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
Paper
• 2310.02410
• Published
• 3
SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using
Training Dynamics
Paper
• 2305.18513
• Published
• 2
NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization
for Vision Transformers
Paper
• 2211.16056
• Published
• 4
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient
Language Model Finetuning
Paper
• 2311.12023
• Published
• 2
Blockwise Compression of Transformer-based Models without Retraining
Paper
• 2304.01483
• Published
• 1
Towards Fine-tuning Pre-trained Language Models with Integer Forward and
Backward Propagation
Paper
• 2209.09815
• Published
• 1
Learning Low-Rank Representations for Model Compression
Paper
• 2211.11397
• Published
• 1
Ada-QPacknet -- adaptive pruning with bit width reduction as an
efficient continual learning method without forgetting
Paper
• 2308.07939
• Published
• 1
Efficient Storage of Fine-Tuned Models via Low-Rank Approximation of
Weight Residuals
Paper
• 2305.18425
• Published
• 1
BitDelta: Your Fine-Tune May Only Be Worth One Bit
Paper
• 2402.10193
• Published
• 21