leonardlin 's Collections speed
updated
LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
Paper
• 2312.11514
• Published • 262
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
• 2312.12456
• Published • 45
Accelerating LLM Inference with Staged Speculative Decoding
Paper
• 2308.04623
• Published • 26
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
Paper
• 2208.07339
• Published • 5
Efficient Memory Management for Large Language Model Serving with
PagedAttention
Paper
• 2309.06180
• Published • 47
Efficient LLM inference solution on Intel GPU
Paper
• 2401.05391
• Published • 11
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
• 2401.08671
• Published • 15
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published • 60
Zero Bubble Pipeline Parallelism
Paper
• 2401.10241
• Published • 25
Lookahead: An Inference Acceleration Framework for Large Language Model
with Lossless Generation Accuracy
Paper
• 2312.12728
• Published
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Paper
• 2401.15077
• Published • 20
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
• 2401.15024
• Published • 73
Dissecting the Runtime Performance of the Training, Fine-tuning, and
Inference of Large Language Models
Paper
• 2311.03687
• Published
Hydragen: High-Throughput LLM Inference with Shared Prefixes
Paper
• 2402.05099
• Published • 20
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Paper
• 2402.02750
• Published • 4
Paper
• 2402.04925
• Published • 4
FAST: Factorizable Attention for Speeding up Transformers
Paper
• 2402.07901
• Published • 3
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache
Quantization
Paper
• 2401.18079
• Published • 8
CATS: Contextually-Aware Thresholding for Sparsity in Large Language
Models
Paper
• 2404.08763
• Published • 2
Layer-Condensed KV Cache for Efficient Inference of Large Language
Models
Paper
• 2405.10637
• Published • 22
Distributed Speculative Inference of Large Language Models
Paper
• 2405.14105
• Published • 18
EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper
• 2406.16858
• Published • 1
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Paper
• 2406.06282
• Published • 39