Inference
updated
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper
• 2403.09636
• Published
• 3
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
• 2404.11912
• Published
• 17
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention
and Distributed KVCache
Paper
• 2401.02669
• Published
• 17
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
• 2404.16710
• Published
• 80
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
• 2404.18911
• Published
• 30
Octopus v4: Graph of language models
Paper
• 2404.19296
• Published
• 118
Better & Faster Large Language Models via Multi-token Prediction
Paper
• 2404.19737
• Published
• 81
Similarity is Not All You Need: Endowing Retrieval Augmented Generation
with Multi Layered Thoughts
Paper
• 2405.19893
• Published
• 33
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
• 2405.12981
• Published
• 33
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Paper
• 2406.15319
• Published
• 64
Writing in the Margins: Better Inference Pattern for Long Context
Retrieval
Paper
• 2408.14906
• Published
• 144