TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Paper β’ 2410.23168 β’ Published Oct 30, 2024 β’ 24 β’ 5
nGPT: Normalized Transformer with Representation Learning on the Hypersphere Paper β’ 2410.01131 β’ Published Oct 1, 2024 β’ 10 β’ 1
Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling Paper β’ 2410.07145 β’ Published Oct 9, 2024 β’ 2 β’ 3
Round and Round We Go! What makes Rotary Positional Encodings useful? Paper β’ 2410.06205 β’ Published Oct 8, 2024 β’ 2 β’ 1
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices Paper β’ 2410.00531 β’ Published Oct 1, 2024 β’ 33 β’ 8
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper β’ 2410.05993 β’ Published Oct 8, 2024 β’ 111 β’ 7
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices Paper β’ 2410.00531 β’ Published Oct 1, 2024 β’ 33 β’ 8
The Mamba in the Llama: Distilling and Accelerating Hybrid Models Paper β’ 2408.15237 β’ Published Aug 27, 2024 β’ 42 β’ 6
KTO: Model Alignment as Prospect Theoretic Optimization Paper β’ 2402.01306 β’ Published Feb 2, 2024 β’ 21 β’ 3
Planning In Natural Language Improves LLM Search For Code Generation Paper β’ 2409.03733 β’ Published Sep 5, 2024 β’ 1
OLMoE: Open Mixture-of-Experts Language Models Paper β’ 2409.02060 β’ Published Sep 3, 2024 β’ 80 β’ 4
FocusLLM: Scaling LLM's Context by Parallel Decoding Paper β’ 2408.11745 β’ Published Aug 21, 2024 β’ 25 β’ 3
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale Paper β’ 2408.12570 β’ Published Aug 22, 2024 β’ 32 β’ 3
LLM Pruning and Distillation in Practice: The Minitron Approach Paper β’ 2408.11796 β’ Published Aug 21, 2024 β’ 58 β’ 4