My notification

nithin12342 's Collections

Rag ai

Ai papper for efficient ai model

My notification

updated 8 days ago

Upvote

OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation

Paper • 2601.15369 • Published Jan 21 • 21
Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

Paper • 2601.15892 • Published Jan 22 • 53
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Paper • 2601.16208 • Published Jan 22 • 55
NAACL: Noise-AwAre Verbal Confidence Calibration for LLMs in RAG Systems

Paper • 2601.11004 • Published Jan 16 • 30
Behavior Knowledge Merge in Reinforced Agentic Models

Paper • 2601.13572 • Published Jan 20 • 27
microsoft/VibeVoice-ASR

Automatic Speech Recognition • 9B • Updated Jan 27 • 731k • 1.05k
zai-org/GLM-4.7-Flash

Text Generation • 31B • Updated Jan 29 • 683k • • 1.71k
LongCat-Flash-Thinking-2601 Technical Report

Paper • 2601.16725 • Published Jan 23 • 180
iFSQ: Improving FSQ for Image Generation with 1 Line of Code

Paper • 2601.17124 • Published Jan 23 • 33
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs

Paper • 2601.17058 • Published Jan 22 • 190
Less is More: Optimizing Function Calling for LLM Execution on Edge Devices

Paper • 2411.15399 • Published Nov 23, 2024 • 1
nvidia/personaplex-7b-v1

Audio-to-Audio • Updated Mar 2 • 541k • 2.46k
Qwen/Qwen3-ASR-0.6B

Automatic Speech Recognition • 0.9B • Updated Jan 30 • 451k • 279
Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 36
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Paper • 2601.19895 • Published Jan 27 • 27
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Paper • 2601.22153 • Published Jan 29 • 74
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models

Paper • 2601.20354 • Published Jan 28 • 112
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

Paper • 2601.21406 • Published Jan 29 • 6
Revisiting Parameter Server in LLM Post-Training

Paper • 2601.19362 • Published Jan 27 • 8
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

Paper • 2601.21420 • Published Jan 29 • 42
SERA: Soft-Verified Efficient Repository Agents

Paper • 2601.20789 • Published Jan 28 • 13
moonshotai/Kimi-K2.5

Image-Text-to-Text • 1.1T • Updated Feb 27 • 5.23M • • 2.76k
DINO-SAE: DINO Spherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Paper • 2601.22904 • Published Jan 30 • 15
Phr00t/LTX2-Rapid-Merges

Image-Text-to-Video • Updated Feb 12 • 352
ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought

Paper • 2601.23184 • Published Jan 30 • 36
FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Paper • 2602.02092 • Published Feb 2 • 18
PixelGen: Pixel Diffusion Beats Latent Diffusion with Perceptual Loss

Paper • 2602.02493 • Published Feb 2 • 46
TTCS: Test-Time Curriculum Synthesis for Self-Evolving

Paper • 2601.22628 • Published Jan 30 • 35
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

Paper • 2602.02488 • Published Feb 2 • 36
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Paper • 2602.02185 • Published Feb 2 • 118
Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization

Paper • 2601.21358 • Published Jan 29 • 7
Balancing Understanding and Generation in Discrete Diffusion Models

Paper • 2602.01362 • Published Feb 1 • 17
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation

Paper • 2602.03796 • Published Feb 3 • 64
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Paper • 2602.01785 • Published Feb 2 • 96
LIVE: Long-horizon Interactive Video World Modeling

Paper • 2602.03747 • Published Feb 3 • 13
Qwen/Qwen3-Coder-Next

Text Generation • 80B • Updated Feb 3 • 632k • • 1.32k
Semantic Routing: Exploring Multi-Layer LLM Feature Weighting for Diffusion Transformers

Paper • 2602.03510 • Published Feb 3 • 27
RISE-Video: Can Video Generators Decode Implicit World Rules?

Paper • 2602.05986 • Published Feb 5 • 27
FASA: Frequency-aware Sparse Attention

Paper • 2602.03152 • Published Feb 3 • 154
DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 55
GEBench: Benchmarking Image Generation Models as GUI Environments

Paper • 2602.09007 • Published Feb 9 • 39
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Paper • 2602.08236 • Published Feb 9 • 9
AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research

Paper • 2602.06540 • Published Feb 6 • 21
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Paper • 2602.04649 • Published Feb 4 • 12
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Paper • 2602.05400 • Published Feb 5 • 352
AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Paper • 2602.05027 • Published Feb 4 • 63
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math

Paper • 2602.06291 • Published Feb 6 • 24
Towards Autonomous Mathematics Research

Paper • 2602.10177 • Published Feb 10 • 36
Free(): Learning to Forget in Malloc-Only Reasoning Models

Paper • 2602.08030 • Published Feb 8 • 6
Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion

Paper • 2602.07775 • Published Feb 8 • 8
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Paper • 2602.08711 • Published Feb 9 • 28
Qute: Towards Quantum-Native Database

Paper • 2602.14699 • Published Feb 16 • 14
Qwen/Qwen3.5-397B-A17B

Image-Text-to-Text • 403B • Updated Mar 15 • 632k • • 1.47k
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning

Paper • 2602.16742 • Published Feb 18 • 12
tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction

Paper • 2602.20160 • Published Feb 23 • 10
From Perception to Action: An Interactive Benchmark for Vision Reasoning

Paper • 2602.21015 • Published Feb 24 • 23
SkyReels-V4: Multi-modal Video-Audio Generation, Inpainting and Editing model

Paper • 2602.21818 • Published Feb 25 • 55
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Paper • 2602.24286 • Published Feb 27 • 98
From Scale to Speed: Adaptive Test-Time Scaling for Image Editing

Paper • 2603.00141 • Published Feb 24 • 138
RubricBench: Aligning Model-Generated Rubrics with Human Standards

Paper • 2603.01562 • Published Mar 2 • 63
SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Paper • 2602.23866 • Published Feb 27 • 88
Qwen3-Coder-Next Technical Report

Paper • 2603.00729 • Published Feb 28 • 64
CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video

Paper • 2603.04291 • Published Mar 4 • 14
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Paper • 2603.04791 • Published Mar 5 • 20
InfinityStory: Unlimited Video Generation with World Consistency and Character-Aware Shot Transitions

Paper • 2603.03646 • Published Mar 4 • 8
Utonia: Toward One Encoder for All Point Clouds

Paper • 2603.03283 • Published Mar 3 • 185
DreamWorld: Unified World Modeling in Video Generation

Paper • 2603.00466 • Published Feb 28 • 16
On-Policy Self-Distillation for Reasoning Compression

Paper • 2603.05433 • Published Mar 5 • 8
fal/virtual-tryoff-lora

Image-to-Image • Updated Mar 6 • 1.1k • 32
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

Paper • 2603.06569 • Published Mar 6 • 119
PixARMesh: Autoregressive Mesh-Native Single-View Scene Reconstruction

Paper • 2603.05888 • Published Mar 6 • 2
Scale Space Diffusion

Paper • 2603.08709 • Published Mar 9 • 16
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Paper • 2603.09877 • Published Mar 10 • 48
MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Paper • 2603.09652 • Published Mar 10 • 15
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Paper • 2603.12255 • Published Mar 12 • 91
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

Paper • 2603.12245 • Published Mar 12 • 18
How Far Can Unsupervised RLVR Scale LLM Training?

Paper • 2603.08660 • Published Mar 9 • 59
WeEdit: A Dataset, Benchmark and Glyph-Guided Framework for Text-centric Image Editing

Paper • 2603.11593 • Published Mar 12 • 25
Video-Based Reward Modeling for Computer-Use Agents

Paper • 2603.10178 • Published Mar 10 • 43
CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Paper • 2603.10757 • Published Mar 11 • 14
Mixture-of-Depths Attention

Paper • 2603.15619 • Published Mar 16 • 80
Multimodal OCR: Parse Anything from Documents

Paper • 2603.13032 • Published Mar 13 • 43
Towards a Neural Debugger for Python

Paper • 2603.09951 • Published Mar 10 • 6
Demystifing Video Reasoning

Paper • 2603.16870 • Published Mar 17 • 369
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Paper • 2603.12180 • Published Mar 12 • 65
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Paper • 2603.13398 • Published Mar 11 • 153
WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Paper • 2603.15132 • Published Mar 16 • 35
Attention Residuals

Paper • 2603.15031 • Published Mar 16 • 182
Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Paper • 2603.19232 • Published Mar 19 • 33
FASTER: Rethinking Real-Time Flow VLAs

Paper • 2603.19199 • Published Mar 19 • 58
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Paper • 2603.16859 • Published Mar 17 • 248
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Paper • 2603.14465 • Published Mar 15 • 23
LoST: Level of Semantics Tokenization for 3D Shapes

Paper • 2603.17995 • Published Mar 18 • 31
SparkVSR: Interactive Video Super-Resolution via Sparse Keyframe Propagation

Paper • 2603.16864 • Published Mar 17 • 17
Running

Featured

77

Nemotron 3 Nano WebGPU

⚛

77

A compact reasoning-capable model running in your browser.
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Paper • 2603.18524 • Published Mar 19 • 58
Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Paper • 2603.15653 • Published Mar 7 • 12
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Paper • 2603.18004 • Published Mar 18 • 13
SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Paper • 2603.19228 • Published Mar 19 • 68
Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Paper • 2603.18002 • Published Mar 18 • 13
Matryoshka Gaussian Splatting

Paper • 2603.19234 • Published Mar 19 • 11
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Paper • 2603.16139 • Published Mar 17 • 32
Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Paper • 2603.19227 • Published Mar 19 • 42
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD

Paper • 2603.20155 • Published Mar 20 • 10
Hyperagents

Paper • 2603.19461 • Published Mar 19 • 50
FlowScene: Style-Consistent Indoor Scene Generation with Multimodal Graph Rectified Flow

Paper • 2603.19598 • Published Mar 20 • 32
VideoDetective: Clue Hunting via both Extrinsic Query and Intrinsic Relevance for Long Video Understanding

Paper • 2603.22285 • Published 29 days ago • 49
mSFT: Addressing Dataset Mixtures Overfiting Heterogeneously in Multi-task SFT

Paper • 2603.21606 • Published 30 days ago • 39
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Paper • 2603.22458 • Published 29 days ago • 135
The Universal Normal Embedding

Paper • 2603.21786 • Published 30 days ago • 15
DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Paper • 2603.23499 • Published 28 days ago • 51
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Paper • 2603.24472 • Published 27 days ago • 54
SIMART: Decomposing Monolithic Meshes into Sim-ready Articulated Assets via MLLM

Paper • 2603.23386 • Published 28 days ago • 40
FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

Paper • 2603.24943 • Published 27 days ago • 12
RealMaster: Lifting Rendered Scenes into Photorealistic Video

Paper • 2603.23462 • Published 28 days ago • 33
Natural-Language Agent Harnesses

Paper • 2603.25723 • Published 26 days ago • 25
On Token's Dilemma: Dynamic MoE with Drift-Aware Token Assignment for Continual Learning of Large Vision Language Models

Paper • 2603.27481 • Published 24 days ago • 35
DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Paper • 2603.28713 • Published 22 days ago • 20
Gen-Searcher: Reinforcing Agentic Search for Image Generation

Paper • 2603.28767 • Published 22 days ago • 57
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Paper • 2603.29029 • Published 22 days ago • 13
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving

Paper • 2604.02190 • Published 20 days ago • 29
Self-Execution Simulation Improves Coding Models

Paper • 2604.03253 • Published Mar 11 • 34
Jackrong/Gemopus-4-26B-A4B-it-GGUF

Text Generation • 25B • Updated 11 days ago • 40.1k • 91
Large Language Models Align with the Human Brain during Creative Thinking

Paper • 2604.03480 • Published 19 days ago • 6

Upvote

Collection guide
Browse collections

Nemotron 3 Nano WebGPU