Jaward Sesay

Jaward

AI & ML interests

Building Lectūra Labs | CS Grad Student @BIT | AI/ML Research: Autonomous Agents, LLMs | Building The Cursor for Learning | Role Model Karpathy

Recent Activity

liked a model about 15 hours ago

CohereLabs/cohere-transcribe-03-2026

posted an update 7 days ago

Supercool! You can now easily train a JEPA world model (15M params) from end-to-end on a single GPU, with planning done under 1s 🤯. - trained with classic prediction loss + SIGReg. - plans purely in raw pixels. - beats SOTA DINO-WM and PLDM. - single hyper-parameter with no heuristics. - fully open sourced!! Paper/Code/Data: https://le-wm.github.io/

posted an update 14 days ago

Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections. For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb. Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent: → instead of blindly summing past layers, → it learns which layers matter, → and dynamically weight contributions across depth. So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!

View all activity

Organizations

liked a model about 15 hours ago

CohereLabs/cohere-transcribe-03-2026

Automatic Speech Recognition • Updated about 12 hours ago • 28.2k • 558

posted an update 7 days ago

Post

138

Supercool! You can now easily train a JEPA world model (15M params) from end-to-end on a single GPU, with planning done under 1s 🤯.
- trained with classic prediction loss + SIGReg.
- plans purely in raw pixels.
- beats SOTA DINO-WM and PLDM.
- single hyper-parameter with no heuristics.
- fully open sourced!!

Paper/Code/Data: https://le-wm.github.io/

posted an update 14 days ago

Post

153

Kimi team dropped a major improvement to the transformer architecture and it quietly targets one of the most taken-for-granted components: residual connections.

For nearly a decade, transformers (since introduction) have relied on residuals that simply add all previous layer outputs equally. It works but it’s also kind of… dumb.

Kimi’s new paper, “Attention Residuals (AttnRes)”, replaces that with something much more intelligent:
→ instead of blindly summing past layers,
→ it learns which layers matter,
→ and dynamically weight contributions across depth.

So attention is no longer just over tokens…it’s now also over layers (depth). This means effectively turning depth into a dynamic memory system, phenomenal!

upvoted a paper about 2 months ago

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Paper • 2602.04515 • Published Feb 4 • 39

posted an update about 2 months ago

Post

252

data in support of findings in our new work on personalized embodied teaching/learning is out, paper coming soon.
Jaward/lectura-agents-data

updated a dataset about 2 months ago

Jaward/lectura-agents-data

Viewer • Updated Jan 30 • 280 • 329 • 21

liked a dataset 2 months ago

Jaward/lectura-agents-data

Viewer • Updated Jan 30 • 280 • 329 • 21

New activity in Jaward/lectura-agents-data 2 months ago

Delete 'claude-4.5' config

#3 opened 2 months ago by

Jaward

Delete 'claude-4.5' config

#2 opened 2 months ago by

Jaward

Delete 'claude-4.5' config

#1 opened 2 months ago by

Jaward

published a dataset 2 months ago

Jaward/lectura-agents-data

Viewer • Updated Jan 30 • 280 • 329 • 21

posted an update 3 months ago

Post

955

Incredible work!! They claim this is the year of recursive language models (I hope so). As models get bigger and better managing their context windows to fit longer prompts has been a standing engineering problem. They propose an inference technique that allows the model to externally crunch down long prompts into snippets that it can recursively call itself on, instead of directly feeding the entire prompt into the transformer. This could make models cheaper and more efficient but I doubt if big tech will adopt it since they profit more with the current approach (bigger models = longer context windows = more expensive the model). Once again such work came from academia/oss community cuz I doubt big tech would have shared these findings lol. They probably have much better inference methods that we may never know of haha.
Paper: https://arxiv.org/pdf/2512.24601