Thank you for sharing. While appreciating this work, there is a mistake in computing rms_factor. In the original paper: 0.2 is outside square root (See Eq. 4 of their paper: https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf)
I attached the erroneous line here:
https://github.com/KingNish24/blog-implementations/blob/302ce8625dc5b178bf302f892a0d72b4f95830e3/muon%20vs%20muonclip%20vs%20muon%2Badam/optimizers.py#L190