Hey guys, first of all thanks for the awesome work!
I've implemented muP in the llm.c project (see here), the coord checks seem to be flat / correct (I went up to 15 steps and still flat!) but I am not getting any performance improvement using mup?
Could it be that this is due to smaller scale? We're testing it on 1.5B LLMs. Should we expect a different behavior at ~7B?
I wrote up a mini document on what i've done to support mup in llm.c here under mup.md.
Am I missing something here?