-
Notifications
You must be signed in to change notification settings - Fork 104
Description
Hi, thank you for maintaining this great repo! We are currently exploring how muP interacts with our unit scaling method, and whether there is a scheme that satisfies both at once.
I have tried to recreate the RHS of your Figure 1 using examples/Transformer/main.py to serve as our baseline. Whilst my results look sensible (nice stable optimal learning rate across varying widths, pleasing tick shape) I have been unable to choose hyperparameters that exactly recreate your plot. In particular my training losses are higher (e.g., width 128 gets to minimum training loss 5.2 whilst yours has a minimum around ~4.75) and my optimal learning rate is slightly different.
I am using the default arguments from main.py except where they are contradicted by the paper's description of Fig. 1.
Can you point to a description of the training parameters you used for Fig 1, or highlight which of the below might be incorrect?
| Param | Val | Reason |
|---|---|---|
| ffn_ratio | 4 | Section 3, pg5 |
| epochs | 5 | Section 3, pg5 |
| optimizer | 'muadam' | as per Fig 1 caption |
| norm | postnorm | as per Fig 18 caption |
| base width | 128 | used by the other transformer experiments in the paper |
| output_mult | 1 | default |
| nlayers | 2 | default |
| nhead | 2 | default |
| batch_size | 20 | default |
| bptt | 35 | default |
| dropout | 0.2 | default |
| etc... | ... | deafult |
Thanks very much. My plot is quite close to yours already, but we would prefer to know our results are directly comparable, and would therefore like to be able to exactly recreate your figure for the baseline.