Skip to content

About Learning rate decay  #64

@afcruzs

Description

@afcruzs

Hello, I have a small question regarding the MuP proxy model sweeps. Did you perform full learning rate decay to the 4b or 16b tokens in the proxy models mentioned in Appendix F.4 (gpt3)? Or did you decay the learning rate to the "real" number of tokens to be used in the target model? (Effectively, decaying very little in the proxy model sweeps)

It'd be interesting to know what did you do in the experiments in the appendix 4.3 (gpt3) and in general if this has any effect at all on the transferability (perhaps you have some empirical or theoretical insights), recommendations would be very welcome :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions