Skip to content

MuP for RNNs #77

@norikazu99

Description

@norikazu99

Hello,
Your paper seems to have covered linear layers, convs, and transformers but not rnns. Was it just to reduce the number of experiments or is their a more fundamental reason behind this choice. If it was just to reduce n_experiments, how would h0 be handeled? Would you recommend zeroing out h0, or it needs to be initialized using mup.init.normal.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions