Alternative method of submitting jobs to DF Runner #756

remylouisew · 2024-10-17T22:06:52Z

Running a dataflow job via the axlearn gcp vm start command is not necessary or intuative for non-apple users. Additionally, if you are already running your commands from a VM (e.g from a remote desktop), this process does not work. I recognize that there are still scenarios where you would want to launch your dataflow jobs from a VM, so rather than replacing that ability, I am adding an alternative.

In order to submit jobs to the Dataflow runner without using ‘axlearn gcp vm start’, changes to the quoting behavior of dataflow.py are necessary. Unfortunately, there’s not an obviously elegant way to provide two versions of dataflow.py, so if you can think of a better option, please let me know.

What I’ve done is this: dataflow.py remains as it was, and I’m adding dataflow.alt.py. In the directions, I’ve added instructions to replace the original module if the user wants to submit jobs to the dataflow runner without needing to create a VM.

Additional note: PR #711 makes a change to the quoting behavior that allows the submission of dataflow jobs without ‘axlearn gcp vm start’, however this fix will not work for any commands that include parameters that require quotes, e.g --dataflow_service_options='worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver'
The dataflow.alt.py module DOES work these kinds of parameters.

… to use 'axlearn gcp vm start'

markblee

Thanks @remylouisew -- IIUC, the main issue is with flag escaping. Aside from finding a more generic fix for flag parsing, maybe we can add support for loading from flagfiles, which would avoid the duplicate code and manual renaming step. (It seems that flag escaping is painful for users anyway.) WDYT?

remylouisew · 2024-10-18T19:41:14Z

@markblee The flag escaping has indeed been painful! I am not familiar with the process of loading from flagfiles, could you elaborate on how it would be implemented?

markblee · 2024-10-24T22:13:05Z

@markblee The flag escaping has indeed been painful! I am not familiar with the process of loading from flagfiles, could you elaborate on how it would be implemented?

absl has builtin support for flagfiles: https://abseil.io/docs/python/guides/flags#a-note-about---flagfile
So either we can accept a flagfile directly at the top-level axlearn gcp dataflow command, which is read and then flags forwarded to the user command; or the user script can use flagfiles directly. Let me know if additional clarifications are helpful.

ruomingp

Will defer to @markblee for approval.

Ethanlm · 2025-07-23T22:21:31Z

Closing this PR due to inactivity. Please re-open or file a new PR if this is still important.

adding alternative method of submitting dataflow jobs without needing…

f0533f3

… to use 'axlearn gcp vm start'

remylouisew requested review from markblee and ruomingp as code owners October 17, 2024 22:06

dataflow.atl.py

57cae3e

markblee reviewed Oct 18, 2024

View reviewed changes

ruomingp reviewed Jan 6, 2025

View reviewed changes

changlan requested a review from a team as a code owner July 23, 2025 21:50

Ethanlm closed this Jul 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternative method of submitting jobs to DF Runner #756

Alternative method of submitting jobs to DF Runner #756

Uh oh!

remylouisew commented Oct 17, 2024

Uh oh!

markblee left a comment

Uh oh!

remylouisew commented Oct 18, 2024 •

edited

Loading

Uh oh!

markblee commented Oct 24, 2024

Uh oh!

ruomingp left a comment

Uh oh!

Ethanlm commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alternative method of submitting jobs to DF Runner #756

Alternative method of submitting jobs to DF Runner #756

Uh oh!

Conversation

remylouisew commented Oct 17, 2024

Uh oh!

markblee left a comment

Choose a reason for hiding this comment

Uh oh!

remylouisew commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markblee commented Oct 24, 2024

Uh oh!

ruomingp left a comment

Choose a reason for hiding this comment

Uh oh!

Ethanlm commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

remylouisew commented Oct 18, 2024 •

edited

Loading