Skip to content

Conversation

@kmontemayor2-sc
Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Jan 9, 2026

Scope of work done

We do this so we can infer the job type downstream, e.g. for is_inference 1.

I thought about making this a CLI flag that we inject, similar to use_cuda 2. But I strongly feel that we should not be injecting CLI flags, and that the CLI flags should only be controled by users.

If I'm a user, I expect there to be all sorts of stuff in the environment variables, but I'd expect the CLI flags to be mine, (and I don't thikn we should be using the ones we have anyways...)

We could still inject the CLI flag but I think it's going to cause more pain down the road, and its going to be unsafe to add to the colcateed (e.g. non-graphstore jobs).

Another note: I think we should use the "job type" vs is_inference as we may have other "job" types in the future (e.g. some GLT preprocessor step to compute PEs or similar.

Where is the documentation for this feature?: N/A

Did you add automated tests or write a test plan?

Updated Changelog.md? NO

Ready for code review?: NO

@kmontemayor2-sc
Copy link
Collaborator Author

/unit_test_py

@kmontemayor2-sc
Copy link
Collaborator Author

/integration_test

@kmontemayor2-sc
Copy link
Collaborator Author

/e2e_test

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

GiGL Automation

@ 23:21:40UTC : 🔄 Python Unit Test started.

@ 24:36:24UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

GiGL Automation

@ 23:21:45UTC : 🔄 Integration Test started.

@ 24:30:51UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 9, 2026

GiGL Automation

@ 23:21:49UTC : 🔄 E2E Test started.

@ 24:36:14UTC : ✅ Workflow completed successfully.

Copy link
Collaborator

@svij-sc svij-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of introducing env vars as it may limit flexibility to move to other backends for train/infer. Specifically we want to support K8s training/inference sometime later this year.

Btw, how does this affect if a user is doing local training/inference?
Will we have to set this var?
If it doesn't affect local training I am fine with this change.

@kmontemayor2-sc
Copy link
Collaborator Author

I am not a fan of introducing env vars as it may limit flexibility to move to other backends for train/infer. Specifically we want to support K8s training/inference sometime later this year.

I think that it's pretty easy to migrate the env vars to k8s/etc, if we're not able to set env vars at all then I think using that as a backend would be a non-starter.

Regardless in the migration to k8s we're going to need to migrate the injected CLI flags (among other things like labels 1).

Btw, how does this affect if a user is doing local training/inference?

What do you mean by this? AFAIK people aren't running glt_trainer locally, they directly run the training loops? If they are running the loops they already need to set a lot of env vars like RANK, WORLD_SIZE etc.

@svij-sc
Copy link
Collaborator

svij-sc commented Jan 13, 2026

What do you mean by this? AFAIK people aren't running glt_trainer locally, they directly run the training loops? If they are running the loops they already need to set a lot of env vars like RANK, WORLD_SIZE etc.

Yes, but those are expected for any dist training.
Ideally we have a local launcher that takes care of it for us too - but we are not there yet. Ideally we make use of somehting like torchrun, but i think thats not possible right now: https://docs.pytorch.org/docs/stable/elastic/run.html

Anyways, I am just concerned but this is not blocking.
I do agree its nice to have a utility to see what component you are in.

Copy link
Collaborator

@mkolodner-sc mkolodner-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks Kyle!

@kmontemayor2-sc
Copy link
Collaborator Author

Yes, but those are expected for any dist training.

Is your concern that we may now need to require users to provide more env vars? I guess that's fair - but given that we build datasets differently based on the component 1 I'm not sure what the other approach here is, we need to signal somehow, either via CLI flag or env var, and I feel that env var is less intrusive.

FWIW I don't expect we'll require users to set this all the time, but I do think it'll be useful for graphstore more (which is sort of weird to run locally anyways...)

@kmontemayor2-sc kmontemayor2-sc added this pull request to the merge queue Jan 13, 2026
@kmontemayor2-sc kmontemayor2-sc removed this pull request from the merge queue due to a manual request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants