Fix: Resolve grafting bugs #6228

dimitrovmaksim · 2025-12-16T20:18:00Z

Resolves #6220, resolves #6221

Issue: #6220

Currently, the can_copy_from validation is only performed when a deployment is considered new—that is, when no existing deployment_schema record exists for the given Qm hash.

In the code here, the layout is assigned if the deployment is new. If a deployment_schema record already exists, the code assumes the deployment was previously created and copied successfully, and therefore sets graft_base_layout to None.

However, if can_copy_from later fails (see here), the transaction rolls back all changes except the record in the deployment_schemas table. This leaves the deployment in a partially created state and copy was never attempted

When the same deployment is attempted again, it is no longer treated as new. As a result, graft_base_layout is set to None, and the can_copy_from check is skipped. The deployment then proceeds as if it were valid, the copy process starts, but it fails again—ultimately leaving the deployment stuck in a "never synced" state.

What this PR changes:

Move the can_copy_from check and the site allocation in a transaction so a failing check fully rolls back the allocation.

Remove the duplicate/late compatibility check from deployment_store.rs

Use Catalog + Layout::new(...) to construct the layout and run layout.can_copy_from(&base_layout) before committing.

Minor cleanups (string formatting improvements, unused variables removed).

Removed the can_copy_from check from copy_deployment because it passed the source layout as the graft base and then compared it against the destination layout, which seems as a redundant check since the two layouts are the same.

Result

Now every re-deploy of a failed deployment should behave as a brand new deployment, and the if !exists check should behave correctly.

Note: This approach may result in deployment_schemas_id_seq gaps

Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>

lutter · 2026-01-12T22:40:52Z

store/postgres/src/catalog.rs

    column: &str,
 ) -> Result<Vec<i64>, StoreError> {
-    const QUERY: &str = "select histogram_bounds::text::int8[] bounds \
+    const QUERY: &str = "select coalesce(histogram_bounds::text::int8[], '{}'::int8[]) as bounds \


woops .. nice!

lutter · 2026-01-12T22:48:28Z

store/postgres/src/subgraph_store.rs

                self.primary_conn()
                    .await?
-                    .record_active_copy(graft_base.site.as_ref(), site.as_ref())
+                    .record_active_copy(graft_base_layout.site.as_ref(), site.as_ref())


Shouldn't this also happen in the transaction above? Otherwise, can't we end up in a situation where we set up everything except recording the active copy if graph-node gets killed atthe wrong moment?

Good catch, I was only focusing on the incompatible schemas case

lutter · 2026-01-12T23:07:49Z

store/postgres/src/subgraph_store.rs

+                        let base_layout = self.layout(graft_base).await?;
+                        let entities_with_causality_region =
+                            deployment.manifest.entities_with_causality_region.clone();
+                        let catalog = Catalog::for_tests(


This is wrong - for_tests doesn't actually check the database for its capabilities (and the method should be marked as #[cfg(debug_assertions)]). Instead, this needs to use Catalog::for_creation where the connection is a connection to the shard in which we will create the subgraph. It's best to create the catalog object outside of this transaction so that we don't hold multiple db connections at once.

for_tests doesn't actually check the database for its capabilities

For this specific case the catalog is needed just to create the layout object, so the can_copy_from helpers can be used to compare the src and dst schemas. Later in the code, when creating the deployment relational schema, we construct the catalog with the proper checks. Although I agree using a for_tests function does not look good.

Instead, this needs to use Catalog::for_creation where the connection is a connection to the shard

At this point the only option (i see) to get a shard conn is using deployment_store.get_replica_conn(ReplicaId::Main) because the pool field is private.

It's best to create the catalog object outside of this transaction so that we don't hold multiple db connections at once.

Catalog creation requires to have the site already created, which happens in the transaction, and the point of adding the transaction is to revert the creation of the site if the can_copy_from fails (which is the cause of the original issue) so I'm not sure if holding a primary and a shard conn can be avoided. Unless, instead of using a transaction, we don't just execute drop_site if the can_copy_from check fails. But I may be missing something.

I wanted to make this fix with minimal changes to the current workflow, but maybe I should rething the whole workflow instead of just patching in.

dimitrovmaksim force-pushed the fix/resolve-grafting-bugs branch from 0738b0e to 103da40 Compare December 16, 2025 21:39

dimitrovmaksim changed the title ~~[WIP] Fix: Resolve grafting bugs~~ Fix: Resolve grafting bugs Dec 16, 2025

dimitrovmaksim self-assigned this Dec 16, 2025

dimitrovmaksim added 2 commits January 6, 2026 13:05

store: Fix histogram_bounds query

f915084

Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>

store: Reject incompatible graft schemas during site allocation

8926e4b

Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>

dimitrovmaksim force-pushed the fix/resolve-grafting-bugs branch from 103da40 to 8926e4b Compare January 6, 2026 11:05

dimitrovmaksim requested a review from lutter January 6, 2026 11:16

fordN requested review from isum and lutter and removed request for isum and lutter January 8, 2026 16:50

lutter requested changes Jan 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Resolve grafting bugs #6228

Fix: Resolve grafting bugs #6228

dimitrovmaksim commented Dec 16, 2025 •

edited

Loading

Uh oh!

lutter Jan 12, 2026

Uh oh!

lutter Jan 12, 2026

Uh oh!

dimitrovmaksim Jan 13, 2026

Uh oh!

lutter Jan 12, 2026

Uh oh!

dimitrovmaksim Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Resolve grafting bugs #6228

Are you sure you want to change the base?

Fix: Resolve grafting bugs #6228

Conversation

dimitrovmaksim commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue: #6220

What this PR changes:

Result

Uh oh!

lutter Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

lutter Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

dimitrovmaksim Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

lutter Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

dimitrovmaksim Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dimitrovmaksim commented Dec 16, 2025 •

edited

Loading