-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fix: Resolve grafting bugs #6228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Fix: Resolve grafting bugs #6228
Conversation
0738b0e to
103da40
Compare
Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>
Signed-off-by: Maksim Dimitrov <dimitrov.maksim@gmail.com>
103da40 to
8926e4b
Compare
| column: &str, | ||
| ) -> Result<Vec<i64>, StoreError> { | ||
| const QUERY: &str = "select histogram_bounds::text::int8[] bounds \ | ||
| const QUERY: &str = "select coalesce(histogram_bounds::text::int8[], '{}'::int8[]) as bounds \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woops .. nice!
| self.primary_conn() | ||
| .await? | ||
| .record_active_copy(graft_base.site.as_ref(), site.as_ref()) | ||
| .record_active_copy(graft_base_layout.site.as_ref(), site.as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this also happen in the transaction above? Otherwise, can't we end up in a situation where we set up everything except recording the active copy if graph-node gets killed atthe wrong moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I was only focusing on the incompatible schemas case
| let base_layout = self.layout(graft_base).await?; | ||
| let entities_with_causality_region = | ||
| deployment.manifest.entities_with_causality_region.clone(); | ||
| let catalog = Catalog::for_tests( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong - for_tests doesn't actually check the database for its capabilities (and the method should be marked as #[cfg(debug_assertions)]). Instead, this needs to use Catalog::for_creation where the connection is a connection to the shard in which we will create the subgraph. It's best to create the catalog object outside of this transaction so that we don't hold multiple db connections at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for_tests doesn't actually check the database for its capabilities
For this specific case the catalog is needed just to create the layout object, so the can_copy_from helpers can be used to compare the src and dst schemas. Later in the code, when creating the deployment relational schema, we construct the catalog with the proper checks. Although I agree using a for_tests function does not look good.
Instead, this needs to use Catalog::for_creation where the connection is a connection to the shard
At this point the only option (i see) to get a shard conn is using deployment_store.get_replica_conn(ReplicaId::Main) because the pool field is private.
It's best to create the catalog object outside of this transaction so that we don't hold multiple db connections at once.
Catalog creation requires to have the site already created, which happens in the transaction, and the point of adding the transaction is to revert the creation of the site if the can_copy_from fails (which is the cause of the original issue) so I'm not sure if holding a primary and a shard conn can be avoided. Unless, instead of using a transaction, we don't just execute drop_site if the can_copy_from check fails. But I may be missing something.
I wanted to make this fix with minimal changes to the current workflow, but maybe I should rething the whole workflow instead of just patching in.
Resolves #6220, resolves #6221
Issue: #6220
Currently, the
can_copy_fromvalidation is only performed when a deployment is considered new—that is, when no existingdeployment_schemarecord exists for the given Qm hash.In the code here, the layout is assigned if the deployment is new. If a
deployment_schemarecord already exists, the code assumes the deployment was previously created and copied successfully, and therefore setsgraft_base_layouttoNone.However, if
can_copy_fromlater fails (see here), the transaction rolls back all changes except the record in thedeployment_schemastable. This leaves the deployment in a partially created state and copy was never attemptedWhen the same deployment is attempted again, it is no longer treated as new. As a result,
graft_base_layoutis set toNone, and thecan_copy_fromcheck is skipped. The deployment then proceeds as if it were valid, the copy process starts, but it fails again—ultimately leaving the deployment stuck in a "never synced" state.What this PR changes:
Move the
can_copy_fromcheck and the site allocation in a transaction so a failing check fully rolls back the allocation.Remove the duplicate/late compatibility check from deployment_store.rs
Use Catalog + Layout::new(...) to construct the layout and run layout.can_copy_from(&base_layout) before committing.
Minor cleanups (string formatting improvements, unused variables removed).
Removed the
can_copy_fromcheck fromcopy_deploymentbecause it passed the source layout as the graft base and then compared it against the destination layout, which seems as a redundant check since the two layouts are the same.Result
Now every re-deploy of a failed deployment should behave as a brand new deployment, and the
if !existscheck should behave correctly.Note: This approach may result in
deployment_schemas_id_seqgaps