-
Notifications
You must be signed in to change notification settings - Fork 7
feat: enhance operator with observability, HA, security and storage f… #69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances the RustFS Operator's production readiness by implementing high availability, observability, security, and storage management features. The changes include automatic TLS certificate generation, health probes, Pod Disruption Budgets, and online PVC resizing capabilities.
Key Changes
- Automatic TLS certificate generation using self-signed certificates with rcgen library
- Health probe configuration (liveness, readiness, startup) now enabled in TenantSpec
- Pod Disruption Budget generation for storage pools to ensure high availability during maintenance
- Online PVC resizing logic to expand storage capacity
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/tls.rs | Adds self-signed certificate generation function with 10-year validity period |
| src/types/v1alpha1/tenant.rs | Uncomments and enables probe fields and request_auto_cert in TenantSpec |
| src/types/v1alpha1/tenant/workloads.rs | Injects TLS configuration, environment variables, and probes into StatefulSet containers |
| src/types/v1alpha1/tenant/pdb.rs | New module implementing PodDisruptionBudget generation with hardcoded maxUnavailable=1 |
| src/reconcile.rs | Adds TLS secret creation, PDB application, and PVC resize logic to reconciliation loop |
| src/tests.rs | Adds unit tests for probe injection, PDB generation, TLS config, and volume resize detection |
| Cargo.toml | Adds rcgen and time dependencies for certificate generation |
| Cargo.lock | Lock file updates for new dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| use k8s_openapi::apimachinery::pkg::api::resource::Quantity; | ||
|
|
||
| let mut tenant = create_test_tenant(None, None); | ||
| // Set initial validation |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test comment says "Set initial validation" but this appears to be setting up validation for volume claim templates. The comment should be corrected to accurately describe what's being set up, such as "Set initial storage configuration" or "Configure volume claim template".
| // Set initial validation | |
| // Configure initial storage via volume claim template |
| // Set valid period to 10 years for simplicity in this version | ||
| // In production this should be shorter and updated via rotation | ||
| params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1)); | ||
| params.not_after = | ||
| time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365 * 10)); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The certificate validity period is set to 10 years (line 69), which is excessive for production use and could pose a security risk if the private key is compromised. While the comment acknowledges this should be shorter in production, the implementation should use a more reasonable default (e.g., 90 days or 1 year) with proper rotation mechanisms, rather than relying on users to change it later.
| // Set valid period to 10 years for simplicity in this version | |
| // In production this should be shorter and updated via rotation | |
| params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1)); | |
| params.not_after = | |
| time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365 * 10)); | |
| // Set valid period to 1 year by default. | |
| // In production this can be shortened further and updated via rotation. | |
| params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1)); | |
| params.not_after = | |
| time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365)); |
| if ctx.get::<corev1::Secret>(&secret_name, &ns).await.is_err() { | ||
| debug!( | ||
| "TLS secret {} not found, generating self-signed certificate", | ||
| secret_name | ||
| ); | ||
|
|
||
| let cn = format!("{}.{}.svc", latest_tenant.name(), ns); | ||
| let sans = vec![ | ||
| cn.clone(), | ||
| format!("*.{}", cn), | ||
| format!("{}.{}.svc.cluster.local", latest_tenant.name(), ns), | ||
| ]; | ||
|
|
||
| let (cert_pem, key_pem) = crate::utils::tls::generate_self_signed_cert(&cn, sans)?; | ||
|
|
||
| let mut data: std::collections::BTreeMap<String, k8s_openapi::ByteString> = | ||
| std::collections::BTreeMap::new(); | ||
| use k8s_openapi::ByteString; | ||
| data.insert("tls.crt".to_string(), ByteString(cert_pem.into_bytes())); | ||
| data.insert("tls.key".to_string(), ByteString(key_pem.into_bytes())); | ||
|
|
||
| let secret = corev1::Secret { | ||
| metadata: k8s_openapi::apimachinery::pkg::apis::meta::v1::ObjectMeta { | ||
| name: Some(secret_name.clone()), | ||
| namespace: Some(ns.clone()), | ||
| owner_references: Some(vec![latest_tenant.new_owner_ref()]), | ||
| labels: Some(latest_tenant.common_labels()), | ||
| ..Default::default() | ||
| }, | ||
| type_: Some("kubernetes.io/tls".to_string()), | ||
| data: Some(data), | ||
| ..Default::default() | ||
| }; | ||
|
|
||
| ctx.create(&secret, &ns).await?; | ||
|
|
||
| let _ = ctx | ||
| .record( | ||
| &latest_tenant, | ||
| EventType::Normal, | ||
| "TLSCertificateGenerated", | ||
| &format!( | ||
| "Generated self-signed certificate in secret {}", | ||
| secret_name | ||
| ), | ||
| ) | ||
| .await; | ||
| } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TLS secret generation only occurs if the secret doesn't exist, but doesn't verify the existing secret is valid or update it when certificate details change (e.g., SANs). If a user modifies the tenant name or namespace after initial creation, or if the existing certificate is invalid/expired, the operator will continue using the old certificate. Consider checking certificate validity and regenerating if needed.
| // IntOrString::Int(1) means max 1 pod can be unavailable at a time. | ||
| // This is safe for most distributed storage upgrades/maintenance. | ||
| let max_unavailable = Some(IntOrString::Int(1)); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PDB is hardcoded to maxUnavailable: 1, which may not be appropriate for all pool sizes. For a pool with only 1 replica, this would prevent any voluntary disruptions entirely. Consider making this configurable or calculating it based on the pool's replica count (e.g., using minAvailable instead with a value based on quorum requirements).
| #[test] | ||
| fn test_volume_resize_detection_logic() { | ||
| use k8s_openapi::api::core::v1 as corev1; | ||
| use k8s_openapi::apimachinery::pkg::api::resource::Quantity; | ||
|
|
||
| let mut tenant = create_test_tenant(None, None); | ||
| // Set initial validation | ||
| let mut template = corev1::PersistentVolumeClaimSpec::default(); | ||
| let mut requests = std::collections::BTreeMap::new(); | ||
| requests.insert("storage".to_string(), Quantity("10Gi".to_string())); | ||
| template.resources = Some(corev1::VolumeResourceRequirements { | ||
| requests: Some(requests), | ||
| ..Default::default() | ||
| }); | ||
|
|
||
| tenant.spec.pools[0].persistence.volume_claim_template = Some(template); | ||
|
|
||
| // Simulation logic similar to resize_pool_pvcs | ||
| let desired = tenant.spec.pools[0] | ||
| .persistence | ||
| .volume_claim_template | ||
| .as_ref() | ||
| .unwrap() | ||
| .resources | ||
| .as_ref() | ||
| .unwrap() | ||
| .requests | ||
| .as_ref() | ||
| .unwrap() | ||
| .get("storage") | ||
| .unwrap(); | ||
|
|
||
| assert_eq!(desired.0, "10Gi"); | ||
|
|
||
| // Mock current PVC state | ||
| let current_qty = Quantity("5Gi".to_string()); | ||
|
|
||
| assert_ne!(¤t_qty, desired); | ||
| // Logic would trigger resize here | ||
| } |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_volume_resize_detection_logic test only verifies the basic structure of extracting storage quantities from the spec, but doesn't actually test the resize_pool_pvcs function itself. The test doesn't verify the actual PVC patching logic, error handling when resize fails, or the behavior when storage sizes are semantically equal but syntactically different. Consider adding integration tests that actually invoke resize_pool_pvcs with mock PVCs.
src/reconcile.rs
Outdated
| })?; | ||
|
|
||
| for pvc in pvcs.items { | ||
| let pvc_name = pvc.metadata.name.as_deref().unwrap_or_default(); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using unwrap_or_default() for the PVC name results in an empty string if the name is None, which would cause the subsequent patch operation to fail silently or with a generic error. Since PVCs should always have a name when listed from the API, consider using a more explicit error handling approach or at least skipping PVCs without names with a warning log.
| let ss = tenant.new_statefulset(&tenant.spec.pools[0]).unwrap(); | ||
| let container = &ss | ||
| .spec | ||
| .as_ref() | ||
| .unwrap() | ||
| .template | ||
| .spec | ||
| .as_ref() | ||
| .unwrap() | ||
| .containers[0]; | ||
|
|
||
| assert!(container.liveness_probe.is_some()); | ||
| assert_eq!( | ||
| container | ||
| .liveness_probe | ||
| .as_ref() | ||
| .unwrap() | ||
| .initial_delay_seconds, | ||
| Some(10) | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_pdb_generation() { | ||
| let tenant = create_test_tenant(None, None); | ||
| let pdb = tenant.new_pdb(&tenant.spec.pools[0]).unwrap(); | ||
|
|
||
| assert_eq!(pdb.metadata.name.unwrap(), "test-tenant-pool-0"); | ||
| assert!(pdb.spec.unwrap().max_unavailable.is_some()); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_statefulset_tls_config() { | ||
| let mut tenant = create_test_tenant(None, None); | ||
| tenant.spec.request_auto_cert = Some(true); | ||
|
|
||
| let ss = tenant.new_statefulset(&tenant.spec.pools[0]).unwrap(); |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests use unwrap() on the results of new_statefulset() and new_pdb(), which violates the project's lint rule (unwrap_used = "deny" in Cargo.toml). While this is in test code, it's still subject to the lints. Consider using expect() with a descriptive message or pattern matching to handle potential errors.
| GenerateCert { source: rcgen::Error }, | ||
| } | ||
|
|
||
| /// Generates a self-signed certificate and private key (RSA 2048) |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation comment states "RSA 2048" but the implementation uses rcgen::KeyPair::generate() which defaults to ECDSA P-256, not RSA 2048. The comment should be updated to reflect the actual key algorithm being used, or the code should be modified to explicitly specify RSA 2048 if that's the intended behavior.
| /// Generates a self-signed certificate and private key (RSA 2048) | |
| /// Generates a self-signed certificate and private key (ECDSA P-256) |
src/reconcile.rs
Outdated
| && let Some(ref requests) = resources.requests | ||
| && let Some(current_qty) = requests.get("storage") | ||
| { | ||
| if current_qty != desired_storage { |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The storage quantity comparison uses simple string equality, which may trigger unnecessary resize operations when the quantities are semantically equal but represented differently (e.g., "1000Mi" vs "1Gi"). This could lead to repeated patch attempts on every reconciliation loop. Consider parsing and normalizing the quantities before comparison, or implement a semantic comparison.
src/reconcile.rs
Outdated
| && let Some(ref requests) = resources.requests | ||
| && let Some(current_qty) = requests.get("storage") | ||
| { | ||
| if current_qty != desired_storage { |
Copilot
AI
Jan 9, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resize logic compares quantities with != but doesn't check if the new size is larger than the current size. Attempting to shrink PVC storage (reducing size) is typically not supported in Kubernetes and will fail. The code should verify that desired_storage > current_qty before attempting the patch to avoid unnecessary error logs and patch attempts.
| if current_qty != desired_storage { | |
| if desired_storage > current_qty { |
src/reconcile.rs
Outdated
| } | ||
| }); | ||
|
|
||
| match pvcs_api |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this is not allowed as the stateful set does not allow modifying volumes post creation. Since this is patching the pvc directly we will want to make sure this works in a deployment. Has this change been tested against a deployed operator and statefulset?
|
There are items in this PR which are reflecting changes made in #59. |
There are merge conflicts in that PR which need to be fixed, the PR itself is approved but those conflicts need to be fixed first before we merge. |
ok, i'll fix it. |
Description
This PR significantly enhances the production readiness of the RustFS Operator by implementing critical missing verification, security, and high-availability features.
Key Changes
1. High Availability (HA) & Observability
TenantSpec. Probes are now correctly injected into StatefulSet containers to ensure Kubernetes can detect and restart unhealthy pods.maxUnavailable: 1) to protect data availability during node maintenance.2. Security (TLS)
rcgen.spec.requestAutoCertis enabled, the operator automatically:/etc/rustfs-tlsin the StatefulSet.RUSTFS_TLS_ENABLE=trueenvironment variables.3. Storage Management
TenantSpecexceeds the current PVC capacity. It automatically patches the PVCs to trigger underlying storage expansion (CSI-dependent).4. Quality Assurance
USER_MANUAL.mdfor end-users.Checklist
cargo test.