Skip to content

Conversation

@loverustfs
Copy link
Collaborator

Description

This PR significantly enhances the production readiness of the RustFS Operator by implementing critical missing verification, security, and high-availability features.

Key Changes

1. High Availability (HA) & Observability

  • Liveness & Readiness Probes: Uncovered and implemented probe configuration in TenantSpec. Probes are now correctly injected into StatefulSet containers to ensure Kubernetes can detect and restart unhealthy pods.
  • Pod Disruption Budget (PDB): Automatically generates PDB resources for each storage pool (defaulting to maxUnavailable: 1) to protect data availability during node maintenance.

2. Security (TLS)

  • Automatic TLS Certificate Generation: implemented a self-signed certificate authority using rcgen.
  • Secret Management: When spec.requestAutoCert is enabled, the operator automatically:
    • Generates a valid x509 certificate for the tenant service.
    • Creates a K8s Secret containing the certificate and key.
    • Mounts the Secret to /etc/rustfs-tls in the StatefulSet.
    • Injects RUSTFS_TLS_ENABLE=true environment variables.

3. Storage Management

  • Online PVC Resizing: Added logic to the reconcile loop to detect when the requested storage in TenantSpec exceeds the current PVC capacity. It automatically patches the PVCs to trigger underlying storage expansion (CSI-dependent).

4. Quality Assurance

  • Added unit tests for:
    • Probe injection logic.
    • PDB generation.
    • TLS certificate generation and validity checks.
    • Volume resize detection mechanism.
  • Created USER_MANUAL.md for end-users.

Checklist

  • Code compiles and passes cargo test.
  • New features are covered by unit tests.
  • CRD changes match the new logic (Probe fields enabled).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the RustFS Operator's production readiness by implementing high availability, observability, security, and storage management features. The changes include automatic TLS certificate generation, health probes, Pod Disruption Budgets, and online PVC resizing capabilities.

Key Changes

  • Automatic TLS certificate generation using self-signed certificates with rcgen library
  • Health probe configuration (liveness, readiness, startup) now enabled in TenantSpec
  • Pod Disruption Budget generation for storage pools to ensure high availability during maintenance
  • Online PVC resizing logic to expand storage capacity

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
src/utils/tls.rs Adds self-signed certificate generation function with 10-year validity period
src/types/v1alpha1/tenant.rs Uncomments and enables probe fields and request_auto_cert in TenantSpec
src/types/v1alpha1/tenant/workloads.rs Injects TLS configuration, environment variables, and probes into StatefulSet containers
src/types/v1alpha1/tenant/pdb.rs New module implementing PodDisruptionBudget generation with hardcoded maxUnavailable=1
src/reconcile.rs Adds TLS secret creation, PDB application, and PVC resize logic to reconciliation loop
src/tests.rs Adds unit tests for probe injection, PDB generation, TLS config, and volume resize detection
Cargo.toml Adds rcgen and time dependencies for certificate generation
Cargo.lock Lock file updates for new dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

use k8s_openapi::apimachinery::pkg::api::resource::Quantity;

let mut tenant = create_test_tenant(None, None);
// Set initial validation
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test comment says "Set initial validation" but this appears to be setting up validation for volume claim templates. The comment should be corrected to accurately describe what's being set up, such as "Set initial storage configuration" or "Configure volume claim template".

Suggested change
// Set initial validation
// Configure initial storage via volume claim template

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +69
// Set valid period to 10 years for simplicity in this version
// In production this should be shorter and updated via rotation
params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1));
params.not_after =
time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365 * 10));
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The certificate validity period is set to 10 years (line 69), which is excessive for production use and could pose a security risk if the private key is compromised. While the comment acknowledges this should be shorter in production, the implementation should use a more reasonable default (e.g., 90 days or 1 year) with proper rotation mechanisms, rather than relying on users to change it later.

Suggested change
// Set valid period to 10 years for simplicity in this version
// In production this should be shorter and updated via rotation
params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1));
params.not_after =
time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365 * 10));
// Set valid period to 1 year by default.
// In production this can be shortened further and updated via rotation.
params.not_before = time::OffsetDateTime::now_utc().saturating_sub(time::Duration::days(1));
params.not_after =
time::OffsetDateTime::now_utc().saturating_add(time::Duration::days(365));

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +161
if ctx.get::<corev1::Secret>(&secret_name, &ns).await.is_err() {
debug!(
"TLS secret {} not found, generating self-signed certificate",
secret_name
);

let cn = format!("{}.{}.svc", latest_tenant.name(), ns);
let sans = vec![
cn.clone(),
format!("*.{}", cn),
format!("{}.{}.svc.cluster.local", latest_tenant.name(), ns),
];

let (cert_pem, key_pem) = crate::utils::tls::generate_self_signed_cert(&cn, sans)?;

let mut data: std::collections::BTreeMap<String, k8s_openapi::ByteString> =
std::collections::BTreeMap::new();
use k8s_openapi::ByteString;
data.insert("tls.crt".to_string(), ByteString(cert_pem.into_bytes()));
data.insert("tls.key".to_string(), ByteString(key_pem.into_bytes()));

let secret = corev1::Secret {
metadata: k8s_openapi::apimachinery::pkg::apis::meta::v1::ObjectMeta {
name: Some(secret_name.clone()),
namespace: Some(ns.clone()),
owner_references: Some(vec![latest_tenant.new_owner_ref()]),
labels: Some(latest_tenant.common_labels()),
..Default::default()
},
type_: Some("kubernetes.io/tls".to_string()),
data: Some(data),
..Default::default()
};

ctx.create(&secret, &ns).await?;

let _ = ctx
.record(
&latest_tenant,
EventType::Normal,
"TLSCertificateGenerated",
&format!(
"Generated self-signed certificate in secret {}",
secret_name
),
)
.await;
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS secret generation only occurs if the secret doesn't exist, but doesn't verify the existing secret is valid or update it when certificate details change (e.g., SANs). If a user modifies the tenant name or namespace after initial creation, or if the existing certificate is invalid/expired, the operator will continue using the old certificate. Consider checking certificate validity and regenerating if needed.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +19
// IntOrString::Int(1) means max 1 pod can be unavailable at a time.
// This is safe for most distributed storage upgrades/maintenance.
let max_unavailable = Some(IntOrString::Int(1));
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PDB is hardcoded to maxUnavailable: 1, which may not be appropriate for all pool sizes. For a pool with only 1 replica, this would prevent any voluntary disruptions entirely. Consider making this configurable or calculating it based on the pool's replica count (e.g., using minAvailable instead with a value based on quorum requirements).

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +187
#[test]
fn test_volume_resize_detection_logic() {
use k8s_openapi::api::core::v1 as corev1;
use k8s_openapi::apimachinery::pkg::api::resource::Quantity;

let mut tenant = create_test_tenant(None, None);
// Set initial validation
let mut template = corev1::PersistentVolumeClaimSpec::default();
let mut requests = std::collections::BTreeMap::new();
requests.insert("storage".to_string(), Quantity("10Gi".to_string()));
template.resources = Some(corev1::VolumeResourceRequirements {
requests: Some(requests),
..Default::default()
});

tenant.spec.pools[0].persistence.volume_claim_template = Some(template);

// Simulation logic similar to resize_pool_pvcs
let desired = tenant.spec.pools[0]
.persistence
.volume_claim_template
.as_ref()
.unwrap()
.resources
.as_ref()
.unwrap()
.requests
.as_ref()
.unwrap()
.get("storage")
.unwrap();

assert_eq!(desired.0, "10Gi");

// Mock current PVC state
let current_qty = Quantity("5Gi".to_string());

assert_ne!(&current_qty, desired);
// Logic would trigger resize here
}
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_volume_resize_detection_logic test only verifies the basic structure of extracting storage quantities from the spec, but doesn't actually test the resize_pool_pvcs function itself. The test doesn't verify the actual PVC patching logic, error handling when resize fails, or the behavior when storage sizes are semantically equal but syntactically different. Consider adding integration tests that actually invoke resize_pool_pvcs with mock PVCs.

Copilot uses AI. Check for mistakes.
src/reconcile.rs Outdated
})?;

for pvc in pvcs.items {
let pvc_name = pvc.metadata.name.as_deref().unwrap_or_default();
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unwrap_or_default() for the PVC name results in an empty string if the name is None, which would cause the subsequent patch operation to fail silently or with a generic error. Since PVCs should always have a name when listed from the API, consider using a more explicit error handling approach or at least skipping PVCs without names with a warning log.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +101
let ss = tenant.new_statefulset(&tenant.spec.pools[0]).unwrap();
let container = &ss
.spec
.as_ref()
.unwrap()
.template
.spec
.as_ref()
.unwrap()
.containers[0];

assert!(container.liveness_probe.is_some());
assert_eq!(
container
.liveness_probe
.as_ref()
.unwrap()
.initial_delay_seconds,
Some(10)
);
}

#[test]
fn test_pdb_generation() {
let tenant = create_test_tenant(None, None);
let pdb = tenant.new_pdb(&tenant.spec.pools[0]).unwrap();

assert_eq!(pdb.metadata.name.unwrap(), "test-tenant-pool-0");
assert!(pdb.spec.unwrap().max_unavailable.is_some());
}

#[test]
fn test_statefulset_tls_config() {
let mut tenant = create_test_tenant(None, None);
tenant.spec.request_auto_cert = Some(true);

let ss = tenant.new_statefulset(&tenant.spec.pools[0]).unwrap();
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests use unwrap() on the results of new_statefulset() and new_pdb(), which violates the project's lint rule (unwrap_used = "deny" in Cargo.toml). While this is in test code, it's still subject to the lints. Consider using expect() with a descriptive message or pattern matching to handle potential errors.

Copilot uses AI. Check for mistakes.
GenerateCert { source: rcgen::Error },
}

/// Generates a self-signed certificate and private key (RSA 2048)
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation comment states "RSA 2048" but the implementation uses rcgen::KeyPair::generate() which defaults to ECDSA P-256, not RSA 2048. The comment should be updated to reflect the actual key algorithm being used, or the code should be modified to explicitly specify RSA 2048 if that's the intended behavior.

Suggested change
/// Generates a self-signed certificate and private key (RSA 2048)
/// Generates a self-signed certificate and private key (ECDSA P-256)

Copilot uses AI. Check for mistakes.
src/reconcile.rs Outdated
&& let Some(ref requests) = resources.requests
&& let Some(current_qty) = requests.get("storage")
{
if current_qty != desired_storage {
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage quantity comparison uses simple string equality, which may trigger unnecessary resize operations when the quantities are semantically equal but represented differently (e.g., "1000Mi" vs "1Gi"). This could lead to repeated patch attempts on every reconciliation loop. Consider parsing and normalizing the quantities before comparison, or implement a semantic comparison.

Copilot uses AI. Check for mistakes.
src/reconcile.rs Outdated
&& let Some(ref requests) = resources.requests
&& let Some(current_qty) = requests.get("storage")
{
if current_qty != desired_storage {
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resize logic compares quantities with != but doesn't check if the new size is larger than the current size. Attempting to shrink PVC storage (reducing size) is typically not supported in Kubernetes and will fail. The code should verify that desired_storage > current_qty before attempting the patch to avoid unnecessary error logs and patch attempts.

Suggested change
if current_qty != desired_storage {
if desired_storage > current_qty {

Copilot uses AI. Check for mistakes.
src/reconcile.rs Outdated
}
});

match pvcs_api
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this is not allowed as the stateful set does not allow modifying volumes post creation. Since this is patching the pvc directly we will want to make sure this works in a deployment. Has this change been tested against a deployed operator and statefulset?

@shahab96
Copy link
Collaborator

shahab96 commented Jan 9, 2026

There are items in this PR which are reflecting changes made in #59.
We should first complete and merge that, then update this branch to reflect new changes before we proceed with this PR.

@loverustfs
Copy link
Collaborator Author

There are items in this PR which are reflecting changes made in #59. We should first complete and merge that, then update this branch to reflect new changes before we proceed with this PR.

The merger of #59 was not approved.

@shahab96
Copy link
Collaborator

shahab96 commented Jan 9, 2026

There are items in this PR which are reflecting changes made in #59. We should first complete and merge that, then update this branch to reflect new changes before we proceed with this PR.

The merger of #59 was not approved.

There are merge conflicts in that PR which need to be fixed, the PR itself is approved but those conflicts need to be fixed first before we merge.

@loverustfs
Copy link
Collaborator Author

There are items in this PR which are reflecting changes made in #59. We should first complete and merge that, then update this branch to reflect new changes before we proceed with this PR.

The merger of #59 was not approved.

There are merge conflicts in that PR which need to be fixed, the PR itself is approved but those conflicts need to be fixed first before we merge.

ok, i'll fix it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants