Skip to content

Conversation

@cfsmp3
Copy link
Contributor

@cfsmp3 cfsmp3 commented Jan 11, 2026

Summary

Fixes a bug where tests could get stuck forever in "Preparation" phase when GCP VM creation failed (e.g., due to quota limits) but a gcp_instance database record was still created.

Problem

Root Cause Investigation (Test #7768 - CCExtractor PR #2014)

While investigating why Linux tests weren't running for CCExtractor/ccextractor#2014, I discovered test #7768 had been stuck in "Preparation" for 12+ hours.

Investigation steps:

  1. Database state check:

    SELECT * FROM gcp_instance WHERE test_id = 7768;
    -- Result: linux-7768 created at 2026-01-10 23:23:07
    -- timestamp_prep_finished = NULL
    
    SELECT COUNT(*) FROM test_progress WHERE test_id = 7768;
    -- Result: 0 (no progress entries)
  2. GCP instance check:

    gcloud compute instances list --project ccextractor-sampleplatform
    # linux-7768 NOT listed - VM doesn't exist!
  3. GCP operation history:

    gcloud compute operations list --filter='targetLink:linux-7768'
    # operation-1768087398971... HTTP_STATUS=403
    
    gcloud compute operations describe operation-1768087398971...
    error:
      errors:
      - code: QUOTA_EXCEEDED
        message: "Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 in region us-central1."
    httpErrorStatusCode: 403
    httpErrorMessage: FORBIDDEN
  4. The bug: The platform created the gcp_instance record BEFORE verifying the GCP operation completed successfully. When the operation failed asynchronously with QUOTA_EXCEEDED, the record remained in the database. The cron job saw this record and assumed the test was running, so it never retried. The test was stuck forever with no way to recover.

Solution

After calling create_instance(), wait for the GCP operation to complete (with a 60-second timeout) before creating the gcp_instance record:

Operation Result Action
Success (status: DONE, no error) Create record, test proceeds normally
Failure (QUOTA_EXCEEDED, etc.) Mark test as failed, NO record created
Timeout (still running after 60s) Create record optimistically (slow VM startup)

The 60-second verification timeout is sufficient to catch quota errors (which fail within seconds) while not blocking too long for legitimate slow VM creations.

Changes

mod_ci/controllers.py

  • Added GCP_VM_CREATE_VERIFY_TIMEOUT = 60 constant
  • Modified start_test() to call wait_for_operation() with this timeout
  • Only creates gcp_instance record after confirming success or timeout
  • Logs full GCP error response for debugging failed operations

tests/test_ci/test_controllers.py

  • Updated existing test_start_test to mock wait_for_operation
  • Added TestVMCreationVerification class with 3 comprehensive tests:
    • test_start_test_quota_exceeded_no_db_record - Verifies the exact scenario from test #7768
    • test_start_test_vm_verified_creates_db_record - Verifies successful VMs create records
    • test_start_test_operation_timeout_creates_db_record - Verifies slow VMs still get records

Test plan

  • CI passes on all Python versions (3.10, 3.12, 3.13, 3.14)
  • New tests cover the quota exceeded scenario
  • Existing tests still pass with the new mocking

Manual verification completed

I manually fixed test #7768 by deleting the stale gcp_instance record:

DELETE FROM gcp_instance WHERE name = 'linux-7768';

The test was then picked up by the next cron run and completed successfully.

🤖 Generated with Claude Code

## Problem

Tests could get stuck forever in "Preparation" phase when GCP VM
creation failed (e.g., due to quota limits) but a gcp_instance
database record was still created.

### Root Cause Investigation (Test #7768)

While investigating why tests weren't running for CCExtractor PR #2014,
I discovered test #7768 had been stuck in "Preparation" for 12+ hours:

1. **Database state**: gcp_instance record existed for `linux-7768`,
   created at 2026-01-10 23:23:07, but `timestamp_prep_finished` was NULL
   and there were zero test_progress entries.

2. **GCP state**: The VM `linux-7768` did NOT exist in GCP.

3. **GCP operation history**: The VM creation operation returned HTTP 403:
   ```
   error:
     errors:
     - code: QUOTA_EXCEEDED
       message: "Quota 'IN_USE_ADDRESSES' exceeded. Limit: 8.0 in region us-central1."
   httpErrorStatusCode: 403
   ```

4. **The bug**: The platform created the gcp_instance record BEFORE
   verifying the GCP operation completed successfully. When the operation
   failed asynchronously with QUOTA_EXCEEDED, the record remained in the
   database. The cron job saw this record and assumed the test was
   running, so it never retried. The test was stuck forever.

## Solution

After calling `create_instance()`, wait for the GCP operation to complete
(with a 60-second timeout) before creating the gcp_instance record:

- If operation completes successfully → create record, test proceeds
- If operation fails (QUOTA_EXCEEDED, etc.) → mark test failed, NO record
- If operation times out → create record optimistically (slow VM creation)

The 60-second verification timeout is sufficient to catch quota errors
(which fail within seconds) while not blocking too long for legitimate
slow VM creations.

## Changes

- `mod_ci/controllers.py`:
  - Added `GCP_VM_CREATE_VERIFY_TIMEOUT = 60` constant
  - Modified `start_test()` to wait for operation verification
  - Only creates gcp_instance record after confirming success or timeout

- `tests/test_ci/test_controllers.py`:
  - Updated existing `test_start_test` to mock `wait_for_operation`
  - Added `TestVMCreationVerification` class with 3 new tests:
    - `test_start_test_quota_exceeded_no_db_record`: Verifies QUOTA_EXCEEDED
      prevents record creation (the exact scenario from test #7768)
    - `test_start_test_vm_verified_creates_db_record`: Verifies successful
      VM creation creates record
    - `test_start_test_operation_timeout_creates_db_record`: Verifies
      timeout still creates record (for slow VMs)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
cfsmp3 and others added 2 commits January 11, 2026 09:28
Addresses SonarCloud quality gate failure for duplicated lines.
Extracted common mock setup into _setup_start_test_mocks() helper method.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sonarqubecloud
Copy link

@canihavesomecoffee canihavesomecoffee merged commit 603f340 into master Jan 11, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants