Technical Advisory 151050

On this page Carat arrow pointing down

Publication date: August 4, 2025

Description

In all versions of CockroachDB v24.1+, on rare occasions, when object storage providers used for backups experience intermittent issues resulting in write failures, backup jobs may report completing successfully without backing up all data to object storage. Additionally, affected backups may appear to restore successfully, despite restored data being incomplete.

In the majority of cases, affected backups can be identified by log entries emitted during the creation of the backup. It is theoretically possible for a backup to be affected without emitting the log line below, but we believe this undetectable case is exceedingly rare due to the error handling and logging logic. Instructions for identifying affected backups are provided in the Mitigation section below.

While this issue is expected to occur rarely, we recommend upgrading to the next patch release in the corresponding major version immediately and only using backups created after the upgrade. Only a new full backup and subsequent incremental backups taken with the upgraded version will ensure full recoverability when restoring.

Statement

This issue is resolved in CockroachDB by PR #151058 which ensures errors encountered during all buffer flushes - even those caused by intermittent object storage provider errors - are correctly reported to the backup job.

The fix has been applied to the following versions of CockroachDB: v24.1.22, v24.3.17, v25.1.10, v25.2.4, and the testing version of v25.3, v25.3.0-rc.2.

This issue is tracked publicly by #151050.

Mitigation

Users of CockroachDB v24.1+ should upgrade to at least v24.1.22, v24.3.17, v25.1.10, v25.2.4, or the testing version of v25.3, v25.3.0-rc.2, immediately and initiate new full backups as soon as possible to reduce the risk of an incomplete recovery when restoring. Only a new full backup and subsequent incremental backups taken with the upgraded version will ensure complete recovery when restoring.

This technical advisory only impacts rare cases of BACKUPs while there were intermittent backup object storage failures during the job that exceeded the built-in retry policies but then resolved shortly thereafter. If object storage writes consistently fail, such as when access credentials are incorrect or network configuration prevents all access, then metadata file writes will also fail, causing the backup to be correctly reported as failed.

In the intermittent failure scenario:

  • One or more row data file writes may silently fail due to an unreported upload error that is persistent enough to exhaust the object storage client retry policy.
  • If that error condition is then resolved, metadata files may then write successfully, causing the job to incorrectly report success.
  • RESTORE operations from such backups can appear to succeed but result in partial, inconsistent data.

For the vast majority of cases, affected backups can be identified by log lines emitted during the creation of the backup. You can check for it by searching your logs for lines matching:

icon/buttons/copy
grep "failed to flush SST sink" | grep -v "cannot call Finish on a closed writer" | grep -v "context canceled"

If such lines are found, extract the associated backup job ID and verify whether the job ultimately succeeded using:

icon/buttons/copy
grep "stepping through state succeeded" | grep {associated_backup_job_id}

Backups identified by this search are likely to be affected by this issue. If no cases are detected via this search, while it is theoretically possible a backup could have been affected without being detectable, a corrupt backup is highly unlikely.

Additionally, because the RESTORE operation simply restores the backup as written, it will succeed even if the backup is affected. As a result, this error cannot be detected during the restore process. Customers who do observe impacts following a restore from backup created on an affected version of CockroachDB should contact our support team.

The best course of action is to upgrade CockroachDB versions and take a full backup immediately.

We are continuing to evaluate longer-term safeguards to prevent recurrence and improve detection.

Impact

On rare occasions, when intermittent object storage provider issues cause backup object storage write failures, backup jobs may report completing successfully without backing up all data to object storage. Additionally, affected backups may appear to restore successfully, despite restored data being incomplete.

This impacts all versions of CockroachDB v24.1+.

Reach out to our support team if more information or assistance is needed.

×