Opting into UnsafeFailForward upgrades
Warning
This feature is not recommended in a majority of cases as enabling UnsafeFailForward upgrades no longer guarantees sane upgrade paths, possibly causing unrecoverable failures resulting in data loss. Only use this feature if you:
- Know every operator installed in the namespace.
- Have deep knowledge regarding the upgrade paths for each operator in the namespace.
- Have control over the contents of the catalogs providing the operators in the namespace.
- Do not want to manually upgrade your failed operators.
What “UnsafeFailForward” upgrades add
A failed installation/upgrade is typically caused by one of two scenarios:
- A Failed CSV: The CSV is in the FAILED phase.
- A Failed InstallPlan: Usually occurs because a resource listed in the InstallPlan fails to be created or updated. An InstallPlan may fail independently of its CSV and may fail to create the CSV.
By opting into “UnsafeFailForward” upgrades, OLM will allow you to recover from failed installations and upgrades by:
- Allowing CSVs to move from the FAILED phase to the REPLACING phase.
- Allowing OLM to calculate new InstallPlans for a set of installables if:
- The InstallPlan referenced by a Subscription is in the FAILED phase.
- The CatalogSource has been updated to include a new upgrade for one or more CSVs in the namespace.
Using UnsafeFailForward Upgrades
Since resolution is namespace scoped, the toggle for allowing “UnsafeFailForward” upgrades is namespace scoped as well. Accordingly, the namespace scoped resource, OperatorGroup, has the upgradeStrategy
field.
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: foo
namespace: bar
spec:
# Possible values include "Default" or "TechPreviewUnsafeFailForward".
upgradeStrategy: TechPreviewUnsafeFailForward
With the upgradeStrategy type set to TechPreviewUnsafeFailForward
, OLM will allow operators to “UnsafeFailForward” in adherence with the principles discussed below. If the upgradeStrategy is unset or set to Default
OLM will exhibit existing behavior.
Understanding “UnsafeFailForward” upgrades
Before using UnsafeFailForward upgrades, it is important to understand how OLM and the Resolver calculate what needs to be installed. When determining what to install, OLM provides the Resolver with the set of CSVs in a namespace. The Resolver treats these CSVs differently depending on whether or not they are claimed by a Subscription, where the Subscription lists the CSV’s name in its .status.currentCSV
field.
- If a CSV is claimed by a Subscription, the Resolver will allow it to be upgraded by a bundle that replaces it.
- If a CSV is not claimed by a Subscription, it is assumed that the user installed the CSV themselves, that OLM should not upgrade the CSV, and that it must appear in the next set of installables.
OLM’s Resolver is deterministic, meaning that the set of installables will not change for a given set of arguments if no new upgrades have been declared in the existing CatalogSources. When an upgrade fails due to a CSV entering the FAILED phase, the CSV being replaced still exists and is not referenced by a Subscription. Today, OLM would send both Operator v1
and Operator v2
to the Resolver which will normally be unsatisfiable because Operator v1
is marked as required and Operator v2
cannot be upgraded further since it provides the same APIs as Operator v1
. In order to support “failing forward”, OLM needs to omit CSVs in the REPLACING phase from the set of arguments sent to the Resolver when “UnsafeFailForward” upgrades are enabled.
When an InstallPlan failed
Let’s review what steps must be taken to recover from a failed InstallPlan today:
Operator v1
is being upgraded toOperator v2
.- The InstallPlan for
Operator v2
fails. Operator v3
is added to the catalog and is defined as the upgrade forOperator v1
.- The user deletes the InstallPlan created for
Operator v2
. - A new InstallPlan is generated for
Operator v3
and the upgrade succeeds.
With “UnsafeFailForward” upgrades, OLM allows new InstallPlans to be generated if:
- The failed InstallPlan is referenced by a Subscription in the namespace.
- The CatalogSource has been updated to include a new upgrade for the set of arguments.
In practice, the fourth step from the previous workflow would be removed, meaning that the you simply need to update the CatalogSource and wait for the cluster to move past the failed install. With catalog polling enabled, you can even skip updating the CatalogSource directly by pushing a new catalog image to the same tag.
BEST PRACTICE NOTE: If a bundle is known to fail, it should be skipped in the upgrade graph using the skips or skipRange feature.
To further understand, let’s visualize this.
When one or more CSVs have failed
Let’s review what steps must be taken to recover from a failed CSV today:
Operator v1
is being upgraded toOperator v2
.- The CSV for
Operator v2
enters the FAILED phase. Operator v3
is added to the catalog which replaces or skipsOperator v2
.- The Resolver cannot upgrade
Operator v2
while includingOperator v1
in the solution set, upgrade is blocked. - User manually deletes the existing CSVs, a new InstallPlan is generated and approved.
Operator v3
is installed and the upgrade succeeds.
To further understand, let’s visualize this.
When you opt into “UnsafeFailForward” upgrades, OLM:
- Does not include CSVs in the REPLACING phase in the set of arguments sent to the Resolver if the final CSV in the upgrade chain is in the FAILED phase and all other CSVs are in the REPLACING phase.
- Allows CSVs to move from the FAILED phase to the REPLACING phase.
These changes allow OLM to recover from any number of failed CSVs in an upgrade path, replacing the latest FAILED CSV with the next upgrade.