I’ve been fortunate enough to be part of a team looking into skip level upgrades recently ahead of the start of the Queens development cycle for OpenStack. What follows is an introduction to the concept of skip level upgrades and an overview of our initial PoC work in this area. Future posts will also cover our plans for enabling skip level upgrades within TripleO and possible work with the wider community to enable this within other deployment tools.
Introduction
Skip level upgrades are as the name suggests, upgrades that move an environment
from release N
to N+X
in a single step, where X
is greater than 1
and
for skip level upgrades is typically 3
. For example in the context of
OpenStack N
to N+3
can refer to an upgrade from the Newton release of
Openstack to the Queens release, skipping Ocata and Pike:
Newton Ocata Pike Queens
+-----+ +-----+ +-----+ +-----+
| | | N+1 | | N+2 | | |
| N | ---------------------> | N+3 |
| | | | | | | |
+-----+ +-----+ +-----+ +-----+
There are existing alternative methods available for skipping a number of releases during an upgrade. For example, parallel cloud migration is a commonly cited alternative. This is where an additional environment is stood up alongside the original, with workloads migrated to the new environment:
Newton
+-----+
| |
env#1 | N |
| |
+-----+
------------------------------------
\ Queens
\ +-----+
\ | |
env#2 `-> | N+3 |
| |
+-----+
The requirement for this type of upgrade is driven by users looking to standardise on a given release (typically LTS), whilst retaining the ability to skip forward when the release hits EOL. This negates the need to keep up with the major release cycle that in the case of OpenStack continues to be every 6 months.
It is worth highlighting that the topic of skip level upgrades is not new to the OpenStack community, with attempts to provide skip level upgrade functionality within the community before now, typically within the various deployment projects. For example openstack-ansible’s leap-upgrades project that attempted to move environments between Juno/Kilo and Newton.
More recently the topic of skip level upgrades was discussed at the OpenStack Forum in Boston in May. A RFC thread was also posted to the development mailing list, however no formal actions came of either discussion. I’m looking to restart this discussion at the next PTG in Denver, more on that later.
Requirements
Now that we understand what skip level upgrades actually are, it’s time to set out some basic requirements for the state of the environment during the upgrade. At the start of this process our team sat down and drafted the following:
- The control plane is inaccessible for the duration of the upgrade
- The upgrade must complete successfully or rollback within 4 hours
- The data plane and workloads must remain available for the duration of the upgrade.
Proof of concept
With the requirements set out, our first real task was to prove that this was even possible with an OpenStack environment. Given the releases available at the time, we began by manually upgrading an existing Mitaka based RHOSP 9 environment running on RHEL 7.3 to our recently released Ocata based RHOSP 11 release running on RHEL 7.4.
Mitaka Newton Ocata
+-----+ +-----+ +-----+
| | | N+1 | | |
| N | ----------> | N+2 |
| | | | | |
+-----+ +-----+ +-----+
RHOSP9 RHOSP10 RHOSP11
RHEL73 RHEL74
We were aware that whilst the goal of skip level upgrades is to give the
impression of a single jump between releases, in practice this isn’t possible
with OpenStack. Upgrades of OpenStack components are verified by the community
across N
to N+1
jumps, so whilst we wanted to skip ahead to Ocata we knew
we would also have to upgrade through Newton to get there.
The following outlines, at a very high level, the steps we followed during the PoC to upgrade the environment from Mitaka to Ocata:
- Rolling minor update of the underlying OS
- Disable control plane and compute services
- Upgrade a single controller to
N+1
and thenN+2
- Update packages
- Introduce new services as required (nova-placement for example)
- Update service configuration files
- Run DB syncs, migrations etc
- Repeat for
N+2
- Upgrade remaining controllers directly to
N+2
- Update packages
- Introduce new services as required
- Update service configuration files
- Upgrade remaining hosts to
N+2
- Update packages
- Update service configuration files
- Enable control plane and compute services
- Verify workload availability during upgrade
- Validate the post upgrade environment.
Let’s take a look at each of these steps below in more detail.
Rolling minor update of the underlying OS
This initial rolling minor update moved hosts from RHEL 7.3 to RHEL 7.4 whilst also pulling in OVS from our RHOSP11 repos in a bid to limit the number of reboots required in the environment. In practise operators could perform this minor update well ahead of any skip level upgrade, reducing any impact on the overall time required for the upgrade itself.
Disable control plane and compute services
As listed above under requirements, a full control plane outage is accounted for during the skip level upgrade. Note that this does not include the infrastructure services providing the database, messaging queues etc. Compute services are also stopped at this time but should not have any impact on the running workloads and data plane.
Upgrade a single controller to N+1
and then N+2
The main work of upgrading between releases is carried out on a single controller. Packages are updated, new services such as nova-placement are deployed as required, configuration files updated and DB migrations completed. This process is repeated on this host until we reach the target release.
Upgrade remaining controllers directly to N+2
Once the single controller has been upgraded to our target release we then skip any remaining controllers ahead to this target release. Updating packages, introducing new services and updating configuration files on these controllers as required.
Upgrade remaining hosts to N+2
This is then repeated for any remaining hosts, such as computes, object storage hosts etc. Again this should not interrupt running workloads or the data plane.
Enable control plane and compute services
Once all hosts are updated to the target release the control and compute services are restarted.
Verify workload availability during upgrade
During our PoC we ran multiple instances across various L2 and L3 networks, using Ansible to first launch and then later collect the results of asynchronous jobs (ping, ssh etc) that had been running between these instances during the upgrade.
Validate the post upgrade environment
Finally Tempest was used to validate the end state of the environment post upgrade.
After many iterations, the eventual introduction of Ansible to automate all the things and much cursing at the amount of time to reconfigure the environment after each run it was agreed that the team would move on to look into the possible implementation of the above in TripleO.
PTG
Thanks for making it this far! Before I end this post I wanted to highlight that I’ve recently agreed to lead the skip level upgrades room at the upcoming Denver PTG. I’ll be posting more details on this shortly to the OpenStack development mailing list but wanted to take this opportunity to encourage anyone interested in this topic to attend and discuss possible ways we can make skip level upgrades a possibility across the various deployment tools within the community.