I’ve been fortunate enough to be part of a team looking into skip level
upgrades recently ahead of the start of the Queens development cycle for
OpenStack. What follows is an introduction to the concept of skip level
upgrades and an overview of our initial PoC work in this area. Future posts
will also cover our plans for enabling skip level upgrades within TripleO and
possible work with the wider community to enable this within other deployment
tools.
Introduction
Skip level upgrades are as the name suggests, upgrades that move an environment
from release N
to N+X
in a single step, where X
is greater than 1
and
for skip level upgrades is typically 3
. For example in the context of
OpenStack N
to N+3
can refer to an upgrade from the Newton release of
Openstack to the Queens release, skipping Ocata and Pike:
Newton Ocata Pike Queens
+-----+ +-----+ +-----+ +-----+
| | | N+1 | | N+2 | | |
| N | ---------------------> | N+3 |
| | | | | | | |
+-----+ +-----+ +-----+ +-----+
There are existing alternative methods available for skipping a number of
releases during an upgrade. For example, parallel cloud migration is a commonly
cited alternative. This is where an additional environment is stood up
alongside the original, with workloads migrated to the new environment:
Newton
+-----+
| |
env#1 | N |
| |
+-----+
------------------------------------
\ Queens
\ +-----+
\ | |
env#2 `-> | N+3 |
| |
+-----+
The requirement for this type of upgrade is driven by users looking to
standardise on a given release (typically LTS), whilst retaining the ability to
skip forward when the release hits EOL. This negates the need to keep up with
the major release cycle that in the case of OpenStack continues to be every 6
months.
It is worth highlighting that the topic of skip level upgrades is not new to
the OpenStack community, with attempts to provide skip level upgrade
functionality within the community before now, typically within the various
deployment projects. For example openstack-ansible’s
leap-upgrades
project that attempted to move environments between Juno/Kilo and Newton.
More recently the topic of skip level upgrades was
discussed at
the OpenStack Forum in Boston
in May. A RFC
thread
was also posted to the development mailing list, however no formal actions came
of either discussion. I’m looking to restart this discussion at the next PTG in
Denver, more on that later.
Requirements
Now that we understand what skip level upgrades actually are, it’s time to set
out some basic requirements for the state of the environment during the
upgrade. At the start of this process our team sat down and drafted the
following:
- The control plane is inaccessible for the duration of the upgrade
- The upgrade must complete successfully or rollback within 4 hours
- The data plane and workloads must remain available for the duration of the upgrade.
Proof of concept
With the requirements set out, our first real task was to prove that this was
even possible with an OpenStack environment. Given the releases available at
the time, we began by manually upgrading an existing Mitaka based RHOSP 9
environment running on RHEL 7.3 to our recently released Ocata based RHOSP 11
release running on RHEL 7.4.
Mitaka Newton Ocata
+-----+ +-----+ +-----+
| | | N+1 | | |
| N | ----------> | N+2 |
| | | | | |
+-----+ +-----+ +-----+
RHOSP9 RHOSP10 RHOSP11
RHEL73 RHEL74
We were aware that whilst the goal of skip level upgrades is to give the
impression of a single jump between releases, in practice this isn’t possible
with OpenStack. Upgrades of OpenStack components are verified by the community
across N
to N+1
jumps, so whilst we wanted to skip ahead to Ocata we knew
we would also have to upgrade through Newton to get there.
The following outlines, at a very high level, the steps we followed during the
PoC to upgrade the environment from Mitaka to Ocata:
- Rolling minor update of the underlying OS
- Disable control plane and compute services
- Upgrade a single controller to
N+1
and then N+2
- Update packages
- Introduce new services as required (nova-placement for example)
- Update service configuration files
- Run DB syncs, migrations etc
- Repeat for
N+2
- Upgrade remaining controllers directly to
N+2
- Update packages
- Introduce new services as required
- Update service configuration files
- Upgrade remaining hosts to
N+2
- Update packages
- Update service configuration files
- Enable control plane and compute services
- Verify workload availability during upgrade
- Validate the post upgrade environment.
Let’s take a look at each of these steps below in more detail.
Rolling minor update of the underlying OS
This initial rolling minor update moved hosts from RHEL 7.3 to RHEL 7.4 whilst
also pulling in OVS from our RHOSP11 repos in a bid to limit the number of
reboots required in the environment. In practise operators could perform this
minor update well ahead of any skip level upgrade, reducing any impact on the
overall time required for the upgrade itself.
Disable control plane and compute services
As listed above under requirements, a full control plane outage is accounted
for during the skip level upgrade. Note that this does not include the
infrastructure services providing the database, messaging queues etc. Compute
services are also stopped at this time but should not have any impact on the
running workloads and data plane.
Upgrade a single controller to N+1
and then N+2
The main work of upgrading between releases is carried out on a single
controller. Packages are updated, new services such as nova-placement are
deployed as required, configuration files updated and DB migrations completed.
This process is repeated on this host until we reach the target release.
Upgrade remaining controllers directly to N+2
Once the single controller has been upgraded to our target release we then skip
any remaining controllers ahead to this target release. Updating packages,
introducing new services and updating configuration files on these controllers
as required.
Upgrade remaining hosts to N+2
This is then repeated for any remaining hosts, such as computes, object storage
hosts etc. Again this should not interrupt running workloads or the data plane.
Enable control plane and compute services
Once all hosts are updated to the target release the control and compute
services are restarted.
Verify workload availability during upgrade
During our PoC we ran multiple instances across various L2 and L3 networks,
using Ansible to first launch and then later collect the results of
asynchronous jobs (ping, ssh etc) that had been running between these
instances during the upgrade.
Validate the post upgrade environment
Finally Tempest was used to validate
the end state of the environment post upgrade.
After many iterations, the eventual introduction of Ansible to automate all the
things and much cursing at the amount of time to reconfigure the environment
after each run it was agreed that the team would move on to look into the
possible implementation of the above in TripleO.
PTG
Thanks for making it this far! Before I end this post I wanted to highlight
that I’ve recently agreed to lead the skip level upgrades
room at the
upcoming Denver PTG. I’ll be posting more
details on this shortly to the OpenStack development mailing list but wanted to
take this opportunity to encourage anyone interested in this topic to attend
and discuss possible ways we can make skip level upgrades a possibility across
the various deployment tools within the community.
Read More