OpenStack - Skip level upgrades - PTG


A short reminder that I’ll be chairing the skip-level upgrades room at next week’s OpenStack PTG in Denver. So far ~15 of you have shown interest in this track on the etherpad so I’m looking forward to some useful discussions over the two days. For now we still have available slots so if you do have suggestions please feel free to add them directly on the pad!

At present the agenda for the room (Durango, Atrium level) looks like this:


  • 09:00 - 10:00 - #####
  • 10:00 - 10:30 - Retrospective of what was discussed in Boston, outcomes, etc.
  • 10:30 - 11:00 - Have operator requirements changed since Boston?
  • 11:00 - 14:00 - #####
  • 14:00 - 16:00 - What efforts (if any) are underway to enable skip level upgrades within the community?
  • 16:00 - 18:00 - #####


  • 09:00 - 10:30 - #####
  • 10:30 - 11:00 - NFV considerations
  • 11:00 - 11:30 - API versions control
  • 11:30 - 14:00 - #####
  • 14:00 - 16:00 - How can we collaborate and share tools for skip level upgrades within the community?
  • 16:00 - 18:00 - Should we think about a different way of releasing?


Later in the week I will also be participating in the TripleO track, with a session on Thursday to discuss my WIP skip-level upgrade spec. I’ll be working on this during the week leading up to this session so feel free to review this ahead of time or just grab me in the hallway for a chat if this is something that interests you!

Read More

OpenStack - Skip level upgrades - Introduction

I’ve been fortunate enough to be part of a team looking into skip level upgrades recently ahead of the start of the Queens development cycle for OpenStack. What follows is an introduction to the concept of skip level upgrades and an overview of our initial PoC work in this area. Future posts will also cover our plans for enabling skip level upgrades within TripleO and possible work with the wider community to enable this within other deployment tools.


Skip level upgrades are as the name suggests, upgrades that move an environment from release N to N+X in a single step, where X is greater than 1 and for skip level upgrades is typically 3. For example in the context of OpenStack N to N+3 can refer to an upgrade from the Newton release of Openstack to the Queens release, skipping Ocata and Pike:

Newton    Ocata     Pike       Queens
+-----+   +-----+   +-----+    +-----+
|     |   | N+1 |   | N+2 |    |     |
|  N  | ---------------------> | N+3 |
|     |   |     |   |     |    |     |
+-----+   +-----+   +-----+    +-----+

There are existing alternative methods available for skipping a number of releases during an upgrade. For example, parallel cloud migration is a commonly cited alternative. This is where an additional environment is stood up alongside the original, with workloads migrated to the new environment:

        |     |
env#1   |  N  |
        |     |
           \       Queens
            \      +-----+
             \     |     |
env#2         `->  | N+3 |
                   |     |

The requirement for this type of upgrade is driven by users looking to standardise on a given release (typically LTS), whilst retaining the ability to skip forward when the release hits EOL. This negates the need to keep up with the major release cycle that in the case of OpenStack continues to be every 6 months.

It is worth highlighting that the topic of skip level upgrades is not new to the OpenStack community, with attempts to provide skip level upgrade functionality within the community before now, typically within the various deployment projects. For example openstack-ansible’s leap-upgrades project that attempted to move environments between Juno/Kilo and Newton.

More recently the topic of skip level upgrades was discussed at the OpenStack Forum in Boston in May. A RFC thread was also posted to the development mailing list, however no formal actions came of either discussion. I’m looking to restart this discussion at the next PTG in Denver, more on that later.


Now that we understand what skip level upgrades actually are, it’s time to set out some basic requirements for the state of the environment during the upgrade. At the start of this process our team sat down and drafted the following:

  • The control plane is inaccessible for the duration of the upgrade
  • The upgrade must complete successfully or rollback within 4 hours
  • The data plane and workloads must remain available for the duration of the upgrade.

Proof of concept

With the requirements set out, our first real task was to prove that this was even possible with an OpenStack environment. Given the releases available at the time, we began by manually upgrading an existing Mitaka based RHOSP 9 environment running on RHEL 7.3 to our recently released Ocata based RHOSP 11 release running on RHEL 7.4.

Mitaka    Newton    Ocata
+-----+   +-----+   +-----+
|     |   | N+1 |   |     |
|  N  | ----------> | N+2 |
|     |   |     |   |     |
+-----+   +-----+   +-----+
RHEL73              RHEL74

We were aware that whilst the goal of skip level upgrades is to give the impression of a single jump between releases, in practice this isn’t possible with OpenStack. Upgrades of OpenStack components are verified by the community across N to N+1 jumps, so whilst we wanted to skip ahead to Ocata we knew we would also have to upgrade through Newton to get there.

The following outlines, at a very high level, the steps we followed during the PoC to upgrade the environment from Mitaka to Ocata:

  • Rolling minor update of the underlying OS
  • Disable control plane and compute services
  • Upgrade a single controller to N+1 and then N+2
    • Update packages
    • Introduce new services as required (nova-placement for example)
    • Update service configuration files
    • Run DB syncs, migrations etc
    • Repeat for N+2
  • Upgrade remaining controllers directly to N+2
    • Update packages
    • Introduce new services as required
    • Update service configuration files
  • Upgrade remaining hosts to N+2
    • Update packages
    • Update service configuration files
  • Enable control plane and compute services
  • Verify workload availability during upgrade
  • Validate the post upgrade environment.

Let’s take a look at each of these steps below in more detail.

Rolling minor update of the underlying OS

This initial rolling minor update moved hosts from RHEL 7.3 to RHEL 7.4 whilst also pulling in OVS from our RHOSP11 repos in a bid to limit the number of reboots required in the environment. In practise operators could perform this minor update well ahead of any skip level upgrade, reducing any impact on the overall time required for the upgrade itself.

Disable control plane and compute services

As listed above under requirements, a full control plane outage is accounted for during the skip level upgrade. Note that this does not include the infrastructure services providing the database, messaging queues etc. Compute services are also stopped at this time but should not have any impact on the running workloads and data plane.

Upgrade a single controller to N+1 and then N+2

The main work of upgrading between releases is carried out on a single controller. Packages are updated, new services such as nova-placement are deployed as required, configuration files updated and DB migrations completed. This process is repeated on this host until we reach the target release.

Upgrade remaining controllers directly to N+2

Once the single controller has been upgraded to our target release we then skip any remaining controllers ahead to this target release. Updating packages, introducing new services and updating configuration files on these controllers as required.

Upgrade remaining hosts to N+2

This is then repeated for any remaining hosts, such as computes, object storage hosts etc. Again this should not interrupt running workloads or the data plane.

Enable control plane and compute services

Once all hosts are updated to the target release the control and compute services are restarted.

Verify workload availability during upgrade

During our PoC we ran multiple instances across various L2 and L3 networks, using Ansible to first launch and then later collect the results of asynchronous jobs (ping, ssh etc) that had been running between these instances during the upgrade.

Validate the post upgrade environment

Finally Tempest was used to validate the end state of the environment post upgrade.

After many iterations, the eventual introduction of Ansible to automate all the things and much cursing at the amount of time to reconfigure the environment after each run it was agreed that the team would move on to look into the possible implementation of the above in TripleO.


Thanks for making it this far! Before I end this post I wanted to highlight that I’ve recently agreed to lead the skip level upgrades room at the upcoming Denver PTG. I’ll be posting more details on this shortly to the OpenStack development mailing list but wanted to take this opportunity to encourage anyone interested in this topic to attend and discuss possible ways we can make skip level upgrades a possibility across the various deployment tools within the community.

Read More

16 bridges charity walk

As I alluded to in my opening post Katie and I are raising money for Cardiomyopathy UK over the coming months, starting with a walk through London in September. But why, you might ask, are we doing this?

Well, the charity held a conference that we both attended shortly after I was hospitalised in August 2016. Thankfully that episode has since been deemed to simply be Atrial fibrillation that my ICD mistook for Ventricular fibrillation, a very serious and potentially life threatening issue.

At the time we were both struggling to deal with the reality of my long suspected, but never fully confirmed condition ARVC presenting itself. The Cardiomyopathy UK National Conference in November of last year was brilliant, eye opening and helped greatly during that time. Now that things aren’t so bleak we wanted to give something back and this walk is the first step (ha!) in achieving that.

We’ve just been given the 25km route that I’ve included below :

We’ve also started training with a few slow jaunts around Hereford :

Finally, if you would like to donate, it would be very much appreciated. Your money will help people like us, going through scary and confusing times, to access information, support and advice as well as contribute to training for medical professionals. We are gratefully accepting donations via our page with our current progress towards our goal shown below :

Read More

Hey, it has been a while..

So it has been a while since I published anything on my original Wordpress based blog, so long in fact that I’ve decided to do away with the old and move onto something new, shiny and statically generated. All previously written and frankly rather embarrassing content has been deleted and lost forever.

This new blog, for anyone that is interested, is now hosted on GitLab pages and statically generated using Hugo. I hopefully have some kind of flare in the mail for switching away from Wordpress to these new shiny things, I could always use more flare.

Anyway, moving forward this blog is going to capture some of my work currently around OpenStack, charity events for Cardiomyopathy UK, hobby projects such as automating my home with Home Assistant , ramblings about the software industry while also documenting my numerous spelling and grammatical failures forever.

That’s enough text for this test post, on with the actual content!

Read More