This is a brief progress report from the Upgrades squad for the fast-forward
upgrades (FFU) feature in TripleO, introducing N to Q upgrades.
tl;dr Good initial progress, missed M2 goal of nv CI jobs, pushing on to M3.
Overview
For anyone unfamiliar with the concept of fast-forward upgrades the following
sentence from the spec gives a brief high level introduction:
> Fast-forward upgrades are upgrades that move an environment from release `N`
> to `N+X` in a single step, where `X` is greater than `1` and for fast-forward
> upgrades is typically `3`.
The spec itself obviously goes into more detail and I’d recommend anyone
wanting to know more about our approach for FFU in TripleO to start there:
Note that the spec is being updated at present by the following change,
introducing more details on the FFU task layout, ordering, dependency on
the on-going major upgrade rework in Q, canary compute validation etc:
The original goal for Queens M2 was to have one or more non-voting FFU jobs
deployed somewhere able to run through the basic undercloud and overcloud
upgrade workflows, exercising as many compute service dependencies as we could
up to and including Nova. Unfortunately while Sofer has made some great
progress with this we do not have any running FFU jobs at present:
FWIW getting these initial changes merged would help avoid the current change
storm every time this series is rebased to pick up upgrade or deploy related
bug fixes.
Also note that the demos currently use the raw Ansible playbooks stack outputs
to run through the FFU tasks, upgrade tasks and deploy tasks. This is by no
means what the final UX will be, with python-tripleoclient and workflow work to
be completed ahead of M3.
M3 Goals
The squad will be focusing on the following goals for M3:
Non-voting RDO CI jobs defined and running
FFU THT changes tested by the above jobs and merged
Finally, a quick note to highlight that this report marks the end of my own
personal involvement with the FFU feature in TripleO. I’m not going far,
returning to work on Nova and happy to make time to talk about and review FFU
related changes etc. The members of the upgrade squad taking this forward and
your main points of contact for FFU in TripleO will be:
Sofer (chem)
Lukas (social)
Marios (marios)
My thanks again to Sofer, Lukas, Marios, the rest of the upgrade squad and
wider TripleO community for your guidance and patience when putting up with my
constant inane questioning regarding FFU over the past few months!
Update 04/12/17 : The initial deployment documented in this demo no longer
works due to the removal of a number of plan migration
steps that
have now been promoted into the Queens repos. We are currently looking into
ways to reintroduce these for use in master UC Newton OC FFU development
deployments, until then anyone attempting to run through this demo should start
with a Newton OC and UC before upgrading the UC to master.
This is another TripleO
fast-forward
upgrade
demo post, this time focusing on a basic stack of Keystone, Glance, Cinder,
Neutron and Nova. At present there are several workarounds still required to
allow the upgrade to complete, please see the workaround sections for more
details.
Again with this demo we are not caching containers locally, the following
command will create a docker_registry.yaml file referencing the RDO registry
for use during the final deployment of the overcloud to Queens:
Finally, as we are using a customised controller role the following services
need to be added to the overcloud_services.yml file on the undercloud node
under ControllerServices:
At present we are waiting for a promotion of tripleo-common that includes
various bugfixes when updating the overcloud stack, generating outputs etc. For
the time being we can simply install directly from master to workaround these
issues.
$ ssh -F $WD/ssh.config.ansible undercloud
$ git clone https://github.com/openstack/tripleo-common.git ; cd tripleo-common
$ sudo python setup.py install ; cd ~
OC - Update heat-agents
As documented in my previous demo
post
we need to remove any legacy heiradata from all overcloud hosts prior to
updating the heat stack:
With the workarounds in place we can now update the stack using the updated
version of tripleo-heat-templates on the undercloud. Once again we need to use
the original deploy command with a number of additional environment files
included:
Once the stack has been updated we can download the config with the following
command:
$ . stackrc
$ openstack overcloud config download
The TripleO configuration has been successfully generated into: /home/stack/tripleo-Oalkee-config
UC - FFU and Upgrade plays
Before running through any of the generated playbooks I personally like to add
the profile_tasks callback to the callback_whitelist for Ansible within
/etc/ansible/ansible.cfg. This provides timestamps during the playbook run
and a summary of the slowest tasks at the end.
# enable callback plugins, they can output to stdout but cannot be 'stdout' type.
callback_whitelist = profile_tasks
We first run the fast_forward_upgrade_playbook to complete the upgrade to Pike:
We then run the upgrade_steps_playbook to start the upgrade to Queens:
$ . stackrc
$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory \
/home/stack/tripleo-Oalkee-config/upgrade_steps_playbook.yaml
[..]
PLAY RECAP *****************************************************************************************************************************
192.168.24.11 : ok=57 changed=45 unreachable=0 failed=0
192.168.24.16 : ok=165 changed=146 unreachable=0 failed=0
Friday 01 December 2017 20:51:55 +0000 (0:00:00.038) 0:10:47.865 *******
===============================================================================
Update all packages ----------------------------------------------------------------------------------------------------------- 263.71s
Update all packages ----------------------------------------------------------------------------------------------------------- 256.79s
Install docker packages on upgrade if missing ---------------------------------------------------------------------------------- 13.77s
Upgrade os-net-config ----------------------------------------------------------------------------------------------------------- 5.71s
Upgrade os-net-config ----------------------------------------------------------------------------------------------------------- 5.12s
Gathering Facts ----------------------------------------------------------------------------------------------------------------- 3.36s
Install docker packages on upgrade if missing ----------------------------------------------------------------------------------- 3.14s
Stop and disable mysql service -------------------------------------------------------------------------------------------------- 1.97s
Check for os-net-config upgrade ------------------------------------------------------------------------------------------------- 1.66s
Check for os-net-config upgrade ------------------------------------------------------------------------------------------------- 1.57s
Stop keepalived service --------------------------------------------------------------------------------------------------------- 1.48s
Stop and disable rabbitmq service ----------------------------------------------------------------------------------------------- 1.47s
take new os-net-config parameters into account now ------------------------------------------------------------------------------ 1.31s
take new os-net-config parameters into account now ------------------------------------------------------------------------------ 1.08s
Check if openstack-ceilometer-compute is deployed ------------------------------------------------------------------------------- 0.70s
Check if iscsid service is deployed --------------------------------------------------------------------------------------------- 0.67s
Start keepalived service -------------------------------------------------------------------------------------------------------- 0.48s
Check for nova placement running under apache ----------------------------------------------------------------------------------- 0.46s
Stop and disable mongodb service on upgrade ------------------------------------------------------------------------------------- 0.45s
remove old cinder cron jobs ----------------------------------------------------------------------------------------------------- 0.45s
$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory \
/home/stack/tripleo-Oalkee-config/deploy_steps_playbook.yaml
[..]
PLAY RECAP *****************************************************************************************************************************
192.168.24.11 : ok=48 changed=11 unreachable=0 failed=0
192.168.24.16 : ok=76 changed=10 unreachable=0 failed=0
localhost : ok=1 changed=0 unreachable=0 failed=0
Friday 01 December 2017 21:04:58 +0000 (0:00:00.041) 0:10:24.723 *******
===============================================================================
Run docker-puppet tasks (generate config) ------------------------------------------------------------------------------------- 186.65s
Run docker-puppet tasks (bootstrap tasks) ------------------------------------------------------------------------------------- 101.10s
Start containers for step 3 ---------------------------------------------------------------------------------------------------- 98.61s
Start containers for step 4 ---------------------------------------------------------------------------------------------------- 41.37s
Run puppet host configuration for step 1 --------------------------------------------------------------------------------------- 32.53s
Start containers for step 1 ---------------------------------------------------------------------------------------------------- 25.76s
Run puppet host configuration for step 5 --------------------------------------------------------------------------------------- 17.91s
Run puppet host configuration for step 4 --------------------------------------------------------------------------------------- 14.47s
Run puppet host configuration for step 3 --------------------------------------------------------------------------------------- 13.41s
Run docker-puppet tasks (bootstrap tasks) -------------------------------------------------------------------------------------- 10.39s
Run puppet host configuration for step 2 --------------------------------------------------------------------------------------- 10.37s
Start containers for step 5 ---------------------------------------------------------------------------------------------------- 10.12s
Run docker-puppet tasks (bootstrap tasks) --------------------------------------------------------------------------------------- 9.78s
Start containers for step 2 ----------------------------------------------------------------------------------------------------- 6.32s
Gathering Facts ----------------------------------------------------------------------------------------------------------------- 4.37s
Gathering Facts ----------------------------------------------------------------------------------------------------------------- 3.46s
Write the config_step hieradata ------------------------------------------------------------------------------------------------- 1.80s
create libvirt persistent data directories -------------------------------------------------------------------------------------- 1.21s
Write the config_step hieradata ------------------------------------------------------------------------------------------------- 1.03s
Check if /var/lib/docker-puppet/docker-puppet-tasks4.json exists ---------------------------------------------------------------- 1.00s
Verification
I’ll revisit this in the coming days and add a more complete set of tasks to
verify the end environment but for now we can run a simple boot from volume
instance (as Swift, the default store for Glance was not installed):
$ cinder create 1
$ cinder set-bootable 46d278f7-31fc-4e45-b5df-eb8220800b1a true
$ nova flavor-create 1 1 512 1 1
$ nova boot --boot-volume 46d278f7-31fc-4e45-b5df-eb8220800b1a --flavor 1 test
[..]
$ nova list
+--------------------------------------+------+--------+------------+-------------+-------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+--------+------------+-------------+-------------------+
| 05821616-1239-4ca9-8baa-6b0ca4ea3a6b | test | ACTIVE | - | Running | priv=192.168.0.16 |
+--------------------------------------+------+--------+------------+-------------+-------------------+
We can also see the various containerised services running on the overcloud:
$ ssh -F $WD/ssh.config.ansible overcloud-controller-0
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d80d6f072604 trunk.registry.rdoproject.org/master/centos-binary-glance-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 12 minutes (healthy) glance_api
61fbf47241ce trunk.registry.rdoproject.org/master/centos-binary-nova-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes nova_metadata
9defdb5efe0f trunk.registry.rdoproject.org/master/centos-binary-nova-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) nova_api
874716d99a44 trunk.registry.rdoproject.org/master/centos-binary-nova-novncproxy:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) nova_vnc_proxy
21ca0fd8d8ec trunk.registry.rdoproject.org/master/centos-binary-neutron-server:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes neutron_api
e0eed85b860a trunk.registry.rdoproject.org/master/centos-binary-cinder-volume:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) cinder_volume
0882e08ac198 trunk.registry.rdoproject.org/master/centos-binary-nova-consoleauth:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) nova_consoleauth
e3ebc4b066c9 trunk.registry.rdoproject.org/master/centos-binary-nova-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes nova_api_cron
c7d05a04a8a3 trunk.registry.rdoproject.org/master/centos-binary-cinder-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes cinder_api_cron
2f3c1e244997 trunk.registry.rdoproject.org/master/centos-binary-neutron-openvswitch-agent:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) neutron_ovs_agent
bfeb120bf77a trunk.registry.rdoproject.org/master/centos-binary-neutron-metadata-agent:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) neutron_metadata_agent
43b2c09aecf8 trunk.registry.rdoproject.org/master/centos-binary-nova-scheduler:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) nova_scheduler
a7a3024b63f6 trunk.registry.rdoproject.org/master/centos-binary-neutron-dhcp-agent:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) neutron_dhcp
3df990a68046 trunk.registry.rdoproject.org/master/centos-binary-cinder-scheduler:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) cinder_scheduler
94461ba833aa trunk.registry.rdoproject.org/master/centos-binary-neutron-l3-agent:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) neutron_l3_agent
4bee34f9fce2 trunk.registry.rdoproject.org/master/centos-binary-cinder-api:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes cinder_api
e8bec9348fe3 trunk.registry.rdoproject.org/master/centos-binary-nova-conductor:tripleo-ci-testing "kolla_start" 13 minutes ago Up 13 minutes (healthy) nova_conductor
22db40c25881 trunk.registry.rdoproject.org/master/centos-binary-keystone:tripleo-ci-testing "/bin/bash -c '/usr/l" 15 minutes ago Up 15 minutes keystone_cron
26769acaaf5e trunk.registry.rdoproject.org/master/centos-binary-keystone:tripleo-ci-testing "kolla_start" 16 minutes ago Up 16 minutes (healthy) keystone
99037a5e5c36 trunk.registry.rdoproject.org/master/centos-binary-iscsid:tripleo-ci-testing "kolla_start" 16 minutes ago Up 16 minutes iscsid
9f4aae72c201 trunk.registry.rdoproject.org/master/centos-binary-nova-placement-api:tripleo-ci-testing "kolla_start" 16 minutes ago Up 16 minutes nova_placement
311302abc297 trunk.registry.rdoproject.org/master/centos-binary-horizon:tripleo-ci-testing "kolla_start" 16 minutes ago Up 16 minutes horizon
d465e4f5b7e6 trunk.registry.rdoproject.org/master/centos-binary-mariadb:tripleo-ci-testing "kolla_start" 17 minutes ago Up 17 minutes (unhealthy) mysql
b9e062f1d857 trunk.registry.rdoproject.org/master/centos-binary-rabbitmq:tripleo-ci-testing "kolla_start" 18 minutes ago Up 18 minutes (healthy) rabbitmq
a57f053afc03 trunk.registry.rdoproject.org/master/centos-binary-memcached:tripleo-ci-testing "/bin/bash -c 'source" 18 minutes ago Up 18 minutes memcached
baeb6d1087e6 trunk.registry.rdoproject.org/master/centos-binary-redis:tripleo-ci-testing "kolla_start" 18 minutes ago Up 18 minutes redis
faafa1bf2d2e trunk.registry.rdoproject.org/master/centos-binary-haproxy:tripleo-ci-testing "kolla_start" 18 minutes ago Up 18 minutes haproxy
$ exit
$ ssh -F $WD/ssh.config.ansible overcloud-novacompute-0
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0363d7008e87 trunk.registry.rdoproject.org/master/centos-binary-neutron-openvswitch-agent:tripleo-ci-testing "kolla_start" 12 minutes ago Up 12 minutes (healthy) neutron_ovs_agent
c1ff23ee9f16 trunk.registry.rdoproject.org/master/centos-binary-cron:tripleo-ci-testing "kolla_start" 12 minutes ago Up 12 minutes logrotate_crond
d81d8207ec9a trunk.registry.rdoproject.org/master/centos-binary-nova-compute:tripleo-ci-testing "kolla_start" 12 minutes ago Up 12 minutes nova_migration_target
abd9b79e2af8 trunk.registry.rdoproject.org/master/centos-binary-ceilometer-compute:tripleo-ci-testing "kolla_start" 12 minutes ago Up 12 minutes ceilometer_agent_compute
aa581489ac9a trunk.registry.rdoproject.org/master/centos-binary-nova-compute:tripleo-ci-testing "kolla_start" 12 minutes ago Up 12 minutes (healthy) nova_compute
d4ade28175f0 trunk.registry.rdoproject.org/master/centos-binary-iscsid:tripleo-ci-testing "kolla_start" 14 minutes ago Up 14 minutes iscsid
ae4652853098 trunk.registry.rdoproject.org/master/centos-binary-nova-libvirt:tripleo-ci-testing "kolla_start" 14 minutes ago Up 14 minutes nova_libvirt
aac8fea2d496 trunk.registry.rdoproject.org/master/centos-binary-nova-libvirt:tripleo-ci-testing "kolla_start" 14 minutes ago Up 14 minutes nova_virtlogd
Conclusion
So in conclusion this demo takes a simple multi-host OpenStack deployment of
Keystone, Glance, Cinder, Neutron and Nova from baremetal Newton to
containerised Queens in ~26 minutes. There are many things still to resolve and
validate with FFU but for now, ahead of M2 this is a pretty good start.
This post will introduce a very rough demo of the new TripleOFast-forward
Upgrades
(FFU) feature, warts and all, using an overcloud with only Keystone deployed.
This should prove to be a useful starting point for anyone interested in this
feature and could even be an approach used for future per-service FFU CI jobs.
Environment
I’m currently using the
tripleo-quickstart
project to deploy virtualised test environments. For this demo I’m using the
following command line to create the demo environment:
Once deployed you should find the 10.0.3 Newton version of Keystone deployed on
overcloud-controller-0:
$ ssh -F $WD/ssh.config.ansible overcloud-controller-0
[..]
$ rpm -qi openstack-keystone
Name : openstack-keystone
Epoch : 1
Version : 10.0.3
Release : 0.20170726120406.bd49c3e.el7.centos
Architecture: noarch
Install Date: Fri 10 Nov 2017 04:24:46 AM UTC
Group : Unspecified
Size : 175014
License : ASL 2.0
Signature : (none)
Source RPM : openstack-keystone-10.0.3-0.20170726120406.bd49c3e.el7.centos.src.rpm
Build Date : Wed 26 Jul 2017 12:07:53 PM UTC
Build Host : n30.pufty.ci.centos.org
Relocations : (not relocatable)
URL : http://keystone.openstack.org/
Summary : OpenStack Identity Service
Description :
Keystone is a Python implementation of the OpenStack
(http://www.openstack.org) identity service API.
Before starting the upgrade I recommend that snapshots of the undercloud and
overcloud-controller-0 libvirt domains are taken on the virthost:
$ ssh -F $WD/ssh.config.ansible virthost
$ for domain in $(virsh list | grep running | awk '{print $2 }'); do virsh snapshot-create-as ${domain} ${domain}_start ; done
UC - docker_registry.yaml
As with a normal container based deployment on >=Pike we will need a Docker
registry file mapping each service to a container image. The following command
will create this file, pointing to the offical RDO registry:
Note that this will result in the container images being pulled from the remote
RDO registry during the upgrade. We can pre-cache these images on the
undercloud to speed the process up. However as we are only using a single host
and minimal number of services in this demo I have chosen to skip this for now.
UC - tripleo-heat-templates
FFU itself is controlled by an Ansible playbook using tasks that are contained
within the
tripleo-heat-templates
(THT) project. The following gerrit topic lists all of the current FFU changes
up for review:
We also need the following
noop-deploy-steps.yaml
environment file that allows us to use openstack overcloud deploy to update
the stack outputs of the overcloud without forcing an actual redeploy of any
resources:
Finally, as we have deployed a custom set of services for the Controller role
we now have to ensure that the Docker service is added to the role prior to our
upgrade:
An older os-apply-config hiera hook and any legacy hiera data needs to be
removed from the overcloud prior to our upgrade. The following ML post has
more details on this workaround:
For the time being this isn’t part of the upgrade playbook and so we need to
run the following commands that will update the heat-agents on the host to
their Ocata versions and remove the legacy data:
At present there is a packaging issue when upgrading the openstack-ceilometer
packages directly from Newton to Queens. As these packages are installed by
default in the Newton overcloud-full image used to deploy the environment but
not used in our demo we can simply remove them for the time being:
$ sudo yum remove openstack-ceilometer* -y
UC - Update stack outputs
We can now use the openstack overcloud deploy command to update the overcloud
stack and generate the new stack outputs, including the FFU playbook. To do
this we simply add the previously created docker_registry.yaml,
environments/docker.yaml and environments/noop-deploy-steps.yaml environment
files to the original command used to deploy the environment.
Now that the stack outputs have been updated we can download the overcloud
config containing the FFU playbook onto the undercloud:
$ openstack overcloud config download
There is a known issue with
the generated upgrade tasks at the moment where the ordering of conditionals
causes Ansible to fail. To workaround this, simply edit the following Ansible
tasks within the Controller/upgrade_tasks.yaml file to ensure the step
conditional is always checked first:
- block:
- name: Upgrade os-net-config
yum: name=os-net-config state=latest
- changed_when: os_net_config_upgrade.rc == 2
command: os-net-config --no-activate -c /etc/os-net-config/config.json -v --detailed-exit-codes
failed_when: os_net_config_upgrade.rc not in [0,2]
name: take new os-net-config parameters into account now
register: os_net_config_upgrade
tags: step3
when:
- step|int == 3
- not os_net_config_need_upgrade.stdout and os_net_config_has_config.rc == 0
UC - Run playbook
With the config present on the undercloud we can finally start the FFU upgrade
using the following command line:
Once the FFU upgrade is complete we can verify that Keystone is functional in
the overcloud with a few simple commands:
$ ssh -F $WD/ssh.config.ansible undercloud
$ . overcloudrc
$ openstack endpoint list
+----------------------------------+-----------+--------------+--------------+---------+-----------+----------------------------+
| ID | Region | Service Name | Service Type | Enabled | Interface | URL |
+----------------------------------+-----------+--------------+--------------+---------+-----------+----------------------------+
| 15fd404ff8c14971b4251b81624edab8 | regionOne | keystone | identity | True | admin | http://192.168.24.10:35357 |
| 2e513f5fdfc140ec916b081b47a2b8f7 | regionOne | keystone | identity | True | internal | http://172.16.2.12:5000 |
| 96980f0f9ac44c718c038ef54af814bc | regionOne | keystone | identity | True | public | http://10.0.0.8:5000 |
+----------------------------------+-----------+--------------+--------------+---------+-----------+----------------------------+
$ openstack service list
+----------------------------------+------------+----------+
| ID | Name | Type |
+----------------------------------+------------+----------+
| 3fc546421e9048f39b2b847b13fa8ea5 | keystone | identity |
| 7f819190dc6f44d8b995021277b24d67 | ceilometer | metering |
+----------------------------------+------------+----------+
We can also log into the overcloud-controller-0 host and verify that the
relevant containers are running:
$ ssh -F $WD/ssh.config.ansible overcloud-controller-0
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4f40f0cf98aa 192.168.24.1:8787/master/centos-binary-keystone:tripleo-ci-testing "/bin/bash -c '/usr/l" About a minute ago Up About a minute keystone_cron
0b9d5cc17f5d 192.168.24.1:8787/master/centos-binary-keystone:tripleo-ci-testing "kolla_start" About a minute ago Up About a minute (healthy) keystone
db967d899aaf 192.168.24.1:8787/master/centos-binary-mariadb:tripleo-ci-testing "kolla_start" About a minute ago Up About a minute (unhealthy) mysql
1f0b9aa72ec7 192.168.24.1:8787/master/centos-binary-rabbitmq:tripleo-ci-testing "kolla_start" 2 minutes ago Restarting (1) 29 seconds ago rabbitmq
8e689f5bac22 192.168.24.1:8787/master/centos-binary-haproxy:tripleo-ci-testing "kolla_start" 2 minutes ago Up 2 minutes haproxy
As I said at the start this is a very rough demo that we can hopefully clean up
and iterate on quickly over the coming weeks. The current goal is to have
another working demo available by M2 that covers all of the required services
to upgrade the computes so we can also start verification of the data plane
during the upgrade.
I know, I know, another post about our 16 bridges walk. We just received our
certificate from Cardiomyopathy UK showing that we raised a grand total of
£635.00. My thanks again to everyone who donated!
This post will be a living document where I will detail how
TripleO developers can initially provision and iterate
quickly while working on service upgrade tasks the new fast-forward
upgrade
feature for in TripleO
Queens.
Initial environment
This section details how to configure the initial environment with specific
undercloud (UC) and overcloud (OC) versions and layouts using
tripleo-quickstart.
Newton UC & OC
This basic combnination is required for end to end testing of fast forward upgrades:
$ bash quickstart.sh -R newton $VIRTHOST
Note however that the following changes are required so that vbmc is used by
the undercloud instead of pxe_ssh (removed in Pike):
My thanks again to everyone who attended and contributed to the
skip-level upgrades track over the first two days of last weeks PTG.
I’ve included a short summary of our discussions below with a list of
agreed actions for Queens at the end.
During our first session we briefly discussed the history of the
skip-level upgrades effort within the community and the various
misunderstandings that have arisen from previous conversations around
this topic at past events.
We agreed that at present the only way to perform upgrades between N and
N+>=2 releases of OpenStack was to upgrade linearly through each major
release, without skipping between the starting and target release of the
upgrade.
This is contrary to previous discussions on the topic where it had been
suggested that releases could be skipped if DB migrations for these
releases were applied in bulk later in the process. As projects within
the community currently offer no such support for this it was agreed to
continue to use the supported N to N+1 upgrade jumps, albeit in a
minimal, offline way.
The name skip-level upgrades has had an obvious role to play in the
confusion here and as such the renaming of this effort was discussed at
length. Various suggestions are listed on the pad but for the time being
I’m going to stick with the basic fast-forward upgrades name (FFOU,
OFF, BOFF, FFUD etc were all close behind). This removes any notion of
releases being skipped and should hopefully avoid any further confusion
in the future.
Support by the projects for offline upgrades was then discussed with a recent
Ironic issue
highlighted as an example where projects have required services to run before
the upgrade could be considered complete. The additional requirement of
ensuring both workloads and the data plane remain active during the upgrade was
also then discussed. It was agreed that both the
supports-upgrades
and
supports-accessible-upgrades
tags should be updated to reflect these requirements for fast-forward upgrades.
Given the above it was agreed that this new definition of what
fast-forward upgrades are and the best practices associated with them
should be clearly documented somewhere. Various operators in the room
highlighted that they would like to see a high level document outline
the steps required to achieve this, hopefully written by someone with
past experience of running this type of upgrade.
I failed to capture the names of the individuals who were interested in
helping out here. If anyone is interested in helping out here please
feel free to add your name to the actions either at the end of this mail
or at the bottom of the pad.
In the afternoon we reviewed the current efforts within the community to
implement fast-forward upgrades, covering TripleO, Charms (Juju) and
openstack-ansible. While this was insightful to many in the room there
didn’t appear to be any obvious areas of collaboration outside of
sharing best practice and defining the high level flow of a fast-forward
upgrade.
Tuesday
Tuesday started with a discussion around NFV considerations with
fast-forward upgrades. These ranged from the previously mentioned need
for the data plane to remain active during the upgrade to the restricted
nature of upgrades in NFV environments in terms of time and number of
reboots.
It was highlighted that there are some serious as yet unresolved bugs in
Nova regarding the live migration of instances using SR-IOV devices.
This currently makes the moving of workloads either prior to or during
the upgrade particularly difficult.
Rollbacks were also discussed and the need for any best practice
documentation around fast-forward upgrades to include steps to allow the
recovery of environments if things fail was also highlighted.
We then revisited an idea from the first day of finding or creating a
SIG for this effort to call home. It was highlighted that there was a
suggestion in the packaging room to create a Deployment / Lifecycle SIG.
After speaking with a few individuals later in the week I’ve taken the
action to reach out on the openstack-sigs mailing list for further
input.
Finally, during a brief discussion on ways we could collaborate and share
tooling for fast-forward upgrades a new
tool
to migrate configuration files between N to N+>=2 releases was introduced.
While interesting it was seen as a more generic utility that could also be used
between N to N+1 upgrades. AFAIK the authors joined the Oslo room shortly
after this session ended to gain more feedback from that team.
I have yet to look into the formal process around making changes to
these tags but I will aim to make a start ASAP.
Find an Ops lead for the documentation effort
I failed to take down the names of some of the operators who were
talking this through at the time. If they or anyone else is still
interested in helping here please let me know!
Find or create a relevant SIG for this effort
As discussed above this could be as part of the lifecycle SIG or an
independent upgrades SIG. Expect a separate mail to the SIG list
regarding this shortly.
Identify a room chair for Sydney
Unfortunately I will not be present in Sydney to lead a similar
session. If anyone is interested in helping please feel free to respond
here or reach out to me directly!
My thanks again to everyone who attended the track, I had a blast
leading the room and hope that the attendees found both the track and
some of the outcomes listed above useful.
I’m finally back from a work trip to the US and wanted share that we completed
the 16 bridges 15 bridges (as the Golden Jubilee Bridge(s) remain closed)
in just over 5 hours 10 days ago!
Again, our thanks to everyone who donated, it’s going to a wonderful charity
and will hopefully make a difference to the lives of people living with
cardiomyopathy!
I’ve already started looking into similar walks we could take part in next
year, with an obvious candidate of the Wye Valley
Challenge being most
likely at the moment. The full 100km version from Chepstow to Hereford might
prove slightly too much for a novice like me however there are shorter, 45km
versions ending in Hereford.
A short reminder that I’ll be chairing the skip-level upgrades room at next
week’s OpenStack PTG in Denver. So far ~15 of
you have shown interest in this track on the
etherpad
so I’m looking forward to some useful discussions over the two days. For now we
still have available slots so if you do have suggestions please feel free to
add them directly on the pad!
10:00 - 10:30 - Retrospective of what was discussed in Boston, outcomes, etc.
10:30 - 11:00 - Have operator requirements changed since Boston?
11:00 - 14:00 - #####
14:00 - 16:00 - What efforts (if any) are underway to enable skip level upgrades within the community?
16:00 - 18:00 - #####
Tuesday
09:00 - 10:30 - #####
10:30 - 11:00 - NFV considerations
11:00 - 11:30 - API versions control
11:30 - 14:00 - #####
14:00 - 16:00 - How can we collaborate and share tools for skip level upgrades within the community?
16:00 - 18:00 - Should we think about a different way of releasing?
TripleO
Later in the week I will also be participating in the TripleO track, with a
session on Thursday to
discuss my WIPskip-level upgrade
spec. I’ll be working on this during
the week leading up to this session so feel free to review this ahead of time
or just grab me in the hallway for a chat if this is something that interests
you!
I’ve been fortunate enough to be part of a team looking into skip level
upgrades recently ahead of the start of the Queens development cycle for
OpenStack. What follows is an introduction to the concept of skip level
upgrades and an overview of our initial PoC work in this area. Future posts
will also cover our plans for enabling skip level upgrades within TripleO and
possible work with the wider community to enable this within other deployment
tools.
Introduction
Skip level upgrades are as the name suggests, upgrades that move an environment
from release N to N+X in a single step, where X is greater than 1 and
for skip level upgrades is typically 3. For example in the context of
OpenStack N to N+3 can refer to an upgrade from the Newton release of
Openstack to the Queens release, skipping Ocata and Pike:
There are existing alternative methods available for skipping a number of
releases during an upgrade. For example, parallel cloud migration is a commonly
cited alternative. This is where an additional environment is stood up
alongside the original, with workloads migrated to the new environment:
The requirement for this type of upgrade is driven by users looking to
standardise on a given release (typically LTS), whilst retaining the ability to
skip forward when the release hits EOL. This negates the need to keep up with
the major release cycle that in the case of OpenStack continues to be every 6
months.
It is worth highlighting that the topic of skip level upgrades is not new to
the OpenStack community, with attempts to provide skip level upgrade
functionality within the community before now, typically within the various
deployment projects. For example openstack-ansible’s
leap-upgrades
project that attempted to move environments between Juno/Kilo and Newton.
More recently the topic of skip level upgrades was
discussed at
the OpenStack Forum in Boston
in May. A RFC
thread
was also posted to the development mailing list, however no formal actions came
of either discussion. I’m looking to restart this discussion at the next PTG in
Denver, more on that later.
Requirements
Now that we understand what skip level upgrades actually are, it’s time to set
out some basic requirements for the state of the environment during the
upgrade. At the start of this process our team sat down and drafted the
following:
The control plane is inaccessible for the duration of the upgrade
The upgrade must complete successfully or rollback within 4 hours
The data plane and workloads must remain available for the duration of the upgrade.
Proof of concept
With the requirements set out, our first real task was to prove that this was
even possible with an OpenStack environment. Given the releases available at
the time, we began by manually upgrading an existing Mitaka based RHOSP 9
environment running on RHEL 7.3 to our recently released Ocata based RHOSP 11
release running on RHEL 7.4.
We were aware that whilst the goal of skip level upgrades is to give the
impression of a single jump between releases, in practice this isn’t possible
with OpenStack. Upgrades of OpenStack components are verified by the community
across N to N+1 jumps, so whilst we wanted to skip ahead to Ocata we knew
we would also have to upgrade through Newton to get there.
The following outlines, at a very high level, the steps we followed during the
PoC to upgrade the environment from Mitaka to Ocata:
Rolling minor update of the underlying OS
Disable control plane and compute services
Upgrade a single controller to N+1 and then N+2
Update packages
Introduce new services as required (nova-placement for example)
Update service configuration files
Run DB syncs, migrations etc
Repeat for N+2
Upgrade remaining controllers directly to N+2
Update packages
Introduce new services as required
Update service configuration files
Upgrade remaining hosts to N+2
Update packages
Update service configuration files
Enable control plane and compute services
Verify workload availability during upgrade
Validate the post upgrade environment.
Let’s take a look at each of these steps below in more detail.
Rolling minor update of the underlying OS
This initial rolling minor update moved hosts from RHEL 7.3 to RHEL 7.4 whilst
also pulling in OVS from our RHOSP11 repos in a bid to limit the number of
reboots required in the environment. In practise operators could perform this
minor update well ahead of any skip level upgrade, reducing any impact on the
overall time required for the upgrade itself.
Disable control plane and compute services
As listed above under requirements, a full control plane outage is accounted
for during the skip level upgrade. Note that this does not include the
infrastructure services providing the database, messaging queues etc. Compute
services are also stopped at this time but should not have any impact on the
running workloads and data plane.
Upgrade a single controller to N+1 and then N+2
The main work of upgrading between releases is carried out on a single
controller. Packages are updated, new services such as nova-placement are
deployed as required, configuration files updated and DB migrations completed.
This process is repeated on this host until we reach the target release.
Upgrade remaining controllers directly to N+2
Once the single controller has been upgraded to our target release we then skip
any remaining controllers ahead to this target release. Updating packages,
introducing new services and updating configuration files on these controllers
as required.
Upgrade remaining hosts to N+2
This is then repeated for any remaining hosts, such as computes, object storage
hosts etc. Again this should not interrupt running workloads or the data plane.
Enable control plane and compute services
Once all hosts are updated to the target release the control and compute
services are restarted.
Verify workload availability during upgrade
During our PoC we ran multiple instances across various L2 and L3 networks,
using Ansible to first launch and then later collect the results of
asynchronous jobs (ping, ssh etc) that had been running between these
instances during the upgrade.
Validate the post upgrade environment
Finally Tempest was used to validate
the end state of the environment post upgrade.
After many iterations, the eventual introduction of Ansible to automate all the
things and much cursing at the amount of time to reconfigure the environment
after each run it was agreed that the team would move on to look into the
possible implementation of the above in TripleO.
PTG
Thanks for making it this far! Before I end this post I wanted to highlight
that I’ve recently agreed to lead the skip level upgrades
room at the
upcoming Denver PTG. I’ll be posting more
details on this shortly to the OpenStack development mailing list but wanted to
take this opportunity to encourage anyone interested in this topic to attend
and discuss possible ways we can make skip level upgrades a possibility across
the various deployment tools within the community.
As I alluded to in my opening post Katie and I are raising money for
Cardiomyopathy UK over the coming months,
starting with a walk through London in September. But why, you might ask, are
we doing this?
Well, the charity held a conference that we both attended shortly after I was
hospitalised in August 2016. Thankfully that episode has since been
deemed to simply be Atrial
fibrillation that my
ICD
mistook for Ventricular
fibrillation, a very
serious and potentially life threatening issue.
At the time we were both struggling to deal with the reality of my long
suspected, but never fully confirmed condition
ARVC presenting itself.
The Cardiomyopathy UK National Conference in November of last year was
brilliant, eye opening and helped greatly during that time. Now that things
aren’t so bleak we wanted to give something back and this walk is the first
step (ha!) in achieving that.
We’ve just been given the 25km route that I’ve included below :
We’ve also started training with a few slow jaunts around Hereford :
Finally, if you would like to donate, it would be very much appreciated. Your money will help people like us, going through scary and confusing times, to access information, support and advice as well as contribute to training for medical professionals. We are gratefully accepting donations via our justgiving.com page with our current progress towards our goal shown below :