vendredi 6 octobre 2017

Cinder volume stucked in detaching state

Since our migration to Newton, some volumes cannot be detached from their server:
[root@controller ~]# openstack volume list --all
+--------------------------------------+--------------+-----------+------+---------------------------------------------------------------+
| ID                                   | Display Name | Status    | Size | Attached to                                                   |
+--------------------------------------+--------------+-----------+------+---------------------------------------------------------------+

...
| 60304d1e-aa57-11e7-9c40-b3ff0b0a5974 | V_NAME       | detaching |  100 | Attached to 23b19384-aa57-11e7-88a7-03b3c5fe3969 on /dev/vdg  |
...
The error displayed in the hypervisor is really obvious:
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
    res = self.dispatcher.dispatch(message)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
    return self._do_dispatch(endpoint, method, ctxt, args)
  File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
    result = func(ctxt, **new_args)
  File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 75, in wrapped
    function_name, call_dict, binary)
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 66, in wrapped
    return f(self, context, *args, **kw)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 216, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 204, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4856, in detach_volume
    attachment_id=attachment_id)
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 4786, in _detach_volume
    connection_info = jsonutils.loads(bdm.connection_info)
  File "/usr/lib/python2.7/site-packages/oslo_serialization/jsonutils.py", line 241, in loads
    return json.loads(encodeutils.safe_decode(s, encoding), **kwargs)
  File "/usr/lib/python2.7/site-packages/oslo_utils/encodeutils.py", line 39, in safe_decode
    raise TypeError("%s can't be decoded" % type(text))
TypeError: <type 'NoneType'> can't be decoded



After replaying the Python code executed by the function, we have found that the following query is executed when trying to get the volume connection informations:
SELECT block_device_mapping.created_at AS block_device_mapping_created_at,
  block_device_mapping.updated_at AS block_device_mapping_updated_at,
  block_device_mapping.deleted_at AS block_device_mapping_deleted_at,
  block_device_mapping.deleted AS block_device_mapping_deleted,
  block_device_mapping.id AS block_device_mapping_id,
  block_device_mapping.instance_uuid AS block_device_mapping_instance_uuid,
  block_device_mapping.source_type AS block_device_mapping_source_type,
  block_device_mapping.destination_type AS block_device_mapping_destination_type,
  block_device_mapping.guest_format AS block_device_mapping_guest_format,
  block_device_mapping.device_type AS block_device_mapping_device_type,
  block_device_mapping.disk_bus AS block_device_mapping_disk_bus,
  block_device_mapping.boot_index AS block_device_mapping_boot_index,
  block_device_mapping.device_name AS block_device_mapping_device_name,
  block_device_mapping.delete_on_termination AS block_device_mapping_delete_on_termination,
  block_device_mapping.snapshot_id AS block_device_mapping_snapshot_id,
  block_device_mapping.volume_id AS block_device_mapping_volume_id,
  block_device_mapping.volume_size AS block_device_mapping_volume_size,
  block_device_mapping.image_id AS block_device_mapping_image_id,
  block_device_mapping.no_device AS block_device_mapping_no_device,
  block_device_mapping.connection_info AS block_device_mapping_connection_info,
  block_device_mapping.tag AS block_device_mapping_tag
  FROM block_device_mapping WHERE block_device_mapping.deleted = 0
  AND block_device_mapping.volume_id = '60304d1e-aa57-11e7-9c40-b3ff0b0a5974'
  AND block_device_mapping.instance_uuid = '23b19384-aa57-11e7-88a7-03b3c5fe3969'
  LIMIT 1 OFFSET 0


The problem is that in our case, if we remove the 'LIMIT 1 OFFSET 0' options, several lines are displayed:
SELECT device_name, connection_info, block_device_mapping.updated_at
  FROM block_device_mapping
  WHERE block_device_mapping.delete = 0
  AND block_device_mapping.volume_id = '60304d1e-aa57-11e7-9c40-b3ff0b0a5974'
  AND block_device_mapping.instance_uuid = '23b19384-aa57-11e7-88a7-03b3c5fe3969'
...
| /dev/vdd    | NULL |
| /dev/vdf    | NULL |
| /dev/vdg    | {"driver_volume_type": "iscsi", "connector": {"platform": "x86_64", "host": "node2.example.com", "do_local_attach": false, "ip": "192.168.1.32", "os_type": "linux2", "multipath": false, "initiator": "iqn.1994-05.com.redhat:b0aced88ee0"}, "serial": "60304d1e-aa57-11e7-9c40-b3ff0b0a5974", "data": {"access_mode": "rw", "target_discovered": false, "encrypted": false, "qos_specs": null, "target_iqn": "iqn.2010-10.org.openstack:volume-60304d1e-aa57-11e7-9c40-b3ff0b0a5974", "target_portal": "192.168.1.20:3260", "volume_id": "60304d1e-aa57-11e7-9c40-b3ff0b0a5974", "target_lun": 0, "device_path": "/dev/disk/by-path/ip-192.168.1.20:3260-iscsi-iqn.2010-10.org.openstack:volume-60304d1e-aa57-11e7-9c40-b3ff0b0a5974-lun-0", "auth_password": "MiDgLw0xY6gmARjL"
, "auth_username": "8uQrIvTyAu5XvPonVWo5", "auth_method": "CHAP"}} |
3 rows in set (0,00 sec)


And filtering the selection with the 'LIMIT 1 OFFSET 0' returns only the first line, that is not usable anymore. To correct the issue, we have to remove the broken entries:
DELETE FROM block_device_mapping
  WHERE volume_id = '60304d1e-aa57-11e7-9c40-b3ff0b0a5974'
  AND block_device_mapping.instance_uuid = '23b19384-aa57-11e7-88a7-03b3c5fe3969'
  AND block_device_mapping.deleted = 0
  AND block_device.connection_info = NULL
Query OK, 2 row affected (0,01 sec)


Once the queries completed, reset the state of the volume using the cinder reset-state command. The volume can be then successfully detached.

jeudi 5 octobre 2017

Migration from Mitaka to Newton

This document details a simple procedure to upgrade an OpenStack Cloud from Mitaka to Newton. A short downtime of 2 hours is required to perform the upgrade and test the services.

To upgrade to Newton, the following steps are performed:
  1. Stop the daemons of the configuration management tool (Puppet, Chef, Quattor, ...) to ensure that it will not interfere with the upgrade procedure. We are using Quattor at IPHC. Two daemons need to be stopped:
    [root@controller ~]# service ncm-cdipsd stop
    [root@controller ~]# service cdp-listend stop
  2. Stop the OpenStack services and ensure with the systemctl command that they are effectily stopped:
    [root@controller ~]# for service in "nova neutron cinder glance"; do \
        service openstack-${service} stop \
    done
    [root@controller ~]# service httpd stop

    Note: we have created startup scripts, like /etc/init.d/openstack-nova, that manage all related daemons.
  3. We took advantage of this upgrade to perform some database cleanup:
    • Backup the database using mysql_dump
    • [root@controller ~]# keystone-manage token_flush
    • On our test infrastructure, we were not able to update the keystone database. This issue was caused by a UTF-8 charset problem. To fix this, we had to set correctly the charset (utf8/utf8_general_ci) of each table and database using the following script. This step is probably not  required if your installation of OpenStack if younger than Juno.
  4. Replace the RDO Mitaka repo by the Newton repo and update the RPMs:
    [root@controller ~]# cat /etc/yum.repos.d/newton.repo
    [x86_64]
    name=OpenStack Newton Repository
    baseurl=http://mirror.centos.org/centos/7/cloud/x86_64/openstack-newton/
    enabled=1
    skip_if_unavailable=0
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Cloud
    priority=98
    [root@controller ~]# rm /etc/yum.repos.d/mitaka.repo
  5. Install the configuration files for the new version. We are using our configuration management tools,  in manual mode):
    [root@controller ~]# ccm-fetch
    [root@controller ~]# ncm-ncd --config filecopy

    [root@controller ~]# ncm-ncd --config mysql
  6. Once all OpenStack components are configured, each database needs to be updated to the current schema:
    1. Keystone
      [root@controller ~]# su -s /bin/sh -c "keystone-manage db_sync" keystone
    2. Glance
      [root@controller ~]# su -s /bin/sh -c "glance-manage db_sync" glance
    3. Cinder
      [root@controller ~]# su -s /bin/sh -c "cinder-manage db sync" cinder
      With cinder, we hit the following issue:
      ERROR oslo_service.service ServiceTooOld: One of the services is in Liberty vers
      ion. We do not provide backward compatibility with Liberty now, you need to upgra
      de to Mitaka first.
      If you check the enabled services, you can see that a cinder-backup service is registered:

      [root@controller ~]# cinder-manage service list
      Binary           Host                                 Zone             Status     State Updated At           RPC Version  Object Version  Cluster
      cinder-scheduler controller                           nova             enabled    XXX   2017-07-27 17:38:25  3.0          1.11
      cinder-volume    controller                           nova             enabled    XXX   2017-07-27 17:38:25  3.0          1.11
      cinder-backup    controller                           nova             enabled    XXX   2014-07-15 05:49:40  None         None
      The solution is to remove the out-dated service:

      [root@controller ~]# cinder-manage service remove cinder-backup controller
    4. Neutron
      [root@controller ~]# su -s /bin/sh -c "neutron-db-manage upgrade heads" neutron
    5. Nova
      First, if your configuration management tool do not create database, manually create the nova_api database and give the nova user access to it.
      [root@controller ~]# su -s /bin/sh -c "nova-manage api_db sync" nova
      [root@controller ~]# su -s /bin/sh -c "nova-manage db sync" nova
    6. Heat
      [root@controller ~]# su -s /bin/sh -c "heat-manage db_sync" heat
    7. Magnum
      [root@controller ~]# su -s /bin/sh -c "magnum-db-manage upgrade heads" magnum
  7. The /etc/keystone/credential-keys/ directory has to be created. It is owned by the keystone user.
  8. The use of Keystone was really slow after the upgrade. To return to a normal state, be sure that Keystone is using memcache and that your token get flushed regularly (we are using a cron file).
  9. The upgrade is completed, restart the OpenStack services, as well as the configuration management tool daemon(s). Look at the OpenStack log files for any errors and test your services using your favorite probe platform.
  10.  After this update, the creation of new flavor fails with the following error:
    not all flavors have been migrated to the API database
    This is caused by a known bug. To revolve this issue, we have run:
    nova-manage db online_data_migrations

vendredi 6 janvier 2017

Migration from Liberty to Mitaka at IPHC

This document details a simple procedure to upgrade OpenStack from Liberty to Mitaka. A short downtime of 2 hours is required to perform the upgrade and test the services.

To upgrade to Mitaka, the following steps are performed:
  1. Stop the daemons of the configuration management tool (Puppet, Chef, Quattor, ...) to ensure that it will not interfere with the upgrade procedure. We are using Quattor at IPHC. Two daemons need to be stopped:
    [root@controller ~]# service ncm-cdipsd stop
    [root@controller ~]# service cdp-listend stop
  2. Stop the OpenStack services and ensure with the systemctl command that they are effectily stopped:
    [root@controller ~]# for service in "nova neutron cinder glance"; do \
        service openstack-${service} stop \
    done
    [root@controller ~]# service httpd stop

    Note: we have created startup scripts, like /etc/init.d/openstack-nova, that manage all related daemons.
  3. We took advantage of this upgrade to perform some database cleanup:
    • Backup the database using mysql_dump
    • [root@controller ~]# keystone-manage token_flush
    • On our test infrastructure, we were not able to update the keystone database. This issue was caused by a UTF-8 charset problem. To fix this, we had to set correctly the charset (utf8/utf8_general_ci) of each table and database using the following script. This step is probably not  required if your installation of OpenStack if younger than Juno.
  4. Replace the RDO Liberty repo by the Mitaka repo and update the RPMs:
    [root@controller ~]# cat /etc/yum.repos.d/mitaka.repo
    [x86_64]
    name=OpenStack Mitaka Repository
    baseurl=http://mirror.centos.org/centos/7/cloud/x86_64/openstack-mitaka/
    enabled=1
    skip_if_unavailable=0
    gpgcheck=1
    gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-SIG-Cloud
    priority=98
    [root@controller ~]# rm /etc/yum.repos.d/liberty.repo
  5. Install the configuration files for the new version. We are using our configuration management tools,  in manual mode):
    [root@controller ~]# ccm-fetch
    [root@controller ~]# ncm-ncd --config filecopy

    [root@controller ~]# ncm-ncd --config mysql
  6. Once all OpenStack components are configured, each database needs to be updated to the current schema:
    1. Keystone
      [root@controller ~]# su -s /bin/sh -c "keystone-manage db_sync" keystone
    2. Glance
      [root@controller ~]# su -s /bin/sh -c "glance-manage db_sync" glance
    3. Cinder
      [root@controller ~]# su -s /bin/sh -c "cinder-manage db sync" cinder
    4. Neutron
      [root@controller ~]# su -s /bin/sh -c "neutron-db-manage upgrade heads" neutron
    5. Nova
      First, if your configuration management tool do not create database, manually create the nova_api database and give the nova user access to it.
      [root@controller ~]# su -s /bin/sh -c "nova-manage api_db sync" nova
      [root@controller ~]# su -s /bin/sh -c "nova-manage db sync" nova
  7. The upgrade is completed, restart the OpenStack services, as well as the configuration management tool daemon(s). Look at the OpenStack log files for any errors and test your services with Tempest.

mercredi 11 mai 2016

Restore iSCSI configuration for Cinder / Nova

In few cases (i.e. cinder-volume crash), some cinder volumes cannot be accessed by a VM (I/O errors), but are still displayed as associated when using cinder or nova CLI. Looking at the hypervisor's log, you may see:
May 11 13:26:45 cloudhyp1 iscsid: conn 0 login rejected: target error (03/01)
May 11 13:26:45 cloudhyp1 iscsid: conn 0 login rejected: initiator failed authorization with target
May 11 13:26:45 cloudhyp1 iscsid: conn 0 login rejected: initiator failed authorization with target


On the cinder-volume host, check the configuration of iSCSI target:
[root@controller ~]# targetcli ls
o- / ......................................................................................................................... [...]
  o- backstores .............................................................................................................. [...]
  | o- block .................................................................................................. [Storage Objects: 1]
  | | o- iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e  [/dev/cinder-volumes/volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e (20.0GiB) write-thru activated]
  | o- fileio ................................................................................................. [Storage Objects: 0]
  | o- pscsi .................................................................................................. [Storage Objects: 0]
  | o- ramdisk ................................................................................................ [Storage Objects: 0]
  o- iscsi ............................................................................................................ [Targets: 7]
  | o- iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e ............................................. [TPGs: 1]
  | | o- tpg1 .......................................................................................... [no-gen-acls, auth per-acl]
  | |   o- acls .......................................................................................................... [ACLs: 0]
  | |   o- luns .......................................................................................................... [LUNs: 1]
  | |   | o- lun0  [block/iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e (/dev/cinder-volumes/volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e)]
  | |   o- portals .................................................................................................... [Portals: 1]
  | |     o- 192.168.1.1:3260 ................................................................................................. [OK]
  o- loopback ......................................................................................................... [Targets: 0]


In that case, the cloudhyp1 cannot connect to the target because no ACL are defined ([ACLs: 0])

 You have to setup the ACL manually:

[root@controller ~]# mysql -u cinder -p -e "select provider_auth from volumes where id='6e95e5b6-83e1-4958-a5e1-ba5afc94559e'" cinder
Enter password:
+------------------------------------------------+
| provider_auth                                  |
+------------------------------------------------+
| CHAP xjrFIwOQ66ktk
xjrFIwO vr2twXxoDww7wvr2twXx |
+------------------------------------------------+


The first entry is the username and the second the password. You can check that you have the same value on the hypervisor (1, 2):
[root@cloudhyp1 ~]# grep node.session.auth /var/lib/iscsi/nodes/iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e/192.168.1.1,3260,1/default
node.session.auth.authmethod = CHAP
node.session.auth.username = xjrFIwOQ66ktkxjrFIwO
node.session.auth.password = vr2twXxoDww7wvr2twXx


On the hypervisor, you need also to get the initiator id (3):

[root@cloudhyp1 ~]# cat /etc/iscsi/initiatorname.iscsi
InitiatorName=iqn.1994-05.com.redhat:1abc12d345e6
 
To update the ACL, first save the targetcli configuration:
[root@controller ~]# targetctl save
[root@controller ~]# cp /etc/target/saveconfig.json /etc/target/saveconfig.old

Replace:
          "node_acls": [] 

By for the right volume (iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e in our case) : 
          "node_acls": [
            {
              "attributes": {
                "dataout_timeout": 3,
                "dataout_timeout_retries": 5,
                "default_erl": 0,
                "nopin_response_timeout": 30,
                "nopin_timeout": 15,
                "random_datain_pdu_offsets": 0,
                "random_datain_seq_offsets": 0,
                "random_r2t_offsets": 0
              },
              "chap_password": "
vr2twXxoDww7wvr2twXx",
              "chap_userid": "
xjrFIwOQ66ktkxjrFIwO",
              "mapped_luns": [
                {
                  "index": 0,
                  "tpg_lun": 0,
                  "write_protect": false
                }
              ],
              "node_wwn": "iqn.1994-05.com.redhat:1abc12d345e6"
            }

          ]

You have to replace chap_userid, chap_password and node_wwn by values obtained in steps 1, 2 and 3 respectively.

Then check and load the configuration:
[root@controller ~]# cat /etc/target/saveconfig.json | json_verify
JSON is valid
[root@controller ~]# targetctl restore

You can connect again to the iSCSI target from the hypervisor:
[root@cloudhyp1 ~]# iscsiadm -m node -T iqn.2010-10.org.openstack:volume-6e95e5b6-83e1-4958-a5e1-ba5afc94559e -l