We worked hard to improve SURFconext’s availability over the past year, and we were successful in our efforts. This blog will tell you about the measures we have taken.
SURFconext has experienced rapid growth over the first four years of its existence. At the start of 2013, the platform was handling around 100,000 logins per week. Over the past year, however, it reached more than two million logins per week. At peak times, the platform often handles 900 logins per minute. In 2013, we distributed the 100,000 logins a week across more than 150 service providers. We now use more than 600. Because of this growth and our institutions’ increased reliance on cloud services, the uptime of the platform that facilitates the logins is extremely important. As such, early 2016 saw the commencement of work to further improve our reliability and availability. This improvement in platform reliability takes place at three levels:
1) Refactoring of Engine Block, the heart of our software
Engine Block is the software responsible for handling the SAML logins. As the software code was already a few years old, we started to rewrite certain parts in early 2016. This eventually led to Version 5.0, which was put into service in August. Besides optimising the code, we made specific modifications to increase reliability. This allowed us to reduce the database size and the required number of writes and to phase out LDAP as a backend (discussed in more detail below). All of the modifications were also applied to OpenConext, the open source software on which SURFconext is based. We also paid additional attention to automatic testing, so that the software can be automatically tested for correct operation in many test scenarios. Each time we update or add a function, automatic tests for that function are immediately added. This is then tested automatically using Travis. Given the vast amount of configurations for both IdPs and SPs, this is an absolutely vital step.
2) Geographical separation
The SURFconext servers were originally hosted by the Nikhef and Vancis data centres in Amsterdam. Last year, we added another site in Utrecht. Logically speaking, both locations (Amsterdam and Utrecht) are two independently operating SURFconext entities. All locations can operate independently. If one location fails, the other can take over automatically. To facilitate this, we migrated the database to a Galera Cluster for MySQL. All connections between the two locations are encrypted to prevent any data breaches.
3) Automation of deployments and management with Ansible
Ansible is a tool that makes it possible to automate server management. The configuration is stored in YAML files, a format that can be read by both computers and humans. This type of automation makes the roll-out of software or infrastructure updates much more predictable. It allows us to thoroughly test configuration file changes or software upgrades. This enables us to make sure that everything is running smoothly before we put them into production during a maintenance window. For example, we currently have four load balancers, 12 application servers and five database servers automatically receiving new software and configurations.
Making adjustments to a platform that is in operation 24/7 is comparable to replacing an aircraft engine during flight. One of the first changes we made was to replace all servers with new (virtual) servers featuring CentOS 7. Those machines were then fully configured and set up with Ansible. This upgrade also included putting new load balancers with HAProxy into production (August) and migrating to a database cluster (November). The new servers and new software were put into operation at the same time, so that we could revert to the old situation. We implemented these changes without incurring any downtime.
The final part of the project was the phase-out of LDAP as a backend. Distributing a high-availability LDAP server across multiple locations is a complex business. On the current platform, it is sufficient to store the data (e.g. on consent and system-generated user identifiers) in a database. This reduces the complexity of the entire platform and increases reliability. It took three attempts to successfully phase out this backend. An unexpected outage of the LDAP server and a software update that prevented some of our institutions from logging on correctly led to downtime outside the maintenance window. The unexpected nature of the outage proved that we made the right decision to phase out this backend. In the end, the LDAP backend was successfully deactivated on 5 January. The Utrecht site is currently operating as a standby location. This site will effectively enter into operation in the next few weeks, and one location will take over from the other in the event of a calamity at either location.
If you have any questions, please do not hesitate to contact Bart Geesink.