Shift — Safe and Easy Database Migrations
A service that helps you run online schema migrations for MySQL databases.
Written by Michael Finch.
Shift is an application that makes it easy to run online schema migrations for MySQL databases. We use it at Square to run thousands of databases migrations every month. Today we are happy to be open sourcing the project, which can be found on GitHub at https://github.com/square/shift.
Two years ago almost every database at Square lived on shared hosts, meaning a single server with a single MySQL instance could have anywhere from 2 to 60 schemata on it. For the most part this actually worked pretty well for us, particularly since we didn’t yet have the tooling in place to manage thousands of database hosts. The main drawback to this setup, though, is the obvious lack of isolation — any one service behaving badly could disrupt — and even take down — tens of other services. Combine that with a general lack of visibility into which service was causing performance problems for a database host and you start to get an idea of the issues we were dealing with back then. Suffice it to say, shared databases served us well for a while but it was time to develop a more mature infrastructure.
We decided to use containerization (via LXC) to provide isolation between databases, but there was an important thing we had to keep in mind when doing this — with the number of services in our ecosystem growing every week (well into the hundreds by that point) it was imperative that we made sure our small operations team could support the move away from shared hosts. Containerizing all of our databases would turn tens of shared hosts into thousands of containers, and we needed to be able to handle the transition seamlessly. So, we set out with a secondary — and actually more important — goal of making databases at Square easy for developers to manage on their own. Shift is a service that evolved from that goal.
The process for migrating database schema at Square used to be rather cumbersome. It looked something like this:
- Developer files a ticket explaining the change they want to make
- Operations team writes up a procedure to do the alter with pt-online-schema-change
- Someone approves the procedure
- Operations team manually runs the procedure
Or, worse yet, a developer might accidentally run a migration live during a deploy, potentially causing downtime for their service. Needless to say this wasn’t a refined process. Now, all database migrations go through Shift.
Within Shift, developers can submit migrations that they need to have run on their databases (altering, dropping, or creating a table). After submission, a dry-run is executed to make sure that the migration is valid, and some basic stats about the table being altered are collected and presented in the UI. With these stats in mind, another user must decide whether or not to approve the migration, optionally leaving feedback in the form of a native comment. After receiving approval, the migration can then be run — all with just the click of a button. Shift has simple hooks you can plug into that provide granular control over which users have access to perform each action for a given database.
Some additional features that we built into the service are:
The ability to pause and resume migrations. To do this we created a patch for the tried-and-true pt-online-schema-change (pt-osc), which you can find here, that keeps track of the state of the migration in file. As long as triggers are preserved, this state file allows us to recover from a pt-osc that is killed without losing any progress. This is useful because we give users the ability to pause a migration if their database is under heavy load (and if load gets too bad, they can entirely cancel it which will remove the triggers as well), but it is also helpful when pt-osc dies for any number of reasons, like losing it’s connection to the database.
Support for sharded databases. When creating a migration request, Shift allows you to multi-select databases and hosts that a DDL statement needs to be run on, ultimately creating a meta request (a group of migrations that share the same DDL). All migrations in a meta request can be approved, started, canceled, etc., at the same time, meaning you can run a migration on 100 databases as easily as you can run one on a single database. For safety reasons only one migration is allowed to run on a database at a time (configurable), but migrations can be enqueued to run automatically when this lock is freed up.
Shift removes the error-prone and manual work from altering database schema, and it gives invaluable time back to operations teams. As of writing this, Shift has had no problem running hundreds of migrations a day or running migrations that take weeks to complete. You can read more about it’s features and components, as well as watch a demo video at https://github.com/square/shift.
Shift works pretty well with the current workload we have at Square. There are some features we plan to build into it in the near future to make it more robust, one example being to make it easier to integrate with developer workflows (ex: direct integration with ActiveRecord). On the long term roadmap we may implement our own version of pt-osc to support things like “WHERE” clauses in an alter (for pruning data), as well as explore the possibilities of running migrations on slaves and then promoting them to master.
Please follow the installation guide here. michaelfinch (Michael Finch) *michaelfinch has 2 repositories available. Follow their code on GitHub.*github.com