What’s Blocking My Migration?

At Axial, as at many other companies, we have software written in the olden days still in active use. It does what it’s supposed to do, so replacing it is not the highest priority. However, sometimes these older scripts keep database transactions open longer than are necessary, and those transactions can hold locks.

This poses a problem when we’re running Alembic migrations; modifying a table requires a lock not just on that table, but on other tables, depending on the operation. (For example, adding a foreign key requires a lock on the destination table to make sure the foreign key constraint is enforced until the transaction is committed.) Sometimes we developers run a migration and it seems to be taking a long time; usually this is because it’s waiting for a lock. Fortunately, it’s really easy to check on the status of queries and locks in PostgreSQL.

There are three system views we care about: pg_class contains data about tables and table-like constructs, pg_locks contains the locking information (with foreign keys to pg_class), and pg_stat_activity contains information on every connection and what it’s doing. By combining data from these three views (technically pg_class is a system catalog, not a view, but whatever) we can determine which connections are holding locks on which tables, and on which tables a given connection is awaiting locks. We have a script called whats_blocking_my_migration that queries these views.

Rather than explain the columns on these views (which you can see very easily in the docs anyway), I’ll show you the queries we execute.

The first thing we do is get information on Alembic’s connection to the DB. In env.py, we set the “application_name” with the connect_args argument to create_engine; this way we can identify the connection Alembic is using. Once we do that, we can use application_name the check the status of the Alembic connection. If there are no rows, or if none of the connection rows are actually waiting, the script prints that info and exits.

SELECT pid, state, waiting FROM pg_stat_activity WHERE application_name = 'alembic';

However, if an Alembic connection is “waiting”, that means something’s holding a lock that it needs. To find out what, we use the pid (which is the pid of the Postgres process handling the connection) to query pg_locks. Our script actually does a join against pg_class to obtain the table name (and make sure the relation is kind ‘r’, meaning an actual table), but the table name can also be obtained directly from the relation by casting it from oid to regclass.

SELECT relation, mode FROM pg_locks WHERE lock type = 'relation' AND granted IS FALSE and pid IN alembic_pids;

This gets us the list of tables that the Alembic connection needs to lock, but there’s one more step we can take. pg_class will also tell us the pids of the connections that are holding locks on these tables, and we can join that pid back against pg_stat_activity to obtain the address and port number of the TCP connection holding the lock. (Unfortunately, most of our connections don’t yet populate application_name, so we have to provide this info, which the user can cross-reference with lsof -i to figure out which service is holding the lock.) The query looks a bit like this:

SELECT client_addr, client_port FROM pg_stat_activity JOIN pg_locks ON pg_stat_activity.pid = pg_locks.pid WHERE pg_locks.granted IS TRUE AND pg_locks.relation IN locked_relations;

There’s no real secret sauce here; it’s just some basic queries against the Postgres system views, but this makes the lives of our developers a little bit easier.

Migrating the Axial stack to PostgreSQL

Axial has used MySQL for most of its life. We had a variety of complaints with it, ranging from poor query planning to silently accepting (and discarding) invalid data. Yet the effort involved in migrating to PostgreSQL always seemed too great compared to the benefits. However, we started using Redshift for data warehousing. Redshift uses the Postgres client library, so we had yet another reason to migrate to Postgres, and it finally happened.

Migrating the data

Preparing for the migration took one engineer (me) about three weeks. The first task was to actually move all the data, converting data types when necessary. pgloader was a fantastic tool for this, but there were still some hiccups. During this process, I discovered that despite our foreign key constraints (and yes, we were using InnoDB) there were a handful of rows with foreign keys that pointed to rows which did not exist. We decided that the least bad course of action here would be to create new dummy rows and point these foreign keys to the dummy rows. We also had a few rows with data that was invalid UTF-8 text, even though the character set was UTF-8. For these there was nothing to do except strip out the bytes that weren’t UTF-8.

Once our data was sane again, it was time to migrate it. pgloader knows about mysql’s types, including common hacks (since MySQL has no boolean type, people typically use a tinyint(1) instead). However, some of our boolean columns were wrongly created as wider integers, which pgloader doesn’t automatically convert. Fortunately, pgloader makes it very easy to override its type adaption on a column-by-column level, and if you need conversion rules that it doesn’t have built in, you can write additional rules in Common Lisp. (No beard necessary!)

Here’s our configuration file:

LOAD DATABASE
FROM mysql://user:pass@127.0.0.1/production
INTO postgresql://user:pass@127.0.0.1/primary_db

WITH include no drop, create tables, create indexes, downcase identifiers, foreign keys, reset sequences, batch rows = 1000

SET maintenance_work_mem to '128MB', work_mem to '12MB'

CAST column ams_msg.migrate_message using nullify-column,
column transaction_profile.control using empty-string-to-null,
column event.is_staff to boolean drop typemod keep default keep not null using tinyint-to-boolean,
column event.is_masquerading to boolean drop typemod keep default keep not null using tinyint-to-boolean,
column preferences_userpreferences.email_buyer_received_opportunity_report_daily to boolean drop typemod keep default keep not null using tinyint-to-boolean,
column preferences_userpreferences.show_import_marketing to boolean drop typemod keep default keep not null using tinyint-to-boolean,
column preferences_userpreferences.show_find_and_send_marketing to boolean drop typemod keep default keep not null using tinyint-to-boolean

EXCLUDING TABLE NAMES MATCHING ~/ams_/, ~/cms_/, 'jogging_log', 'RedisOppEmails_tbl', ~/announcements_announcement/, 'bak_opportunity_expiraiton_date', ~/company_sellerprofile/, ~/djcelery_/, 'numbers_tbl', ~/network_updates_statusupdate/, 'opp_invite_redis_dump_tbl', 'site_announcements_siteannouncement', 'south_migrationhistory', ~/tmp_m/, ~/waffle_/, 'inbound_email_inboundemail', 'audit_log_auditlog', ~/analytics_/

AFTER LOAD DO
$$ UPDATE alembic_version SET version_num = '4e97c60f45e0'; $$;

We start off by telling pgloader to create the tables and indexes, lowercase all identifiers and keep foreign keys, and reset the sequences for primary keys to the next value they should have after copying over the data. We’re also batching rows; pgloader requires a bit of tuning to keep it from running out of memory (compiling it with Clozure Common Lisp instead of Steel Bank Common Lisp definitely helps with the memory usage). Most of the custom column casts are booleans, but one of them uses a very simple custom cast that turns all data into nulls. We did this because this was a column we wanted to keep in MySQL in case the migration went poorly, but we knew we would not need in Postgres. We also exclude certain tables matching either regular expressions or strings, and finish by setting our alembic migration version number. This ran smoothly and quickly.

Migrating the code

Our codebase used the Django ORM and SQLAlchemy for most of its SQL access, but we still had some bad habits. Some queries would use 1 and 0 with boolean fields instead of Python’s True and False values. Other queries were written in Django templates that were rendered and executed by a custom system called tplsql. These queries often used MySQL specific functions and index hints. We also needed to start using timezones on our DateTime data, since Postgres supports (and indeed expects) timezones. These were simple to fix and easily found by exercising the app against the Postgres database.

A couple of other issues were more troublesome. Some of our code used an upsert idiom that worked a bit like this:

try:
    create_a_thing()
except IntegrityError:
    update_the_thing()

This doesn’t work in Postgres because an integrity error means the transaction is now in an invalid state and you can’t continue using it. The simplest approach is to wrap the existing idiom in savepoints, so that’s what we did. A better idiom would use Postgres’s support for CTEs, but given everything else we were changing, we decided to be conservative.

We also had queries (usually using the tplsql system) which were misusing GROUP BY. If you tell a SQL engine to group by some column, every other column you select needs to be an aggregate function; otherwise the SQL engine doesn’t know which rows to return, right? Well, MySQL will just pick a row, but PostgreSQL complains. So we had to restructure those queries to properly work with GROUP BY, in the process making sure we were actually querying for the data we wanted.

Going live

After the migration had been extensively tested, we had to do the release. This would be a multi-hour downtime, since we needed to make sure no new data was going into MySQL while we were moving it to PostgreSQL. So we ordered a pizza and waited till Friday night, whereupon the migration failed two hours into the process. It turned out the postgres user in our production environment didn’t have all the permissions that it had had in test, and some of those permissions were important. So once we fixed the permission problems, we tried again on Sunday. This time it went off without a hitch and everyone came in on Monday to a faster, more reliable Axial.

Preventing errant git pushes with a pre-push hook

GitAt Axial, we use a centralized repository workflow; everyone works off the main git remote, making branches and pull requests as needed. Broken code is allowed on topic branches, but the latest code on master is expected to always work.

However, since master is just an ordinary git branch and the git command-line interface has a lot of different ways to do the same thing, we’ve had a few incidents where a developer intends to push a topic branch to the remote, creating a new branch there, but instead pushes the topic branch to master. Luckily, git 1.8.2 added support for a pre-push hook.

Git hooks are small scripts that can interact with different parts of the git operation. For example, a pre-commit hook can check that your code is properly formatted, or a post-receive hook can be responsible for kicking off a build. In our case, pre-push will run before any push and has the ability to abort it.

The pre-push hook is an executable script named .git/hooks/pre-push. It will be called with two command-line arguments (the name and url of the remote) and provided a list of refs being pushed and the corresponding ref, if any, on the remote side. See the description in githooks(5) for full details.

The first part of the hook assembles data about the push. We obtain the current branch using git symbolic-ref HEAD and check for a force push by inspecting the command-line arguments of the parent process (the git push in question). We also split out the branch names from the refs in the commits when possible. (A git push origin HEAD unfortunately won’t have the branch name associated with it, which is why we get the current branch.)

Push = namedtuple('Push', ['commits', 'remote_name', 'remote_url',
                           'current_branch', 'removing_remote', 'forcing'])
Commit = namedtuple('Commit', ['local_ref', 'local_sha1', 'remote_ref', 'remote_sha1',
                               'local_branch', 'remote_branch'])

def assemble_push(args, lines):
    commits = []
    for line in lines:
        split = line.split()
        if len(split) != 4:
            parser.exit(status=1,
                        message="Could not parse commit from '{}'\n".format(line))
        # local_branch
        local_branch = split[0].split('/')[-1] if '/' in split[0] else None
        split.append(local_branch)
        # remote_branch
        remote_branch = split[2].split('/')[-1] if '/' in split[2] else None
        split.append(remote_branch)
        commits.append(Commit(*split))
    current_ref = subprocess.check_output(['git', 'symbolic-ref', 'HEAD']).rstrip()
    current_branch = current_ref.split('/')[-1]
    pid = os.getppid()
    push_command = subprocess.check_output(['ps', '-ocommand=', '-p', str(pid)])
    forcing = ('--force' in push_command or '-f' in push_command)
    removing_remote = set()
    for commit in commits:
        if commit.local_ref == "(delete)":
            removing_remote.add(commit.remote_branch)
    return Push(commits=commits,
                remote_name=args.remote_name, remote_url=args.remote_url,
                current_branch=current_branch, removing_remote=removing_remote,
                forcing=forcing)

Now that we’ve assembled all the info, we can check the push for things we want to prohibit.

Push(commits=[Commit(local_ref='refs/heads/topicbranch',
                     local_sha1='6eadeac2dade6347e87c0d24fd455feffa7069f0',
                     remote_ref='refs/heads/master',
                     remote_sha1='f1d2d2f924e986ac86fdf7b36c94bcdf32beec15',
                     local_branch='topicbranch',
                     remote_branch='master')],
     remote_name='origin',
     remote_url='git@github.com:axialmarket/repo.git',
     current_branch='topicbranch',
     removing_remote=set([]),
     forcing=False)

In this push, you can see that I’m attempting to push topicbranch on the local repo to master on the remote; this is the first category of mistake we want to prevent.

def check_unmerged(push):
    for commit in push.commits:
        compare = commit.local_branch
        if commit.local_ref == 'HEAD':
            compare = push.current_branch
        if commit.remote_branch in PROTECTED and \
                compare != commit.remote_branch:
            msg = ("You cannot push the local branch '{}' to the protected "
                   "remote branch '{}'\n".format(
                       compare, commit.remote_branch))
            parser.exit(1, message=msg)

PROTECTED is a set of branch names, currently master and qa. commit.local_branch will be None if the local ref doesn’t have a branch name. We have a special case to handle HEAD, where we instead use the current branch, but otherwise we fail to be on the safe side. This function will prohibit accidentally pushing a topic branch directly to master or qa. We have a similar check for deleting a protected branch, based on the removing_remote property on the push object.

We use alembic for database migrations. When two people create migrations in their topic branches and merge them together, this creates a branched database history and alembic will refuse to apply migrations until you resolve the branch. We have some scripts to warn you when you get an alembic branch after a merge, but people have sometimes forgotten to fix that and pushed the branch to qa or even master. So here’s a check for that! It’s a bit more complicated than the last one.

def check_alembic_branch(push):
    for commit in push.commits:
        if commit.remote_branch not in PROTECTED:
            continue
        awd = tempfile.mkdtemp(prefix='alembic.')
        treeish = "{}:share/migrations".format(commit.local_sha1)
        tar = subprocess.Popen(['git', 'archive', '--format=tar', treeish],
                               stdout=subprocess.PIPE)
        extract = subprocess.Popen(['tar', 'xf', '-'], stdin=tar.stdout, cwd=awd)
        tar.stdout.close()  # Allows extract to receive a SIGPIPE if tar exits.
        extract.communicate()
        if extract.returncode != 0:
            parser.exit(1, message="unable to check for alembic branches\n")
        branches = subprocess.Popen(['alembic', 'branches'], cwd=awd,
                                    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        output = branches.communicate()
        if branches.returncode != 0:
            parser.exit(1, message="unable to check for alembic branches\n")
        # cleanup awd! otherwise we pollute /tmp
        shutil.rmtree(awd)
        if len(output[0]) > 0:
            msg = ("Alembic migration conflict!\n{0}\n"
                   "Fix this first!\n")
            parser.exit(1, message=msg.format(output[0]))

The good news is alembic has a command to let you know if a branch exists. The bad news is alembic expects to be able to work on an actual file tree on disk. The working tree will often correspond to the tree being pushed, but we can’t guarantee that will be the case. So, we make a temporary directory and use git archive to extract the migrations directory for the appropriate refs. This is actually pretty fast.

There’s a lot more checks we can add here following these examples; we can prevent force-pushing to master, ensure Jenkins has successfully built the topic branch first, and so on. This addresses our immediate pain points while being easy to extend in the future.

[image via git-scm.com]