Login regression for users with non-ASCII names

On 2020-08-13, we deployed an update that caused users whose full names contain non-ASCII characters (which is of course very common) to be unable to log into Launchpad. We heard about this serious regression from users on 2020-08-17, and rolled out a fix on 2020-08-18. We’re sorry about this; it doesn’t meet the standards of both inclusion and quality that we set for ourselves. This post aims to explain what happened, technical details of why it happened, and the steps we’ve taken to avoid it happening again.

Launchpad still runs on Python 2. This is a problem, and we’ve been gradually chipping away at it for the last couple of years. With about three-quarters of a million lines of Python code in the main tree and over 200 dependencies, it’s a big job – but we’re well underway!

Some of those dependencies have been difficult problems in their own right. The one at issue here was python-openid, which we use as part of our login workflow, but which hasn’t been actively maintained for over ten years. Fortunately, in this case we didn’t have to port it ourselves, because there were already a couple of forks featuring Python 3 support while preserving more or less the same interface: we chose python-openid2 on the grounds that it had done a good job of maintaining both Python 2 and 3 support in the same codebase, which we needed in order to arrange a practical transition, and that it was in itself well-maintained. We worked with upstream to fix a couple of issues discovered by the Launchpad test suite that blocked us migrating to it (notably PR #41, although that was fixed as PR #43 instead), and switched Launchpad over once python-openid2 3.2 was released. So far, so good.

One of the major reasons for much of the disruption in the Python 3 transition was to provide a clean separation between the concept of a sequence of bytes and a text string, which was often a problem for code that needed to handle Unicode: it’s all too common in Python 2 to have code that works on the ASCII domain (which can be represented either as str or unicode) but that fails on Unicode strings outside that subset. Launchpad is less prone to that than many Python 2 applications because the ORM we use (Storm) has always been relatively strict about the boundary between bytes and text; nevertheless, having a stricter data model here is a good thing for us in the long term. It might seem ironic that we ran into exactly such a bug as part of porting to Python 3; but then, we aren’t using the new interpreter yet.

Launchpad uses the OpenID Simple Registration Extension in its login workflow. It specifically requests the user’s full name from Canonical’s OpenID provider (login.ubuntu.com, which we generally call “SSO”): this means that if the user has an SSO account but not yet a Launchpad account, we can create a Launchpad account for them without them needing to enter their name again. That full name is encoded as a UTF-8 string, which in turn is URL-encoded using the usual %xx mechanism. This means that if, say, your name is Gráinne Ní Mháille, it will show up in the OpenID response’s query string as openid.sreg.fullname=Gr%C3%A1inne+N%C3%AD+Mh%C3%A1ille.

python-openid2 uses its openid.urinorm module to normalise parts of the response, decoding and re-encoding it to make sure comparisons work as expected; this is built on top of the URL handling code in Python’s standard library. Now, unlike Python 3, Python 2’s urlencode has undocumented restrictions on values in the query argument: if the doseq argument is False (the default), then it converts values using str(v), while if it’s True then it converts Unicode values using v.encode("ASCII", "replace") (potentially losing information!). In this case, doseq is False, and the input given to it is always text (unicode on Python 2): this works fine if the input is within the ASCII subset, but if it’s not:

>>> urlencode({u'openid.sreg.fullname': u'Gráinne Ní Mháille'})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/urllib.py", line 1350, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 2: ordinal not in range(128)

The fix is that on Python 2 one must always pass values to urlencode as bytes rather than text:

>>> urlencode({u'openid.sreg.fullname': u'Gráinne Ní Mháille'.encode('UTF-8')})
'openid.sreg.fullname=Gr%C3%A1inne+N%C3%AD+Mh%C3%A1ille'

We’ve sent PR #47 to python-openid2 to implement this. We’ve also made a temporary local fork of python-openid2 containing this patch and deployed it to Launchpad production.

One thing to be clear about here: though the root cause was a bug in python-openid2, it’s our responsibility to make sure it works correctly when integrated into Launchpad.

We missed this bug because of a gap in testing: although we did test the full login workflow, we only did so with a test user whose full name was entirely ASCII. We’ve closed this gap now, so we’ll catch it if a dependency regresses in future.

Tags:

One Response to “Login regression for users with non-ASCII names”

  1. Francisco Jiménez Cabrera Says:

    Nice job!

Leave a Reply