[PATCH 1 of 3 RFC] mercurial: implement a source transforming module loader on Python 3

Gregory Szorc gregory.szorc at gmail.com
Mon May 16 15:50:44 UTC 2016



> On May 16, 2016, at 08:43, Simon King <simon at simonking.org.uk> wrote:
> 
> I don't think that's supposed to happen, is it? Python should
> automatically invalidate .pyc files based on a magic number that
> changes when the format changes:
> 
> https://hg.python.org/cpython/file/2.7/Python/import.c#l31

The problem is we're inserting code between file reading and code generation - code that Python's importers don't know about. Python assumes file content gets converted into code in a deterministic manner that is defined by the Python distribution itself. We're outside of that, so we need to provide our own cache validation checking.

> 
>> On Mon, May 16, 2016 at 4:31 PM, timeless <timeless at gmail.com> wrote:
>> Fwiw, We already need some cache invalidation. Switching between Python 2.6
>> and 2.7 results in really bad outcomes. :)
>> 
>>> On May 16, 2016 12:03 AM, "Gregory Szorc" <gregory.szorc at gmail.com> wrote:
>>> 
>>> # HG changeset patch
>>> # User Gregory Szorc <gregory.szorc at gmail.com>
>>> # Date 1463370916 25200
>>> #      Sun May 15 20:55:16 2016 -0700
>>> # Node ID 7c5d1f8db9618f511f40bc4089145310671ca57b
>>> # Parent  f8b87a779c87586aa043bcd6030369715edfc9c1
>>> mercurial: implement a source transforming module loader on Python 3
>>> 
>>> The most painful part of ensuring Python code runs on both Python 2
>>> and 3 is string encoding. Making this difficult is that string
>>> literals in Python 2 are bytes and string literals in Python 3 are
>>> unicode. So, to ensure consistent types are used, you have to
>>> use "from __future__ import unicode_literals" and/or prefix literals
>>> with their type (e.g. b'foo' or u'foo').
>>> 
>>> Nearly every string in Mercurial is bytes. So, to use the same source
>>> code on both Python 2 and 3 would require prefixing nearly every
>>> string literal with "b" to make it a byte literal. This is ugly and
>>> not something mpm is willing to do.
>>> 
>>> This patch implements a custom module loader on Python 3 that performs
>>> source transformation to convert string literals (unicode in Python 3)
>>> to byte literals. In effect, it changes Python 3's string literals to
>>> behave like Python 2's.
>>> 
>>> The module loader is only used on mercurial.* and hgext.* modules.
>>> 
>>> The loader works by tokenizing the loaded source and replacing
>>> "string" tokens if necessary. The modified token stream is
>>> untokenized back to source and loaded like normal. This does add some
>>> overhead. However, this all occurs before caching. So .pyc files should
>>> cache the version with byte literals.
>>> 
>>> This patch isn't suitable for checkin. There are a few deficiencies,
>>> including that changes to the loader won't result in the cache
>>> being invalidated. As part of testing this, I've had to manually
>>> blow away __pycache__ directories. We'll likely need to hack up
>>> cache checking as well so caching is invalidated when
>>> mercurial/__init__.py changes. This is going to be ugly.
>>> 
>>> diff --git a/mercurial/__init__.py b/mercurial/__init__.py
>>> --- a/mercurial/__init__.py
>>> +++ b/mercurial/__init__.py
>>> @@ -139,14 +139,89 @@ class hgimporter(object):
>>>             if not modinfo:
>>>                 raise ImportError('could not find mercurial module %s' %
>>>                                   name)
>>> 
>>>         mod = imp.load_module(name, *modinfo)
>>>         sys.modules[name] = mod
>>>         return mod
>>> 
>>> +if sys.version_info[0] >= 3:
>>> +    from . import pure
>>> +    import importlib
>>> +    import io
>>> +    import token
>>> +    import tokenize
>>> +
>>> +    class hgpathentryfinder(importlib.abc.PathEntryFinder):
>>> +        """A sys.meta_path finder."""
>>> +        def find_spec(self, fullname, path, target=None):
>>> +            # Our custom loader rewrites source code and Python code
>>> +            # that doesn't belong to Mercurial doesn't expect this.
>>> +            if not fullname.startswith(('mercurial.', 'hgext.')):
>>> +                return None
>>> +
>>> +            # This assumes Python 3 doesn't support loading C modules.
>>> +            if fullname in _dualmodules:
>>> +                stem = fullname.split('.')[-1]
>>> +                fullname = 'mercurial.pure.%s' % stem
>>> +                target = pure
>>> +                assert len(path) == 1
>>> +                path = [os.path.join(path[0], 'pure')]
>>> +
>>> +            # Try to find the module using other registered finders.
>>> +            spec = None
>>> +            for finder in sys.meta_path:
>>> +                if finder == self:
>>> +                    continue
>>> +
>>> +                spec = finder.find_spec(fullname, path, target=target)
>>> +                if spec:
>>> +                    break
>>> +
>>> +            if not spec:
>>> +                return None
>>> +
>>> +            if fullname.startswith('mercurial.pure.'):
>>> +                spec.name = spec.name.replace('.pure.', '.')
>>> +
>>> +            # TODO need to support loaders from alternate specs, like zip
>>> +            # loaders.
>>> +            spec.loader = hgloader(spec.name, spec.origin)
>>> +            return spec
>>> +
>>> +    def replacetoken(t):
>>> +        if t.type == token.STRING:
>>> +            s = t.string
>>> +
>>> +            # If a docstring, keep it as a string literal.
>>> +            if s[0:3] in ("'''", '"""'):
>>> +                return t
>>> +
>>> +            if s[0] not in ("'", '"'):
>>> +                return t
>>> +
>>> +            # String literal. Prefix to make a b'' string.
>>> +            return tokenize.TokenInfo(t.type, 'b%s' % s, t.start, t.end,
>>> t.line)
>>> +
>>> +        return t
>>> +
>>> +    class hgloader(importlib.machinery.SourceFileLoader):
>>> +        """Custom module loader that transforms source code.
>>> +
>>> +        When the source code is converted to code, we first transform
>>> +        string literals to byte literals using the tokenize API.
>>> +        """
>>> +        def source_to_code(self, data, path):
>>> +            buf = io.BytesIO(data)
>>> +            tokens = tokenize.tokenize(buf.readline)
>>> +            data = tokenize.untokenize(replacetoken(t) for t in tokens)
>>> +            return super(hgloader, self).source_to_code(data, path)
>>> +
>>> # We automagically register our custom importer as a side-effect of
>>> loading.
>>> # This is necessary to ensure that any entry points are able to import
>>> # mercurial.* modules without having to perform this registration
>>> themselves.
>>> -if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>>> -    # meta_path is used before any implicit finders and before sys.path.
>>> -    sys.meta_path.insert(0, hgimporter())
>>> +if sys.version_info[0] >= 3:
>>> +    sys.meta_path.insert(0, hgpathentryfinder())
>>> +else:
>>> +    if not any(isinstance(x, hgimporter) for x in sys.meta_path):
>>> +        # meta_path is used before any implicit finders and before
>>> sys.path.
>>> +        sys.meta_path.insert(0, hgimporter())
>>> _______________________________________________
>>> Mercurial-devel mailing list
>>> Mercurial-devel at mercurial-scm.org
>>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>> 
>> 
>> _______________________________________________
>> Mercurial-devel mailing list
>> Mercurial-devel at mercurial-scm.org
>> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>> 



More information about the Mercurial-devel mailing list