[PATCH 5 of 6 V2] match: avoid translating glob to matcher multiple times for large sets
Martin von Zweigbergk
martinvonz at google.com
Wed Nov 28 18:30:01 UTC 2018
On Wed, Nov 28, 2018 at 7:41 AM Boris FELD <boris.feld at octobus.net> wrote:
>
> On 24/11/2018 00:51, Martin von Zweigbergk via Mercurial-devel wrote:
>
>
>
> On Fri, Nov 23, 2018 at 9:20 AM Boris FELD <boris.feld at octobus.net> wrote:
>
>>
>> On 23/11/2018 10:24, Yuya Nishihara wrote:
>> > On Fri, 23 Nov 2018 18:00:36 +0900, Yuya Nishihara wrote:
>> >> On Fri, 23 Nov 2018 00:00:36 -0800, Martin von Zweigbergk via
>> Mercurial-devel wrote:
>> >>> On Thu, Nov 22, 2018 at 11:44 PM Martin von Zweigbergk <
>> >>> martinvonz at google.com> wrote:
>> >>>> On Thu, Nov 22, 2018 at 2:26 PM Boris Feld <boris.feld at octobus.net>
>> wrote:
>> >>>>
>> >>>>> # HG changeset patch
>> >>>>> # User Boris Feld <boris.feld at octobus.net>
>> >>>>> # Date 1542916922 -3600
>> >>>>> # Thu Nov 22 21:02:02 2018 +0100
>> >>>>> # Node ID 018578f3ab597d5ea573107e7310470de76a3907
>> >>>>> # Parent 4628c3cf1fc1052ca25296c8c1a42c4502b59dc9
>> >>>>> # EXP-Topic perf-ignore-2
>> >>>>> # Available At https://bitbucket.org/octobus/mercurial-devel/
>> >>>>> # hg pull
>> https://bitbucket.org/octobus/mercurial-devel/ -r
>> >>>>> 018578f3ab59
>> >>>>> match: avoid translating glob to matcher multiple times for large
>> sets
>> >>>>>
>> >>>>> For hgignore with many globs, the resulting regexp might not fit
>> under
>> >>>>> the 20K
>> >>>>> length limit. So the patterns need to be broken up in smaller
>> pieces.
>> >>>>>
>> >>>> Did you see 0f6a1bdf89fb (match: handle large regexes, 2007-08-19)
>> >>>> and 59a9dc9562e2 (ignore: split up huge patterns, 2008-02-11)? It
>> might be
>> >>>> worth trying to figure out what Python versions those commits are
>> talking
>> >>>> about. Maybe we've dropped support for those versions and we can
>> simplify
>> >>>> this code.
>> >>>>
>> >>> Oh, and what made me do the archaeology there was that you seem to
>> have
>> >>> lost the handling of OverlowError from the regex engine. As I said
>> above, I
>> >>> suspect that's fine because we no longer support some very old Python
>> >>> versions (but please try to figure out what version that refers to).
>> Still,
>> >>> if we decide to drop that OverflowError handling, I'd prefer to see
>> that in
>> >>> an explicit commit early in this series.
>> To me, 0f6a1bdf89fb (catching error from engine) is superseded by
>> 59a9dc9562e2 (cannot trust the engine, preemptively raise our own error).
>>
>
> Yes, perhaps (if it was only expressions longer than 20k that raised
> OverflowError). My point was that if that was the case, we should rewrite
> to avoid using an internal exception for flow control, i.e. change from
> this:
>
> try:
> regex = # create regex
> if len(regex) > MAX_RE_SIZE:
> raise OverflowError
> return regex, _rematcher(regex)
> except OverflowError:
> # break up into smaller
>
> to this:
>
> regex = # create regex
> if len(regex) < MAX_RE_SIZE:
> return regex, _rematcher(regex)
> # break up into smaller
>
> I don't disagree with the point. However, I do not know why it is relevant
> to the current discussion. The series in review is no longer using
> exception for flow control. What am I missing?
>
My request was to make the removal of the support for OverflowError
explicit. As I said above, it seems it was just lost in this patch without
a mention. I've sent D5309 to show what I mean. Your patch 6/6 would go
well right after that. You could rebase your series on top, or maybe I'm
just being nitpicky and I should drop the patch.
>
>
>
>>
>> So I feel like it is fine to just rely on the size limit.
>> >> Perhaps it's been fixed since 2.7.4. The regexp code width is extended
>> >> from 16bit to 32bit (or Py_UCS4) integer. That should be large enough
>> to
>> >> handle practical patterns.
>> >>
>> >> https://bugs.python.org/issue1160
>>
>> Thanks for digging this out. It looks like we may be able to drop this
>> limit altogether. However, I would like to make it a change distinct
>> from this series.
>>
>> The current code is very problematic for some people (to the point where
>> the majority of `hg status` time is spent in that function). I would
>> like to get fast code for the same semantic first. Then look into
>> changing the semantic.
>>
>
> Is your concern that you might regress in performance of something by
> changing how large the groups are? Or that it would be more work?
>
>
> My concern is that we are currently delaying a simple code change yielding
> a huge improvement because there is the option for another larger, riskier
> one.
>
> I'm not dismissing your finding about lifting the limit and avoiding the
> regexps joining. They look very promising and we are planning to explore
> them. However, they are a different change.
>
> The current series is preserving all the assumption we have been using for
> the past 10 years, implementing a faster code for the same result. This is
> a very low risk and solves the most immediate problem of a user group we
> support.
>
> Dropping the limit is a different matter. It is probably harmless to do
> so, but Mercurial runs on a variety of systems and regexp engines. We'll
> have to look at all the case to make sure our new assumption is valid.
> The same goes for stopping the grouping, we'll have to check the behavior
> on various real-life use cases to make sure it is valid.
>
I suspect it will be best (depending on regex engine) to either not do any
grouping, or to put all patterns in one group. I'd like (for someone) to
see how re2 behaves here.
> We are happy to explore these two ideas ourselves,
>
Thanks!
> but I would prefer to see the faster grouping in first. This is a good
> intermediate step that improves the current situation.
>
> In addition, we won't have direct access to the real-life repository until
> in about two months, so it will hard for us to check the timing of a
> significantly new approach on real-life data.
>
>
> I tried creating a regex for *every* pattern and that actually seemed
> faster (to my surprise), both when creating the matcher and when evaluating
> it. I tried it on the mozilla-unified repo both with 1k files and with 10k
> files in the hgignores. I used the following patch on top of your series.
>
> Could you share more details on the patterns you generated, the operation
> you performed and how you gathered the timings?
>
Have to run to a meeting. Will get back to you later.
>
> diff --git a/mercurial/match.py b/mercurial/match.py
> --- a/mercurial/match.py
> +++ b/mercurial/match.py
> @@ -1184,51 +1184,15 @@ def _buildmatch(kindpats, globsuffix, li
> else:
> return regex, lambda f: any(mf(f) for mf in matchfuncs)
>
> -MAX_RE_SIZE = 20000
> -_BASE_SIZE = len('(?:)') - 1
> -
> -def _joinregexes(regexps):
> - """gather multiple regular expressions into a single one"""
> - return '(?:%s)' % '|'.join(regexps)
> -
> def _buildregexmatch(kindpats, globsuffix):
> """Build a match function from a list of kinds and kindpats,
> return regexp string and a matcher function.
> -
> - Test too large input
> - >>> _buildregexmatch([
> - ... ('relglob', '?' * MAX_RE_SIZE, '')
> - ... ], '$')
> - Traceback (most recent call last):
> - ...
> - Abort: matcher pattern is too long (20009 bytes)
> """
> try:
> - allgroups = []
> regexps = [_regex(k, p, globsuffix) for (k, p, s) in kindpats]
> - fullregexp = _joinregexes(regexps)
> -
> - startidx = 0
> - groupsize = _BASE_SIZE
> - for idx, r in enumerate(regexps):
> - piecesize = len(r)
> - if (piecesize + 4) > MAX_RE_SIZE:
> - msg = _("matcher pattern is too long (%d bytes)") %
> piecesize
> - raise error.Abort(msg)
> - elif (groupsize + 1 + piecesize) > MAX_RE_SIZE:
> - group = regexps[startidx:idx]
> - allgroups.append(_joinregexes(group))
> - startidx = idx
> - groupsize = _BASE_SIZE
> - groupsize += piecesize + 1
> -
> - if startidx == 0:
> - func = _rematcher(fullregexp)
> - else:
> - group = regexps[startidx:]
> - allgroups.append(_joinregexes(group))
> - allmatchers = [_rematcher(g) for g in allgroups]
> - func = lambda s: any(m(s) for m in allmatchers)
> + fullregexp = '(?:%s)' % '|'.join(regexps)
> + allmatchers = [_rematcher(r) for r in regexps]
> + func = lambda s: any(m(s) for m in allmatchers)
> return fullregexp, func
> except re.error:
> for k, p, s in kindpats:
>
>
>
>
>>
>> > That said, combining more chunks of regex patterns might be likely to
>> > lead to another funny problem.
>> >
>> > % python -c 'import re; re.compile("(a)" * 100)'
>> > Traceback (most recent call last):
>> > File "<string>", line 1, in <module>
>> > File "/usr/lib/python2.7/re.py", line 194, in compile
>> > return _compile(pattern, flags)
>> > File "/usr/lib/python2.7/re.py", line 249, in _compile
>> > p = sre_compile.compile(pattern, flags)
>> > File "/usr/lib/python2.7/sre_compile.py", line 583, in compile
>> > "sorry, but this version only supports 100 named groups"
>> > AssertionError: sorry, but this version only supports 100 named groups
>> >
>> > It's unrelated to the OverflowError issue, but splitting patterns could
>> > help avoiding the 100-named-group problem.
>>
>> By chance, my current gigantic use case does not involve named groups.
>>
>> Catching AssertionError, will be fun. I wish there were some clean API
>> to expose and check engine limitation.
>>
>> > _______________________________________________
>> > Mercurial-devel mailing list
>> > Mercurial-devel at mercurial-scm.org
>> > https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>>
>
> _______________________________________________
> Mercurial-devel mailing listMercurial-devel at mercurial-scm.orghttps://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mercurial-scm.org/pipermail/mercurial-devel/attachments/20181128/42528528/attachment-0002.html>
More information about the Mercurial-devel
mailing list