[Updated] D11280: fix: reduce number of tool executions

hooper (Danny Hooper) phabricator at mercurial-scm.org
Thu Sep 2 22:07:50 UTC 2021


hooper edited the summary of this revision.
hooper marked 2 inline comments as done.
hooper updated this revision to Diff 30179.

REPOSITORY
  rHG Mercurial

CHANGES SINCE LAST UPDATE
  https://phab.mercurial-scm.org/D11280?vs=29889&id=30179

BRANCH
  stable

CHANGES SINCE LAST ACTION
  https://phab.mercurial-scm.org/D11280/new/

REVISION DETAIL
  https://phab.mercurial-scm.org/D11280

AFFECTED FILES
  hgext/fix.py
  tests/test-fix.t

CHANGE DETAILS

diff --git a/tests/test-fix.t b/tests/test-fix.t
--- a/tests/test-fix.t
+++ b/tests/test-fix.t
@@ -1758,7 +1758,56 @@
   $ cat $LOGFILE | sort | uniq -c
         4 bar.log
         4 baz.log
-        4 foo.log
-        4 qux.log
+        3 foo.log
+        2 qux.log
 
   $ cd ..
+
+For tools that support line ranges, it's wrong to blindly re-use fixed file
+content for the same file revision if it appears twice with different baserevs,
+because the line ranges could be different. Since computing line ranges is
+ambiguous, this isn't a matter of correctness, but it affects the usability of
+this extension. It could maybe be simpler if baserevs were computed on a
+per-file basis to make this situation impossible to construct.
+
+In the following example, we construct two subgraphs with the same file
+revisions, and fix different sub-subgraphs to get different baserevs and
+different changed line ranges. The key precondition is that revisions 1 and 4
+have the same file revision, and the key result is that their successors don't
+have the same file content, because we want to fix different areas of that same
+file revision's content.
+
+  $ hg init differentlineranges
+  $ cd differentlineranges
+
+  $ printf "a\nb\n" > file.changed
+  $ hg commit -Aqm "0 ab"
+  $ printf "a\nx\n" > file.changed
+  $ hg commit -Aqm "1 ax"
+  $ hg remove file.changed
+  $ hg commit -Aqm "2 removed"
+  $ hg revert file.changed -r 0
+  $ hg commit -Aqm "3 ab (reverted)"
+  $ hg revert file.changed -r 1
+  $ hg commit -Aqm "4 ax (reverted)"
+
+  $ hg manifest --debug --template "{hash}\n" -r 0; \
+  > hg manifest --debug --template "{hash}\n" -r 3
+  418f692145676128d2fb518b027ddbac624be76e
+  418f692145676128d2fb518b027ddbac624be76e
+  $ hg manifest --debug --template "{hash}\n" -r 1; \
+  > hg manifest --debug --template "{hash}\n" -r 4
+  09b8b3ce5a507caaa282f7262679e6d04091426c
+  09b8b3ce5a507caaa282f7262679e6d04091426c
+
+  $ hg fix --working-dir -r 1+3+4
+  3 new orphan changesets
+
+  $ hg cat file.changed -r "successors(1)" --hidden
+  a
+  X
+  $ hg cat file.changed -r "successors(4)" --hidden
+  A
+  X
+
+  $ cd ..
diff --git a/hgext/fix.py b/hgext/fix.py
--- a/hgext/fix.py
+++ b/hgext/fix.py
@@ -283,20 +283,29 @@
         # There are no data dependencies between the workers fixing each file
         # revision, so we can use all available parallelism.
         def getfixes(items):
-            for rev, path in items:
-                ctx = repo[rev]
+            for srcrev, path, dstrevs in items:
+                ctx = repo[srcrev]
                 olddata = ctx[path].data()
                 metadata, newdata = fixfile(
-                    ui, repo, opts, fixers, ctx, path, basepaths, basectxs[rev]
+                    ui,
+                    repo,
+                    opts,
+                    fixers,
+                    ctx,
+                    path,
+                    basepaths,
+                    basectxs[srcrev],
                 )
-                # Don't waste memory/time passing unchanged content back, but
-                # produce one result per item either way.
-                yield (
-                    rev,
-                    path,
-                    metadata,
-                    newdata if newdata != olddata else None,
-                )
+                # We ungroup the work items now, because the code that consumes
+                # these results has to handle each dstrev separately, and in
+                # topological order. Because these are handled in topological
+                # order, it's important that we pass around references to
+                # "newdata" instead of copying it. Otherwise, we would be
+                # keeping more copies of file content in memory at a time than
+                # if we hadn't bothered to group/deduplicate the work items.
+                data = newdata if newdata != olddata else None
+                for dstrev in dstrevs:
+                    yield (dstrev, path, metadata, data)
 
         results = worker.worker(
             ui, 1.0, getfixes, tuple(), workqueue, threadsafe=False
@@ -376,23 +385,32 @@
 
 
 def getworkqueue(ui, repo, pats, opts, revstofix, basectxs):
-    """Constructs the list of files to be fixed at specific revisions
+    """Constructs a list of files to fix and which revisions each fix applies to
 
-    It is up to the caller how to consume the work items, and the only
-    dependence between them is that replacement revisions must be committed in
-    topological order. Each work item represents a file in the working copy or
-    in some revision that should be fixed and written back to the working copy
-    or into a replacement revision.
+    To avoid duplicating work, there is usually only one work item for each file
+    revision that might need to be fixed. There can be multiple work items per
+    file revision if the same file needs to be fixed in multiple changesets with
+    different baserevs. Each work item also contains a list of changesets where
+    the file's data should be replaced with the fixed data. The work items for
+    earlier changesets come earlier in the work queue, to improve pipelining by
+    allowing the first changeset to be replaced while fixes are still being
+    computed for later changesets.
 
-    Work items for the same revision are grouped together, so that a worker
-    pool starting with the first N items in parallel is likely to finish the
-    first revision's work before other revisions. This can allow us to write
-    the result to disk and reduce memory footprint. At time of writing, the
-    partition strategy in worker.py seems favorable to this. We also sort the
-    items by ascending revision number to match the order in which we commit
-    the fixes later.
+    Also returned is a map from changesets to the count of work items that might
+    affect each changeset. This is used later to count when all of a changeset's
+    work items have been finished, without having to inspect the remaining work
+    queue in each worker subprocess.
+
+    The example work item (1, "foo/bar.txt", (1, 2, 3)) means that the data of
+    bar.txt should be read from revision 1, then fixed, and written back to
+    revisions 1, 2 and 3. Revision 1 is called the "srcrev" and the list of
+    revisions is called the "dstrevs". In practice the srcrev is always one of
+    the dstrevs, and we make that choice when constructing the work item so that
+    the choice can't be made inconsistently later on. The dstrevs should all
+    have the same file revision for the given path, so the choice of srcrev is
+    arbitrary. The wdirrev can be a dstrev and a srcrev.
     """
-    workqueue = []
+    dstrevmap = collections.defaultdict(list)
     numitems = collections.defaultdict(int)
     maxfilesize = ui.configbytes(b'fix', b'maxfilesize')
     for rev in sorted(revstofix):
@@ -410,8 +428,19 @@
                     % (util.bytecount(maxfilesize), path)
                 )
                 continue
-            workqueue.append((rev, path))
+            baserevs = tuple(ctx.rev() for ctx in basectxs[rev])
+            dstrevmap[(fctx.filerev(), baserevs, path)].append(rev)
             numitems[rev] += 1
+    workqueue = [
+        (dstrevs[0], path, dstrevs)
+        for (filerev, baserevs, path), dstrevs in dstrevmap.items()
+    ]
+    # Move work items for earlier changesets to the front of the queue, so we
+    # might be able to replace those changesets (in topological order) while
+    # we're still processing later work items. There are some situations where
+    # this doesn't help much, but some situations where it lets us buffer O(1)
+    # files instead of O(n) files.
+    workqueue.sort(key=lambda item: min(item[2]))
     return workqueue, numitems
 
 
@@ -516,9 +545,9 @@
         return {}
 
     basepaths = {}
-    for rev, path in workqueue:
-        fixctx = repo[rev]
-        for basectx in basectxs[rev]:
+    for srcrev, path, _dstrevs in workqueue:
+        fixctx = repo[srcrev]
+        for basectx in basectxs[srcrev]:
             basepath = copies.pathcopies(basectx, fixctx).get(path, path)
             if basepath in basectx:
                 basepaths[(basectx.rev(), fixctx.rev(), path)] = basepath
@@ -641,10 +670,10 @@
     toprefetch = set()
 
     # Prefetch the files that will be fixed.
-    for rev, path in workqueue:
-        if rev == wdirrev:
+    for srcrev, path, _dstrevs in workqueue:
+        if srcrev == wdirrev:
             continue
-        toprefetch.add((rev, path))
+        toprefetch.add((srcrev, path))
 
     # Prefetch the base contents for lineranges().
     for (baserev, fixrev, path), basepath in basepaths.items():



To: hooper, #hg-reviewers
Cc: Alphare, mercurial-patches
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mercurial-scm.org/pipermail/mercurial-patches/attachments/20210902/644b5a40/attachment-0001.html>


More information about the Mercurial-patches mailing list