Heuristic rename detection (was: Suggestions from a newbie)
Matt Mackall
mpm at selenic.com
Wed Jun 1 06:06:51 UTC 2005
On Tue, May 31, 2005 at 10:34:09PM -0400, Kevin Smith wrote:
> Matt Mackall wrote:
> >On Tue, May 31, 2005 at 07:07:07PM -0400, Kevin Smith wrote:
> >
> >>5. Will it be possible for mercurial to have the same kind of heuristic
> >>rename/copy detection that's being built into git? It's too early to
> >>tell whether it will really work well, but it seems very cool.
> >
> >How does it work?
>
> I'm no expert, but here's my understanding:
>
> Let's take say we started with tree A and modified it to become tree B,
> which is represented by changeset C. For each new file in B, we run a
> diff against each file in A that was changed or removed. The result of
> the diff is a "similarity score", based on the size of the diff relative
> to the sizes of the two files. If the score exceeds a threshold (which
> might be passed in by the user), we declare that a rename or copy occurred.
Eep. That's nasty. I mean, it might work reasonably well but it's
O(N^2) in the number of diffs.
I used to do similar cross-correlation stuff in image processing and
came up with tricks for sparsing the comparison matrix, but those
don't seem to apply too well here.
> An extension that is not needed to detect renames, but helps for
> detecting copies would be to compare each new file in B against *every*
> file in A. Obviously that's much more expensive.
That's pretty outrageous when you have 17345 files. Making something
like that automatic is just asking for trouble.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial
mailing list