[PATCH] auto rename: best matches and speed improvement UPDATE4
Herbert Griebel
herbertg at gmx.at
Sat Aug 16 09:24:24 UTC 2008
Matt Mackall wrote:
> On Sat, 2008-08-16 at 02:16 +0200, Herbert Griebel wrote:
>> Matt Mackall wrote:
>>> Thanks for looking into this, Herbert.
>>>
>>> First off, this will want a test case.
>> ok, this will take me some time, since I have no Linux currently.
>>
>>> Second, I get the impression that the time complexity here is going from
>>> O(n) to O(n**2), is that right?
>> The complexity is the same, only all matches are now checked against each other.
>
> If I do 1000 renames, I end up doing 1000**2 comparisons, right? This
> was probably already the case with the original algorithm, but it's
> still a worry.
Yes, it will always be O(n**2), you have to compare each files with all
other files. The only thing you can do is to make the comparison of the
files more efficient. For example:
- if the file sizes differ more than the similarity threshold,
don't even read the files.
- take file pathnames into account:
- a *.cpp file will never get a *.bmp file
- it is unlikely that a binary file will get an ascii file
- compare the crc: maybe from the repo you get it for free, for
the file you have to calc it. There may be something like very
efficient "CRC similarity" measures or so. This will need some
search in the web...which I found: the rsync on UNIX has a very
efficient comparision algorithm using "rolling" checksums:
http://www.itworld.com/unix-shuffle-file-systems-rsynch-nlsunix-080116?page=0%2C1
- finally speed up the comparison itself, something in C maybe. Maybe
I can find some very efficient algorithms on the web, block based, etc.
http://portal.acm.org/citation.cfm?doid=359460.359467
- ordering the files might also help. small files first, large files
last since they empty the disk cache.
>> For this there is an extra loop at the end of findrenames which is fast.
>> Speed has improved because the loop order is exchanged. There are also
>> no more multiple occurances in the list.
>>
>> What really is lacking is the quality of the diff. I have files with almost no similarity
>> which get a score of almost 50%. I will take a look at that.
>
> I think this is simply taking the size of a delta as the measure of
> similarity. And that effectively doesn't count deletions. So if the
> delta says "delete whole file, replace with smaller one", the ratio of
> delta to original file might not be too bad. Of course, this should
> actually score 0%.
>
I have problems using the -X option in addremove for multiple files,
or globing. Are there some examples?, I cannot get it to work. How to
exclude two files?
More information about the Mercurial-devel
mailing list