[PATCH] auto rename: best matches and speed improvements UPDATE3 - Matts + Petr's findings added

Herbert Griebel herbertg at gmx.at
Thu Oct 2 23:40:55 UTC 2008


Petr Kodl wrote:
>     >
>     > Uh oh. Why did we grow more C code?
> 
>     Histogram and the Levenshtein distance can only be done
>     in C efficiently. Histograms are used to pre-compare
>     the files and get an upper-bound for the score. The
>     Levenshtein distance is needed for the name matching
>     when moving identical files (we discussed this couple
>     of weeks ago).
> 
> 
> The levenshtein and reverse string search do not make that much
> difference - they only operate on short filename strings, but histogram
> related functions have to be in C - I tried to reimplement the C
> functions in Python just for the fun of it - the speed degrades too much
> - serveral X - for just 632 renames and mostly source code files.
> 
> attached is the haddremove.py where you can switch between C and native
> python version - just change the if 1: on top - I do not think Python
> can do it much more efficiently than this
> 
> pk

I also already did implement it in Python, and got a 100% speed loss
(= takes twice the time) due to the histogram. String similarity is
not so bad in the test repos I use.
I also think the complexity of the C code is low, it's just byte
operations on strings, which do not look different in Python except
for the language interface. But this is Matt's decision, and I am
fine with that.



More information about the Mercurial-devel mailing list