[PATCH] auto rename: best matches and speed improvements UPDATE3 - Matts + Petr's findings added
Herbert Griebel
herbertg at gmx.at
Thu Oct 2 23:40:55 UTC 2008
Petr Kodl wrote:
> >
> > Uh oh. Why did we grow more C code?
>
> Histogram and the Levenshtein distance can only be done
> in C efficiently. Histograms are used to pre-compare
> the files and get an upper-bound for the score. The
> Levenshtein distance is needed for the name matching
> when moving identical files (we discussed this couple
> of weeks ago).
>
>
> The levenshtein and reverse string search do not make that much
> difference - they only operate on short filename strings, but histogram
> related functions have to be in C - I tried to reimplement the C
> functions in Python just for the fun of it - the speed degrades too much
> - serveral X - for just 632 renames and mostly source code files.
>
> attached is the haddremove.py where you can switch between C and native
> python version - just change the if 1: on top - I do not think Python
> can do it much more efficiently than this
>
> pk
I also already did implement it in Python, and got a 100% speed loss
(= takes twice the time) due to the histogram. String similarity is
not so bad in the test repos I use.
I also think the complexity of the C code is low, it's just byte
operations on strings, which do not look different in Python except
for the language interface. But this is Matt's decision, and I am
fine with that.
More information about the Mercurial-devel
mailing list