speed up relink script

Brendan Cully brendan at kublai.com
Mon Mar 19 21:23:56 UTC 2007


If you want to speed it up, you might try searching from the back to
the front (differences should show up faster that way), or perhaps
forking off md5sum for the candidate lists and comparing by that
(possibly hand-checking matches for md5 collisions). I can't convince
myself that it's safe to assume that a match in the last chunk is
sufficient.

You probably don't need to pass around file sizes - if the source and
destination don't have the same size, you won't be comparing them
anyway.

On Monday, 19 March 2007 at 16:09, TK Soh wrote:
> On 3/19/07, TK Soh <teekaysoh at gmail.com> wrote:
> >On 3/19/07, Alexis S. L. Carvalho <alexis at cecm.usp.br> wrote:
> >> Maybe I'm missing something obvious, but doesn't this seek beyond EOF,
> >> making sfp.read(CHUNKLEN) return an empty string, which means the loop
> >> doesn't get executed and you unconditionally relink stuff, possibly
> >> losing data?

> >I think you are right. But somehow it didn't seem to cause any obvious
> >error. Hm, I will look into it. Thanks for the input.

> Looks like file.seek() allows seeking beyong EOF, though in any case
> my last patch was totally broken. So, how about the patch below?

> BTW, I wonder what the odds are for two corresponding *.[id] files to
> have the same size, but contain different data, with CHUNKLEN of 64K.
> A smaller value of CHUNKLEN would improve the performance, obviously.

> +++ b/contrib/hg-relink Mon Mar 19 16:08:39 2007 -0500
> @@ -58,7 +58,7 @@ def prune(candidates, dst):
>             raise Exception('Source and destination are on different devices')
>         if st.st_size != ts.st_size:
>             continue
> -        targets.append((fn, ts.st_size))
> +        targets.append((fn, st, ts))

>     return targets

> @@ -67,11 +67,13 @@ def relink(src, dst, files):
>     relinked = 0
>     savedbytes = 0

> -    for f, sz in files:
> +    for f, st, ts in files:
>         source = os.path.join(src, f)
>         tgt = os.path.join(dst, f)
>         sfp = file(source)
>         dfp = file(tgt)
> +        sfp.seek(-min(st.st_size, CHUNKLEN), 2)
> +        dfp.seek(-min(ts.st_size, CHUNKLEN), 2)
>         sin = sfp.read(CHUNKLEN)
>         while sin:
>             din = dfp.read(CHUNKLEN)
> @@ -89,7 +91,7 @@ def relink(src, dst, files):
>                 raise
>             print 'Relinked %s' % f
>             relinked += 1
> -            savedbytes += sz
> +            savedbytes += ts.st_size
>             os.remove(tgt + '.bak')
>         except OSError, inst:
>             print '%s: %s' % (tgt, str(inst))
> _______________________________________________
> Mercurial mailing list
> Mercurial at selenic.com
> http://selenic.com/mailman/listinfo/mercurial



More information about the Mercurial mailing list