Slow push of large file over HTTP

Michael Tjørnemark mtj at pfa.dk
Mon Apr 30 09:32:53 UTC 2012


>>Michael Tjørnemark <mtj at pfa.dk> writes:
>>
>>> I have a repository with a single changeset which adds a single 60 MB 
>>> file (zip-file). Pushing this repo over HTTP is much, much slower 
>>> than other commands on the repository, including a similar pull - is 
>>> this to be expected? I have recreated the problem on other machines 
>>> and files as well, so it seems to be a general problem with pushing a
>>> large(ish) file.
>>>
>>I'm unsure why it then stops for 4 minuttes -- I would not expect that.
>>
>>Versioning zip files is... unusual :) Every revision of the zip file will take up a lot of new space since it can't be delta compressed much against the previous version. So after 10 edits to the file, you could end up with a repo with maybe 400 MB of history for that single file.
>>
>
>Yes, this is not really what i am trying to do, but was just a simple way to
>recreate the core problem with push of large files over HTTP. The real usecase
>that started my investigation was initializing a new repository with around
>6000 files, 150 MB total and of varying size - the largest 33 MB, and about
>15 of them > 1 MB - which took more than 15 minutes to push (on a slow machine).
>
>I have now recreated the problem with 5 xml files of 16 MB each (a more
>reasonable example than a single zip), and pulling over HTTP takes 8 secs
>while pushing takes around 1 minute - so as in the original example a factor
>of almost 1:10 of pull to push times. A test of a repository with 4500 small
>Files (7,5 MB total) takes around 9 secs for both pull and push, so it seems
>To be a problem with large files.
>
>We can live with the current performance since it only has to be done once
>and we are used to slow ClearCase performance, so I don't need a definitive
>answer. I just think that something is wrong when push is that much slower
>than pull - I would think the times should be comparable, so 10 times slower
>just seems wierd. It should be easy to recreate the problem anywhere (I have
>only tried it on Windows though) by serving an empty repository, and pushing
>a single changeset to it with one/more large files added, and compare that
>time with a pull from the same repository.

I've actually found the problem now. I couldn't let it go because i really felt that
something was wrong, so I've spent most of the weekend looking at and debugging
the mercurial source code. I've never done any python before in my life, but the
"hackable mercurial" package was a great way to get started.

Anyway, the problem seems to be that the serverside saves the incoming bundle to
a temporary file, and reading and decompressing that file happens in a lot of very,
very smull chunks. When decompressing multiple chunks it works in an incremental
way, so it decompresses as much as possible of each chunk and then saves the rest
for the next call. This in turn means that the same data might be handled many
times especially when the chunks are small. This also explains very well why it
caused the CPU to be the bottleneck, since decompression is CPU-intensive.

I have created a one-line fix that decompresses larger chunks, and pushing is now
as fast as I hoped it would be. What do I do from here - how do I submit the change
for inclusion in the actual code? And can anybody tell if my change might have any
unwanted side-effects or breaks anything, since I don't really know what I am doing?

# HG changeset patch
# User Michael Tjørnemark <michael at tjornemark.dk>
# Date 1335725930 -7200
# Branch stable
# Node ID 9b229327b6210802598643a426e7644de3148ab8
# Parent  be786c5ac0a852cab965d9e541611f882bdb0bb8
changegroup: decompress GZ algorithm in larger chunks for better performance

diff -r be786c5ac0a8 -r 9b229327b621 mercurial/changegroup.py
--- a/mercurial/changegroup.py	Sat Apr 28 16:38:07 2012 -0500
+++ b/mercurial/changegroup.py	Sun Apr 29 20:58:50 2012 +0200
@@ -118,7 +118,7 @@
     elif alg == 'GZ':
         def generator(f):
             zd = zlib.decompressobj()
-            for chunk in f:
+            for chunk in util.filechunkiter(f):
                 yield zd.decompress(chunk)
     elif alg == 'BZ':
         def generator(f):

Michael Tjørnemark


More information about the Mercurial mailing list