hg and binary repository
Matt Mackall
mpm at selenic.com
Fri Nov 30 21:45:35 UTC 2007
On Fri, Nov 30, 2007 at 06:00:24PM -0300, Martin Marques wrote:
> Matt Mackall escribi?:
> >On Fri, Nov 30, 2007 at 08:06:08AM -0300, Martin Marques wrote:
> >>I'm curious about one thing: How scalable is hg in tracking a binary
> >>repository, for example a repository with word documents?
> >
> >Delta compression on such files isn't optimal, but it's decent.
> >
> >But provided you can comfortably fit the document in memory, checking
> >in the 1000th revision should take about the same amount of time as
> >the 1st or 2nd. And for files of some small numbers of megabytes, that
> >should be seconds.
>
> I'm more interested in how .hg/ behaves as commits go by. Does it grow
> to fast, and how fast (linear, polinomial, exponential,..)?
We attempt both delta compression and zip compression on each
revision. So total size should be no greater than the sum of zipped
versions of each revision, but usually -much- smaller.
Here's a quick test:
I happen to have a Word doc some lawyer sent me. 9 pages of legalese
and tables and whatnot. Hack it up in Open Office a bit and save two
versions. Let's check in those two versions back to back 500 times to
simulate 1000 deltas.
$ ls -ln
total 456
-rw------- 1 1000 1000 150016 Nov 30 15:31 stip.doc
-rw-r--r-- 1 1000 1000 149504 Nov 30 15:26 stip2.doc
-rw-r--r-- 1 1000 1000 150016 Nov 30 15:28 stip3.doc
$ for i in `seq 500`; do cp stip2.doc stip.doc; hg ci -m "a $i"; cp
stip3.doc stip.doc; hg ci -m "b $i"; echo $i; done
1
2
3
...
500
Elapsed time: 2m17s. Or 7.3 revisions per second. If I checked in
another 1000 revisions, the time would stay about the same. Other
systems would be going very slowly at this point.
$ du -sh .hg
15M .hg
That looks like a 10:1 compression ratio. Let's take a peek at the internals:
$ hg debugindex .hg/store/data/stip.doc.i
rev offset length base linkrev nodeid p1 p2
0 0 18029 0 0 fe9f539226b1 000000000000 000000000000
1 18029 14277 0 1 b12ed9c3b669 fe9f539226b1 000000000000
2 32306 14251 0 2 f5f789e91f9d b12ed9c3b669 000000000000
3 46557 14277 0 3 f3b80878411a f5f789e91f9d 000000000000
4 60834 14251 0 4 2c110d5eae72 f3b80878411a 000000000000
5 75085 14277 0 5 292b77fb9f00 2c110d5eae72 000000000000
6 89362 14251 0 6 e53b36e1d311 292b77fb9f00 000000000000
7 103613 14277 0 7 2e17f8e95e46 e53b36e1d311 000000000000
8 117890 14251 0 8 d95bf4afba18 2e17f8e95e46 000000000000
9 132141 14277 0 9 611142c9516c d95bf4afba18 000000000000
The first revision compressed down to 18k, with each delta compressing
down to 14k. This suggests the internals of the doc file are fairly
unstable, so we're not getting much advantage out of our deltas here.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial
mailing list