Are revlog diff calculated as "text" ALWAYS?

Jesus Cea jcea at jcea.es
Mon May 12 22:19:54 UTC 2014


I just started using "hg-zipdoc" to help me to manage a mercurial
repository of OpenDocument files.

Those files are ZIP files, and ZIPDOC should do a great job storing only
the differences between the documents, not the (huge) differences
between the ZIP files.

But I am seeing little benefice out of it.

Doing some experiments, looks like the DELTA algorithm in Mercurial 3.0
(I don't know about previous versions) are calculating DIFF doing
something like this:

1. Take both files and split them as LINEFEED delimited lines.

2. Do the DIFF of those lines.

This is OK for text-based files with line-oriented content, but doesn't
work very well when files are binary, even when those files are quite
structured and changes are small.

To reproduce:

1. Create a mercurial repository

    hg init test; cd test

2. Create a 200Kb random file, with no "LINEFEED"s on it:

    cat /dev/urandom | tr -d '\n' | dd of=z bs=1024 count=200

3. Add the file to the repository:

   hg add z
   hg commit -m "test"

4. Verify revlog file size:

   ls -la .hg/store/data/z.d

5. Add A SINGLE character to the random file:

   echo "X" >> z
   hg commit -m "test 2"

6. Verify the new revlog file size:

   ls -la .hg/store/data/z.d

7. Repeat steps 5-6.

Revlog will add 200Kbytes in each commit even when you simply add a
single character at the end of the file.

I guess this line diff is an optimization feature for speed and because
"hg diff" will show line diff, but when the file is binary this is an
opportunity cost.

I wonder if my analysis is correct and if mercurial team is open to some
other DIFF algorithm when the file is identified as binary. Compatible
at revlog level, of course.

I wonder what MAC OS users experience, since end-of-line there is "\r",
not "\n". Or they migrated to "\n" in Mac OS X?.

-- 
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea at jabber.org  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 538 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20140513/f2ae0a68/attachment.asc>


More information about the Mercurial mailing list