Are revlog diff calculated as "text" ALWAYS?
Jesus Cea
jcea at jcea.es
Mon May 12 22:19:54 UTC 2014
I just started using "hg-zipdoc" to help me to manage a mercurial
repository of OpenDocument files.
Those files are ZIP files, and ZIPDOC should do a great job storing only
the differences between the documents, not the (huge) differences
between the ZIP files.
But I am seeing little benefice out of it.
Doing some experiments, looks like the DELTA algorithm in Mercurial 3.0
(I don't know about previous versions) are calculating DIFF doing
something like this:
1. Take both files and split them as LINEFEED delimited lines.
2. Do the DIFF of those lines.
This is OK for text-based files with line-oriented content, but doesn't
work very well when files are binary, even when those files are quite
structured and changes are small.
To reproduce:
1. Create a mercurial repository
hg init test; cd test
2. Create a 200Kb random file, with no "LINEFEED"s on it:
cat /dev/urandom | tr -d '\n' | dd of=z bs=1024 count=200
3. Add the file to the repository:
hg add z
hg commit -m "test"
4. Verify revlog file size:
ls -la .hg/store/data/z.d
5. Add A SINGLE character to the random file:
echo "X" >> z
hg commit -m "test 2"
6. Verify the new revlog file size:
ls -la .hg/store/data/z.d
7. Repeat steps 5-6.
Revlog will add 200Kbytes in each commit even when you simply add a
single character at the end of the file.
I guess this line diff is an optimization feature for speed and because
"hg diff" will show line diff, but when the file is binary this is an
opportunity cost.
I wonder if my analysis is correct and if mercurial team is open to some
other DIFF algorithm when the file is identified as binary. Compatible
at revlog level, of course.
I wonder what MAC OS users experience, since end-of-line there is "\r",
not "\n". Or they migrated to "\n" in Mac OS X?.
--
Jesús Cea Avión _/_/ _/_/_/ _/_/_/
jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
Twitter: @jcea _/_/ _/_/ _/_/_/_/_/
jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 538 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20140513/f2ae0a68/attachment.asc>
More information about the Mercurial
mailing list