Are revlog diff calculated as "text" ALWAYS?
Kastner Masilko, Friedrich
kastner-masilko at at.festo.com
Tue May 13 08:22:30 UTC 2014
> From: mercurial-bounces at selenic.com [mailto:mercurial-bounces at selenic.com] On Behalf Of Jesus Cea
>
> Doing some experiments, looks like the DELTA algorithm in Mercurial 3.0
> (I don't know about previous versions) are calculating DIFF doing
> something like this:
>
> 1. Take both files and split them as LINEFEED delimited lines.
>
> 2. Do the DIFF of those lines.
>
> This is OK for text-based files with line-oriented content, but doesn't
> work very well when files are binary, even when those files are quite
> structured and changes are small.
>
> To reproduce:
>
> 1. Create a mercurial repository
>
> hg init test; cd test
>
> 2. Create a 200Kb random file, with no "LINEFEED"s on it:
>
> cat /dev/urandom | tr -d '\n' | dd of=z bs=1024 count=200
>
> 3. Add the file to the repository:
>
> hg add z
> hg commit -m "test"
>
> 4. Verify revlog file size:
>
> ls -la .hg/store/data/z.d
>
> 5. Add A SINGLE character to the random file:
>
> echo "X" >> z
> hg commit -m "test 2"
>
> 6. Verify the new revlog file size:
>
> ls -la .hg/store/data/z.d
>
> 7. Repeat steps 5-6.
>
> Revlog will add 200Kbytes in each commit even when you simply add a
> single character at the end of the file.
I just tried that with 2.7.1, and indeed the first character-adding triggers a second snapshot. However, repeated character-adding does not do that for me:
$ hg vers
Mercurial Distributed SCM (version 2.7.1)
(see http://mercurial.selenic.com for more information)
Copyright (C) 2005-2013 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ hg init testdiff
$ cd testdiff
$ cat /dev/urandom | tr -d '\n' | dd of=z bs=1024 count=200
200+0 records in
200+0 records out
204800 bytes (205 kB) copied, 0.0329855 s, 6.2 MB/s
$ ls
z
$ ls -la
total 216
drwxr-xr-x 3 4096 2014-05-13 09:59 .
drwxrwxrwx 14 4096 2014-05-13 09:58 ..
drwxr-xr-x 3 4096 2014-05-13 09:58 .hg
-rw-r--r-- 1 204800 2014-05-13 09:59 z
$ hg add z
$ hg commit -m "test"
$ ls -la .hg/store/data/z.d
-rw-r--r-- 1 204801 2014-05-13 09:59 .hg/store/data/z.d
$ echo "X" >> z
$ hg commit -m "test 2"
$ ls -la .hg/store/data/z.d
-rw-r--r-- 1 409604 2014-05-13 10:00 .hg/store/data/z.d
$ echo "X" >> z
$ hg commit -m "test 3"
$ ls -la .hg/store/data/z.d
-rw-r--r-- 1 409618 2014-05-13 10:00 .hg/store/data/z.d
$ echo "X" >> z
$ hg commit -m "test 4"
$ ls -la .hg/store/data/z.d
-rw-r--r-- 1 409632 2014-05-13 10:01 .hg/store/data/z.d
It would have stunned me, anyway, knowing how Mercurial's diff is a binary diff and no text one. I also doubt that my many binary-file commits would actually compress that good without a decent binary diff in the background.
The first double snapshot is interesting, though, but I guess here it is not so much about the diffing, but about the revlog format. The later has rules when a full snapshot or a delta should be applied. I don't want to go into details, but one thing seems to be clear from my test: 2.7.1 is not using a text-based diff for revlog creation.
regards,
Fritz
Development Software Systems
Festo Gesellschaft m.b.H.
Linzer Strasse 227
Austria - 1140 Wien
Firmenbuch Wien
FN 38435y
UID: ATU14650108
Tel: +43(1)91075-198
Fax:
www.festo.at
Der Inhalt dieser E-Mail und moeglicher Anhaenge sind ausschliesslich fuer den bezeichneten Adressaten bestimmt.
Jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieser E-Mail und
moeglicher Anhaenge durch unberechtigte Dritte ist unzulaessig. Wir bitten Sie, sich mit dem Absender der E-Mail in
Verbindung zu setzen, falls Sie nicht der Adressat dieser E-Mail sind sowie das Material von Ihrem Computer zu loeschen.
This e-mail and any attachments are confidential and intended solely for the addressee. The perusal, publication, copying or
dissemination of the contents of this e-mail by unauthorised third parties is prohibited. If you are not the intended recipient of this
e-mail, please delete it and immediately notify the sender.
More information about the Mercurial
mailing list