Are revlog diff calculated as "text" ALWAYS?

Jesus Cea jcea at jcea.es
Wed May 14 14:04:13 UTC 2014


On 14/05/14 10:24, Kastner Masilko, Friedrich wrote:
>> From: mercurial-bounces at selenic.com
>> [mailto:mercurial-bounces at selenic.com] On Behalf Of Jesus Cea
>> 
>> You don't want fixed size chunks (unless your source files are
>> database files). Any insertion in the middle of the file will
>> change all blocks after it.
>> 
>> You want to find a suitable "linefeed" mark that doesn't change a
>> lot when you do small changes to the file. For instance, for XML
>> files, the "line ending" could be the character ">".
[...]
> 
> It is clear that my proposal is _not_ the generic solution to all
> those problems resulting from the used line-based algorithm. But it
> would ease the described situation somewhat, without cutting too much
> into performance or usage. Let's see if a patch gets some positive
> reviews...

The problem with your suggestion, Friedrich is that your proposal is
only useful for block oriented content like databases. For my problem at
hand, ODT/DOCX documents, it would be not useful at all. Completely
ineffective.

Since current code already scans the document hunting for "\n" maybe it
could keep counters for other "frequent" markers like ">", "\r", space,
etc, and choose a "fragment marker" a little bit more cleverly.

Another option would be to abandon current approach (line oriented) and
embrace a completely different (but revlog compatible) option like
xdelta (I have no particular love for it, it is just a suggestion). If
the revlog compatibility can be ensured, I guess the only problem would
be performance difference, if any.

Current code split the input files in lines, compute a hash of each one
and then use that to compare both files and calculate the DIFF.

Another approach would be to use a rolling hash
<https://en.wikipedia.org/wiki/Rolling_hash> based on Rabin Fingerprint
<https://en.wikipedia.org/wiki/Rabin_fingerprint> to break the input
files in arbitrary statistically calculable block sizes (lets say, 32
bytes) and do the diff on them. It would be an hybrid between your
aligned fixed-size block proposal and flexible size/alignment of a
rolling hash. This option auto-align blocks when you do inserts/deletes,
"automágically".

Details available under demand, if there is interest in pursuing this :).

This could be suboptimal for text files, where lines are a natural block
unit, but it could work too. If you can switch between both approaches
according to text/binary detection.

-- 
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea at jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea at jabber.org  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 538 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20140514/4b6153be/attachment.asc>


More information about the Mercurial mailing list