Are revlog diff calculated as "text" ALWAYS?
Jesus Cea
jcea at jcea.es
Wed May 14 16:07:29 UTC 2014
On 14/05/14 16:38, Kastner Masilko, Friedrich wrote:
>> From: mercurial-bounces at selenic.com
>> [mailto:mercurial-bounces at selenic.com] On Behalf Of Jesus Cea
>>
>> The problem with your suggestion, Friedrich is that your proposal
>> is only useful for block oriented content like databases. For my
>> problem at hand, ODT/DOCX documents, it would be not useful at all.
>> Completely ineffective.
>
> That's not true. In the _example_ you gave, the suggested
> block-delimiting to 4k reduces the delta from 200k to max. 4k. The
> same would be true in similar circumstances for deserialized XMLs,
> i.e. if you have XMLs that grow, or get insertions in the middle of
> the file, the resulting delta would not be as big as before, but
> reduced. Sure it would not change anything if the insertion is at the
> beginning of the file, but as I said: it is no generic solution, just
> an enhancement of the current algorithm without too much risk. Every
> win - and if only the skipping of the first chunk - would be better
> than nothing, right?
Yes, you solve the example I provided to show the shortcomings of
current DIFF implementation. I can provide you with a new example that
breaks your proposal. For example: instead of appending an extra byte at
the end of the random file, do it AT THE BEGINNING of the file. Plof,
dead :-).
> I really doubt that a general xdelta system will be as fast as the
> current bdiff implementation. I suspect (from the code I skimmed of
> that xdelta project) that it will be an order of magnitude slower.
xdelta generates VCDIFF standard files, do compression, try to be clever
in the generated opcodes and, in general, do a lot of work trying to
minimize the DIFF size. I don't propose use xdelta as is. I am proposing
about taking ideas from it (they toke them from rsync).
A rolling hash implementation is quite efficient too, and enough for our
needs. Moreover, there are other costs you have. For instance zlib
compression of the result. Or the flush to disk. Pretty costly, and
helping us to cope with the incremental performance hit, if any.
Anyway, we could settle the argument just investing some time and doing
actual measurements.
> In a similar manner, every heuristical approach (counting markers,
> "guessing" binary status) will induce "magic", as you already wrote.
> Thus it increases risk of breakage deep down in the very backbones of
> the whole system. The appropriate patch would have a low chance of
> finding its way into productive deployments, either.
If the generated bytestream is compatible with current revlog format,
the only problem I can see is "hg diff"-and-related implementation. And
Matt already said that decoupling DIFF calculation from "hg diff"
representation *COULD* be done, if necessary.
> I know it is a very pragmatic approach, but have you tried
> block-delimiting in your case already? My quick tests on a simple
> Word document tells me that it reduces the store size by ~60% without
> much change regarding commit time. If this patch goes through due to
> being minimal-invasive, would that not be appreciated?
Are you saying that you toke a big Word document, added a phrase *AT THE
BEGINNING* of it and your algorithm produce an useful result?. I don't
understand how that can be possible, using the fixed block partition
algorithm you described.
Friedrich, this is not a personal attack. What I am saying is that if we
are open to change the bdiff algorithm (keeping revlog compatibility),
we can do far better, and that your proposed algorithm is easy to
implement but it is not actually a real improvement. Please, take my
random file and insert a new byte AT THE BEGINNING as an example.
Please, try this option: instead of looking for "\n" (current
implementation) or choosing a fixed size block (your proposal), use
something like <https://github.com/joeltucci/rabin-fingerprint-c> to
choose the block boundaries. Choosing a fast non-cryptographic hash
algorithm is enough. We don't need crypto-resistence. Rsync uses a mere
adler32 <https://en.wikipedia.org/wiki/Adler-32>.
Just do a CRC-32 rolling calculation (efficient!) over the file and
choose boundaries where "rolling CRC-32 & X==0", where X = "2^n-1" for a
suitable value of n. With n=5, blocks will have a size around 32 bytes,
comparable with current line-oriented DIFF, I guess.
--
Jesús Cea Avión _/_/ _/_/_/ _/_/_/
jcea at jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
Twitter: @jcea _/_/ _/_/ _/_/_/_/_/
jabber / xmpp:jcea at jabber.org _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 538 bytes
Desc: OpenPGP digital signature
URL: <http://lists.mercurial-scm.org/pipermail/mercurial/attachments/20140514/c1cb403e/attachment.asc>
More information about the Mercurial
mailing list