[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"
Matt Mackall
mpm at selenic.com
Tue May 27 21:36:45 UTC 2014
On Wed, 2014-05-14 at 23:54 +0200, Friedrich Kastner-Masilko wrote:
> # HG changeset patch
> # User Friedrich Kastner-Masilko <kastner-masilko at at.festo.com>
> # Date 1400101077 -7200
> # Wed May 14 22:57:57 2014 +0200
> # Node ID 1b57d1650cd2d5aa8c6cc103c344ecbd8fbabc39
> # Parent 1ae3cd6f836c3c96ee3e9a872c8e966750910c2d
> bdiff.c: implemented block-delimiting to better deal with long "lines"
>
> Recent XML-based file formats often resemble human-readable text
> without a single line-break. This mostly comes from serialization of
> binary data into the XML format without well-forming the content for
> viewing. Storing such files with the current revlog implementation
> results in ineffective storage due to the used bdiff line-based
> algorithm. Since bdiff creates chunks based on the line-break mark,
> the whole file content is considered as one chunk, thus creating a
> delta as big (or even bigger) as the file itself.
>
> This patch is introducing block-limiting of lines. All lines
> encountered will be split into 4k blocks, thus giving the algorithm a
> chance to create smaller deltas, especially if the changes are at the
> end of the file. Especially for growing content where the header of
> the file is never changed, this patch increases the storage
> efficiency. However, with changes at the beginning of the file the
> block-limiting is not changing the results w.r.t. the original
> algorithm. The same is true for standard usage with text-files:
> because these usually contain lines shorter than 4k characters, the
> patch never kicks in.
So this looks fine as far as it goes, but I think we should try to go
one further: getting somewhat repeatable block boundaries even if we
insert in the middle.
For instance, we could have a hierarchy of block break rules:
- break on newline
- break on [?_%] if > 1k # some other rare characters?
- break if > 4k
This might give the delta engine some opportunities to realign.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list