[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"

Pierre-Yves David pierre-yves.david at ens-lyon.org
Thu Aug 21 17:07:47 UTC 2014



On 05/28/2014 02:29 AM, Kastner Masilko, Friedrich wrote:
>> From: Matt Mackall [mailto:mpm at selenic.com]
>>
>> So this looks fine as far as it goes, but I think we should try to go
>> one further: getting somewhat repeatable block boundaries even if we
>> insert in the middle.
>>
>> For instance, we could have a hierarchy of block break rules:
>>
>> - break on newline
>> - break on [?_%] if > 1k  # some other rare characters?
>> - break if > 4k
>>
>> This might give the delta engine some opportunities to realign.
>
> I think that a fixed heuristic (rare characters) will not do much in terms of improvement for the described use-case of serialized XML files. That said, I've already experimented with a statistical approach:
>
> 1. During the first N bytes of the line, count every possible byte (only costs a bit of memory and one additional indirect addressing + incrementing). This will give a localized distribution of characters within the "line".
> 2. If no break occurred during N bytes, use one of the characters with a distribution closest to a general newline distribution in text-files (estimated at 1.5%) as break-marker _in addition_ to newline.
> 3. As soon as the first newline occurs, break and repeat from start.
>
> This can be improved by means of continuously keeping the distribution up-to-date while running up to the newline, but this is IMHO too costly to do (calculating best marker according to distribution) on every byte. A fixed point of strategy-change is faster to do, especially with N chosen on a 2^x position.
>
> So far I've got the best results with N=256 on various formats, but I worry about how this would influence visualization of diffs for the standard use-case. Your "rare characters" proposal is worth a try, though.

Any news on this front? The general idea for this patch with interesting 
and I would be happy to see the topic come to a conclusion.

-- 
Pierre-Yves David



More information about the Mercurial-devel mailing list