[PATCH 1 of 1] bdiff.c: implemented block-delimiting to better deal with long "lines"
Kastner Masilko, Friedrich
kastner-masilko at at.festo.com
Wed May 28 09:29:06 UTC 2014
> From: Matt Mackall [mailto:mpm at selenic.com]
>
> So this looks fine as far as it goes, but I think we should try to go
> one further: getting somewhat repeatable block boundaries even if we
> insert in the middle.
>
> For instance, we could have a hierarchy of block break rules:
>
> - break on newline
> - break on [?_%] if > 1k # some other rare characters?
> - break if > 4k
>
> This might give the delta engine some opportunities to realign.
I think that a fixed heuristic (rare characters) will not do much in terms of improvement for the described use-case of serialized XML files. That said, I've already experimented with a statistical approach:
1. During the first N bytes of the line, count every possible byte (only costs a bit of memory and one additional indirect addressing + incrementing). This will give a localized distribution of characters within the "line".
2. If no break occurred during N bytes, use one of the characters with a distribution closest to a general newline distribution in text-files (estimated at 1.5%) as break-marker _in addition_ to newline.
3. As soon as the first newline occurs, break and repeat from start.
This can be improved by means of continuously keeping the distribution up-to-date while running up to the newline, but this is IMHO too costly to do (calculating best marker according to distribution) on every byte. A fixed point of strategy-change is faster to do, especially with N chosen on a 2^x position.
So far I've got the best results with N=256 on various formats, but I worry about how this would influence visualization of diffs for the standard use-case. Your "rare characters" proposal is worth a try, though.
regards,
Fritz
Development Software Systems
Festo Gesellschaft m.b.H.
Linzer Strasse 227
Austria - 1140 Wien
Firmenbuch Wien
FN 38435y
UID: ATU14650108
Tel: +43(1)91075-198
Fax:
www.festo.at
Der Inhalt dieser E-Mail und moeglicher Anhaenge sind ausschliesslich fuer den bezeichneten Adressaten bestimmt.
Jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieser E-Mail und
moeglicher Anhaenge durch unberechtigte Dritte ist unzulaessig. Wir bitten Sie, sich mit dem Absender der E-Mail in
Verbindung zu setzen, falls Sie nicht der Adressat dieser E-Mail sind sowie das Material von Ihrem Computer zu loeschen.
This e-mail and any attachments are confidential and intended solely for the addressee. The perusal, publication, copying or
dissemination of the contents of this e-mail by unauthorised third parties is prohibited. If you are not the intended recipient of this
e-mail, please delete it and immediately notify the sender.
More information about the Mercurial-devel
mailing list