Are revlog diff calculated as "text" ALWAYS?

Kastner Masilko, Friedrich kastner-masilko at at.festo.com
Wed May 14 20:46:36 UTC 2014


> From: mercurial-bounces at selenic.com [mailto:mercurial-bounces at selenic.com] On Behalf Of Jesus Cea
> 
> Yes, you solve the example I provided to show the shortcomings of
> current DIFF implementation. I can provide you with a new example that
> breaks your proposal. For example: instead of appending an extra byte
> at the end of the random file, do it AT THE BEGINNING of the file.
> Plof, dead :-).

Yes, of course. I already acknowledged that. That's why I have written that it is not a general solution. However, chances are high that you are not always changing something at the start of a file. Especially with documents, the natural workflow is to append data.

> Anyway, we could settle the argument just investing some time and doing
> actual measurements.

Please do so. I'm sure a comparison between xdelta (or one of the other methods you mentioned) and the current bdiff implementation is valuable for further discussion. 
 
> If the generated bytestream is compatible with current revlog format,
> the only problem I can see is "hg diff"-and-related implementation. And
> Matt already said that decoupling DIFF calculation from "hg diff"
> representation *COULD* be done, if necessary.

That something could be done does not mean that it is easy to do, or even desirable. However, if you deliver an appropriate refactoring of every aspect in this regards, I'm sure nobody will object.
 
> Are you saying that you toke a big Word document, added a phrase *AT
> THE
> BEGINNING* of it and your algorithm produce an useful result?. I don't
> understand how that can be possible, using the fixed block partition
> algorithm you described.

Perhaps I did not make myself clear enough. I do not propose to exchange the current line-break delimiting with a fixed size delimiting. I proposed (and already tested) a mixture: if there is no line-break after 4k, just break a chunk.
What I tried was a ca. 200kB word document (docx) at rev 0, changes at the beginning in rev 1, changes in the middle in rev 2, and changes (actually additions) at the end in rev 3. I've put it through 4 times: vanilla HG, with my patch, with zipdoc alone, and lastly with zipdoc and my patch. The first and second were exactly the same (as expected), but the last one showed a significantly smaller value than the third, in both summary size and size changes in each step. Uncompressed docx size (aka. what zipdoc does) was around 1.2MB, BTW. Perhaps you can give me an example of your actual data (in similar manipulated versions) to put it through for detailed numbers?

> Friedrich, this is not a personal attack. What I am saying is that if
> we are open to change the bdiff algorithm (keeping revlog
> compatibility), we can do far better, and that your proposed algorithm
> is easy to implement but it is not actually a real improvement. Please,
> take my random file and insert a new byte AT THE BEGINNING as an
> example.

Hey, we are not on the Linux kernel or Git mailing list, now are we? I didn't think this to be a personal attack, just as I think you are not something like Felipe Contreras in disguise ;) .

I understand your point about "open to change bdiff". I also understand that my proposal is no general solution, you don't have to convince me about that. Just as I understand that there are clever algorithms to get a fast diff for certain formats out there.

The thing is this: I don't believe that a radical change of the diffing system will get accepted. For that to happen you will have to prove that neither the performance is decreased (too much), nor the backwards-compatibility is broken. If you can come up with such a patch series, all the better! Please have a go at it.

regards,
Fritz



Development Software Systems
Festo Gesellschaft m.b.H.
Linzer Strasse 227
Austria - 1140 Wien

Firmenbuch Wien
FN 38435y
UID: ATU14650108

Tel: +43(1)91075-198
Fax: 
www.festo.at

Der Inhalt dieser E-Mail und moeglicher Anhaenge sind ausschliesslich fuer den bezeichneten Adressaten bestimmt.
Jede Form der Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder Weitergabe des Inhalts dieser E-Mail und
moeglicher Anhaenge durch unberechtigte Dritte ist unzulaessig. Wir bitten Sie, sich mit dem Absender der E-Mail in
Verbindung zu setzen, falls Sie nicht der Adressat dieser E-Mail sind sowie das Material von Ihrem Computer zu loeschen.

This e-mail and any attachments are confidential and intended solely for the addressee. The perusal, publication, copying or
dissemination of the contents of this e-mail by unauthorised third parties is prohibited. If you are not the intended recipient of this
e-mail, please delete it and immediately notify the sender.




More information about the Mercurial mailing list