Thoughts on diff extensions

Christopher Li hg at chrisli.org
Sun Jun 19 19:12:41 UTC 2005


On Sun, Jun 19, 2005 at 02:26:45PM -0700, Matt Mackall wrote:
> On Sun, Jun 19, 2005 at 01:29:32PM -0400, Christopher Li wrote:
> > It is the cost of how to group the chunks, that is different from the
> > cost of how many bytes it inserted. You insert more and you have to
> > deleted more.
> 
> No, the issue is deleted bytes DON'T SHOW UP IN THE OUTPUT. 

Yes I know that, but my point exactly is that is an irrelevant issue.
Why? because the inserted bytes will show up output in the output and it keep
a constant delta relationship with the deleted bytes.

It is purely a distribute/grouping of the bytes matter the different in
the output.  Let's say you delete 5 new line in 5 different place, no insert
there does it not cost you any thing in the output? You still need 5 hunks to describe it.

But if delete 5 new line is one place, you just need 1 hunks. That is what
I am try to said, it is the grouping of the bytes matters, not deleted vs insert.

>  
> You're missing my point by about five miles. The point is the
> optimality of the GNU diff algorithm is irrelevant. 
> 
> - it doesn't generate minimal diffs by default
> - if it did, it would be slow
> - those diffs are not really minimal anyway

Define which minimal.  It is minimal changed bytes. Not your minimal
output one way delta size.

> and b", not "prepend ac to ab". And I'm pretty sure that fewer diff
> chunks will generally result in a better merge.

Why? Bigger chunk subject to conflict more easily. Let's take one extreme
example, if you take the whole file and do byte base diffs. It will have a
lot more chucks and it will able to merge on the very common case that you
modify a different place of the same line. The bigger chunk line base diff
will have conflict right there.

> 
> Here's what I care about here, again:
> 
> 1. must be fast
> 2. must be about the same compression factor
> 3. must be reasonably simple
> 4. must not choke on binary
> 5. must be capable of storing just changed lines
> 

> My code fails on 1. It's the same algorithm as GNU diff, except
> without the non-minimal shortcut. And it's less than 300 lines of
> relatively clean code. I suspect it's going to be easier to add the
> shortcut case to mine than to beat diffutils into acceptable shape.

The GNU diff is complex is because it has a lot of this short cut
and tunable stuff build in. Like discarding start and ending lines.
If you look close a lot of code is already dead, I #if 0 a lot of it. I just
did not deleted it completely.

By the time you add the short cut and all the speed up the code it will
add up complexity and size.

Do what you feel like.

Chris




More information about the Mercurial mailing list