filelog binary diff
Matt Mackall
mpm at selenic.com
Mon May 16 23:11:58 UTC 2005
On Mon, May 16, 2005 at 02:29:14PM -0400, Christopher Li wrote:
> On Mon, May 16, 2005 at 12:39:02PM -0700, Matt Mackall wrote:
> > On Mon, May 16, 2005 at 10:19:28AM -0400, Christopher Li wrote:
> > > On Mon, May 16, 2005 at 09:05:20AM -0700, Matt Mackall wrote:
> >
> > Not sure if mmap is a win here. Doing a memcpy of all of the kernel
> > source is well under a second, so saving the copy is not a big deal on
> > something like a full tree checkout. On the other hand, handing off
> > large I/Os to the kernel with read() can result in substantially
> > better I/O patterns than doing a dependent read on each page in an
> > mmap segment. And bad I/O can be _really_ bad.
>
> I guess we just have to try it out to find out isn't it? Just read
> from the index first, find out the base and the following diff bucket.
> Read them in linearly. In the linear case, it should behave exactly like
> what you have currently (from the file system point of view). In the branch
> case, you read sequential but have bigger diff bucket. My suggestion way
> read sparsely but have a smaller diff. The stuff it skip is the interleave
> part in your case it need to read.
>
> It is not just any bad IO. It is still linear read with some parts skipped,
> in exchange have fewer stuff to read. Why do you think it will much worse
> the what we have currently?
The ancestor can be an unbounded distance away in the log. It might
not be much worse, but it's also unlikely to be much better. And
there's a lot more room to get worse than to get better. And is no
longer as simple.
The current revlog scheme aims for O(1) seeks and I/O and computation
proportional to file size. Compression is secondary. Doing ancestor
chaining rather than the linear approximation I'm currently using
potentially gives up that O(1) property.
> Again, I am not saying this is a performance win, it can be in some case
> and lose in the other case. I am just saying it should be in the same performance
> range if we doing it carefully. It save more disk space, and have fewer
> stuff to transfer over the network because the diff is smaller.
>
>
> > > some project like linux kernel right? The kernel has tons branch
> > > going on. I agree on the normal file log of a kernel source might not
> > > have a lot of open branch going on. The biggest victim of this
> > > design is the changelog and manifest file, which associate with
> > > every single freaking checking and have to choke in the complicate
> > > branch relations.
> >
> > Making a branch is cheap (it's one of the fastest operations, in
> > fact). hg branch does cp -al internally and COWs files as necessary.
> > You can have dozens of branches before it begins to be a problem.
>
> Sure, but that doesn't help solving the "linux kernel has complicate
> branch relation" problem. The manifest file still going to surfer
> from the interleave diff problem, no?
Not much, no. The compression factor on the manifest is already
enormous. The average manifest size for the kernel is about 1M, while
the whole manifest log with >28000 revisions is only 11M. So we're at
about 2500:1 compression on the manifest. As interleaving is mostly
going to happen with merges, that shouldn't affect the average
compression very much.
> > Importantly, throwing away a branch is also very cheap. Pull from
> > upstream, make a branch, do some work, verify it, pull it into a
> > submission tree, and throw away your branch.
>
> But you can not throw away the branch that become part of the linux
> kernel history so the branching and merge in the revlog is unavoidable.
It will tend to show up in the mergelog like this:
AAAAAAAABBBBBBBBMAAAAAAAAAAAAAAABBBBBBBBBM
The size of the A->B diff won't be very big compared to the manifest
itself.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial
mailing list