Speeding up Mercurial on NFS
Matt Mackall
mpm at selenic.com
Thu Jan 13 18:35:08 UTC 2011
On Wed, 2011-01-12 at 13:40 +0100, Martin Geisler wrote:
> Matt Mackall <mpm at selenic.com> writes:
>
> > On Tue, 2011-01-11 at 12:11 +0100, Martin Geisler wrote:
> >
> >> I agree with you that a single disk should max out like you
> >> describe... but the above numbers are for a normal 7200 RPM 1 TB SATA
> >> disk and a quad core i7 930.
> >
> > Then you're getting lucky with disk layout and read-ahead. It's very
> > easy for an 'aged' directory tree to take much longer than 5 seconds
> > to walk on spinning media.
>
> Yeah, the directory is not aged in any way, it's just a clone of
> OpenOffice that I made at some point.
>
> >> > By comparison, threaded NFS lookups is all about saturating the
> >> > pipe because the (cached on server) lookup is much faster than the
> >> > request round trip.
> >> >
> >> > How many files are you walking here?
> >>
> >> There are 70k files -- it is the working copy of OpenOffice,
> >> changeset 67e476e04669. You sometimes talk about walking a repo with
> >> 207k files, is that a public repo?
> >
> > Might have been Netbeans?
>
> I just cloned it and they "only" have 98k files and 186k changesets.
Hmm, no idea then.
> Oh, I just looked at the graph in the Gnome System Monitor and saw that
> the spikes went no further than ~50% or so. It shows 8 curves, one for
> each "virtual" core.
Yeah, I don't think that's actually meaningful. You can never get both
threads on a core to 100%, so there's no way to tell how close to
saturation you are. It might be when all threads are at 50%, 40%, or
60%, depending on workload.
> Okay, as a start I have timings for a cache-hot local walk here:
>
> threads pywalker walker
> 1 565 ms 259 ms
> 2 1330 ms 204 ms
> 4 1707 ms 440 ms
> 8 1834 ms 636 ms
> 16 1947 ms 739 ms
> 32 1969 ms 765 ms
Huh. This hits a performance wall before you reach the number of cores
and the scaling even at 2 cores is bad. That wall is probably due to
cacheline bouncing as various lookups touch things in the dcache. In the
not-yet released kernel, dcache lookup is now 'store-free', so this
should go away and yield vastly better numbers for high N.
But it means you're probably at the limit of what you can do with your
testing on a single system: if you've got 8 threads trying to fill the
loopback 'pipe', the kernel NFS server is probably going to process them
on the same core as the submitting thread (or equally, a random core)
with the same cache ping-pong effects.
If we subtract the local walk speed from the NFS walk speed, we end up
with something like:
local nfs diff
1 259 1931 1672
2 204 1164 960
4 440 818 378
8 636 833 197
16 739 991 252
Here 'diff' is effectively the overhead of going over NFS. If we could
combine the best-case NFS communication overhead of 197ms with the best
case walk result of 204ms, we'd be down at 401ms.
That'd be more like the result you could achieve having a well-tuned
multithreaded client saturating a dedicated NFS server.
--
Mathematics is the supreme nostalgia of our time.
More information about the Mercurial-devel
mailing list