Corrupted repositories on NFS

Matt Mackall mpm at selenic.com
Mon Nov 29 17:08:25 UTC 2010


On Fri, 2010-11-26 at 16:01 +1100, Jesper Noehr wrote:
> Hi list,
> 
> I've managed to corrupt a repository on an NFS mount by accessing it
> from several clients at once. This shouldn't be possible, should it?
> 
> Mercurial version 1.7.1, python 2.6.
> 
> I wrote this script: http://paste.pocoo.org/show/296128/
> 
> Please excuse the quality, it was hacked up quickly.
> 
> Anyway, after running this for a while, the repository becomes
> corrupted. It leaves an abandoned transaction in the repository, and
> you need to run "hg recover" to get it back into a working state.
> 
> I think I managed to trace down at least some of the reason why this happens.
> 
> In http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/lock.py#cl-78,
> it tries to make a lock, and if it fails due to the lock already being
> there, it will call self.testlock(). self.testlock() can naturally
> assume that the file exists (cause the OS just said so!), but over
> NFS, it can happen that the file will no longer exist inside
> 'testlock()'. We saw that happen, at least.
> 
> I modified http://bitbucket.org/mirror/mercurial-crew/src/tip/mercurial/util.py#cl-593
> (util.readlock) to return a dummy-string in case os.readlink raised
> errno.ENOENT, triggering mercurials error.LockHeld, which seems to
> have fixed that race condition.

Hmm. It might be better to simply do a try: except: around the testlock
call in trylock. But it looks to me like we'll fail safe if we hit this?

> Secondly, unlinking on NFS is not atomic. The recommended way to go
> about it is to 1. rename the file (which is atomic), and 2. unlink it.
> Then you get the same guarantees you can get from a normal filesystem.
> I've modified mercurial to rename, then unlink, in cases where it
> deals with lockfiles. That fixes the other race.

Not sure why this should matter. Locking and unlocking are not
symmetric. If other clients take a while to notice a lock has appeared,
you have a problem. But if they take a while to see a lock has
disappeared, then you can have issues. Did you test this change
separately?

On the other hand, if clients see the lock disappear before they see
updates to the changelog, then yes, scribbling can occur. Perhaps the
rename is enforcing a 'sequence point'.

> Abort: working directory has unknown parent '8e16e3e8db02'!
> 
> ... however, they don't seem to corrupt anything.

This is clients seeing .hg/dirstate updates before the corresponding
changelog updates. What's your test? Serial commits?

-- 
Mathematics is the supreme nostalgia of our time.





More information about the Mercurial-devel mailing list