[PATCH alternative-version] pathencode: for long paths, strip first 5 chars, not first dir
Adrian Buehlmann
adrian at cadifra.com
Fri May 8 21:13:47 UTC 2015
On 2015-05-08 21:59, Martin von Zweigbergk wrote:
> # HG changeset patch
> # User Martin von Zweigbergk <martinvonz at google.com>
> # Date 1430953094 25200
> # Wed May 06 15:58:14 2015 -0700
> # Node ID baa1fb9f49617b0d554c6c47af9838922baf9a35
> # Parent 8179af513aebf96c4902ba3e5e3cf710d49501e4
> pathencode: for long paths, strip first 5 chars, not first dir
>
> When encoding long paths, the pure Python code strips the first
> directory from the path, while the native code currently strips the
> first 5 characters. This discrepancy has not been a problem so far,
> since we have not stored anything in directories other than
> data/. However, we will soon be storing submanifest revlogs in
> metadata/, so the discrepancy will have to go [1]. Since file
> collisions are avoided by the hashing alone (which is done on the full
> unencoded path), it doesn't really matter whether we drop the first
> dir, the first 5 characters, or special-case non-data/. To avoid
> touching the C code, let's always strip the first 5 characters.
>
> [1] Or maybe elsewhere, but the discrepancy is ugly either way.
>
> diff -r 8179af513aeb -r baa1fb9f4961 mercurial/store.py
> --- a/mercurial/store.py Thu May 07 16:43:58 2015 -0700
> +++ b/mercurial/store.py Wed May 06 15:58:14 2015 -0700
> @@ -187,7 +187,7 @@
>
> def _hashencode(path, dotencode):
> digest = _sha(path).hexdigest()
> - le = lowerencode(path).split('/')[1:]
> + le = lowerencode(path[5:]).split('/')
> parts = _auxencode(le, dotencode)
> basename = parts[-1]
> _root, ext = os.path.splitext(basename)
> diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py
> --- a/tests/test-hybridencode.py Thu May 07 16:43:58 2015 -0700
> +++ b/tests/test-hybridencode.py Wed May 06 15:58:14 2015 -0700
> @@ -460,3 +460,9 @@
> 'VWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxx'
> 'xxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwww'
> 'wwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i')
> +
> +print "paths outside data/ can be encoded"
> +show('metadata/dir/00manifest.i')
> +show('metadata/12345678/12345678/12345678/12345678/12345678/'
> + '12345678/12345678/12345678/12345678/12345678/12345678/'
> + '12345678/12345678/00manifest.i')
> diff -r 8179af513aeb -r baa1fb9f4961 tests/test-hybridencode.py.out
> --- a/tests/test-hybridencode.py.out Thu May 07 16:43:58 2015 -0700
> +++ b/tests/test-hybridencode.py.out Wed May 06 15:58:14 2015 -0700
> @@ -491,3 +491,10 @@
> A = 'data/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxxxxxx-xxxxxxxxx-xxxxxxxxx-123456789-12.3456789-12345-ABCDEFGHIJKLMNOPRSTUVWXYZ-abcdefghjiklmnopqrstuvwxyz-ABCDEFGHIJKLMNOPRSTUVWXYZ-1234567890-xxxxxxxxx-xxxxxxxxx-xxxxxxxx-xxxxxxxxx-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww-wwwwwwwww.i'
> B = 'dh/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345/-xxxxx93352aa50377751d9e5ebdf52da1e6e69a6887a6.i'
>
> +paths outside data/ can be encoded
> +A = 'metadata/dir/00manifest.i'
> +B = 'metadata/dir/00manifest.i'
> +
> +A = 'metadata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manifest.i'
> +B = 'dh/ata/12345678/12345678/12345678/12345678/12345678/12345678/12345678/00manife0a4da1f89aa2aa9eb0896eb451288419049781b4.i'
> +
The current layout (using the current, 'fncache', repo format) for
regular revlogs under .hg/store/ is:
data/ for non-path-hashed revlogs
dh/ for path-hashed-revlogs
You - as I understand it - want to separately store submanifests under a
new directory named 'metadata', under 'store' (leaving aside for now
that I don't like the name 'metadata', because of its length).
Don't you then also need a separate directory for paths of submanifests
that trigger the path-hash-encoding?
If you store path-hashed submanifest revlogs under 'dh', together with
non-submanifest revlogs, won't that lead to potential path collisions
inside dh?
The trailing 'ata' of 'metadata' doesn't look like a correct
discriminator to me, as that string may derive from the working tree as
well.
More information about the Mercurial-devel
mailing list