User metadata support
Guenther Brunthaler
spam_me_not_please_dont at gmx.nospam.net
Sun Nov 19 19:54:51 UTC 2006
Hi all,
I'm rather new to Mercurial, but I have used a lot of different SCMs before.
Mercurial looks in many aspects like the "right thing" to me.
I used to check out Monotone some time ago, but it had just too many
shortcomings to be useful - most of which Mercurial managed to avoid.
Actually, there is only one single big thing left (except for symlink
support) which would make everyone happy: User metadata support.
With user metadata, I mean something like the "properties" of Subversion.
In essence, it's just a version-controlled key/value list associated
with each file.
You might think nobody actually needs such a thing?
Let's illustrate a few cases where such metadata would be highly useful:
* Additional permission bits. Currently, Mercurial supports the
executable bit right out of the box. Fine. But what if more permission
bits should be associated with a file, such as the sticky bit. Or
creating a file as read-only. Or a special POSIX ACL. If hooks for
checkin/checkout had access to file metadata, the hooks could set the
appropriate bits on checkout as required, and without a need to
integrate such features into the Mercurial core.
* Actually, even the executable bit needed not to be supported directly
my the core Mercurial any more: Hooks could take over that job too,
provided they have access to metadata items such as to a property named
"hg:executable".
* Line-ending conversion. While I agree that line-ending conversions
should normally be performed based on heuristics because users tend to
forget about setting special properties, there are exceptions. What if a
file with extension .txt is a texture in some project subdirectory
rather than a text file like in the rest of the project? If the
autodetection heuristics for binary files fails, we'll be screwed as
soon as line-ending conversion will be attempted on that file. Using a
property such as "hg:eol-style" set to "binary" would let a hook script
override autodetection in such cases.
* Character set conversion. What if a single directory contains text
files in different character set encodings? Just think of text files on
a Windows machine which shall also be edited on a UTF-8 Linux
workstation: On the Windows side, most files will be using the "ANSI"
character set (in fact WINDOWS-1252 because of that EURO-Symbol), but
some files intended to be used by the Console are instead represented
using the "OEM" character set (IBM CP 437 or CP 850). Using the same
conversion for all text files cannot work in this case. And they all
share the same filename extension. It is necessary to override the
conversions on a per-file basis. Metadata properties would allow the
hook to also take care of this.
* Stream metadata. Machines like the Apple Macintosh can use different
streams in a file, the so-called "data fork" and "resource fork". Think
of it as a kind of sub-file. Same for NTFS which also supports streams.
Each stream of a file has the potential to contain data a user would
like to be subject to version control. The "normal file contents" are
just the contents of the default stream of each file. But the best is
yet to come:
* Symlinks could be implemented using hooks having access to stream
metadata! In this case, a symlink would be treated as a stream with a
reserved name of a file which has no default stream (and thus no normal
file data) at all. That means, when checking out, there will no file be
created. But the checkout hooks (which will be run for all streams, not
just for file data contents) can check the stream type and create a symlink.
* Any kind of additional information to be attached to files/streams,
such as MIME types etc. The hooks can make use of this information if
required.
* Directory attributes! Directory can also also have metadata streams,
storing version-controlled metadata about the directories, such as
"hg:ignore".
* Tracking even empty directories and renaming or moving of directories.
If we assign an (empty) dummy stream such as "hg:dir" to each directory,
we can deduce the existence of a directory from the mere existence of
that stream. Which means we won't need things like dummy-".keep" any more.
You see, metadata support would in fact be exceptionally useful for
everyone, and would even make implementation of some things easier.
For instance, you could forget about the executable bit or symlinks in
the core Mercurial project, and delegate such problems to the hook scripts.
Of course, metadata support should be implemented in a way that requires
the least changes to the existing implementation.
First, what is actually needed:
* Not files are versioned, but streams are. It's pretty much the same
from the perspective of a revlog, but we have to add an additional
level below the leaf level (as it is now).
For instance, instead of having a
./hg/data/somedir/somefile.d
we then could have a
./hg/data/somedir/somefile/hg_data.d
which means: This represents a stream with name "hg:data" of the version
controlled "object somedir/somefile". We say "object" here rather than
"file", because that object could be a symlink as well - depending on
which stream properties it has.
In this case, it is a file, because it has a "hg:data" property (which
contains the actual file contents).
But it could have additional properties as well:
./hg/data/somedir/somefile/hg_executable.d
could indicate the fact that file "somedir/somefile" has an additional
property "hg:executable", which means its executable bit should be set
on checkout.
"hg:executable" is also the example for a "switch"-style property: It's
mere existence indicates something; the actual revlog contents will
typically an empty file because all that matters is whether this
property is there or not.
You can also see: Streams and Properties are pretty much the same in
this model - from the viewpoint of the revlog they are just more files
to be version-controlled.
It's the only *interpretation* as streams/properties which makes them
special.
Another example: In order to save a symlink instead of a file using the
same name as above, we could use a property-revlog like this:
./hg/data/somedir/somefile2/hg_symlink.d
which represents a stream with name "hg:symlink", and the contents of
the current revision of that revlog contain the symlink target.
And now how to store a "hg:ignore" property for directory
"somedir/somesubdir":
./hg/data/somedir/somedir/somesubdir/hg_ignore.d
So, the most important thing to be changed for implementing that feature
is to add an additional subdirectory level at the leaves of the
version-controlled directory tree that is omitted when checking out a
revision, but available to the hooks.
In the output of "hg manifest" the streams could be displayed using the
-v option, while the "hg:data" stream should be suppressed from output
in the normal case (because it is the default).
For instance,
$ hg manifest
hexstuff... somedir/somefile
hexstuff... somedir/somefile2 [hg:symlink]
hexstuff... somedir/somesubdir [hg:ignore]
$ hg manifest -v
hexstuff... somedir/somefile [hg:data]
hexstuff... somedir/somefile2 [hg:symlink]
hexstuff... somedir/somesubdir [hg:ignore]
Of course, streams could also be displayed in any different way as well;
it's just an example.
The manifest internal format needed to be updated as well:
$ hg debugdata .hg/00manifest.d 0
somedir/somefile/hg_data<hexstuff>
somedir/somefile2/hg_symlink<hexstuff>
somedir/somesubdir/hg_ignore<hexstuff>
So, actually it does NOT need to be changed, but rather includes the
*uninterpreted* contents of the .gh/data directory, including the leaf
revlogs which always represent streams.
To be more precise: *All* the revlogs now represent streams! Because the
actual files or directories or symlinks will be represented by
subdirectories now which contain the stream revlogs. And whether such a
directory will be interpreted as the name of a version-controlled file,
directory, symlink, fifo, device file or something different depends
solely on which stream revlogs exist in that directory.
And to make the best of properties, there should be a means of
*inheriting* them from the parent directory, possibly overriding them in
nested subdirectories. But that's worth its own thread I think.
Regarding stream names, it might be wise to enforce a naming policy to
avoid name clashes with user-defined properties.
A suggestion of such a policy:
* All property/stream names optionally start with a namespace prefix,
followed by a colon, and then an identifier. For instance, in "hg:data",
"hg" is the name of the namespace, and "data" is the namespace-relative
name of the stream.
* Namespace "hg" is reserved for Mercurial's "well-known" or specially
interpreted streams. For instance, while "hg:executable" could be a
user-defined property as well which is only of interest for user-defined
hooks, "hg:data" is clearly of essential interest for the internal
checkout and checkin functions of Mercurial.
* Namespace "urn" is reserved for property names which conform to the
URN syntax, e. g. globally unique and *persistent* identifiers.
(Persistency is also the big difference between an URL and an URN. URLs
cannot truly be considered to be persistent: Domains come into existence
and go away all the time.) For instance, there is a "urn:uuid" scheme
which allows to create URNs based on UUIDs for those who like this. But
numerous other schemes exist as well.
* Namespace "rdn" is reserved for the "reversed domain name" identifiers
which are so popular in JAVA (or Monotone). This specifies properties
such as "rdn:com.sun.java/bigproject/specialstream". However, as stated
in the previous paragraph, DNS names might be not be the best choice to
guarantee uniqueness of a name - at least not over time. URNs, in
contrast, will do (if an appropriate URN scheme is chosen, such as
"urn:uuid").
* All other names with or without a namespace prefix are free to be used
by users in any way they like.
However, there is a problem here: The "urn" and "rdn" namespaces allow
to include slashes, colons and other characters better to be avoded in
filenames. Especially under Windows.
So I suggest a simple name mapping strategy:
* We add a pseudo-namespace "b32" which encodes whatever follows it in
BASE-32 encoding.
* That pseudo-namespace will only be recognized at the beginning of a
stream name and will encode whichever follows it in BASE-32.
* colon characters are mapped into underscores.
Here are some examples of stream names and the mapped revlog file names
which will represent them:
"hg:data" -> "hg_data2.d"
"hg:symlink" -> "hg_symlink.d"
"plain" -> "plain.d"
"usernamespace:whatever" -> "usernamespace_whatever.d"
"funny_name" -> "b32_<base32stuff>.d"
"urn:uuid:11223344-5566-3353-aabbccddeeff" -> "b32_<base32stuff>.d"
In those examples, <base32stuff> is a placeholder for the BASE-32
encoding of the string on the left side.
Why BASE-32 instead of BASE-64 one might ask?
Because BASE-32 does not use both upper- and lower case characters in
the encoding it generates, which eliminates problems which filesystems
that do not preserve letter case in file or directory names. (See the
RFC about BASE-32 encoding for more details.)
Anyway, all the above is a mere suggestion to show how streams could be
implemented; I'll be happy to keep it open to discussion.
But I would really be happy to see properties supported by Mercurial
some day, which will also be the day I convert my SVK repositories into
Mercurial!
Currently I cannot use Mercurial because I have lots of symlinks under
version control in SVK.
I am using SVK because it is still the best distributed SCM I have
encountered so far: It has (nearly) all the features of Subversion, but
adds fully disconnected (off-line) operation.
SVK has also some disadvantages. The biggest disadvantage of SVK is its
lack of concise documentation and it's largely intransparent operation.
It's very obscure. *And* written in Perl. ;-)
Mercurial clearly excels here: All basic data structures (i. e. the
revlog) are well defined and the interconnection between the components
of the data structures (revlog, nodeid, manifest, etc) are nicely
explained. This is how it should be.
In SVK I do not even fully understand the options and operation modes of
its 3 merge commands... especially in disconnected or mirrored operation.
However, it works.
Somehow.
But I would really prefer Mercurial - if it only could support support
properties like symlinks and character conversion attributes.
Greetings from Vienna,
Guenther
More information about the Mercurial-devel
mailing list