Proposed new "big file" extension
Greg Ward
greg-hg at gerg.ca
Mon Oct 5 19:45:34 UTC 2009
On Sun, Oct 4, 2009 at 6:27 PM, Chad Dombrova <chadrik at gmail.com> wrote:
>> So part of my secret covert goal with bfiles is to turn 30 MB mistakes
>> into 40 byte mistakes (the "standin" file containing the revision
>> hash). Once you no longer need the 30 MB file in every checkout, you
>> delete the 40 byte standin. The only price you pay is carrying around
>> the history of that 40 byte standin file forever. Big deal.
>
> i agree that pruning history is a requirement. if i were to implement
> my idea, we would have to add a command like perfoce's obliterate -
> which would replace a snapshot with a stub file. this obliteration
> would be fairly safe ', because we would be storing snapshots and not
> deltas.
With bfiles, "obliterate" is possible: just go into the remote store
and start deleting revisions. But it means that your repository will
no longer pass "hg bfverify --all". If that matters to you, don't
delete old revisions. If you don't care about passing bfverify,
delete at will. I have no intention to implement an obliterate
command, though.
Hmmm: I think I just figured out where your
store-big-files-in-a-revlog idea belongs: in the remote store. That
is, I've implemented bfiles so that the central store is a tree of
file revisions:
store/bigfile1/0e50052b66247471d322d4f85fc3a3fc89519a15
store/bigfile1/a7dba1a5cc6e146da083ef96ac293df2c51cd4d8
store/bigfile2/347009b9dbd731f0ab93705d410e97b0d1c382ea
...
That's very simple and easy to manage. (In particular, manual
obliteration of unwanted revisions is just an "rm" command.) But if
the big files in question are amenable to deltas and/or compression,
it's rather wasteful. It might be possible to pack 'em into revlogs
and save some space. But that would *not* be a classic Mercurial
filelog; it would be a special type of revlog dedicated to server-side
storage of big files. And it would make obliteration harder and
slower.
A simpler space optimization might be to gzip files in the remote store.
> One of my requirements is transparency. For example, once a file is
> tracked by big files, why not have push call bfput in the background,
> or why not have status show big files by default? It seems that many
> of the big files commands have existing analogues:
>
> push - bfput
> pull/update - bfget
> commit - bfrefrsh
>
> IIUC, in your current design, the user must be aware each time a big
> file changes, and manually bfrefresh it.
Correct. That's the base layer of bfiles. I plan to implement
another layer that integrates tightly with core Mercurial commands.
That layer will be optional for two reasons:
* sometimes you need to interact with bfiles directly (eg.
troubleshooting, or if you only want to download certain big files) --
I want to make sure the base layer works well on its own
* tight integration will modify Mercurial in rather evil ways (e.g.
commit will modify files, update will require network access)
I want people to try out bfiles in its current (non-integrated form)
before I go to the trouble of implementing tight integration.
>> Hmmm. It sounds to me like you are thinking of having your "remote
>> store" and all clones on the same machine. To me, that's a minor use
>> case that's not worth optimizing.
>
> well, for us, this is the use case. I am looking into mercurial and
> git to revision control a lot of data, at least half of which is large
> binary assets, and all of which will be revision controlled on a
> large, centralized server. The repos, each containing gigabytes worth
> of data, will be cloned hundreds of times to provide different users/
> teams access to read-only working copies of each others work, so it is
> essential that there is no redundancy in either the stores or the
> working copies.
Interesting. bfiles as it currently stands will waste huge amounts of
disk space in your case. But it should be possible to nudge it in the
right direction: 1) implement a local cache, 2) use hard links between
the local cache and the working copies *with* read-only files. (I
think that's a sensible compromise between saving disk space,
convenience, and preventing accidental corruption.) You could save
even more space by hard linking from the not-so-remote store to the
local cache, but I'd be very leery of doing that. Too much chance to
corrupt your store.
BTW, do you need locking of big files? I gather that's one of
Subversion's advantages over Mercurial in this area. Not sure how to
implement it offhand, but I'd be happy to shoot ideas back and forth.
Anyways, I think bfiles has a very good chance of working for you with
a little more work. I look forward to your patches to implement
caching and read-only hard links. ;-)
Greg
More information about the Mercurial
mailing list