git-annex vs largefiles (was: Re: Mercurial popularity is stagnant)

Michael McNeil Forbes michael.forbes at gmail.com
Tue Jul 1 23:05:11 UTC 2014


On Jul 1, 2014, at 3:54 PM, Augie Fackler <raf at durin42.com> wrote:
> On Jul 1, 2014, at 5:31 PM, Michael McNeil Forbes <michael.forbes at gmail.com> wrote:
>> The big advantage of git-annex is the files can be distributed: you do not need
>> to have all of the files in a single location.  It keeps track of where the
>> various files are.  Thus, you can checkout the source repository on your laptop
>> and the get only a few of the data files associated with a run for analysis.
>> The killer feature is that git-annex keeps track of where the various files are
>> located, making sure that you keep at least n reliable copies somewhere, but
>> not demanding that all of the files be kept anywhere in specific.
> 
> I'm unfamiliar with git-annex, so please bear with me. Doesn't there have to be a place where all the data files are present? That is, don't you necessarily have to have some central place where all the files exist for reliability purposes?

No.  This is what is so interesting: git-annex keeps track of where the files are, so you can have some files on your laptop, some files on an USB drive, some files on Amazon, Google drive etc.  You tell git-annex how reliable these storage devices are, and how many reliable copies you need to have, and then it takes care of making sure that these constraints are met.  If there are lots of copies of a given file floating around, then you can "git annex drop" that file, and it will be removed from your local computer, freeing disk space.  If there are not enough copies of that file on the remote repositories, then git-annex will not let you drop the file with a message that this is an important copy.

Only the small git repo with the metadata about where the files are located is shared everywhere: the actual files can be distributed as you like/need.

> It sounds like git-annex adds some limited sparse checkout capabilities on top of largefiles. You may be interested in the work Durham at Facebook has been doing: https://bitbucket.org/DurhamG/hg/src/de9df54e2667d80f78b15e7b754033b550b6f1a8/hgext/sparse.py

Yes, but it is integrated into the system from the ground up so there is no need for a complete central repository to exist anywhere.

Michael.





More information about the Mercurial mailing list