Mercurial and very large files...

Marcin Kasperski Marcin.Kasperski at softax.com.pl
Tue Mar 27 11:32:15 UTC 2007


Thank you for your comments. Seems the restriction is rather 
deep.

I will probably workaround it by splitting large files into 
smaller parts (in my case it is possible). Nevertheless, I think 
addressing the issue at some point in the future could make 
sense, looks like there are no revision-control tools handling 
large files well at the moment. And, leaving apart my case,
there are tasks like sound or video editing...

The rest of this email is just loose discussion, feel free to 
ignore.

> a) diff algorithms can only work efficiently when contents of
> both revisions fit in memory
> (...)
> There's not much that can be done about (a) aside from falling
> back to non-delta storage for large files.

I am probably a bit naive, but large files not necessarily mean 
large changes. Even stupid heuristics like skipping common 
prefix of two files and using normal algorithm for the rest 
could in many cases do fairly well. More generally, I believe
that assuming limited change size (in most cases this is true)
one could think about implementation suited for this usage 
scenario (if we are not able to find common parts in - say - 
10MB windows, then the new change can be treated as total 
replacement).

> And meanwhile, everything else assumes files can be read into
> memory in a single chunk, because the delta storage already
> requires it.

Another stupid idea: internally, for the sake of algorithms, 
treat the large file as many small files (for instance bytes 
1-50.000.000, bytes 50.000.001-100.000.000, etc). Just silently 
split them before adding/committing and join when updating. 

Then, the changes like appending something or fixed size 
replacement will be handled perfectly. Changes like 
inserting/deleting something will result in similar size changes 
among many files (if I add 3 bytes at position 3437, this 
algorithm will add/remove 3 bytes in every successive file), but 
nevertheless it would be significantly better than failing at 
all.

Best regards



More information about the Mercurial mailing list