Exploring a new data structure for the dirstate
Siddharth Agarwal
sid0 at fb.com
Fri Jan 4 05:10:39 UTC 2013
When I was looking at why Mercurial's add() function was slow, I found
that much of the time (~60%) was spent trying to figure out all the
directories currently tracked in the dirstate. In the past, I was also
able to identify several other suboptimal operations, such as a sort
operation in walk() [1] that takes around 0.2 seconds for ~180k file
entries.
From inspecting dirstate.py, the operations that seem to be important are:
(1) membership testing in the dirstate
(2) sorted traversal of the dirstate
(3) testing that a directory has at least one non-removed file in the
dirstate
(4) iterating over all the files in a directory
These requirements, particularly (3) and (4), indicate that perhaps maps
based on prefix trees (aka tries) [2] would be more appropriate than a
simple dict. I've been testing out a particular variant called crit-bit
trees [3], using a modified version of a C implementation [4] and the
Python/C API.
The results are highly encouraging. So far I've mostly tested loading
the dirstate, and for the same ~180k entries as above loading the
dirstate into a dict with parsers.c:parse_dirstate takes 0.10 seconds.
Reading in a dirstate generated by a dictionary into a crit-bit tree map
takes 0.25 seconds. That sounds pretty bad, but it turns out that
sorting the dirstate helps a lot. Reading a sorted dirstate into a
crit-bit tree map only takes 0.11 seconds. For this relatively trivial
hit, we should be able to get roughly O(1) operations 3 and 4 [5].
The great thing about merely sorting the dirstate is that since the
order of entries in the dirstate file isn't defined, this is both
forward and backward compatible. The moment a dirstate gets modified by
an hg using critbit tree maps, it will be written out in sorted order,
and so any parse-time perf penalties will vanish.
The code is currently not really in a presentable state, but I'm hoping
to get a public repo up very soon. I'm going to pursue this further and
see how far I can push this.
[1] http://selenic.com/repo/hg/file/2c1276825e93/mercurial/dirstate.py#l706
[2] http://en.wikipedia.org/wiki/Trie
[3] http://cr.yp.to/critbit.html
[4] https://github.com/jgehring/critbit89
[5] Technically it's O(length of string), but so is a hash table (since
you need to read the entire string to calculate its hash) and no one
says those are anything but O(1) :)
More information about the Mercurial-devel
mailing list