IBLT-base discovery
Joerg Sonnenberger
joerg at bec.de
Tue Feb 22 23:57:00 UTC 2022
Hello all,
TIL: Prototype covering incoming/pull can be found at the end after the
hyphen line.
This mail is about a prototype for a discovery mechanism based on
Invertible Bloom Lookup Tables (IBLT). An IBLT is a generalized form of
a Bloom Filter and can be used to efficiently compute set differences.
The only requirement is that the set items are canonical and have the
same size, e.g. a cryptographic hash over a record can be used as
identifier. It is efficient in the follow senses:
(1) The data to be exchanged between two parties scales linear with the
size of the different (i.e. O(|A\B + B\A|) given sets A and B).
(2) The number of round trips is expected to be O(1), normally one for
the estimation step and a second for the actual exchange. Depending on
the details, both steps can be part of one round trip in opposite
directions.
For the changeset discovery, this approach is interesting because it
works independently of the structure of the commit graph as long as they
are similar enough. This is different from the currently used sampling
protocol that scales based on the total and disjoint number of DAG
heads. This property also makes it attractive for exchanging additional
data attached to commits like changeset signatures or provenance records.
The expected O(1) derives from the probalistic nature of the algorithm,
e.g. it derives from properties of random 3-graphs. The principles are
also found in modern construction mechanisms for perfect hash tables or
error-correcting codes. The success rate increases with size and
practical simulations of the hypergraphs show that it works almost
always even for moderate sizes in thousand entries range.
Based on the natural and justified choices of (1) deterministically
keyed hash functions, (2) fixed size categories with exponential growth
and (3) retries with the next larger size as normal handling for failure
cases, the data structure can be pre-computed and easily updated
incrementally. Without trying for optimal encodings, space requirement
would be O(d * log d) with d being the maximum allowed difference size
and update cost of O(log d) when adding a new key.
-------
A working prototype as alternative to the current set discovery algorithm
can be found in https://www.netbsd.org/~joerg/iblt-exchange.diff
There are a couple of hacks and short cuts in the code, but it works
well enough to illustrate the point. It is written to be minimally
intrusive. For example, there is no cache (in)validation or incremental
update implemented.
>From a protocol perspective, it is also much more inefficient than
necessary. It reuses the current getbundle wire command. For this, it
needs to compute (or at least approximate a superset of) the common heads
and the remote sets. This is done by using both the node id of the commit
as well as the parent nodes as key, so it tripples the key size and
therefore the exchanged data. A better but more intrusive approach
replaces the parameters of getbundle directly with the IBLT.
Test case is my local pkgsrc repository with ~776k commits and a clone
without my local changesets. Incoming shows 71 changes. Test uses ssh
over localhost. All values are best of three runs.
New protocol Old protocol
Run time 2.40s 2.77s
Data to server 2784 24,338
Data from server 19,218,995 19,181,398
For the no-change case of incoming against the same repository:
New protocol Old protocol
Run time 0.62s 0.62s
Data to server 544 9,965
Data from server 38,796 11,028
For reference, the estimator step uses 35,308 Bytes and results in an
IBLT of 181 entries for the first case and 32 entries for the second
case with a size of 11,958 Bytes and 2124 Bytes respectively.
Joerg
More information about the Mercurial-devel
mailing list