[PATCH 2 of 5] adjustlinkrev: use C implementation to test ancestor if possible

Pierre-Yves David pierre-yves.david at ens-lyon.org
Thu Nov 10 17:22:32 UTC 2016



On 11/10/2016 04:34 PM, Jun Wu wrote:
> Excerpts from Pierre-Yves David's message of 2016-11-10 15:02:26 +0000:
>> Plan page does not needs to be very complex,
>>
>> So you current plan is to have this "(path, fnode) → [linkrev]" mapping
>> stored into a sqlite database and update it on the fly when needed. What
>> were your rational to pick this over a more complete cache of
>> (changerev, path, filenode) exchanged over the wire ?
>>
>> Out of curiosity, do you have speedup number for your current experiment ?
>
> The experiment is still ongoing so I don't have data or clear plans yet.
> Things can change - for example, I may want to replace sqlite with gdbm.
> (So I'm still not motivated to write such a subject-to-change plan page)
>
>
> On "on demand":
>
> The motivation of the linkrev cache is to speed up building fastannotate
> cache, which essentially goes through every file in the repo and does a
> massive adjustlinkrevs. So it's better to have an offline command like
> "debugbuildlinkrevcache" to build the cache, in a more efficient way. I'm
> less concerned about the on-demand use-case - Ideally I want to make sure
> the linkrev database is *always* up-to-date, i.e. it contains all possible
> candidate linkrevs for every file node so changelog.d walk becomes
> impossible and we can have another optimization: if there is only one
> candidate linkrevs, just use it, skip the changelog.i ancestor test.
>
> That said, "on demand" could be an optional feature, it's not hard to
> implement after all.

Now I'm confused I assumed your database was on demand but is actually 
command triggered, so how is your cache updated as the repo move forward?
(I admit I did not read your paste  in full depth and did not saw 
details about that in the initial doc)

> On "(changerev, path, filenode)":
>
> Note that "filenode" is redundant because "(changerev, path)" can decide
> the filenode.

Yes, but with a manifest look up so having it on the cache is more cachy ;-)

> I believe a traditional database for "(changerev, path) -> linkrev" is not
> the way to go as it could be extremely inefficient on space usage with an
> offline command to build it. (think about storing manifests un-delta-ed)

The "path" part does not necessarly have to be in the database, if we 
have a per-filelog-cache alongside the filelog, that would not be taking 
extra space.

In addition, we can certain have cache that only contains entry for 
changeset touching a file. given the pattern of filelog traversal this 
is probably enough to boost adjust linkrev massively,

> I didn't say the current approach I'm taking is the unique / best way to
> address the adjustlinkrev perf problem. There is possibility that we can
> have O(1)-like adjustlinkrev with cost of space usage like summing up all .i
> files.
>
> That approach is to make linkrev unique. Since we cannot BC the hash
> function, we can store new file nodes which takes the commit hash as a salt
> (and new manifests) in parallel. If we use tree-manifest, the space usage
> will be somehow feasible. I didn't go this way because it's more complex
> (may require invent some new formats) and depends on tree manifest, which is
> still "experimental" internally.

I really don't think having to slat all hash because of some internal 
implementation details is a good way to go, salting all filenode would 
have serious impact, especially when commit rewrite comes into play.

> On "exchange":
>
> The current data ((path, fnode) -> [linkrev]) map could also be exchanged -
> but there is no immediate plan for that right now.
>
>>> But that does not speed up changelog.d walk. And I doubt that the use of
>>> _ancestrycontext could lead to incorrect results in some cases if they are
>>> the same object for the chain.
>>
>> The changelog.d walk is about looking for modified file, right ?
>
> Yes. The changelog.d walk happens if none of the linkrev candidates is an
> ancestor of "srcrev". It's basically what git does all the time (one of the
> key reasons that git does not scale currently imo).

Goes git have a "file touched" field too? or do they need to look at the 
manifest/tree too?

-- 
Pierre-Yves David



More information about the Mercurial-devel mailing list