Integration of Mercurial into LXR

Simon King simon at simonking.org.uk
Fri Dec 14 11:02:04 UTC 2012


[CC-ing the mercurial users list]

On Fri, Dec 14, 2012 at 10:16 AM, andre-littoz <page74010-sf at yahoo.fr> wrote:
> Hello,
>
> First thanks for your very quick answer. I also apologise for poor
> readability (it looks like Yahoo! Mail has undergone some configuration
> change and I experience bad UI behaviour) and wrong mailing destination.
>
> I tested Simon's suggestion and adapted it a bit (see the attached file).
> However, this does not change performance: I'm still 10 times slower than
> Git or Subversion. From what I understand in Python code, the full tree is
> retrieved from ctx.manifest() and then exhaustively explored through an
> "iterator" mf.iteritems(). I think this is not fundamentally different from
> command 'hg locate ...' and explains why performance does not improve.
>

I think that is correct. "hg locate" does use mercurial's
pattern-matching machinery, but I doubt that adds much overhead:

  http://selenic.com/hg/file/34a1a639d835/mercurial/commands.py#l3974

How big is the repository that you are testing this on (number of
revisions and number of files)? In mercurial, the manifest is a single
revlog, so "exploring the whole repository" isn't as expensive as it
sounds. I don't know enough about Git or Subversion's storage models
to suggest why they are so much faster.

> Paul suggested to attack the problem the other way, through change sets.
> I've not yet tried this approach, however I fear that some files might not
> pop up: LXR aims at showing a project snapshot at a given stage in
> development (namely at the time of a tag). This means this state contains
> the modified files for this tag plus the unmodified files. I want these
> latter to show up (with their change set number so that they can be
> retrieved).

If you are happy to build your own index of the repository, this
should be (relatively) easy. For each revision in the repository, save
the manifest and associated file sizes to a file named after the
revision id. Then when you want to display a revision, you only need
to look at the file you generated for that revision.

>
> With a Python extension, I stumbled into the trustworthiness issue. LXR is a
> two-stage process:
> 1- project indexing
> It is done with a standard script by a user in his own environment.
> 2- project browsing and display
> Decorated HTML pages are served by a web server, e.g. Apache.
> Note: directory page display takes a handful of seconds whereas it is nearly
> instantaneous with the other storage backends.
>

Does this handful of seconds include calculating the file size? I can
imagine that being slow, but just listing the filenames should be
fairly quick.

> Considering these two contexts, I placed the [Extensions] configuration
> directives into .hg/hgrc so that they can be accessed both by the script and
> Apache. For quick-and-dirty tests, I chown the .hg directory but this is not
> a solution for public release. LXR may be deployed in environments where the
> LXR advertiser has no administrator privilege and can't modify
> /etc/mercurial nor chown to arbitrary user/group. I have not fully
> understood the explanations on the Wiki and I don't think my double access
> issue can be easily solved:
> - public access (read-only mode) to hg repository through web server,
> - user access (also in read-only mode) from LXR site maintainer user to
> cross reference project source.
> (Note: I understand that trust is related to the nature of the extension,
> not to its intended use; it might contain devastating code.)
>

You should be able to solve this by enabling the extension in each
user's ~/.hgrc file, rather than in the repository. If one of the
users doesn't have a home directory (eg. the "apache" user), you can
set the HGRCPATH environment variable to point to an alternative
location.


> This trust issue makes me prefer a solution without extension. Basically,
> the problem is to limit 'hg locate' to one-level deep retrieval with
> directories.
>
> Has ctx.manifest() any defined structure? I mean some architecture, the
> knowledge of which can be put to profit to limit traversal. Of course, this
> structure must be a guaranteed invariant through Mercurial upgrades.
>

The manifest is the "index" of the repository. For each revision in
the repository, it stores a mapping of filename to file version. It is
stored in revlog format, which is an efficient way of storing multiple
versions of a file. The best description of mercurial's storage is at
http://mercurial.selenic.com/wiki/Presentations?action=AttachFile&do=view&target=ols-mercurial-paper.pdf.
However, I very much doubt that you will be able to improve on the
performance compared to "hg locate".

> I also noted the repo[rev] array. Is it sorted by filename (either as a
> traditional array or an associative array)? Bisection algorithms could then
> be used, though they do no solve the one-level deep only traversal. Is it
> smaller than ctx.manifest()?

repo[rev] returns a "changectx" object which represents the state of
the repository at a given revision. ctx.manifest() then gives you the
manifest (ie. list of filenames and file revisions) in that revision
of the repository. I don't think bisection will help you here.

>
> Best regards,
> André Littoz
> LXR

Regards,

Simon



More information about the Mercurial mailing list