[PATCH 1 of 3] pull: add --subrepos flag
Angel Ezquerra
angel.ezquerra at gmail.com
Thu Feb 28 17:02:00 UTC 2013
On Sun, Feb 24, 2013 at 9:20 AM, Matt Harbison <matt_harbison at yahoo.com> wrote:
> Angel Ezquerra wrote:
>>
>> On Thu, Feb 21, 2013 at 5:55 AM, Matt Harbison<matt_harbison at yahoo.com>
>> wrote:
>>>
>>> Angel Ezquerra wrote:
>>>>
>>>> On Wed, Feb 20, 2013 at 6:57 AM, Matt
>>>> Harbison<matt_harbison at yahoo.com> wrote:
>>>>>
>>>>> On Sun, 17 Feb 2013 13:19:16 +0100, Angel Ezquerra wrote:
>>>>>
>>>>>> # HG changeset patch # User Angel
>>>>>> Ezquerra<angel.ezquerra at gmail.com> # Date 1360519226 -3600 # Node
>>>>>> ID abbd26cca35280fb8f784b3f2c02eef71696c47b # Parent
>>>>>> 55b9b294b7544a6a144f627f71f4b770907d5a98 pull: add --subrepos
>>>>>> flag
>>>>>>
>>>>>> The purpose of this new flag is to ensure that you are able to
>>>>>> update to any incoming revision without requiring any network
>>>>>> access. The idea is to make sure that the repository is
>>>>>> self-contained after doing hg pull --subrepos, as long as it
>>>>>> already was self-contained before the pull).
>>>>>>
>>>>>> When the --subrepos flag is enabled, pull will also pull (or
>>>>>> clone) all subrepos that are present on the current revision and
>>>>>> those that are referenced by any of the incoming revisions.
>>>>>
>>>>> I haven't gotten a chance to really play with this yet, so I'm
>>>>> going more off the comments here- I apologize if these answers
>>>>> should be obvious, but I'm not familiar enough with some of the
>>>>> code.
>>>>>
>>>>> - Is there an easy way to tell if the repo is/was self contained?
>>>>> (Maybe incoming -S?)
>>>>
>>>>
>>>> No there is not. I don't think incoming -S would do the trick since
>>>> that would just tell you if there are _new_ incoming revisions on
>>>> some of the _current_ subrepos. A repo is "self-contained" if it is
>>>> possible to update to any of its revisions withing requiring a pull
>>>> of one or more of its subrepos.
>>>>
>>>> I don't know of any existing mercurial command that would be able to
>>>> give you that information.
>>>>
>>>>> - Is the 'self-contained' bit to limit overhead on each pull, or is
>>>>> there another reason this can't ensure the result is self
>>>>> contained? 'Push' and 'outgoing -S' recognize (almost) everything
>>>>> going in the other direction, so it might be nice to have the same
>>>>> capability with a form of pull. (I may have found a push bug that
>>>>> I haven't gotten back to yet.)
>>>>
>>>>
>>>> I'm not sure I understand what you mean.
>>>
>>>
>>> Consider this (contrived) case:
>>>
>>> 1) push a repo and subrepo to remote (remote is now self contained)
>>> 2) strip or rollback the remote subrepo
>>> 3) a top level local repo push will repopulate the remote subrepo
>>>
>>> Now reverse it:
>>>
>>> 1) pull a repo and subrepo from remote (local is self contained)
>>> 2) strip or rollback the local subrepo
>>> 3) nothing is incoming top level, so the subrepo isn't repopulated (if
>>> you've updated to a working dir without that subrepo).
>>>
>>> I'm not sure if there's a less contrived case, or if this matters too
>>> much. I guess I was just wondering aloud about the symmetry between
>>> push and pull -S (this is certainly much better than it was).
>>
>>
>> I don't think this is a very big deal because "pull --subrepos" also
>> performs a pull on all the repositories that are found on the current
>> revision. I think this would fix your contrived case.
>
>>
>>
>> The only corner case that would not be properly handled by this would
>> be the case in which pull --subrepos cloned a _new_ subrepo that is
>> not found on the current revision (i.e. that is introduced on one of
>> the cloned revisions) and then you striped some revisions from that
>> "future subrepo". In that case would would still be able to get the
>> missing revisions as you would do today, i.e. by updating to the
>> offending revision. You could also use the onsub extension (which is
>> shipped with TortoiseHg).
>>
>> BTW, the more I use the onsub extension the more I am convinced that
>> it should be distributed with mercurial, and even that the onsub
>> command itself should become a built-in command.
>
>
> I'd be fine with that. I haven't used it, but from what I've read, it seems
> like a useful crutch to do things that aren't currently possible because
> they aren't well defined, like tagging all repos or opening a branch. But
> ideally there wouldn't be a need for it with commands that accept a -S. The
> principle of least surprise says those commands can handle subrepos already.
The cool thing about the onsub command is that it lets you operate on
subrepos without operating on the parent repo, which often is what you
want. Using the -S option on mercurial commands does not work the same
because it (usually) will also perform an action on the parent repo.
>> That being said, if we did want to also handle this very small corner
>> case there are at least a couple of possible solutions:
>>
>> 1. make the --subrepos flag take a revset that would tell it which
>> revisions to look for subrepos to pull.
>
>
> Why not use the -r option? But really, I think in this case 'pull -S' ==
> 'pull -S -r all()', so I'm not sure if that helps.
I think it could help a little. We must just make sure that 'pull -S
-r .' does what I would expect, which is to just pull all current
subrepos.
>> 2. make --subrepos also look for subrepos on the descendants of the
>> current revision.
>>
>> #1 would be nice _if_ we could also skip the revset to get the
>> behavior that the current patch proposes. However I don't think that
>> is possible to have a flag that can optionally take a parameter.
>
> Agreed. A special case parameter isn't nice.
I guess if this really became an issue we could think about an
--all-subrepos options, as in large files, but I don't really think it
will be necessary. Also perhaps using pull -r 0:tip would work. I'll
have to check.
>> #2... maybe that is too much overhead for such a small corner case?
>
> Not sure.. That's why I was asking if there's large repos to benchmark on
> :-). The other thing I'm wondering is, since we seem to have various caches,
> maybe caching subrepo info would speed things up? I have no idea what form
> this would take, or even how the existing caches are invalidated, detect
> corruption, etc. The construct I've got in my push patches is:
>
> f = repo.wjoin('.hgsubstate')
> revs = repo.revs('(adds(%s) or modifies(%s)) and ::(%ln)', f, f, revs)
>
> I'm not sure if this helps cut down overhead- you do have to traverse the
> whole repo, but you don't have to process the .hgsubstate file for revisions
> where the subrepos don't change. (Though you referenced having the latest
> path for the repo, and that can change without affecting .hgsubstate, so
> this may not be useful anyway.)
I think the code could easily check if the parent repo refers to a
revision that is missing on the subrepo and then only pull those
subrepos that were missing revisions (or new subrepos). However this
would make it hard to just pull all subrepos, which you need to do
when you want to move a subrepo forward (i.e. select a newer revision
on a subrepo).
My ideal pull behavior would be for pull without --subrepos to
actually perform this check and pull the minimum amount of revisions
needed to make the current parent repo revision consistent. Adding
--subrepos would pull all current and new subrepos (as this patch
proposes), regardless of whether they have changed.
That being said, I am not proposing (yet) to change the current pull
(without --subrepos) behavior. I first want to add the --subrepos flag
and make it do something useful.
>> Additionally, I wonder if it would make sense to change the current
>> pull behavior a little, by checking if any of the subrepos that are
>> present on the _current_ parent revision point to a revision that does
>> not exist, and if that is the case fulling from them _up to that
>> revision_? I think the parent pull time is an excellent time to do
>> this kind of stuff since you already know that you have an internet
>> connection...
>
>
> Can you show a short test case or pseudo test case of this? I'm not sure
> what you're suggesting here. Wouldn't the existing pull bring in all of the
> changes to the tip of each subrepo now? Why back off on that without a -r
> option?
Umm, what I had in mind was the following scenario:
1. You pull a repo that has some subrepos.
2. You update to tip, which pulls all subrepos.
3. You enter into one of the subrepos
4. You strip some revisions on that subrepo.
How do you get those missing revisions back?
However I've been thinking about this and it is not really necessary.
You can just do "hg update --clean -r ." and that will pull from all
subrepos that are missing revisions, which is what I wanted.
So, never mind... Just forget about that :-)
>>> I realize you can't do such a thing without crawling most of the
>>> history. Are there large public repositories that use subrepos? I'm
>>> wondering what the performance hit would be. (It's easy for me to think
>>> something is a good idea when I only have small repos and wouldn't
>>> notice the hit.)
>>
>>
>> We have repositories with thousands of files and more than 20
>> subrepos. I don't know if you would consider that big...
>
>
> Not sure. I'm assuming that the number of commits is the governing factor
> when walking the history? But that's almost certainly going to be larger
> than the toy repos I make, or even the real ones I use at work (still less
> than 1000 in the subrepo, and probably less than 200 on the parent). Just
> looking for something to run benchmarks on.
I think fully pulling all subrepos when you explicitly ask to by using
the --subrepos flag would be acceptable in any case.
>>> FWIW, largefiles works the same way- if you don't clone or pull with
>>> --all-largefiles, there's no single command to go back and get the files
>>> for all revisions that are not incoming. That leaves the user wondering
>>> if they really can disconnect from the central repo.
>>
>>
>> That is true. It would be nice to be able to do that. However there is
>> a difference between largefiles and subrepos. Largefiles tries to only
>> get the files that you need, when you need them _by design_. Subrepos
>> is not meant (IMHO) to do that (or at least it is not necessary for
>> subrepos to fulfill its main purpose, which is to track "submodules").
>
>
> I agree with you about subrepos, but the --all-largefiles flag is a new(ish)
> feature that lets you override that original largefiles design, so you can
> disconnect from the central repo. (So a very similar motivation for this.)
> I was going to put in a command to download missing largefiles, but never
> got around to it. But I don't think a new command would be an option for
> subrepos if we don't get this right, because it really is just a pull. If
> we can change the pull command in the future if there's a need to handle
> this, then it may not be that big of a deal now. And I suspect the only
> hinderance to adding it later will be if there's a big performance drop.
If this became a problem we could think about adding some other flag
in the future, or perhaps integrating the onsub extension into core?
>>>> I don't think you (we?) must give too much importance to this
>>>> "self-contained" concept. It is just a way for me to explain the
>>>> purpose of the patch, and specially to explain why we must look for
>>>> subrepos on all the new incoming revisions, and why we cannot just
>>>> limit ourselves to pulling the subrepos on the current revisions
>>>> (short answer: because new subrepos may appear on the new, incoming
>>>> revisions).
>>>>
>>>> My patch explicitly says that hg pull -S will only make your subrepo
>>>> self-contained if it was already self-contained before. This is in
>>>> order to avoid having to look for subrepos on all the repo history,
>>>> rather than just looking for subrepos on the incoming revision (and
>>>> the current one).
>>>>
>>>>> - The full subrepo gets pulled, even revs not committed to the
>>>>> parent? I think that's a good thing, because regularly get burned
>>>>> when I 'pull -u' the tree to another machine and then go to apply
>>>>> the rest of a patch queue to the subrepo.
>>>>
>>>>
>>>> Yes. It is perhaps not optimal but I think it is simpler. In
>>>> addition if different parent repo revisions point to different
>>>> revisions on a subrepo there is no way for us to tell which of those
>>>> subrepo revisions is the one that is closes to tip, or which ones
>>>> are ancestors of the other ones, etc. As a result we would need to
>>>> perform as many pulls on a given repo as the number of different
>>>> revisions of that subrepo that were referenced on the parent repo.
>>>> That is complex and slow, so it is much simpler and possibly faster
>>>> (in some cases at least) to just pull all revisions from each
>>>> subrepo.
>>>
>>>
>>> OK, I misread the code- I thought each subrepo was getting a pull at
>>> each revision, which I figured would be slow. I attached a test patch
>>> below- there's nothing special about it, but it helped me with my pull
>>> and outgoing changes (some comments probably still reflect this), so I
>>> changed that to pull and incoming to test your patch.
>>>
>>> - I think I see a double pull of a subrepo (search for "hg pull -S -r
>>> 3").
>>
>>
>> You are right. There is a problem with the current version of the
>> patch. It will clone the same repo multiple times if the same subrepo
>> refers to different revisions. I will resend with a fix.
>
>>
>>
>> BTW, where do you think the tests for this new feature should go?
>> Should it go into one of the test-pull*.t files or in one of the
>> test-subrepo*.t files? Do you think the test that you propose in that
>> patch would be enough to test this new feature? Maybe it could be made
>> a bit simpler?
>>
>
> I assumed it would go in test-subrepo*t, so anyone hacking on subrepos
> doesn't have to find it in an obscure test, but I can see the same argument
> from someone working on pull(). Probably what tips it to subrepo*t though
> is that there is some existing subrepo infrastructure that doesn't exist in
> test-pull*t.
Yes, I think you are right. It should probably go into test-subrepo-recursion.t.
> I'm not sure that it's comprehensive, and there's probably some duplication.
> I put it together to e.g. test 'push -r 2' then 'push -r 4', and make sure
> the grandchild repo code path got exercised. Since you aren't doing
> anything with -r, there may be duplicate pull tests. The incoming tests
> probably aren't relevant either, though I think they should agree on what
> they are doing eventually.
>
> Those tests also didn't exercise the subrepo pull path (the dests were
> empty, so clone was used). I marked various potential issues in there with
> 'XXX'.
> One thing I did specifically was to skew the rev values between the parent
> and child, so parent rev 1 doesn't lock in subrepo rev 1. This makes it
> easier to test with the -r option (which I realize you aren't doing now, but
> might be a nice to have in the future).
I think I will take some inspiration from your test and write a new
one specific to my proposed patch.
>>> - I wonder if the code in this patch can be leveraged to make incoming
>>> print all of the stuff 'pull -S' will grab in the future.
>>
>>
>> In theory that would be certainly possible. At least we could run the
>> same "look for subrepos to pull" code and show the list of subrepos
>> that would be pulled and then run incoming on them. The main problem
>> is that incoming already has a --subrepos flag, which does something
>> slightly different... I'm not sure that could be changed...
>
>
> I think it just recurses into the subrepo and does an incoming on it. If it
> does something else, it's too subtle for me to have noticed.
>
> I've probably stated bits and pieces of this over the course of a week or
> so, but just to explicitly put it all together, I think in an ideal world we
> would have new pull/incoming, and push/outgoing agree with the same options,
> and each group is the reverse of the other. It may be fantasy, but:
>
> incoming pull operation
> <no arg> <no arg> current behavior, parent repo only
> xxx [1] -u current behavior, pull subrepo on update
> -S -S (this patch) pull parent + all subrepos to
> tip
> -S -r <rev> -S -r <rev> (future) pull parent to rev, revs of
> subrepos
> used up to that rev
I think your -S -r <rev> idea is worth thinking about. This could mean
that if <rev> where revs that you already had in your parent repo you
would still pull from the corresponding subrepos, right?
> [1] It doesn't seem sensible to add -u to incoming for this, so leaving no
> equivalent form is probably best.
>
> outgoing push operation
> <no arg> [2] <no arg> current, parent only and all respectively
> -S <no arg> [3] current, push parent + all subrepos to tip
> -S -r <rev> -r <rev> (future) push parent to rev, revs of
> subrepos
> used up to that rev
The problem with this is that if a subrepo changes more than once in a
<rev> range there is not cheap way to only send those revisions (and
their descendants) AFAIK. What I mean is that if subrepo S changes
from revision A to B and then from B to C on two different parent
revisions, you cannot just do push -r C on the subrepo, because B and
C may be on different branches and not have an ancestor-descendant
relationship. In that case you only have two choices. Either you push
twice ("push -r A" then "push -r C" or you push all).
> [2] It seems buggy to me that the no arg forms disagree, since outgoing's
> help says "These are the changesets that would be pushed if a push was
> requested." But that ship has probably already sailed.
Unfortunately I think that is true.
> [3] There is no form of push that takes -S
Since I have another patch series (waiting for review!) that makes
push only push "cleanstore" subrepos, it may make sense to add a "push
-S" flag to force pushing all subrepos regardless of the cleanstore
state.
> So the last two lines in each table align in functionality and (roughly) in
> arguments.
>
>
>
>>>>> I'll try to experiment with this some in the next few days. I ran
>>>>> into issues with what I'm working on (push, outgoing) with deeply
>>>>> nested subrepos, and also when a parent locks in an earlier subrepo
>>>>> version. I wonder if deeply nested subrepos will be a problem here
>>>>> since hgsubrepo.pull() doesn't walk its subrepos and pull them.
>>>>
>>>>
>>>> I must confess that I have not tried that too much. We should
>>>> definitely do this recursively. That being said I hope to get some
>>>> feedback on the current version that I sent to the list first.
>>>
>>>
>>> Sorry, I got crossed up on that too. hgsubrepo.pull() ends up calling
>>> _repo.pull(), so it does recurse.
>>
>>
>> You are right. That's nice! :-)
>>
>>> The test below indicates that clone won't
>>> recurse- it reminds that an update is needed. (Maybe clone needs a -S
>>> too
>>> as part of these changes? If you aren't walking the history, I
>>> don't see a way around that because nothing is incoming after a clone,
>>> so you won't see subrepos of a cloned subrepo.)
>>
>>
>> I guess that would make sense in another patch.
>>
>>> The other thing worth a test is largefiles- the --all-largefiles
>>> option isn't passed to subrepos, so as it stands, 'pull -S' won't let
>>> you really disconnect, because largefiles in subrepos won't be cached.
>>> That
>>> can be fixed later.
>>
>>
>> I agree. We do not really use largefiles (yet) so we have not come
>> across this (yet) but I'm sure this is something we would like to
>> improve as well.
>>
>> One last thing, did you have time to have a look at the code of the
>> patch series itself? I plan to send an slighly changed version of the
>> patch series that will only change the first patch to fix the issue
>> you found (the fix is minimal so the code should be almost the same).
>> Do you have any comments?
>
>
> I did, but I'm not as well versed in this area of code- that's why I've kept
> the conversation fairly high level. I was able to follow your commit
> message and it looks like the code does what it says. Just curious- is
> specifying None as the revision the right thing to do in pull()? AFAICT,
> None gives you the working directory... so does that change the results
> based on what the subrepo is currently updated to?
I use '.' on the latest version of the patch.
Anyway, I think I will resend the series with an added test. We can
think about the other possible improvements later. I really hope I can
get this in before the next release!
Cheers,
Angel
More information about the Mercurial-devel
mailing list