Solaris 11.4 hosted repository, TortoiseHG clone attempt consumes all resources
Scott Newman - NOAA Affiliate
scott.newman at noaa.gov
Thu Jul 23 16:53:08 UTC 2020
On Mon, Jul 20, 2020 at 6:28 PM Pierre-Yves David
<pierre-yves.david at ens-lyon.org> wrote:
>
>
>
> On 6/23/20 11:15 PM, Scott Newman - NOAA Affiliate wrote:
> >>>>>>>>>>> Good morning everyone!
> >>>>>>>>>>>
> >>>>>>>>>>> We are currently using Mercurial 5.2.2 hosted on Solaris 11.3 and
> >>>>>>>>>>> accessed
> >>>>>>>>>>> by contributors via TortoiseHG 5.0.2 from their Windows Desktops.
> >>>>>>>>>>> We are
> >>>>>>>>>>> in the process of migrating applications to new hosts running
> >>>>>>>>>>> Solaris
> >>>>>>>>>>> 11.4.
> >>>>>>>>>>
> >>>>>>>>>> As far as I understand, you use the same versions (Mercurial 5.2.2
> >>>>>>>>>> on
> >>>>>>>>>> server TortoiseHG 5.0.2 on client) and the same python (probably
> >>>>>>>>>> 2.7
> >>>>>>>>>> something?) The only software version difference is Solaris 11.3 vs
> >>>>>>>>>> Solaris 11.4, right ?
> >>>>>>>>>
> >>>>>>>>> Pierre-Yves, so nice to hear from you! Correct. Python 2.7.18
> >>>>>>>>> (tried
> >>>>>>>>> some others with the same result). I have an update that when we
> >>>>>>>>> tried going back to THG 3.4 the clone worked as expected, but that
> >>>>>>>>> doesn't seem like a good long-term solution, particularly since we
> >>>>>>>>> will lose the ability to export-archive that was introduced
> >>>>>>>>> somewhere
> >>>>>>>>> around version 4.5, if you recall.
> >>>>>>>>
> >>>>>>>> That is very interesting, We are talking about using THG 3.4 on the
> >>>>>>>> client right? with still using Mercurial 5.2.2 on the server, right?
> >>>>>>>
> >>>>>>> Correct. It is so interesting that the client can have such an impact
> >>>>>>> on the server!
> >>>>>>>
> >>>>>>>>
> >>>>>>>> If so, this means using a new protocol feature introduced in betwen
> >>>>>>>> 3.4
> >>>>>>>> and 5.2 reveal the issue.
> >>>>>>>>
> >>>>>>>> Can you confirm this? And if so, can you try to find the exact
> >>>>>>>> Mercurial
> >>>>>>>> version client side that trigger this issue?
> >>>>>>>
> >>>>>>> I am scheduled to work on this with another resource tomorrow at 15:00
> >>>>>>> EST and will update this thread. We have confirmed that the problem
> >>>>>>> exists in THG4.5.0, so it will be somewhere in between 3.4 and 4.5.0.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> However, the export-archive thingy is something you run server side,
> >>>>>>>> don't you?
> >>>>>>>>
> >>>>>>>
> >>>>>>> We perform this task on the client side now with the archive function
> >>>>>>> and have abandoned the customization in favor of the built-in archive
> >>>>>>> functionality added
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> When trying to clone a copy of the repository hosted on
> >>>>>>>>>>> Solaris
> >>>>>>>>>>> 11.4 the clone runs very slowly and the process consumes most of
> >>>>>>>>>>> the
> >>>>>>>>>>> memory (64GB) on the host, starts generating "-bash: fork:
> >>>>>>>>>>> Resource
> >>>>>>>>>>> temporarily unavailable" errors for users on the box after about 2
> >>>>>>>>>>> minutes, and the clone process fails with a " Server Unexpectedly
> >>>>>>>>>>> closed
> >>>>>>>>>>> connection" message.
> >>>>>>>>>>
> >>>>>>>>>> So, the serveur hosting the repository is crumbly while cloning
> >>>>>>>>>> right?
> >>>>>>>>>> how are you cloning ? ssh or http ?
> >>>>>>>>>
> >>>>>>>>> Cloning via ssh.
> >>>>>>>>
> >>>>>>>> Great, can you add:
> >>>>>>>>
> >>>>>>>> [ui]
> >>>>>>>> debug=yes
> >>>>>>>>
> >>>>>>>> In the HGRC of the remote repository and run a clone, this you give
> >>>>>>>> you
> >>>>>>>> a tons of remote output that might help to understand what is going
> >>>>>>>> on
> >>>>>>>> when the memory explode.
> >>>>>>>
> >>>>>>> Here is the result on the client BEFORE adding debug:
> >>>>>>> % hg clone --verbose ssh://<username>@<hostname>//<dirname>/<reponame>
> >>>>>>> "C:\Repos\test"
> >>>>>>> requesting all changes
> >>>>>>> adding changesets
> >>>>>>> adding manifests
> >>>>>>> adding file changes ### Processes 123/5396 files, takes 10-15
> >>>>>>> minutes, fails here
> >>>>>>> transaction abort!
> >>>>>>> rollback completed
> >>>>>>> abort: stream ended unexpectedly (got 20593 bytes, expected 32768)
> >>>>>>> [command returned code 255 Mon Jun 22 15:31:39 2020]
> >>>>>>>
> >>>>>>> When I add the debug entry it stalls at:
> >>>>>>> % hg clone --verbose ssh://<username>@<hostname>//<dirname>/<reponame>
> >>>>>>> "C:\Repos\test"
> >>>>>>> requesting all changes ### stalls here
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>> The same process on Solaris 11.3 has a negligible
> >>>>>>>>>>> impact on resources and finishes in about 10 minutes.
> >>>>>>>>>>>
> >>>>>>>>>>> I have spent several days with the Network and Systems
> >>>>>>>>>>> Administrators
> >>>>>>>>>>> trying to resolve this issue without success. We tried many
> >>>>>>>>>>> things,
> >>>>>>>>>>> including adjusting resource configurations, rebuilding Mercurial
> >>>>>>>>>>> and
> >>>>>>>>>>> Python, using Mercurial and Python from the working server, using
> >>>>>>>>>>> the
> >>>>>>>>>>> pre-built package from Oracle (v4.9.1),
> >>>>>>>>>>
> >>>>>>>>>> How did you transfer the repository between the two servers?
> >>>>>>>>>
> >>>>>>>>> I used hg clone (via ssh) between the servers without issue.
> >>>>>>>>
> >>>>>>>> This clone might have upgraded the repository to newer format, and
> >>>>>>>> jumped on an unknown issue affecting you repository. what does `hg
> >>>>>>>> debugformat` says on the older server?
> >>>>>>>
> >>>>>>> On older server:
> >>>>>>> format-variant repo
> >>>>>>> fncache: yes
> >>>>>>> dotencode: yes
> >>>>>>> generaldelta: yes
> >>>>>>> sparserevlog: no
> >>>>>>> sidedata: no
> >>>>>>> copies-sdc: no
> >>>>>>> plain-cl-delta: no
> >>>>>>> compression: zlib
> >>>>>>> compression-level: default
> >>>>>>
> >>>>>> Okay, so the most notable difference is `sparserevlog`. You might
> >>>>>> encounter some unknown pathologilab. Can you try making a new server
> >>>>>> clone using `--config format.sparse-revlog=no` during the clone ?
> >>>>>>
> >>>>>
> >>>>> I created a new server clone using:
> >>>>> hg clone --config format.sparse-revlog=no --noupdate
> >>>>> ssh://<username>@<hostname>/<SRCreponame> <TARGETreponame>
> >>>>> When I tried to clone with THG 5.0.2 via the UI I saw the same behavior.
> >>>>> When I performed the clone via the console using:
> >>>>> hg clone --config format.sparse-revlog=no --verbose
> >>>>> ssh://<username>@<hostname>/<SRCreponame> "<TARGETreponame"
> >>>>> I saw the same behavior.
> >>>>
> >>>> You are cloning from the Solaris 11.3 machine into the solaris 11.4
> >>>> machine right ? can you double check the `hg debugformat` of the
> >>>> resulting clone ?
> >>>>
> >>>
> >>> Correct, I cloned to 11.4 machine from 11.3 machine, then tried to
> >>> clone to Windows machine from 11.4 machine using THG 5.0.2. Here are
> >>> the hg debugformat results:
> >>> format-variant repo
> >>> fncache: yes
> >>> dotencode: yes
> >>> generaldelta: yes
> >>> sparserevlog: no
> >>> sidedata: no
> >>> copies-sdc: no
> >>> plain-cl-delta: yes
> >>> compression: zlib
> >>> compression-level: default
> >>>
> >>> Scott
> >>
> >> Okay, so this is not the source of the issue.
> >> What happens if you copy the repository from one server to the other (no
> >> clone, just `scp -r` ?
> >>
> >
> > I created a copy of the repo from the 11.3 machine to the 11.4 machine
> > and tried to clone using THG5.0.2 and saw the same bad behavior.
> >
> > Further to the request above to find the exact THG version the issue
> > started with: We did not have the issue with THG4.3.1, but did have
> > it in THG 4.4.1.
>
> Okay, so the problem appears with something that change betwen 4.3.1 and
> 4.4.1 on the client side. Yet impacting the server side during pull.
Right!
>
> protocol wise very few changed between 4.3 and 4.4 (only the head based
> phase computation), so this is a bit puzzling. However they was a lot a
> of churn around the changegroup (that serialise change to repository and
> file over the wire) in that period, maybe something faulty slipped in
> that ? However this is odd that it would affect the server side.
>
> Can you do the test with copying the directory manually between the too
> machine (instead of cloning) to see if the problem still appears.
Was able to successfully SCP the entire directory (using Putty pscp)
with no issues.
>
> Once this test is done, I fear there will be little that can be done
> without further inspection of the repository itself.
>
OK, we will see how best to accomplish that. Do you have access to
Solaris 11.4 host to see if we can reproduce?
Scott
More information about the Mercurial
mailing list