[isabelle-dev] Repository Trouble
Alexander Krauss
krauss at in.tum.de
Tue Dec 25 11:38:43 CET 2012
On 12/20/2012 06:22 PM, Makarius wrote:
>> I just had a long phone call with Franz Huber, the local system admin
>> person. All the macbroy20..29 and lxlabbroyX machines involved here
>> use the same OpenSuse 12.2 with that hg 2.4. So just empirically that
>> looks like the problem -- breakdowns started approx. at the time of
>> update of several of these machines.
I made a few experiments to reduce the room for wild speculations and myths:
(Bottom line: everybody use lxbroy10 for pushes and we are safe)
SETUP
=====
The attached crude shell script pushes single changesets to a repository
in loop. As source, I used a clone of the Isabelle repository. As
destination, I used an empty repository in the same format as our
central repository. I ran the script concurrently on various
combinations of machines and with different versions of Mercurial.
This test turns out to be fairly selective: Either the repository gets
corrupted in the first 1,000 pushes, or it stays intact for 20,000
pushes, where I stopped the test.
My assumption is that the corruption seen here is the same that we have
had in production use. The error message is the same.
RESULTS
=======
| No | host A | hg A | host B | hg B | Conc. | NFS | -> |
|----+------------+--------+------------+--------+-------+-----+-----|
| 1 | lxlabbroy5 | 2.4 | lxbroy7 | 2.4 | Yes | Yes | BAD |
| 2 | lxlabbroy5 | 2.4 | - | - | No | Yes | OK |
| 3 | lxlabbroy6 | 2.2.2* | lxbroy7 | 2.1.1* | Yes | Yes | BAD |
| 4 | lxlabbroy7 | 2.4 | lxlabbroy7 | 2.4 | Yes | Yes | OK |
| 5 | lxlabbroy6 | 2.2.2* | lxlabbroy6 | 2.2.2* | Yes | No | OK |
| 6 | lxlabbroy8 | 2.2.2* | lxlabbroy9 | 2.2.2* | Yes | Yes | BAD |
| 7 | lxlabbroy8 | 2.2.2* | lxlabbroy9 | 2.2.2* | No | Yes | BAD |
| 8 | lxbroy7 | 2.1.1* | lxbroy8 | 2.1.1* | No | Yes | OK |
| 9 | lxbroy6 | 2.1.1* | lxbroy9 | 2.1.1* | Yes | Yes | OK |
| 10 | lxlabbroy7 | 2.1.1 | lxlabbroy8 | 2.1.1 | Yes | Yes | BAD |
| 11 | macbroy20 | 2.2.2* | macbroy21 | 2.2.2* | Yes | Yes | BAD |
| 12 | lxbroy6 | 2.4 | lxbroy9 | 2.4 | Yes | Yes | OK |
| 13 | macbroy20 | 2.1.1 | macbroy21 | 2.1.1 | Yes | Yes | BAD |
*) Version from the system's installation. Otherwise, Mercurial was
compiled from source.
Conc.:
Yes: Different processes (on the two hosts) push concurrently
No: Only one process, but via ssh through two different hosts
(Here, I used a slightly different script). Exception: Run #2
NFS:
Does the destination repository live on NFS?
-> :
OK: Can do 20,000 pushes without seeing a corruption
BAD: Repository corruption before 1,000 pushes.
INTERPRETATION
==============
- There is no correlation with the Mercurial version in use. Breakages
occur with older and newer versions alike, and the same version is OK in
other circumstances.
- The error only occurs when the repository is accessed from different
hosts. The access does not need to be concurrent (which excludes a
problem with Mercurial's locking mechanisms). This is also similar to
the situation we had in production use, where concurrent pushes are
fairly unlikely.
- At least one of the hosts involved must be lxlabbroy* or macbroy*, the
OpenSuSE machines. The Gentoo servers are not affected.
I would say that this points to the SUSE NFS client driver as the source
of the problem. If we use lxbroy10 exclusively for pushes, we should be
safe until the issue is fixed.
Alex
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: push_loop
URL: <https://mailman46.in.tum.de/pipermail/isabelle-dev/attachments/20121225/6922f6fb/attachment-0002.ksh>
More information about the isabelle-dev
mailing list