[isabelle-dev] Repository Trouble

Tue Dec 25 11:38:43 CET 2012

On 12/20/2012 06:22 PM, Makarius wrote:
>> I just had a long phone call with Franz Huber, the local system admin
>> person. All the macbroy20..29 and lxlabbroyX machines involved here
>> use the same OpenSuse 12.2 with that hg 2.4. So just empirically that
>> looks like the problem -- breakdowns started approx. at the time of
>> update of several of these machines.

I made a few experiments to reduce the room for wild speculations and myths:

(Bottom line: everybody use lxbroy10 for pushes and we are safe)

SETUP
=====

The attached crude shell script pushes single changesets to a repository 
in loop. As source, I used a clone of the Isabelle repository. As 
destination, I used an empty repository in the same format as our 
central repository. I ran the script concurrently on various 
combinations of machines and with different versions of Mercurial.

This test turns out to be fairly selective: Either the repository gets 
corrupted in the first 1,000 pushes, or it stays intact for 20,000 
pushes, where I stopped the test.

My assumption is that the corruption seen here is the same that we have 
had in production use. The error message is the same.

RESULTS
=======

| No | host A     | hg A   | host B     | hg B   | Conc. | NFS | ->  |
|----+------------+--------+------------+--------+-------+-----+-----|
|  1 | lxlabbroy5 | 2.4    | lxbroy7    | 2.4    | Yes   | Yes | BAD |
|  2 | lxlabbroy5 | 2.4    | -          | -      | No    | Yes | OK  |
|  3 | lxlabbroy6 | 2.2.2* | lxbroy7    | 2.1.1* | Yes   | Yes | BAD |
|  4 | lxlabbroy7 | 2.4    | lxlabbroy7 | 2.4    | Yes   | Yes | OK  |
|  5 | lxlabbroy6 | 2.2.2* | lxlabbroy6 | 2.2.2* | Yes   | No  | OK  |
|  6 | lxlabbroy8 | 2.2.2* | lxlabbroy9 | 2.2.2* | Yes   | Yes | BAD |
|  7 | lxlabbroy8 | 2.2.2* | lxlabbroy9 | 2.2.2* | No    | Yes | BAD |
|  8 | lxbroy7    | 2.1.1* | lxbroy8    | 2.1.1* | No    | Yes | OK  |
|  9 | lxbroy6    | 2.1.1* | lxbroy9    | 2.1.1* | Yes   | Yes | OK  |
| 10 | lxlabbroy7 | 2.1.1  | lxlabbroy8 | 2.1.1  | Yes   | Yes | BAD |
| 11 | macbroy20  | 2.2.2* | macbroy21  | 2.2.2* | Yes   | Yes | BAD |
| 12 | lxbroy6    | 2.4    | lxbroy9    | 2.4    | Yes   | Yes | OK  |
| 13 | macbroy20  | 2.1.1  | macbroy21  | 2.1.1  | Yes   | Yes | BAD |

*) Version from the system's installation. Otherwise, Mercurial was
compiled from source.

Conc.:
  Yes: Different processes (on the two hosts) push concurrently
  No: Only one process, but via ssh through two different hosts
  (Here, I used a slightly different script). Exception: Run #2

NFS:
  Does the destination repository live on NFS?

-> :
  OK: Can do 20,000 pushes without seeing a corruption
  BAD: Repository corruption before 1,000 pushes.

INTERPRETATION
==============

- There is no correlation with the Mercurial version in use. Breakages 
occur with older and newer versions alike, and the same version is OK in 
other circumstances.

- The error only occurs when the repository is accessed from different 
hosts. The access does not need to be concurrent (which excludes a 
problem with Mercurial's locking mechanisms). This is also similar to 
the situation we had in production use, where concurrent pushes are 
fairly unlikely.

- At least one of the hosts involved must be lxlabbroy* or macbroy*, the 
OpenSuSE machines. The Gentoo servers are not affected.

I would say that this points to the SUSE NFS client driver as the source 
of the problem. If we use lxbroy10 exclusively for pushes, we should be 
safe until the issue is fixed.

Alex
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: push_loop
URL: <https://mailman46.in.tum.de/pipermail/isabelle-dev/attachments/20121225/6922f6fb/attachment-0002.ksh>