[isabelle-dev] Nontermination in HOL-Corec_Examples on Epyc

Mon Jun 25 15:50:16 CEST 2018

On 22/06/18 16:33, Lars Hupel wrote:
>> But there are still (spurious) issues with some AFP sessions. With high
>> contention (a total of 64 worker threads at the same time, e.g. with
>> threads=2 and -j 32), some sessions will run into suspicious timeouts:
>>
>> Timing Median_Of_Medians_Selection (2 threads, 25.039s elapsed time,
>> 44.468s cpu time, 2.451s GC time, factor 1.78)
>> *** Timeout
>> Median_Of_Medians_Selection FAILED
> 
> Here are the offending sessions:
> 
> Dict_Construction
> Hidden_Markov_Models
> Median_Of_Medians_Selection
> Mason_Stothers
> 
> They fail spuriously (except for the last one, which fails reproducibly)
> when running with a total of 64 or even 128 worker threads:
> 
> ./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
> threads=2 -j 32
> 
> ./isabelle/bin/isabelle build -cbv -a -d afp-devel/thys/ -X slow -o
> threads=4 -j 32
> 
> None of this happens when using a total of just 32 worker threads.
> 
> Makarius, David, this probably requires another round of analysis.
> 
> Isabelle/b6e48841d0a5
> AFP/00e13b87d199

I think it works properly now, see Isabelle/f0f83ce0badd + AFP/f7e9efc65d82.

The failing sessions above had very small timeout, fitting to a high-end
laptop of the authors, but unfitting to a huge server with many other
things running. E.g. see now

changeset:   9452:6ab2cc180aae
user:        wenzelm
date:        Sun Jun 24 11:25:19 2018 +0200
files:       thys/AODV/ROOT thys/Automatic_Refinement/ROOT
thys/Buffons_Needle/ROOT thys/Category3/ROOT thys/Dict_Construction/ROOT
thys/Error_Function/ROOT thys/Hidden_Markov_Models/ROOT
thys/LLL_Factorization/ROOT
description:
more uniform timeout: discretization for 300 (5min);

I usually measure the CPU time, multiply it by 2, and take the ceiling
towards the next multiple of 300. In principle this maintenance of
options could be automated.

Also note that option -b causes a lot of file-system traffic, which I
suspect does not work so well with many parallel processes. (This is an
old-fashioned HD, not an SSD.)

There were other file-system oriented crashes due to the new
Export.export in batch build, but I have made this conditional and
disabled by default:

changeset:   68491:f0f83ce0badd
tag:         tip
user:        wenzelm
date:        Sun Jun 24 22:13:23 2018 +0200
files:       etc/options src/Pure/Thy/thy_info.ML src/Pure/Tools/dump.scala
description:
disable export_document by default (presently unused and for
demo/testing purposes): avoid spurious IO exception in highly parallel
environment;

Here is an earlier change that improves the overall scheduling of
long-running sessions:

changeset:   68486:6984a55f3cba
user:        wenzelm
date:        Sat Jun 23 14:23:53 2018 +0200
files:       src/Pure/Tools/build.scala
description:
clarified queue ordering: take session descendants into account, notably
for "slow" AFP sessions;

All this together gives fairly good robustness and performance as follows:

  ML_PLATFORM="x86_64-linux"
  ML_HOME="/home/isabelle-temp/makarius/x86_64-linux"
  ML_SYSTEM="polyml-5.7.1"
  ML_OPTIONS="--minheap 3G --maxheap 30G"

isabelle build -j8 -o threads=8 -a -X slow -d afp-devel/thys -N -f
0:58:59 elapsed time, 28:11:51 cpu time, factor 28.68
0:59:18 elapsed time, 28:14:49 cpu time, factor 28.58

isabelle build -j8 -o threads=8 -a -X very_slow -d afp-devel/thys -N -f
Finished AODV (1:17:55 elapsed time, 8:01:27 cpu time, factor 6.18)
1:23:40 elapsed time, 46:19:04 cpu time, factor 33.21

The "/home/isabelle-temp/makarius/x86_64-linux" above refers to a recent
repository version of Poly/ML built on Ubuntu 18.04 LTS: that was
essentially an accident, and should have been official polyml-5.7.1-6.

It would be also interesting to test this with x86-linux, but it
requires the Ubuntu package lib32stdc++6.

Note that the above 8 * 8 arrangement follows the structure of the
hardware (according to numactl -H): it is a cluster of 8-core nodes,
with a memory access penalty of factor 1.6 or 3.2 for threads sitting on
a distant node. The isabelle build option -N avoids that.

Here are some variations that burn more CPU cycles without really
improving the performance:

isabelle build -j16 -o threads=6 -a -X slow -d afp-devel/thys -N -f
0:58:00 elapsed time, 33:37:20 cpu time, factor 34.77

isabelle build -j32 -o threads=6 -a -X slow -d afp-devel/thys -N -b -f
1:08:28 elapsed time, 48:51:32 cpu time, factor 42.81

isabelle build -j64 -o threads=1 -a -X slow -d afp-devel/thys
2:05:31 elapsed time, 28:49:42 cpu time, factor 13.78
1:59:28 elapsed time, 28:56:00 cpu time, factor 14.53

The last one demonstrates where we would be without the great
multithreading provided by Poly/ML.

Summary: this is a quite good machine, which manages to push the
standard Isabelle + AFP quasi-interactive test below 1h, but the ideal
of 45min (the "Paris commuter constant") is not met yet. It probably
requires smarter scheduling in isabelle build to achieve that.

The Isabelle + AFP slow test is quite good in that configuration, and
the "slow" tag is hardly justified on that machine: the overall test
time converges towards the longest-running session.

	Makarius