[isabelle-dev] Poly/ML x86_64_32 available for testing

Wed Jan 23 12:05:26 CET 2019

On 22/01/2019 23:01, Makarius wrote:
> On 22/01/2019 23:08, Fabian Immler wrote:
>> On 1/22/2019 4:00 PM, Fabian Immler wrote:
>>> On 1/22/2019 2:28 PM, Makarius wrote:
>>>> On 22/01/2019 20:05, Fabian Immler wrote:
>>>>> These numbers look quite extreme:
>>>>> The slowdown in polyml-5.7.1-8 is about a factor of 2-3 (and not 30 as
>>>>> it seems to be the case with polyml-test-0a6ebca445fc).
>>>>>
>>>>> The following is isabelle/ab5a8a2519b0 afp/f49f0b0ead38 on lxcisa0:
>>>>>
>>>>> ML_PLATFORM="x86-linux"
>>>>> ML_OPTIONS="--minheap 2000 --maxheap 4000"
>>>>> Finished HOL-ODE-Numerics (0:25:27 elapsed time, 1:07:49 cpu time,
>>>>> factor 2.66)
>>>>> Finished HOL-ODE-ARCH-COMP (0:18:01 elapsed time, 0:26:19 cpu time,
>>>>> factor 1.46) > (I am testing HOL-ODE-ARCH-COMP with
>>> polyml-test-0a6ebca445fc/x86_64_32-linux right now, but that does take
>>> some time...)
>> HOL-ODE-ARCH-COMP looked fine (2-3 times slower). But I realized that
>> this was the case with your computations, too. Unlike Lorenz_C0:
>>
>>> x86_64_32-linux -minheap 1500
>>> Finished Lorenz_C0 (3:46:03 elapsed time, 28:01:18 cpu time, factor 7.44)
>>> x86_64-linux --minheap 3000
>>> Finished Lorenz_C0 (0:06:52 elapsed time, 0:51:44 cpu time, factor 7.52)
>> Lorenz_C0 had the most significant slowdown, it has the biggest number
>> of parallel computations, so I thought this might be related.
>>
>> And indeed, if you try the theory at the end:
>> Par_List.map (with 6 cores, on Windows) is 12 times slower on _64_32
>> whereas the sequential evaluation is only 2 times slower.
>>
>> All of this reminds me of the discussion we had in November 2017:
>> https://mailmanbroy.informatik.tu-muenchen.de/pipermail/isabelle-dev/2017-November/thread.html#7643
> 
> Back in Nov-2017, I made the following workaround that is still active:
> http://isabelle.in.tum.de/repos/isabelle/rev/5da20135f560
> 
> Looking at the profiles of the included Float_Test.thy I now see a lot
> of IntInf.divMod, but it is qualitatively similar to former IntInf.pow.
> 
> Maybe David can revisit both of these operations, so that we can  get
> rid of the workarounds.

I've had a look at this under Windows and the problem seems to be with 
calls to IntInf.divMod from generated code, not from IntInf.pow.  The 
underlying RTS call used by IntInf.quotRem has changed in the 32-in-64 
version.  It previously returned the pair of values by passing in a 
stack location and having the RTS update this.  That isn't possible in 
the 32-in-64 version so now it allocates a pair on the heap.  For 
simplicity this is now used for the native code versions as well.  From 
what I can tell nearly all the threads are waiting for mutexes and I 
suspect that the extra heap allocation in IntInf.quotRem is causing some 
of the problem.  Waiting for a contended mutex can cost a significant 
amount of processor time both in busy-waiting and in kernel calls.

I'm not sure what to suggest except not to use concurrency here.  There 
doesn't seem to be a way to avoid allocating a pair in IntInf.quotRem.

By the way I updated IntInf.pow with
https://github.com/polyml/polyml/commit/c2a296122426f5a6205cf121218e3f38415d2b06

David