[isabelle-dev] New Testing Infrastructure -- status report

Mon May 30 09:54:12 CEST 2011

Hi all,

In the past weeks, there has been some progress with our new testing
infrastructure, which I would like to summarize here. Please give
feedback, ask questions, and discuss.

Really fast status for "isabelle makeall"
=========================================

Thanks to a new fat server with 24 cores and 130G of RAM, we can now
run "isabelle makeall all" in 17-18 minutes. This is done on all
changesets as they come in, so we always know quickly when the tip of
the repository breaks.

The reports with logs etc. can all be browsed at

   http://isabelle.in.tum.de/reports/Isabelle/

This is a fully functional repository view, annotated with test
results.

The first column of flashlights represents the status of
Isabelle_makeall.

Developer-initiated tests
=========================

Apart from changesets from the official repository, the system also
considers changesets from a special "testboard" repository. All
developers can push their untested changes here, to get some automatic
feedback. Try it out, it is really convenient.

To use testboard, it is convenient to add the following to your
~/.hgrc

   [paths]
   testboard = 
ssh://USERNAME@macbroy20.informatik.tu-muenchen.de//home/isabelle-repository/repos/testboard

Then, local changes can be pushed to testboard via

   hg push -f testboard

and will usually be considered fairly quickly by the scheduler.
(Note: -f is required, since the testboard repository will usually
have multiple heads. Never use -f when pushing to the main development
repository!)

The reports from testboard can be viewed at

   http://isabelle.in.tum.de/testboard/Isabelle

Note that testboard should be considered write-only. You should never
pull from it unless you want to get everybody's unfinished stuff!

SML/NJ testing
==============

Building the HOL image takes ca. 3-4 hours, and a full makeall takes
ca. 24 hours. This has been prohibitive for pre-push testing with
SML/NJ, so people generally have fixed issues after getting a failure.
The new setup supports the same style of working but gives quicker
feedback. Both the HOL image and makeall are run continuously, so
there is much quicker feedback.

The third column in the reports view signals the status of SML/NJ
builds.  Note that the SML_makeall test still excludes a small number
of sessions which are too big to fit into memory.

AFP
===

Currently, there is a reduced build of the AFP, which is run
continuously, but it still excludes Flyspeck_Tame and
JinjaThreads. When Gerwin is here, we'll try to make this build
equivalent to the official AFP builds.

The results are signalled in the second column in the reports view.

Data collection
===============

Performance data is collected in every run. When Makarius is available
again, we must decide on a set of sessions to be run regularly in a
reproducible environment (i.e., without other jobs or interactive
activity on the same machine), to get less fluctuation in the
measurements.

Judgement Day benchmarks are now run continuously on two lxlabbroy
machines. This is still somewhat preliminary, and I must coordinate
with Jasmin and Sascha what data should best be collected and how.

Mutabelle benchmarks are run continuously on one lxlabbroy machine in
pretty much the same way.

The most important weakness at the moment is the lack of good plotting
facilities for this sort of data. A student (Michael Kerscher) is
currently working on that, with promising first results.

Limitations
===========

- Missing email notification

   Here is some room for discussion: What would be a good notification
   scheme? Since tests are now run continuously, sending an email after
   each run is not really an option. Should we simply go for a daily
   summary? More sophisticated options are possible (e.g., notify the
   author of a broken changeset when all its parents are working), but
   these require some thought.

- Scheduling strategies could be better sometimes
- Plotting (see above)
- ...

Alex