[isabelle-dev] Script characters in words
Makarius
makarius at sketis.net
Wed Mar 6 14:51:16 CET 2024
On 06/03/2024 13:19, Lawrence Paulson wrote:
> I have noticed something unusual, which is true both on Isabelle2023 and in the current version: most script letters seem to be regarded as word characters (as we would expect), but not all of them.
>
> To see this, it is enough to type
>
> thm 𝒜_def ℬ_def 𝒞_def 𝒟_def ℰ_def ℱ_def ℛ_def 𝒮_def
>
> and double click on the letter f. It's pretty hard to explain why A should be treated differently from B here, or R from S.
You have rediscovered a long-standing problem of "Unicode". It started out as
a universal and uniform 16-bit character sets many decades ago (UCS-2), but is
now a conglomerate of many add-ons and adhoc complexity, in order to
approximate text representation for every language on the planet (including
mathematics).
The problem here is that the script "𝒜" was introduced much later than "ℬ":
thus it requires 2 16-bit characters instead of just 1. Many Java (or
JavaScript) programs will usually get it wrong, because they work with type
Char ("character") instead of Int (codepoint).
This complexity (and failure) of Unicode is avoided in Isabelle by using a
more stable and uniform notion of "Isabelle symbols". The Isabelle/jEdit
front-end uses "poor-man's rendering of Isabelle symbols in Unicode". Thus the
underlying jEdit text editor may get things wrong occasionally: it would be
better to work natively with Isabelle symbols everywhere.
Here are some snippets to illustrate the situation:
(*Isabelle/ML symbols: good*)
ML ‹Symbol.explode "𝒜_def ℬ_def" |> map (fn s => (s, Symbol.is_letter s))›
(*Isabelle/Scala symbols or UTF-16 codepoints: good*)
Symbol.explode("𝒜_def ℬ_def").map(s => (s, Symbol.is_letter(s)))
(*Java UTF-16 codepoints: good*)
Codepoint.iterator("𝒜_def ℬ_def").toList.map(c => (c, Character.isLetter(c)))
(*Java UCS-2 characters: bad*)
"𝒜_def ℬ_def".toList.map(c => (c, Character.isLetter(c)))
The first one is for the Isabelle/jEdit document buffer, within a formal
theory context. The other ones are for the Console/Scala REPL: here you may
notice that the Java GUI component cannot display the "𝒜" anymore, after it
got chopped up into 2 characters.
I am already used to post tickets to the average Java project, in order to
make such fine-points more correct. But for jEdit we have a certain status-quo
that won't move much.
Makarius
More information about the isabelle-dev
mailing list