Bytes are not characters

odin · 2021-02-12 13:40:08 UTC

I wish to start by stressing that what I’m about to note is not a bug. That being said, uLisp has a slight problem. It assumes bytes and characters are the same thing, but lives in a world where that is decidedly not true. Witness:

20479> #\æ
Error: unknown character
20479> (length "æ")
2

20479> (char-code (char "æ" 0))
195

20479> (char-code (char "æ" 1))
166

20479>

Most terminals by now default to using UTF-8, and not a simple 8-bit encoding. (This impacts not only us non-English folks. I can repeat the exercise using £ as an example - the codes are 194 and 163.) For the purposes uLisp is intended and suited, this is absolutely irrelevant, and there’s really no reason to change the behaviour seen here.

But it should be specified, documented and detected. Specifically, uLisp should be defined as an 8-bit clean ASCII system, and tie the #\ character notation to a.) ASCII printable characters, b.) defined control character names and c.) some representation of raw byte values (I would suggest something on the pattern of #\xFF). If someone tries to use it to escape a high-bit-set value, a distinct error can be given - “character not in ASCII” or so - to prevent confusion. And one thing in particular should be avoided:

20479> (char "æ" 0)
#\�

20479> (char "æ" 1)
#\�

20479>

uLisp tried to send a character that in reality is outside of its character set to the terminal. The terminal, being set to receive UTF-8, saw an invalid byte and replaced it with the replacement character, U+FFFD. Sure, if I configure the terminal to ISO 8859-1, this stops being an issue - but that’s not the common case, and uLisp itself doesn’t ever interpret the high byte at all.

As far as I can tell, the character notation is the only thing that needs to be changed - it’s the only place where 8-bit values can show up as part of uLisp’s syntax, rather than as a user-input string.

johnsondavies · 2021-02-12 14:36:54 UTC

Thanks - good suggestions. I’m trying to think how this would be implemented in a consistent way.

Would uLisp allow you to create a string containing non-ASCII characters? If so, what would it show if you print such a string? If not, what would it do if you enter a string such as:

"æ"

uLisp doesn’t directly allow you to convert characters to strings, but you can do this:

> (with-output-to-string (str) (write-byte 195 str) (write-byte 166 str))
"æ"

What should happen in this case?

odin · 2021-02-12 15:01:46 UTC

I think both cases are fine as is. Actually, in theory it’d also be perfectly alright to let #\ be followed by a single high-bit-set byte - I was really thinking about the ‘common case’ of UTF-8, which currently causes the slightly odd ‘unknown character’ message because it triggers trying to read control code names because any non-ASCII character is multiple bytes there.

In particular, I think it’s better to treat strings as octet vectors than it is to place limits on them. Who knows, you might be sending that weird string to some I²C or SPI gadget rather than a terminal.

johnsondavies · 2021-02-12 15:59:22 UTC

So what do you think should change in uLisp? Is it just the representation of non-ASCII characters, with something like:

> (code-char 195)
#\xC3

odin · 2021-02-12 16:53:43 UTC

That and documentation, yes. The escape notation might be better done differently, but that counts as a ‘character name’ according to the CL spec, and is thus implementation-dependent anyway. We should probably avoid the U+ syntax, because by going this route we’re explicitly saying these don’t represent Unicode codepoints.

Having a specific error message for “you’re trying to use a multi-byte character we don’t support” for #\ would be nice, but on second thought we’d prefer not to overload the Uno again.

I think I had code implementing this at some point, even. I’ll look into it.

johnsondavies · 2021-02-14 18:35:30 UTC

To keep things as simple as possible I propose to express non-ASCII characters as a three-digit decimal number; for example:

> (code-char 129)
#\129

I assume numbers greater than 255 should give an error; for example:

> (code-char 257)

For symmetry the reader should be able to read back in a character consisting of three decimal digits:

> (char-code #\129)
129

Does that seem OK?

Aeneas · 2021-02-15 08:32:42 UTC

This is ideal, because it would allow uLisp to communicate with “more complex” devices - it could still “handle” such text without issue, it could only not “display” it (which may be perfectly irrelevant depending on the application).

EDIT: It would be interesting to see ALL non-displayable characters, if the line editor is not enabled. - That would essentially allow uLisp to be used to “detect weird characters” basically on any terminal. That is, e.g. to show exactly what is signalled when backspace is pressed, etc.

Aeneas · 2021-02-15 08:36:25 UTC

Regarding “creeping featuritis” (depending on the point of view), an interesting function would be to “turn characters into numbers” and to “turn numbers into characters”, that is, that there is some way to “make” #\129 given the number 129 and vice versa. But this could be handled over a simple table by the user, of course, given that 256 numbers/characters is not the world.

odin · 2021-02-15 09:29:03 UTC

These are the functions code-char and char-code, respectively. They are from Common Lisp.

One way to think about this is that I’m saying bytes with the high bit set should be treated like control characters. You can have them in strings, and you can print them, but uLisp takes measures not to output them accidentally.

Actually, in theory that’s what happens now. uLisp echoes all bytes received that aren’t a newline. The receiving terminal, however, doesn’t display them. Some can be configured to do that, though.

Aeneas · 2021-02-15 16:15:59 UTC

Thank you for the explanation. Then actually using your mentioned functions one can implement the charcode-display already…

johnsondavies · 2021-02-16 09:48:40 UTC

This is now incorporated into uLisp 3.5, and I’ve documented it here:

Language reference - #\

Thanks!

odin · 2021-02-16 14:59:39 UTC

Huh. On porting that solution to my WIP build system I noticed that you remembered to escape ␡. Good catch; the description I gave failed to account for that little oddity of ASCII.

For reference: Code point 127, all seven bits set, isn’t strictly speaking a control character. It’s technically supposed to be ignored, but Unix and its most popular terminals used it instead of backspace, so that can’t be relied upon. Everything I said was about code point 128 and above…

johnsondavies · 2021-02-16 15:26:02 UTC

Yes, I pondered about that. On LispWorks it is shown as Rubout:

> (code-char 127)
#\Rubout

Doing that would have taken several more bytes, so I just included it in the range shown as three decimal digits.

odin · 2021-02-25 16:48:41 UTC

One little thing, which doesn’t matter much, but still…

{2}20354> #\256
#\Null

This might get confusing if we ever add ‘proper’ UTF-8 support to the 32-bit platforms and have a need to refer to code points larger than eight bits. Don’t know if it’s worth it, though.

johnsondavies · 2021-02-25 16:57:32 UTC

There isn’t currently any error checking on the three-digit numeric characters – there wasn’t room, at least on the Arduino Uno. I think this comes under the heading of “reserved for future use”.

odin · 2021-02-25 16:59:06 UTC

I figured it was that; just wanted to make a note of it for future reference.