I wish to start by stressing that what I’m about to note is not a bug. That being said, uLisp has a slight problem. It assumes bytes and characters are the same thing, but lives in a world where that is decidedly not true. Witness:
20479> #\æ
Error: unknown character
20479> (length "æ")
2
20479> (char-code (char "æ" 0))
195
20479> (char-code (char "æ" 1))
166
20479>
Most terminals by now default to using UTF-8, and not a simple 8-bit encoding. (This impacts not only us non-English folks. I can repeat the exercise using £ as an example - the codes are 194 and 163.) For the purposes uLisp is intended and suited, this is absolutely irrelevant, and there’s really no reason to change the behaviour seen here.
But it should be specified, documented and detected. Specifically, uLisp should be defined as an 8-bit clean ASCII system, and tie the #\
character notation to a.) ASCII printable characters, b.) defined control character names and c.) some representation of raw byte values (I would suggest something on the pattern of #\xFF
). If someone tries to use it to escape a high-bit-set value, a distinct error can be given - “character not in ASCII” or so - to prevent confusion. And one thing in particular should be avoided:
20479> (char "æ" 0)
#\�
20479> (char "æ" 1)
#\�
20479>
uLisp tried to send a character that in reality is outside of its character set to the terminal. The terminal, being set to receive UTF-8, saw an invalid byte and replaced it with the replacement character, U+FFFD. Sure, if I configure the terminal to ISO 8859-1, this stops being an issue - but that’s not the common case, and uLisp itself doesn’t ever interpret the high byte at all.
As far as I can tell, the character notation is the only thing that needs to be changed - it’s the only place where 8-bit values can show up as part of uLisp’s syntax, rather than as a user-input string.