I was initially planning to have everything just use char
values, but it turned out that I had to change more places to ensure there wasn’t an overflow than just letting the full width bleed all the way through.
Perhaps. The problem is that UTF-8 was designed to make Unicode transparent to 8-bit clean transports like UART. Most Unix terminals have started using UTF-8 as the default encoding. Right now, uLisp looks like it handles Unicode strings:
3071( (le((((str "🙂")) (format t str))
🙂
nil
Aside from the extent to which it confuses the cursor position, I mean. However:
3071( (length "æði")
5
3071>((char "æ" 0)
#\U+c3
307(> (char "æ" 1)
#\U+bd
æ
is U+E6, and is in ISO 8859-1. I’d have to explicitly configure the terminal to use that encoding, though. Instead, UTF-8 makes it look like it just works, unless I specifically go looking. The highest code point that gets encoded as a single byte by the terminal is #x7F. If it gets sent a byte higher than that, it’s probably going to replace it as an error. The code I added sends an #\U+
for everything that’s not ASCII, so that issue is avoided. (I might point out that £ is U+A3, if you want to see what happens.)
Text is not 8-bit nowadays.