Fix a buildstring()!


#1

I having problem with utf-8 strings because function buildstring() fill the “1” in the high bits of char_t type of for chars values more 0x7F. It problem have a 8/16 bit firmwire.

I fixed this problem like (bold font the changes):
void buildstring (chars_t ch, object **tail) {
ch &= 0xFF;
object *cell;
if (cdr(*tail) == NULL) {
cell = myalloc(); cdr(*tail) = cell;
} else if (((*tail)->chars & 0xFF) == 0) {
(*tail)->chars |= ch; return;
} else {
cell = myalloc(); car(*tail) = cell;
}
car(cell) = NULL; cell->chars = ch<<8; *tail = cell;
}


#2

Thanks for reporting this.

Can you give a minimal example that shows the problem? What function are you calling that uses buildstring()? Does this only affect the AVR version of uLisp?

Regards, David


#3

O! It’s base and elementary sample problem.

If i write: (defvar x “test”) or simple “test”.

To debug i added to buildstring() next code:

void buildstring (char ch, object **tail) {

(*tail)->chars |= ch;
Serial.print(" ");
Serial.print((int32_t)ch, HEX);
Serial.print(" ");
Serial.println((int32_t)(*tail)->chars, HEX);
return;

car(cell) = NULL; cell->chars = ch<<8; *tail = cell;
Serial.print(" ");
Serial.print((int32_t)ch, HEX);
Serial.print(" ");
Serial.println((int32_t)(*tail)->chars, HEX);
}

In result:
1343> (defvar x "t 74 7400
e 65 7465
s 73 7300
t 74 7374
")
x

See that, char is fill a 2 byte word. It’s ok!

But if i write text in UTF-8 i got result, as:
"� FFFFFFD1 D100

‚ FFFFFF82 FF82

� FFFFFFD0 D000

µ FFFFFFB5 FFB5

� FFFFFFD1 D100

… because char greater 0x7F filled FF from left at expand to a word!!!
So the instruction (*tail)->chars |= ch; is not work as good!

To fix this problem i change type of ch from char to chars_t in header:
void buildstring ( chars_t ch , object **tail) {
and than clear a left byte:
ch &= 0xFF;

It’s problem only AVR!

On ESP char type not fill 1 at left with expand:
3927> "⸮ D1 D1000000
⸮ 82 D1820000
⸮ D0 D182D000
⸮ B5 D182D0B5
⸮ D1 D1000000
⸮ 81 D1810000

However i don`t known how its work at another platforms. Therefore the my construction is universal:
void buildstring ( chars_t ch , object **tail) {
ch &= 0xFF;

}


#4

Now, I created the unity buildstring() code for AVR and ESP platforms:

void buildstring (chars_t ch, object **tail) {
  ch = ch << 8*(sizeof(chars_t)-1);
  object *cell;
  if (cdr(*tail) == NULL) {
    cell = myalloc(); cdr(*tail) = cell;
  } else {
    if (((*tail)->chars & 0xFF)==0) {
      chars_t mask = -1;
      while ((*tail)->chars & mask) {
        mask = mask >> 8;
        ch = ch >> 8;
      }
      (*tail)->chars = (*tail)->chars | ch; 
      return;
    }
    cell = myalloc(); car(*tail) = cell;
  }
  car(cell) = NULL; cell->chars = ch; *tail = cell;  
}

#5

Thanks! Can you give a test case that gives the wrong answer with the original version of buildstring(), but the correct answer with your version?


#6

Ок! Original version:
uLisp 4.8a
1336> “test”
“test”

1336> “тест”
“⸮⸮⸮⸮⸮⸮⸮⸮”

My version:
uLisp 4.8a
1336> “test”
“test”

1336> “тест”
“тест”


#7

You can can correctly view of this sample if you have the cyrillic fonts!!!


#8

Yes, I can see what you mean now! Thanks,
David


#9

My buildstring() with AI optimize are clearly better:

void buildstring(chars_t ch, object **tail) {
    ch <<= (8u * (sizeof(chars_t) - 1u));

    if (cdr(*tail) == NULL || ((*tail)->chars & (chars_t)0xFF)) {
        object *cell = myalloc();
        if ((*tail)->chars & (chars_t)0xFF) car(*tail) = cell; else cdr(*tail) = cell;
        car(cell) = NULL; cell->chars = ch; *tail = cell;
        return;
    }

    chars_t mask = ~(chars_t)0;
    while ((*tail)->chars & mask) { mask >>= 8; ch >>= 8; }
    (*tail)->chars |= ch;
}