View Single Post
02/15/19, 07:34 AM   #6
merlight
AddOn Author - Click to view addons
Join Date: Jul 2014
Posts: 671
Originally Posted by sirinsidiator View Post
I admit I may not have been completely correct about everything I wrote, but the point still stands that it is not a bug, but just wrong assumptions being made.
Yes, but wrong assumption on the side of ESOLua's string matching implementation -- the assumption being that input string is LATIN-1 encoded. The game works with UTF-8. To me that qualifies as a bug.

Originally Posted by sirinsidiator View Post
Since the pattern classes do not support unicode, one would need to use the appropriate replacements in order to get the expected output:
Lua Code:
  1. local inStr = "1à1";
  2. for outStr in inStr:gmatch("[^\t-\r ]+") do
  3.  d(outStr)
  4. end
There are different ways of not supporting Unicode. The C library does not support Unicode, yet it's safe to use with UTF-8 strings, because it doesn't make assumptions about bytes beyond ASCII 0-127 range. ESOLua string matching deliberately assumes an 8-bit encoding, which hinders its usability with UTF-8 input. All the special character classes like %s, %w, %u are useless, because they match arbitrary bytes inside multi-byte UTF-8 characters, because they happen to match the class in LATIN-1. So "à" matches "%u%s" (uppercase letter, then space).

Originally Posted by sirinsidiator View Post
I am not sure if it would be a good idea to change the string library so it supports unicode, but doesn't follow the Lua documentation on the web anymore. Maybe they should instead add luautf8? That way we'd have a unicode enabled string library.
ESOLua's string.lower and string.upper have been replaced by UTF-8-aware implementation. It makes zero sense that these functions assume UTF-8, while string.match et al. assume LATIN-1 encoding. I'm not even asking for full Unicode support (although luautf8 would be nice). A good start would be if all functions in the string module made sane assumptions: if one can handle UTF-8, go for it, otherwise stick with ASCII, i.e. don't assume LATIN-1 (or any other encoding that is not a subset of UTF-8).
  Reply With Quote