View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0020431 | mantisbt | db schema | public | 2015-12-28 05:22 | 2022-07-03 19:58 |
Reporter | dregad | Assigned To | dregad | ||
Priority | normal | Severity | minor | Reproducibility | N/A |
Status | assigned | Resolution | open | ||
Target Version | 2.26.0 | ||||
Summary | 0020431: Use utf8mb4 charset for new MySQL installations | ||||
Description | We currently create the database with 'utf8' charset and 'general_ci' collation. In MySQL, utf8 charset uses up to 3 bytes, which means some characters can't be stored properly. Since MySQL 5.5.3 the 'utf8mb4' charset is available, and does not have this limitation. We should use utf8mb4 for new installation. In addition, we default to 'general' collation which sometimes yield incorrect sort order for special (multibyte) characters. See MySQL documentation [1] for examples and more explanation. [1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html | ||||
Tags | schema | ||||
related to | 0008017 | closed | dregad | Increase the size of the username field |
parent of | 0020465 | closed | dregad | Reduce size of username and email fields to allow utf8mb4 charset |
has duplicate | 0025691 | closed | atrol | UTF8_mb4 allows unicode emoji's without own grafics |
related to | 0021101 | closed | dregad | Issues with emoji's are truncated before getting saved |
related to | 0021841 | closed | dregad | Minimum requirements for 2.x releases |
related to | 0023549 | closed | atrol | Entering Emojis in comments with a user mention crashes with an error |
Switching to utf8mb4 charset may not be as simple as I thought initially... MySQL limits the maximum index key size depending on the engine used:
Considering 4 bytes per characters, the maximum length of a character field used as index key is:
Schema update step 196 updated the user table's username field to 255 chars in length (see 0008017). Converting the table to utf8mb4 or trying to create it from scratch with that charset triggers the following error 1071: Specified key was too long; max key length is 767 bytes We have the following options:
Option 3 is probably acceptable, since it seems highly unlikely that we would have email addresses longer than 191 chars to begin with, and even less to have 2 of them differ only on or after the 192nd char. Nevertheless there is still the technical possiblity that this situation will occur. I would recommend number 2 as the safest option, and assuming we don't really need 255 chars for the username. I don't think we do; the original requirement from 0008017 was to allow storing email addresses as user identifier; 255 was set arbitrarily. [1] http://dev.mysql.com/doc/refman/5.7/en/myisam-storage-engine.html |
|
Reminder sent to: atrol, vboctor Your opinion on 0020431:0052209 would be appreciated. |
|
Looking at the code under admin/check/ it seems that our current minimum requirement for MySQL is 5.0.8 and hence if we are planning to use utf8mb4, then we need to up such requirement to 5.5.3. Not sure how common is 5.5.3. Though if we end up waiting, then we could make the move directly to 5.7.7 where it wouldn't be necessary to reduce the field size. I don't have a problem with reducing the max size for username. I assume based on the above, this is the only offending field. |
|
To wait for 5.7.7 is no option. Ubuntu LTS versions that come with 5 years of security updates [1] Red Hat Enterprise Linux is quite another story [2] So requiring 5.5.3 would rule out Red Hat. It seems we should stay with utf8 at the moment. [1] http://distrowatch.com/table.php?distribution=ubuntu |
|
Thanks to both of you for your comments. Actually my intention was not to enforce utf8mb4 across the board, but rather to detect the mysql version at install time, and define the charset accordingly (utf8 if < 5.5.3, utf8mb4 otherwise), allowing instances running recent software to benefit from better unicode support.. We can keep MySQL 5.0.8 as minimal support, even though that's effectively end-of-life since 2013. With regards to MariaDB, at least until version 5.5 it's supposed to be a "drop-in replacement" for MySQL, so (in theory) RHEL 7.2 should be just fine and use utf8mb4. Not sure how things will evolve with MariaDB 10.x / MySQL 5.7 though, but that's another topic. |
|
Pull request https://github.com/mantisbt/mantisbt/pull/699 |
|
@dregad How about the option is using a prefix index and updating the API(s) that looks up by username to potentially handle more than one match? That is assuming the DBMS won't filter these independent of the index anyway. Which I think it should. |
|
The problem is not with filtering, it is about ensuring the key's uniqueness, which can't be guaranteed with a prefix index. Updating the API would not resolve this issue, and sounds like overengineering for an issue that is anyway quite unlikely to occur (have you ever heard of a 191-char long e-mail address ?) |
|
Interesting note on Drupal's approach to handle index size limitation on utf8mb4 fields https://www.drupal.org/node/1314214 |
|
In order to (greatly) simplify the implementation of utf8mb4 charset support, including upgrade steps and future maintenance, I would like to propose that we increase the minimum version requirement for MySQL to 5.5.3 in 2.0.x. 5.5.3 was released in March 2010. It was actually a milestone release, the General availability is 5.5.8 (released 2010-12-03), but I don't see the need to require a higher version than we actually need. As pointed out by atrol, this will prevent some distros from running our software, but I think that's an acceptable trade-off. For the record, Drupal followed a similar approach, as documented in the link I referenced in my previous post. |
|
I would prefer this instead of implementing workarounds to fix 0021101. |
|
I agree, but I also believe that we do need a solution for 1.x as well, since depending on MySQL settings, use of any 4-byte char will either result in
The workaround I propose in 0021101 / PR https://github.com/mantisbt/mantisbt/pull/797 is fairly simple, and prevents an error which is more and more likely to occur as people using Mantis on smartphones are used to inserting emojis. Of course will need to be reverted once utf8mb4 support has been implemented. |
|
This blog post suggests a different and interesting workaround compared to the pull requests referred to in the previous post: https://roartindon.blogspot.de/2015/04/hacking-utf16-to-work-around-mysqls.html |
|
Changing target version since we'll have MySQL 5.5.3 as minimum requirement there. |
|
Did you consider 0020431:0052214 ? |
|
There is some generic info about UTF8mb4 here: https://www.everipedia.com/UTF8/ |
|
@travm1 what is your point ? |
|
cool that mantis will to that out of the box utf8mb4, without manuell database stuff with this quite cool mysql, mssql, postgre tool https://www.heidisql.com/ :-) now mantis is able to process unicode... stuff like this is possible https://unicode.org/emoji/charts/full-emoji-list.html |
|
stuff like http://www.sonderzeichen.de/Emoticons-Alphabet.html this html charakter also works with 😁 for example ... � ��� i'll do a ticket, for supporting input with smileys... |
|
is the bugstracker mysql database really under UTF8mb4 correctly ? in my previous ticket are the correct unicode smileys, normaly they are working... we all see not the smiley but � |
|
So my mantis mysql database is very old, initially created many many years ago. It still uses |
|
@rogueresearch, if you are going to migrate your database from |
|
MantisBT: master-1.3.x 805ef0cb 2016-06-18 12:42 Details Diff |
New database API function db_mysql_fix_utf8() This new function replaces 4-byte UTF-8 chars by Unicode U+FFFD character for MySQL databases. This is a temporary workaround to avoid data getting truncated on MySQL databases using native utf8 encoding which only supports 3 bytes chars, until we're able to support utf8mb4 charset (see issue 0020431). Fixes 0021101 |
Affected Issues 0020431, 0021101 |
|
mod - core/database_api.php | Diff File |