View Issue Details

IDProjectCategoryView StatusLast Update
0020431mantisbtdb schemapublic2017-10-16 08:35
ReporterdregadAssigned Todregad 
PrioritynormalSeverityminorReproducibilityN/A
Status assignedResolutionopen 
Product Version 
Target Version2.8.0Fixed in Version 
Summary0020431: Use utf8mb4 charset for new MySQL installations
Description

We currently create the database with 'utf8' charset and 'general_ci' collation.

In MySQL, utf8 charset uses up to 3 bytes, which means some characters can't be stored properly. Since MySQL 5.5.3 the 'utf8mb4' charset is available, and does not have this limitation.

We should use utf8mb4 for new installation.

In addition, we default to 'general' collation which sometimes yield incorrect sort order for special (multibyte) characters. See MySQL documentation [1] for examples and more explanation.

[1] http://dev.mysql.com/doc/refman/5.7/en/charset-unicode-sets.html

TagsNo tags attached.

Relationships

related to 0008017 closeddregad Increase the size of the username field 
parent of 0020465 closeddregad Reduce size of username and email fields to allow utf8mb4 charset 
related to 0021101 closeddregad Issues with emoji's are truncated before getting saved 
related to 0021841 closeddregad Minimum requirements for 2.x releases 

Activities

dregad

dregad

2015-12-30 18:21

developer   ~0052209

Switching to utf8mb4 charset may not be as simple as I thought initially...

MySQL limits the maximum index key size depending on the engine used:

  • MyISAM: 1000 bytes [1]
  • InnoDB: 767 bytes [2] in MySQL < 5.7.7 (unless 'innodb_large_prefix' option is enabled), 3072 bytes in 5.7.7 and above (or when innodb_large_prefix == ON)

Considering 4 bytes per characters, the maximum length of a character field used as index key is:

  • MyISAM: 1000 / 4 = 250 chars
  • InnoDB: 767 / 4 = 191 chars

Schema update step 196 updated the user table's username field to 255 chars in length (see 0008017). Converting the table to utf8mb4 or trying to create it from scratch with that charset triggers the following error

1071: Specified key was too long; max key length is 767 bytes

We have the following options:

  1. forget about utf8mb4, keep using 3-byte utf8 (eventually, someone will face issues as they try to store 4-byte unicode chars, e.g. emoji or some CJK characters)
  2. reduce size of username column from 255 to 191 chars
  3. change the idx_user_username index to become a 'prefix' index, i.e. to only cover the first 191 chars of the username field

Option 3 is probably acceptable, since it seems highly unlikely that we would have email addresses longer than 191 chars to begin with, and even less to have 2 of them differ only on or after the 192nd char. Nevertheless there is still the technical possiblity that this situation will occur.

I would recommend number 2 as the safest option, and assuming we don't really need 255 chars for the username. I don't think we do; the original requirement from 0008017 was to allow storing email addresses as user identifier; 255 was set arbitrarily.

[1] http://dev.mysql.com/doc/refman/5.7/en/myisam-storage-engine.html
[2] http://dev.mysql.com/doc/refman/5.7/en/innodb-restrictions.html

dregad

dregad

2015-12-30 18:23

developer   ~0052210

Reminder sent to: atrol, vboctor

Your opinion on 0020431:0052209 would be appreciated.

vboctor

vboctor

2015-12-31 05:25

manager   ~0052211

Looking at the code under admin/check/ it seems that our current minimum requirement for MySQL is 5.0.8 and hence if we are planning to use utf8mb4, then we need to up such requirement to 5.5.3. Not sure how common is 5.5.3. Though if we end up waiting, then we could make the move directly to 5.7.7 where it wouldn't be necessary to reduce the field size.

I don't have a problem with reducing the max size for username. I assume based on the above, this is the only offending field.

atrol

atrol

2015-12-31 12:25

developer   ~0052214

To wait for 5.7.7 is no option.
Even the next version of LTS Ubuntu comes with an older version.

Ubuntu LTS versions that come with 5 years of security updates [1]
12.04 5.5.22
14.04 5.5.35
16.04 5.6.27

Red Hat Enterprise Linux is quite another story [2]
RHEL-5.1 5.0.95
RHEL-6.7 5.1.73
RHEL-7.2 no longer MySQL but MariaDB 5.5.44

So requiring 5.5.3 would rule out Red Hat.

It seems we should stay with utf8 at the moment.

[1] http://distrowatch.com/table.php?distribution=ubuntu
[2] http://distrowatch.com/table.php?distribution=redhat

dregad

dregad

2015-12-31 18:21

developer   ~0052215

Thanks to both of you for your comments.

Actually my intention was not to enforce utf8mb4 across the board, but rather to detect the mysql version at install time, and define the charset accordingly (utf8 if < 5.5.3, utf8mb4 otherwise), allowing instances running recent software to benefit from better unicode support..

We can keep MySQL 5.0.8 as minimal support, even though that's effectively end-of-life since 2013.

With regards to MariaDB, at least until version 5.5 it's supposed to be a "drop-in replacement" for MySQL, so (in theory) RHEL 7.2 should be just fine and use utf8mb4.

Not sure how things will evolve with MariaDB 10.x / MySQL 5.7 though, but that's another topic.

dregad

dregad

2016-01-01 19:25

developer   ~0052230

Pull request https://github.com/mantisbt/mantisbt/pull/699

vboctor

vboctor

2016-01-03 20:43

manager   ~0052239

Last edited: 2016-01-03 20:46

View 2 revisions

@dregad How about the option is using a prefix index and updating the API(s) that looks up by username to potentially handle more than one match? That is assuming the DBMS won't filter these independent of the index anyway. Which I think it should.

dregad

dregad

2016-01-04 10:39

developer   ~0052240

The problem is not with filtering, it is about ensuring the key's uniqueness, which can't be guaranteed with a prefix index.

Updating the API would not resolve this issue, and sounds like overengineering for an issue that is anyway quite unlikely to occur (have you ever heard of a 191-char long e-mail address ?)

dregad

dregad

2016-06-13 06:17

developer   ~0053358

Interesting note on Drupal's approach to handle index size limitation on utf8mb4 fields https://www.drupal.org/node/1314214

dregad

dregad

2016-06-18 16:32

developer   ~0053413

In order to (greatly) simplify the implementation of utf8mb4 charset support, including upgrade steps and future maintenance, I would like to propose that we increase the minimum version requirement for MySQL to 5.5.3 in 2.0.x.

5.5.3 was released in March 2010. It was actually a milestone release, the General availability is 5.5.8 (released 2010-12-03), but I don't see the need to require a higher version than we actually need.

As pointed out by atrol, this will prevent some distros from running our software, but I think that's an acceptable trade-off.

For the record, Drupal followed a similar approach, as documented in the link I referenced in my previous post.

atrol

atrol

2016-06-18 17:12

developer   ~0053416

we increase the minimum version requirement for MySQL to 5.5.3 in 2.0.x.

I would prefer this instead of implementing workarounds to fix 0021101.
Maybe target for 2.1.x if someone think it's to early to enforce 5.5.3.

dregad

dregad

2016-06-18 17:35

developer   ~0053418

I would prefer this instead of implementing workarounds to fix 0021101.

I agree, but I also believe that we do need a solution for 1.x as well, since depending on MySQL settings, use of any 4-byte char will either result in

  • the offending char and everything after it to be silently truncated
  • a DB error to occur, preventing data from being saved to the DB

The workaround I propose in 0021101 / PR https://github.com/mantisbt/mantisbt/pull/797 is fairly simple, and prevents an error which is more and more likely to occur as people using Mantis on smartphones are used to inserting emojis.

Of course will need to be reverted once utf8mb4 support has been implemented.

j_schultz

j_schultz

2016-11-02 10:37

reporter   ~0054374

This blog post suggests a different and interesting workaround compared to the pull requests referred to in the previous post: https://roartindon.blogspot.de/2015/04/hacking-utf16-to-work-around-mysqls.html
Rather than replacing everything with <?>, surrogate pairs are used. It requires some unicode decoding code so maybe it's not worth the effort, but still an interesting idea.

dregad

dregad

2016-11-24 09:01

developer   ~0054574

Changing target version since we'll have MySQL 5.5.3 as minimum requirement there.

atrol

atrol

2016-11-24 09:47

developer   ~0054577

since we'll have MySQL 5.5.3 as minimum requirement there

Did you consider 0020431:0052214 ?
Added also note to 0021841:0054575

travm1

travm1

2017-04-07 00:09

reporter   ~0056429

There is some generic info about UTF8mb4 here: https://www.everipedia.com/UTF8/

dregad

dregad

2017-04-07 03:21

developer   ~0056430

There is some generic info about UTF8mb4 here: https://www.everipedia.com/UTF8/

@travm1 what is your point ?

Related Changesets

MantisBT: master-1.3.x 805ef0cb

2016-06-18 16:42:22

dregad

Details Diff
New database API function db_mysql_fix_utf8()

This new function replaces 4-byte UTF-8 chars by Unicode U+FFFD
character for MySQL databases.

This is a temporary workaround to avoid data getting truncated on MySQL
databases using native utf8 encoding which only supports 3 bytes chars,
until we're able to support utf8mb4 charset (see issue 0020431).

Fixes 0021101
mod - core/database_api.php Diff File

Issue History

Date Modified Username Field Change
2015-12-28 05:22 dregad New Issue
2015-12-30 18:21 dregad Note Added: 0052209
2015-12-30 18:21 dregad Relationship added related to 0008017
2015-12-30 18:23 dregad Note Added: 0052210
2015-12-31 05:25 vboctor Note Added: 0052211
2015-12-31 12:25 atrol Note Added: 0052214
2015-12-31 18:21 dregad Note Added: 0052215
2015-12-31 18:27 dregad Relationship added parent of 0020465
2015-12-31 18:28 dregad Assigned To => dregad
2015-12-31 18:28 dregad Status new => assigned
2015-12-31 18:28 dregad Category installation => db schema
2015-12-31 18:28 dregad Target Version => 1.3.0-rc.2
2016-01-01 19:25 dregad Note Added: 0052230
2016-01-03 20:43 vboctor Note Added: 0052239
2016-01-03 20:46 vboctor Note Edited: 0052239 View Revisions
2016-01-04 10:39 dregad Note Added: 0052240
2016-06-12 02:37 atrol Target Version 1.3.0-rc.2 => 1.3.0
2016-06-13 06:10 dregad Relationship added related to 0021101
2016-06-13 06:17 dregad Note Added: 0053358
2016-06-18 16:32 dregad Note Added: 0053413
2016-06-18 17:12 atrol Note Added: 0053416
2016-06-18 17:35 dregad Note Added: 0053418
2016-07-03 05:23 dregad Changeset attached => MantisBT master-1.3.x 805ef0cb
2016-07-10 07:57 atroladmin Target Version 1.3.0 => 1.3.1
2016-08-28 10:37 atrol Target Version 1.3.1 => 1.3.2
2016-10-02 19:36 atrol Target Version 1.3.2 => 1.3.3
2016-10-30 23:23 vboctor Target Version 1.3.3 => 1.3.4
2016-11-02 06:09 dregad Relationship added related to 0021841
2016-11-02 10:37 j_schultz Note Added: 0054374
2016-11-24 09:01 dregad Target Version 1.3.4 => 2.0.0-rc.2
2016-11-24 09:01 dregad Note Added: 0054574
2016-11-24 09:47 atrol Note Added: 0054577
2016-11-27 08:20 dregad Target Version 2.0.0-rc.2 => 2.0.0
2016-12-30 15:56 vboctor Target Version 2.0.0 => 2.0.1
2017-02-01 22:49 vboctor Target Version 2.0.1 => 2.2.0
2017-02-26 21:19 vboctor Target Version 2.2.0 => 2.3.0
2017-04-01 00:20 vboctor Target Version 2.3.0 => 2.4.0
2017-04-07 00:09 travm1 Note Added: 0056429
2017-04-07 03:21 dregad Note Added: 0056430
2017-04-30 14:53 vboctoradmin Target Version 2.4.0 => 2.5.0
2017-06-04 16:19 atrol Target Version 2.5.0 => 2.6.0
2017-09-03 18:49 vboctor Target Version 2.6.0 => 2.7.0
2017-10-08 23:55 vboctor Target Version 2.7.0 => 2.8.0