There is a Unicode mode in JavaScript regular expressions
- Published at
- Updated at
- Reading time
- 2min
Unicode is such an interesting topic, and it feels like there are new things to discover every day. Today was one of these days. I was reading a blog post and came across the u
flag. I haven't seen this regular expression flag, and I found myself reading Axel's chapter in "Exploring ES6" on that topic.
So what's this u
flag?
In JavaScript, we've got the "problem" that strings are represented in UTF-16 which means that not every character can be represented with a single code unit. This behavior leads to weird length
properties of certain strings, and it becomes tricky when you deal with surrogate pairs.
In short: surrogate pairs are two Unicode code units representing a single character.
If you want to learn more about Unicode or Regular Expressions in JavaScript, have a look at these two talks:
Should the period (
) in regular expressions (
) match a character that needs two code units then? This is where the u
flag comes into play.
Let's have a look at an example:
const emoji = '\u{1F60A}'; // "smiling face with smiling eyes" / "๐"
emoji.length // 2 -> it's a surrogate pair
/^.$/.test(emoji) // false
/^.$/u.test(emoji) // true
The unicode mode (//u
) enables the use of code point escape sequences (\u{1F42A}
) in regular expressions and they help when dealing with surrogate pairs.
const emoji = '\u{1F42A}'; // "๐ช"
/\u{1F42A}/.test(emoji); // false
/\uD83D\uDC2A/.test(camel); // true
/\u{1F42A}/u.test(emoji); // true
Unicode mode helps deal with Unicode in Regular Expressions. Read Axel's book chapter or Mathias Bynens' article on the topic if you want to learn more. Have fun!
Join 5.4k readers and learn something new every week with Web Weekly.