Unicode字符属性
自从 PHP 4.4.0 和 5.1.0, 三个额外的转义序列在选用UTF-8模式时用于匹配通用字符类型。他们是:
- \p{xx}
- 一个有属性 xx 的字符
- \P{xx}
- 一个没有属性 xx 的字符
- \X
- 一个扩展的 Unicode 字符
上面xx代表的属性名用于限制 Unicode 通常的类别属性。 每个字符都有一个这样的确定的属性,通过两个缩写的字母指定。 为了与 perl 兼容, 可以在左花括号 { 后面增加 ^ 表示取反。比如:\p{^Lu}就等同于\P{Lu}。
如果通过\p或\P仅指定了一个字母,它包含所有以这个字母开头的属性。 在这种情况下,花括号的转义序列是可选的。
\p{L}
\pL
| Property | Matches | Notes |
|---|---|---|
| C | Other | |
| Cc | Control | |
| Cf | Format | |
| Cn | Unassigned | |
| Co | Private use | |
| Cs | Surrogate | |
| L | Letter | Includes the following properties:Ll,Lm,Lo,LtandLu. |
| Ll | Lower case letter | |
| Lm | Modifier letter | |
| Lo | Other letter | |
| Lt | Title case letter | |
| Lu | Upper case letter | |
| M | Mark | |
| Mc | Spacing mark | |
| Me | Enclosing mark | |
| Mn | Non-spacing mark | |
| N | Number | |
| Nd | Decimal number | |
| Nl | Letter number | |
| No | Other number | |
| P | Punctuation | |
| Pc | Connector punctuation | |
| Pd | Dash punctuation | |
| Pe | Close punctuation | |
| Pf | Final punctuation | |
| Pi | Initial punctuation | |
| Po | Other punctuation | |
| Ps | Open punctuation | |
| S | Symbol | |
| Sc | Currency symbol | |
| Sk | Modifier symbol | |
| Sm | Mathematical symbol | |
| So | Other symbol | |
| Z | Separator | |
| Zl | Line separator | |
| Zp | Paragraph separator | |
| Zs | Space separator |
InMusicalSymbols等扩展属性在 PCRE 中不支持
指定大小写不敏感匹配对这些转义序列不会产生影响,比如,\p{Lu}始终匹配大写字母。
Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如:
- \p{Greek}
- \P{Han}
不在确定文字中的则被集中到Common。当前的文字列表中有:
| Arabic | Armenian | Avestan | Balinese | Bamum | |
| Batak | Bengali | Bopomofo | Brahmi | Braille | |
| Buginese | Buhid | Canadian_Aboriginal | Carian | Chakma | |
| Cham | Cherokee | Common | Coptic | Cuneiform | |
| Cypriot | Cyrillic | Deseret | Devanagari | Egyptian_Hieroglyphs | |
| Ethiopic | Georgian | Glagolitic | Gothic | Greek | |
| Gujarati | Gurmukhi | Han | Hangul | Hanunoo | |
| Hebrew | Hiragana | Imperial_Aramaic | Inherited | Inscriptional_Pahlavi | |
| Inscriptional_Parthian | Javanese | Kaithi | Kannada | Katakana | |
| Kayah_Li | Kharoshthi | Khmer | Lao | Latin | |
| Lepcha | Limbu | Linear_B | Lisu | Lycian | |
| Lydian | Malayalam | Mandaic | Meetei_Mayek | Meroitic_Cursive | |
| Meroitic_Hieroglyphs | Miao | Mongolian | Myanmar | New_Tai_Lue | |
| Nko | Ogham | Old_Italic | Old_Persian | Old_South_Arabian | |
| Old_Turkic | Ol_Chiki | Oriya | Osmanya | Phags_Pa | |
| Phoenician | Rejang | Runic | Samaritan | Saurashtra | |
| Sharada | Shavian | Sinhala | Sora_Sompeng | Sundanese | |
| Syloti_Nagri | Syriac | Tagalog | Tagbanwa | Tai_Le | |
| Tai_Tham | Tai_Viet | Takri | Tamil | Telugu | |
| Thaana | Thai | Tibetan | Tifinagh | Ugaritic | |
| Vai | Yi |
\X转义匹配任意数量的 Unicode 字符。\X等价于(?>\PM\pM*)
也就是说,它匹配一个没有 ”mark” 属性的字符,紧接着任意多个由 ”mark” 属性的字符。 并将这个序列认为是一个原子组(详见下文)。 典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。
用 Unicode 属性来匹配字符并不快, 因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。 这就是为什么在 PCRE中 要使用传统的转义序列\d、\w而不使用 Unicode 属性的原因。
To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).
I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.My country, Vietnam, have our own alphabet table: http://en.wikipedia.org/wiki/Vietnamese_alphabet I hope PHP will support better than in Vietnamese.
An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. For example, there are three codepoints for the "LJ" digraph in Unicode: (*) uppercase "LJ": U+01C7 (*) titlecase "Lj": U+01C8 (*) lowercase "lj": U+01C9
If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:
<?php
// Need to check PCRE version because some environments are
// running older versions of the PCRE library
// (run in *nix environment `pcretest -C`)
$allowInternational = false;
if (defined('PCRE_VERSION')) {
if (intval(PCRE_VERSION) >= 7) { // constant available since PHP 5.2.4
$allowInternational = true;
}
}
?>
Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"
if you want to match any word but want to provide a fallback, you can do something like that:
<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
// fallback goes here
// for example just '/\w+/u' for a less acurate match
}
?>
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. For example, there are three codepoints for the "LJ" digraph in Unicode: (*) uppercase "LJ": U+01C7 (*) titlecase "Lj": U+01C8 (*) lowercase "lj": U+01C9
Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:
<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>
Will match any combination of letters and numbers.