Unicode字符属性

自从 PHP 4.4.0 和 5.1.0，三个额外的转义序列在选用UTF-8模式时用于匹配通用字符类型。他们是：

\p{xx}: 一个有属性 xx 的字符
\P{xx}: 一个没有属性 xx 的字符
\X: 一个扩展的 Unicode 字符

上面xx代表的属性名用于限制 Unicode 通常的类别属性。每个字符都有一个这样的确定的属性，通过两个缩写的字母指定。为了与 perl 兼容，可以在左花括号 { 后面增加 ^ 表示取反。比如：\p{^Lu}就等同于\P{Lu}。

如果通过\p或\P仅指定了一个字母，它包含所有以这个字母开头的属性。在这种情况下，花括号的转义序列是可选的。

\p{L}
\pL

**支持的Unicode属性**
Property	Matches	Notes
C	Other
Cc	Control
Cf	Format
Cn	Unassigned
Co	Private use
Cs	Surrogate
L	Letter	Includes the following properties:Ll,Lm,Lo,LtandLu.
Ll	Lower case letter
Lm	Modifier letter
Lo	Other letter
Lt	Title case letter
Lu	Upper case letter
M	Mark
Mc	Spacing mark
Me	Enclosing mark
Mn	Non-spacing mark
N	Number
Nd	Decimal number
Nl	Letter number
No	Other number
P	Punctuation
Pc	Connector punctuation
Pd	Dash punctuation
Pe	Close punctuation
Pf	Final punctuation
Pi	Initial punctuation
Po	Other punctuation
Ps	Open punctuation
S	Symbol
Sc	Currency symbol
Sk	Modifier symbol
Sm	Mathematical symbol
So	Other symbol
Z	Separator
Zl	Line separator
Zp	Paragraph separator
Zs	Space separator

InMusicalSymbols等扩展属性在 PCRE 中不支持

指定大小写不敏感匹配对这些转义序列不会产生影响，比如，\p{Lu}始终匹配大写字母。

Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如：

\p{Greek}
\P{Han}

不在确定文字中的则被集中到Common。当前的文字列表中有：

**支持的文字**
Arabic	Armenian	Avestan	Balinese	Bamum
Batak	Bengali	Bopomofo	Brahmi	Braille
Buginese	Buhid	Canadian_Aboriginal	Carian	Chakma
Cham	Cherokee	Common	Coptic	Cuneiform
Cypriot	Cyrillic	Deseret	Devanagari	Egyptian_Hieroglyphs
Ethiopic	Georgian	Glagolitic	Gothic	Greek
Gujarati	Gurmukhi	Han	Hangul	Hanunoo
Hebrew	Hiragana	Imperial_Aramaic	Inherited	Inscriptional_Pahlavi
Inscriptional_Parthian	Javanese	Kaithi	Kannada	Katakana
Kayah_Li	Kharoshthi	Khmer	Lao	Latin
Lepcha	Limbu	Linear_B	Lisu	Lycian
Lydian	Malayalam	Mandaic	Meetei_Mayek	Meroitic_Cursive
Meroitic_Hieroglyphs	Miao	Mongolian	Myanmar	New_Tai_Lue
Nko	Ogham	Old_Italic	Old_Persian	Old_South_Arabian
Old_Turkic	Ol_Chiki	Oriya	Osmanya	Phags_Pa
Phoenician	Rejang	Runic	Samaritan	Saurashtra
Sharada	Shavian	Sinhala	Sora_Sompeng	Sundanese
Syloti_Nagri	Syriac	Tagalog	Tagbanwa	Tai_Le
Tai_Tham	Tai_Viet	Takri	Tamil	Telugu
Thaana	Thai	Tibetan	Tifinagh	Ugaritic
Vai	Yi

\X转义匹配任意数量的 Unicode 字符。\X等价于(?>\PM\pM*)

也就是说，它匹配一个没有 ”mark” 属性的字符，紧接着任意多个由 ”mark” 属性的字符。并将这个序列认为是一个原子组(详见下文)。典型的有 ”mark” 属性的字符是影响到前面的字符的重音符。

用 Unicode 属性来匹配字符并不快，因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。这就是为什么在 PCRE中要使用传统的转义序列\d、\w而不使用 Unicode 属性的原因。

To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).
I wondered why a German sharp S (ß) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.

My country, Vietnam, have our own alphabet table:
http://en.wikipedia.org/wiki/Vietnamese_alphabet
I hope PHP will support better than in Vietnamese.

An excellent article explaining all these properties can be found here: http://www.regular-expressions.info/unicode.html

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
 (*) uppercase "LJ": U+01C7 
 (*) titlecase "Lj": U+01C8 
 (*) lowercase "lj": U+01C9

If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:
<?php
// Need to check PCRE version because some environments are
// running older versions of the PCRE library
// (run in *nix environment `pcretest -C`)
$allowInternational = false;
if (defined('PCRE_VERSION')) {
  if (intval(PCRE_VERSION) >= 7) { // constant available since PHP 5.2.4
    $allowInternational = true;
  }
}
?>
Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.

these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"
if you want to match any word but want to provide a fallback, you can do something like that: 
<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
 // fallback goes here
 // for example just '/\w+/u' for a less acurate match
}
?>

For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter. 
For example, there are three codepoints for the "LJ" digraph in Unicode: 
 (*) uppercase "LJ": U+01C7 
 (*) titlecase "Lj": U+01C8 
 (*) lowercase "lj": U+01C9

Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:
<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>
Will match any combination of letters and numbers.