Java正则表达式（分组与捕获）

一个正则表达式被定义为一个字符串，且这个字符串必须先被编译进Pattern类实例中。生成的pattern实例可以被用来创建能够匹配与正则表达式对应的任意(arbitrary)字符序列(character sequences)的Matcher对象。所有涉及到匹配执行过程的状态都会被驻留(resides)在匹配器(matcher)中, 因此许多匹配器可以共享相同的Pattern实例。
下面是一个典型的调用字符串的例子：

```{lang}
Pattern p = Pattern.compile("a*b");
Matcher m = p.matcher("aaaaaaab");
boolean b = m.matches();
```

matches方法由Pattern类来定义是为了方便，因为一个正则表达式（regular expression）仅仅被使用过一次。这个方法在一次调用中编译表达式并且匹配输入字符串。这个声名boolean b = Pattern.matches("a*b", "aaaaaaaab")和上面那三条语句是完全相等的，不过对于重复匹配来说，这个语句的效率更低，因为它不允许编译了正则表达式的Pattern实体类被重复使用。

Pattern类的实例是不可变的（immutable），被多线程(multiple concurrent threads)来使用是安全的。但是Mather实例不是线程安全的。

## # 正则表达式结构简要介绍

字符

```markdown
结构(construct)  :   Matches Characters(被匹配的字符串)
x     : The character x
\\    : The backslash character
\0n   : The character with octal value 0n (0 <= n <= 7)
\0nn  : The character with octal value 0nn (0 <= n <= 7)
\0mnn : The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh  : The character with hexadecimal value 0xhh
\uhhhh: The character with hexadecimal value 0xhhhh
\x{h...h} : The character with hexadecimal value 0xh...h (Character.MIN_CODE_POINT  <= 0xh...h <=  Character.MAX_CODE_POINT)
\t    : The tab character ('\u0009')
\n    : The newline (line feed) character ('\u000A')
\r    : The carriage-return character ('\u000D')
\f    : The form-feed character ('\u000C')
\a    : The alert (bell) character ('\u0007')
\e    : The escape character ('\u001B')\cx:The control character corresponding to x`
```

字符类

```{lang}
[abc]
a, b, or c (simple class)  
[^abc]
Any character except a, b, or c (negation)  
[a-zA-Z]
a through z or A through Z, inclusive (range)   
[a-d[m-p]]
a through d, or m through p: [a-dm-p] (union)  
[a-z&&[def]]
d, e, or f (intersection)
[a-z&&[^bc]]
a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]
a through z, and not m through p: [a-lq-z](subtraction)
```

Predefined character classes

```{lang}
`.`
Any character (may or may not match line terminators)
\d
A digit: [0-9]
\D
A non-digit: [^0-9]
\h
A horizontal whitespace character: [ \t\xA0\u1680\u180e\u2000-\u200a\u202f\u205f\u3000]
\H
A non-horizontal whitespace character: [^\h]
\s
A whitespace character: [ \t\n\x0B\f\r]
\S
A non-whitespace character: [^\s]
\v
A vertical whitespace character: [\n\x0B\f\r\x85\u2028\u2029]
\V
A non-vertical whitespace character: [^\v]
\w
A word character: [a-zA-Z_0-9]
\W
A non-word character: [^\w]
```

贪婪匹配(Reluctant quantifiers)  （尽可能多的匹配）

```{lang}
X?
X, once or not at all
X*
X, zero or more times
X+
X, one or more times
X{n}
X, exactly n times
X{n,}
X, at least n times
X{n,m}
X, at least n but not more than m times
```

例如： 字符串:aabaaab, 正则表达式: a.*b, 匹配结果为aabaaab

勉强模式限定符(Reluctant quantifiers)  (总是尽可能少的匹配)

```{lang}
X??
X, once or not at all
X*?
X, zero or more times
X+?
X, one or more times
X{n}?
X, exactly n times
X{n,}?
X, at least n times
X{n,m}?
X, at least n but not more than m times
```

例如： 字符串:aabaaab, 正则表达式: a.*b, 匹配结果为aab
结果很明显，勉强模式是只要匹配到了就停止后面的匹配，有点类似于短路与&&,一旦&&左侧表达式为false就不再判断右侧了。

Possessive(占有) quantifiers

> **解释**
> 非回溯，一旦匹配失败立即停止匹配，比贪婪模式更加高效。[Possessive Quantifiers](https://www.regular-expressions.info/possessive.html)

```{lang}
X?+
X, once or not at all
X*+
X, zero or more times
X++
X, one or more times
X{n}+
X, exactly n times
X{n,}+
X, at least n times
X{n,m}+
X, at least n but not more than m times
```

逻辑运算符

```{lang}
XY
X followed by Y
X|Y
Either X or Y
(X)
X, as a capturing group
```

回溯引用

```{lang}
\n
Whatever the nth capturing group matched
\k<name>
Whatever the named-capturing group "name" matched
```

Quotation

```{lang}
\
Nothing, but quotes the following character
\Q
Nothing, but quotes all characters until \E
\E
Nothing, but ends quoting started by \Q
```

Special constructs (named-capturing and non-capturing)

```{lang}
(?<name>X)
X, as a named-capturing group
(?:X)
X, as a non-capturing group
(?idmsuxU-idmsuxU) 
Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X)  
X, as a non-capturing group with the given flags i d m s u x on - off
(?=X)
X, via zero-width positive lookahead
(?!X)
X, via zero-width negative lookahead
(?<=X)
X, via zero-width positive lookbehind
(?<!X)
X, via zero-width negative lookbehind
(?>X)
X, as an independent, non-capturing group
```

## # Groups and capturing(分组和捕获)

捕获组通过从左到右计算其开始括号来编号。例如，在表达式((A)(B (C))中，有四个这样的组:
0    ((A)(B(C)))
1    (A)
2    (B(C))
3    (C)

> **WARNING**
> 第零组永远代表了整个表达式。

定义为捕获组这个名字的原因是在匹配的过程中，输入序列中的每一个与组锁匹配子序列都会被保存起来。捕获子序列可以在之后的通过回溯再次被使用，并且可以在匹配完成之后在匹配器(matcher)中再次检索。

可以使用group方法来获取捕获组，请看以下例子：

```{lang}
String s = "abbbbaabbbbaaa";
String regex = "a(.*)b";
Matcher m = Pattern.compile(regex).matcher(s);
m.group(1)  // bbbbaabbb
```