1062 字

多段文本分段

计算机 - 正则表达式; 正则表达式;

参考资料http://www.geany.org/manual/gtk/glib/glib-regex-syntax.html

说明,只是大概思想,如果真想对一篇很长的文本分段,直接这么做肯定不行。

表达式:<h1>(.(?!<h1>))+\r\n,注意环境为.NET 2.0,设置匹配的 Options 为SingleLine

文本

<h1>第一百八十一章 正文</h1>
<p>asdf</p>
<p>asdf</p>
<p>asdf</p>
<h1>第一百八十三章 正文</h1>
<p>asdf</p>
<p>asdf</p>
<p>asdf</p>
<h1>第一百八十二章 正文</h1>
<p>asdf</p>
<p>asdf</p> 

结果

(<h1>第一百八十一章 正文</h1>
<p>asdf</p>
<p>asdf</p>
<p>asdf</p>
) (>)

------- NEXT MATCH -------
(<h1>第一百八十三章 正文</h1>
<p>asdf</p>
<p>asdf</p>
<p>asdf</p>
) (>)

------- NEXT MATCH -------
(<h1>第一百八十二章 正文</h1>
<p>asdf</p>
<p>asdf</p>
) (>)

------- NEXT MATCH -------

参考表达式

string strReg=@"((?<!深圳市).(?!深圳市))+";


参考资料

http://www.geany.org/manual/gtk/glib/glib-regex-syntax.html

An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, \B, \A, \Z, \z, ^ and $ are described above. More complicated assertions are coded

as subpatterns. There are two kinds: those that look ahead of the current position in the subject string, and those that look behind it.

An assertion subpattern is matched in the normal way, except that it does not cause the current matching position to be changed. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example,

\w+(?=;)

matches a word followed by a semicolon, but does not include the semicolon in the match, and

foo(?!bar)

matches any occurrence of “foo” that is not followed by “bar”. Note that the apparently similar pattern

(?!foo)bar

does not find an occurrence of “bar” that is preceded by something other than “foo”; it finds any occurrence of “bar” whatsoever, because the assertion (?!foo) is always TRUE when the next three characters are “bar”. A lookbehind assertion is needed to

achieve this effect. Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example,

(?<!foo)bar

does find an occurrence of “bar” that is not preceded by “foo”. The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several alternatives, they do not all have to have

the same fixed length. Thus

(?<=bullock|donkey)

is permitted, but

(?<!dogs?|cats?)

causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl 5.005, which requires all branches to match the same length of string. An

assertion such as

(?<=ab(c|de))

is not permitted, because its single top-level branch can match two different lengths, but it is acceptable if rewritten to use two top-level branches:

(?<=abc|abde)

The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match. If there are insufficient characters before the current position, the match is deemed to fail.

Lookbehinds in conjunction with once-only subpatterns can be particularly useful for matching at the ends of strings; an example is given at the end of the section on once-only subpatterns.

Several assertions (of any sort) may occur in succession. For example,

(?<=\d{3})(?<!999)foo

matches “foo” preceded by three digits that are not “999”. Notice that each of the assertions is applied independently at the same point in the subject string. First there is a check that the previous three characters are all digits, then there is a check

that the same three characters are not “999”. This pattern does not match “foo” preceded by six characters, the first of which are digits and the last three of which are not “999”. For example, it doesn’t match “123abcfoo”. A pattern to do that is

(?<=\d{3}...)(?<!999)foo

This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not “999”.

Assertions can be nested in any combination. For example,

(?<=(?<!foo)bar)baz

matches an occurrence of “baz” that is preceded by “bar” which in turn is not preceded by “foo”, while

(?<=\d{3}(?!999)...)foo

is another pattern which matches “foo” preceded by three digits and any three characters that are not “999”.

Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the same thing several times. If any kind of assertion contains capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions, because it does not make sense for negative assertions.