Python正则表达式：30秒精通文本处理

一、概述

1. 含义

正则表达式是一种记录文本规则的代码工具，用于描述字符串的结构和模式。它广泛应用于字符串的匹配、查找、替换、提取等操作。

2. 特点

语法复杂：符号多、规则灵活，可读性较差。
功能强大：可以精确控制字符串内容，适用于各种文本处理场景。
跨语言支持：Python、JavaScript、Java 等主流语言都支持正则表达式。

二、Python 中使用正则表达式

1. 导入模块

import re

2. 常用函数介绍

`re.match(pattern, string, flags)`

从字符串的开头开始匹配，若成功返回 Match 对象，否则返回 None。

pattern：要匹配的正则表达式。
string：目标字符串。
flags：标志位（如忽略大小写）。

⚠️ 注意：

匹配失败时返回的是 None，而不是 False。
match() 只从字符串开头开始匹配。

示例：

res = re.match("hello", "hello python")
if res:
    print(res.group())  # 输出: hello
else:
    print("未匹配到字符串")

三、基本语法与元字符

1. 单个字符匹配

元字符	含义
`.`	匹配任意一个字符（除 `\n` 外）
`\d`	匹配任意数字 `[0-9]`
`\D`	匹配非数字
`\s`	匹配空白字符（空格、换行、制表符等）
`\S`	匹配非空白字符
`\w`	匹配单词字符（字母、数字、下划线 `_`）
`\W`	匹配非单词字符
{}	匹配次数
[]	匹配[]中列举的字符

示例：

# 1 使用.匹配惹你单个字符
text = "hello python"

res = re.match("..",text)
print(res.group())

# 2 匹配 []中列举的字符，一个
res =  re.match("[he]",text)  # 只会匹配h因为匹配单个字符，是从开头匹配的
print(res.group())

res =  re.match("[he][he]",text)  # 匹配到he
print(res.group())
# 3 匹配 0-9
res = re.match("[0123456789]","2312") # 匹配到2
print(res.group())
res = re.match("[0-9]","2312") # 匹配到2
print(res.group())
res = re.match("[0-23-5-9]","2312") # 不匹配4
print(res.group())
# 4 匹配任意字母
res = re.match("[a-zA-Z]","ABCD")
print(res.group())
# 匹配数字 \d
res = re.match("\d","9823456")
print(res.group())
# 匹配数字 9开头的连续4个数字的
res = re.match("9\d{3}","9823456")
print(res.group())

# \D匹配非数字
res = re.match("\D{4}","ABCDEFGHIJKLMNO")
print(res.group())

# 匹配空白 \s
res = re.match("\s"," helllo python")
print(res.group())

# 匹配非空白
res = re.match("\S","helllo python")
print(res.group())

#匹配单词字符 A-Z a-z 0-9 _ 汉字
res = re.match("\w","helllo python")
print(res.group())
res = re.match("\w","你好 python")
print(res.group())
res = re.match("\w",".你好 python") # 匹配不到
if res is not None:
    print("True")
    print(res.group())
else:
    print("False")
# 匹配非单词 \W
res = re.match('\W',"   .你好 python") # 匹配不到
try:
    print("True")
    print(res.group())
except Exception as e:
    print("错误信息",e)

2. 匹配次数控制（量词）

量词	含义
`*`	匹配前一个字符出现 0 次或多次
`+`	匹配前一个字符出现至少一次
`?`	匹配前一个字符出现 0 次或 1 次
`{m}`	匹配前一个字符出现恰好 m 次
`{m,n}`	匹配前一个字符出现 m 到 n 次（包含）

自定义安全匹配函数

def safe_match(pattern, text):
    try:
        res = re.match(pattern, text)
        print(f"匹配成功:[{pattern}] ==>'{res.group()}'")
    except Exception as e:
        print(f"出错了-->{e}")

示例：

safe_match('\w+'," hello python") #捕获异常,至少要匹配1次

safe_match('\w*',"hello python") #输出hello \w匹配单词 *: 将匹配出的单词匹配0次或者无数次  要注意的是*的0次和无数次是针对\w 而不是\w匹配出来的单词

safe_match('\w*',"    hello python") #不会报错,因为是0次或者无数次

safe_match('\w+'," hello python") #捕获异常,至少要匹配1次
safe_match('\w?'," hello python") #不会报错,因为是0次或者1次
safe_match('\w{3}', "hello python")  # 不会报错,因为是0次或者1次

safe_match('\w{7}', "hello python") # 会报错,没有7个连续的单词
safe_match('\w{4,7}', "hello python") # 不会报错 因为查询4-7个单词

四、匹配位置控制

符号	含义
`^`	匹配字符串的开始位置
`$`	匹配字符串的结束位置

示例：

# ^: 匹配字符串开头,或者对某种规则取反
safe_match('^hell', "hello python") # 以hell开头
# []:再[]中表示取反
safe_match("[^py]","hello python") #
# $ 匹配字符串结尾 但是要注意match是从开头匹配的
# 匹配以n结尾应该这样
safe_match('.*[n]$',"hello python")
# 匹配以非n结尾应该这样
safe_match('.*[^n]$',"hello python")

五、分组与引用

符号	含义
\|	匹配左右任意一个表达式: 优先匹配左边的,左边不匹配再去右边
(ab)	将括号中的字符作为一个分组
\num	匹配分组num匹配到的字符串
(?P<name>)	分组起别名
(?P=name)	引用别名为name分组匹配到的字符串。为分组命名，便于后续引用。

示例：

# 匹配左右任意任意表达式
safe_match("abc|ABC","abc")
safe_match("abc|ABC","ABC")

# (ab) 将括号中的字符作为一个分组
safe_match("\w*@(163|qq|wechat).com","stitchcool@163.com")

# 匹配分组num匹配到的字符串  -一般再匹配标签时使用
# 注意:从外到内进行排序,编号从1开始
safe_match("<\w*>\w*</\w*>","<html>login</html>")
#这样太麻烦 我们使用匹配到分组匹配到的字符串

safe_match("<(\w*)>\w*</\\1>","<html>login</html>")
# 也可以使用r取消转义
safe_match(r"<(\w*)>\w*</\1>","<html>login</html>")
safe_match(r"<(\w*)><(\w*)>\w*</\2></\1>","<html><body>login</body></html>")

# 别名操作
safe_match(r"<(?P<标签1>\w*)><(?P<标签2>\w*)>\w*</(?P=标签2)></(?P=标签1)>","<html><body>login</body></html>")
# 匹配网址 前缀一般是www 后缀 .com/.cn等
li = ["www.baidu.com","www.python.org","http.taobao.cn","http\iaidu\com"]

for i in li:
    safe_match(r"www(.)\w*\1(com|cn|org)",i)

    # r'':也就是原始字符串,也就是不会经过转义,将字符串完整的写入
    #  "http\niaidu\com"  ->这个没有加r表示会对里面的字符串进行转义,在内存中就会是 http 回车 iaidu\com  因为后面的\c不是有效的转义符号,就会原样输出
    # r("http\niaidu\com")  这个是原始字符串,不会进行转义,写入内存的就是 "http\niaidu\com"
print(li[3])

str111= "http\niaidu\com"
print(str111)
str111 = r"http\niaidu\com"
print(str111)
safe_match(r"\w",str111)

五、高级用法

1. `re.search()`

扫描整个字符串并返回第一个成功匹配的结果。

def safe_search(pattern, text):
    try:
        res = re.search(pattern, text)
        print("search成功: ", res.group())
    except Exception as e:
        print("search失败:", e)

safe_search("\d", "python123")  # 成功匹配 '1'

2. `re.findall()`

返回所有匹配项组成的列表，不会报错。

print(re.findall("\d", "1py2345thon"))  # ['1','2','3','4','5']

3. `re.sub()`

替换匹配内容。

# re.sub(pattern,rep,string,count)
# pattren : 代表正则表达式,表示需要被替换的,
# rep: 替换的新内容
# string: 被替换的
# count: 替换的次数默认匹配到的全被替换

res = re.sub("\d", "X", "[1,2,3,4,5]")
print(res)  # 输出: [X,X,X,X,X]

4. `re.split()`

按正则表达式分割字符串。

#re.split(pattern,string,maxsplit)
# pattern 正则表达式,把其中的内容当作分隔符
# string 字符串
# maxsplit：指定最大分割次数

res = re.split("\d", "Y1Y2Y3Y4Y5]", 2)
print(res)  # ['Y', 'Y', 'Y3Y4Y5]']

六、贪婪与非贪婪

默认是贪婪匹配：尽可能匹配更长的字符串。
添加 ? 表示非贪婪匹配：尽可能短地匹配。

示例：

text = "<div>hello</div><div>world</div>"
safe_match(r"<div>(.*)</div>", text)  # 贪婪，匹配整个字符串
safe_match(r"<div>(.*?)</div>", text)  # 非贪婪，只匹配第一个 div

七、原始字符串（Raw String）

在 Python 中，使用 r"" 定义原始字符串，避免转义问题。

写法	含义
`"\\\\"`	普通字符串中表示两个反斜杠，进入内存后是 `\\`，这正好是正则表达式要匹配的单个 `\`
`r"\\"`	原始字符串中直接表示 `\`，不会被 Python 转义，推荐使用

在普通字符串中，为了在正则表达式中匹配一个反斜杠 \，你需要写成 "\\\\"：

第一层转义是 Python 字符串解析器做的，将 "\\\\" 转义为 \\
第二层转义是正则表达式引擎做的，将 \\ 转义为一个实际的 \
如果使用原始字符串 r""，只需要写成 r"\\" 就可以了，清晰又安全！

显示区分:

"E:\\pyCode\\pytest\\pythonProject1" 这个字符串中，每个 \\ 都被 Python 解释器当作一个单独的反斜杠字符。
实际在内存中它是E:\pyCode\pytest\pythonProject1一个\，当你打印它的时候，print() 函数默认会把每一个反斜杠 \ 显示为字面意义上的 \\。

所以你看到的是E:\\pyCode\\pytest\\pythonProject1，但如果你直接在控制台输入那就会变得不一样。

控制台输出

下面就是一个在控制台输出的\\的实际显示

示例：

safe_match(r"\\", r"\game")  # 成功匹配 '\'
safe_match("\\\\", "\\game")  # 成功匹配 '\'