【Python 正则表达式】

news2025/5/16 16:10:50

Python 正则表达式通过 re 模块实现模式匹配，是文本处理的核心工具。以下是系统化指南，包含语法详解和实战案例：

一、正则基础语法

1. 元字符速查表

符号	含义	示例	匹配结果
`.`	任意字符（除换行符）	`r"a.c"`	“abc”, “a\nc” ❌
`^`	行首锚定	`r"^Python"`	“Python…”
`$`	行尾锚定	`r"\.py$"`	“file.py”
`\d`	数字	`r"\d{3}-\d{4}"`	“010-1234”
`\D`	非数字	`r"\D+@example.com"`	“user@example.com”
`\w`	单词字符（字母/数字/_）	`r"\w+@\w+\.\w+"`	“alice@test.com”
`\s`	空白字符	`r"hello\s+world"`	“hello world”
`*`	0次或多次	`r"ab*c"`	“ac”, “abc”, “abbc”
`+`	1次或多次	`r"ab+c"`	“abc”, “abbc”
`?`	0次或1次	`r"https?://"`	“http://”, “https://”
`{}`	精确次数/范围	`r"\d{3,5}"`	“123”, “45678”

2. 特殊构造

# 分组与捕获
match = re.search(r"(\d{3})-(\d{4})", "010-1234")
print(match.group(1))  # "010"（第一个分组）
print(match.groups())   # ("010", "1234")

# 非捕获分组
re.search(r"(?:\d{3}-){2}\d{4}", "010-1234-5678")  # 不捕获中间分组

# 命名分组
re.search(r"(?P<area>\d{3})-(?P<num>\d{4})", "010-1234").groupdict()  # {'area': '010', 'num': '1234'}

二、核心函数详解

1. 匹配与搜索

import re

# 全文匹配
re.fullmatch(r"\d{3}-\d{4}", "010-1234")  # 必须完全匹配

# 搜索首个匹配
re.search(r"\b\w+@\w+\.\w+\b", "Contact: alice@test.com").group()  # "alice@test.com"

# 搜索所有匹配
re.findall(r"\d+", "订单123，金额456元")  # ['123', '456']

2. 替换操作

# 简单替换
re.sub(r"\bPython\b", "Java", "Python is great. Pythonic code.")  # "Java is great. Pythonic code."

# 函数替换（动态计算）
def hex_replace(match):
    return hex(int(match.group()))

re.sub(r"\d+", hex_replace, "RGB(255,0,128)")  # "RGB(0xff,0x0,0x80)"

3. 分割字符串

re.split(r"[,;\s]+", "apple, banana; cherry  date")  # ['apple', 'banana', 'cherry', 'date']

三、高级模式技巧

1. 贪婪与非贪婪匹配

re.findall(r"<(.*)>", "<a>text</a><b>more</b>")  # 贪婪模式：['a>text</a><b>more']
re.findall(r"<(.*?)>", "<a>text</a><b>more</b>") # 非贪婪：['a', 'b']

2. 边界控制

# 单词边界
re.findall(r"\bcat\b", "The cat sat on the mat.")  # ['cat']

# 多行模式
re.findall(r"^Python", "Java\nPython\nC++", re.MULTILINE)  # ['Python']

3. 前瞻断言

# 肯定顺序环视
re.findall(r"\b\w+(?=ing\b)", "Reading writing coding")  # ['Read', 'writ', 'cod']

# 否定顺序环视
re.findall(r"\b\w+(?!ing\b)", "Play played playing")      # ['Play', 'played']

四、实战案例库

1. 数据验证

# 邮箱验证
EMAIL_REGEX = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
re.fullmatch(EMAIL_REGEX, "user@example.com")  # 有效

# URL验证
URL_REGEX = r"https?://(?:www\.)?[^\s/$.?#].[^\s]*"
re.fullmatch(URL_REGEX, "https://www.test.com/path?query=1")  # 有效

2. 文本提取

# 提取HTML标签内容
html = "<div class='content'>Hello</div><p>World</p>"
re.findall(r"<([a-z]+)>(.*?)</\1>", html, re.DOTALL)  # [('div', 'Hello'), ('p', 'World')]

# 解析日志时间戳
log = "2025-05-11 14:30:00 [ERROR] Connection failed"
re.search(r"\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}", log).group()  # "2025-05-11 14:30:00"

3. 数据清洗

# 去除多余空格
"Hello   World  ".strip()  # 简单方法
re.sub(r"\s+", " ", text).strip()  # 正则更彻底

# 隐藏敏感信息
phone = "138-1234-5678"
re.sub(r"(\d{3})\d{4}(\d{4})", r"\1****\2", phone)  # "138****5678"

五、性能优化策略

1. 预编译正则对象

# 编译模式（高频使用时提升30%+性能）
email_pattern = re.compile(EMAIL_REGEX)
email_pattern.fullmatch("user@test.com")  # 比直接使用re.fullmatch快

2. 避免回溯失控

# 危险模式（可能导致指数级回溯）
re.search(r"^(a+)+$", "a" * 20 + "b")  # 极端情况会卡死

# 安全模式（使用原子组）
re.search(r"^(?>(a+)+)b$", "aaaaab")  # 快速失败

3. 匹配引擎选择

re.search() vs re.match()：后者强制从字符串开头匹配
re.finditer()：返回迭代器节省内存（处理大文本时）

六、调试工具推荐

在线测试：
- Regex101（实时可视化匹配过程）
- RegExr（内置常用正则库）

Python 调试：

# 打印调试信息
pattern = re.compile(r"(\d+)-(\w+)")
print(pattern.pattern)      # 输出正则表达式
print(pattern.flags)        # 显示修饰符标志
print(pattern.groups)       # 显示分组数量

七、常见陷阱避坑指南

特殊字符转义：

# 错误：直接使用括号
re.search(r"(123)", "test(123)test")  # 无法匹配
# 正确：转义元字符
re.search(r"\(123\)", "test(123)test")

编码问题：

# 处理非ASCII字符时指定UNICODE标志
re.search(r"^\w+$", "中文", re.UNICODE)  # Python3默认开启

贪婪匹配陷阱：

# 错误：贪婪匹配导致跨标签捕获
re.findall(r"<(.*)>", "<a>1</a><b>2</b>")  # ['a>1</a><b>2']
# 正确：使用非贪婪模式
re.findall(r"<(.*?)>", ...)  # ['a', '/a', 'b', '/b']

掌握这些技巧后，可处理90%以上的正则需求。对于复杂场景（如多语言混合文本），建议结合regex第三方库（支持Unicode属性、模糊匹配等高级功能）。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2376989.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！