Ceph PG unfound/lost 问题排查与解决
- 背景
- 现象
- 排查过程
- 经验总结
- 参考命令
- 结语
背景
Ceph 集群出现 HEALTH_ERR
,提示有 PG 对象丢失(unfound),并且 repair 无法自动修复。
现象
-
ceph health detail
显示:HEALTH_ERR 4/213107278 objects unfound (0.000%); Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 36/2130898991 objects degraded (0.000%), 1 pg degraded OBJECT_UNFOUND 4/213107278 objects unfound (0.000%) pg 2.f06 has 4 unfound objects PG_DAMAGED Possible data damage: 1 pg recovery_unfound pg 2.f06 is active+recovery_unfound+degraded+repair, acting [520,454,563,300,70,59,243,166,422,333], 4 unfound PG_DEGRADED Degraded data redundancy: 36/2130898991 objects degraded (0.000%), 1 pg degraded pg 2.f06 is active+recovery_unfound+degraded+repair, acting [520,454,563,300,70,59,243,166,422,333], 4 unfound
-
repair 日志显示:
repair 4 missing, 0 inconsistent objects repair 36 errors, 36 fixed
排查过程
-
确认 OSD 状态
- 所有相关 OSD 均为
up
,无进程或硬件异常。
- 所有相关 OSD 均为
-
分析 repair/scrub 日志
- repair 已修复 36 个错误,但有 4 个对象在所有副本上都找不到(unfound)。
-
尝试 mark_unfound_lost revert
- 报错:
mode must be 'delete' for ec pool
,说明 EC 池只能用delete
。
- 报错:
-
最终执行
ceph pg 2.f06 mark_unfound_lost delete
- 系统提示:
pg has 4 objects unfound and apparently lost marking
- 系统提示:
-
健康恢复
- 片刻后,
ceph health detail
显示HEALTH_OK
,PG 状态恢复正常。
- 片刻后,
经验总结
- unfound objects 表示对象在所有副本上都丢失,无法自动修复。
- EC 池只能用
delete
方式丢弃丢失对象,不能 revert。 - repair 只能修复可用副本间的数据不一致,无法凭空恢复丢失对象。
- 标记 lost 后,集群健康恢复,但对应对象永久丢失,需业务评估影响。
参考命令
# 查看健康和详细信息
ceph health detail
ceph status
# 标记 unfound/lost 对象(EC池只能delete)
ceph pg <pgid> mark_unfound_lost delete
# 检查PG状态
ceph pg <pgid> query
结语
遇到 Ceph PG unfound/lost 问题,需冷静排查,确认无法恢复后果断 mark lost,保障集群整体健康。建议定期备份重要数据,防止极端情况下的不可恢复丢失。