KingbaseES集群运维案例之--主备发生故障，主库能正常使用，备库无法启用

news2026/3/17 18:35:37

KingbaseES集群运维案例之–主备发生故障主库能正常使用备库无法启用案例主备发生故障主库能正常使用备库无法启用文章目录KingbaseES集群运维案例之--主备发生故障主库能正常使用备库无法启用案例主备发生故障主库能正常使用备库无法启用一、启动主备库数据库服务二、注册服务到集群1、注册primary到集群如果主库正常这步可以省略2、查看集群节点状态三、注册备库到集群1、关闭数据库服务2、注册standby到集群3、将备库节点重新加入到集群4、主库查看集群状态和主备流复制状态1、查看集群节点状态2查看主备流复制状态5、启动主备repmgrd服务四、重启集群服务验证1、通过sys_monitor.sh 启动集群2、查看集群节点状态3、查看主备流复制状态[kingbaselocalhost bin]$ repmgr cluster show [WARNING] node node2 not found in sys_stat_replication ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | ! node1 | default | 100 | 1 | 0 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 [WARNING] following issues were detected - node node2 (ID: 2) is not attached to its upstream node node1 (ID: 1)一、启动主备库数据库服务[kingbaselocalhost bin]$ ./sys_ctl -D /home/kingbase/cluster/kingbase/data start [kingbaselocalhost bin]$ ./sys_ctl -D /home/kingbase/cluster/kingbase/data start 现在别起集群因为数据库元数据表里面的信息跟现在的信息是不一致的所以先手动起数据库二、注册服务到集群1、注册primary到集群如果主库正常这步可以省略[kingbaselocalhost bin]$ ./repmgr primary register -F [kingbaselocalhost bin]$ ./repmgr cluster show [WARNING] node node2 not found in sys_stat_replication ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | ! node1 | default | 100 | 1 | 1888 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 [WARNING] following issues were detected - node node2 (ID: 2) is not attached to its upstream node node1 (ID: 1)2、查看集群节点状态[kingbaselocalhost bin]$ ./repmgr cluster show [WARNING] node node2 not found in sys_stat_replication ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | ! node1 | default | 100 | 1 | 1888 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 [WARNING] following issues were detected - node node2 (ID: 2) is not attached to its upstream node node1 (ID: 1)三、注册备库到集群此时不能直接注册到集群因为备库的元数据还没有更新需关闭备库数据库服务1、关闭数据库服务[kingbaselocalhost bin]$ ./sys_ctl -D /home/kingbase/cluster/kingbase/data stop 等待服务器进程关闭 ...... 完成服务器进程已经关闭2、注册standby到集群[kingbaselocalhost bin]$ ./repmgr standby register -h 192.168.158.26 -U esrep -d esrep -F [INFO] connecting to local node node2 (ID: 2) [INFO] connecting to primary database [WARNING] unable to connect to remote host via ES [ERROR] unable to connect via ES to host [NOTICE] failed to update nodes_info file on primary node. [INFO] standby registration complete [NOTICE] standby node node2 (ID: 2) successfully registered3、将备库节点重新加入到集群[kingbaselocalhost bin]$ ./repmgr node rejoin -h 192.168.158.26 -U esrep -d esrep [NOTICE] rejoin target is node node1 (ID: 1) [INFO] timelines are same, this server is not ahead [DETAIL] local node lsn is 0/1F0000A0, rejoin target lsn is 0/1F000CF0 [INFO] creating replication slot as user esrep [NOTICE] setting node 2s upstream to node 1 [WARNING] unable to ping host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 [DETAIL] KCIping() returned KCIPING_NO_RESPONSE [NOTICE] begin to start server at 2025-12-12 15:57:28.157322 [NOTICE] starting server using /home/kingbase/cluster/kingbase/bin/sys_ctl -w -t 90 -D /home/kingbase/cluster/kingbase/data -l /home/kingbase/cluster/kingbase/bin/logfile start [NOTICE] start server finish at 2025-12-12 15:57:28.370792 [NOTICE] NODE REJOIN successful [DETAIL] node 2 is now attached to node 14、主库查看集群状态和主备流复制状态1、查看集群节点状态[kingbaselocalhost bin]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout90002查看主备流复制状态[kingbaselocalhost bin]$ ksql test system 输入 help 来获取帮助信息. test# test# test# select * from sys_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time -------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- 52974 | 16385 | esrep | node2 | 192.168.158.27 | | 19282 | 2025-12-12 15:57:28.29273008 | | streaming | 0/1F000EE0 | 0/1F000EE0 | 0/1F000EE0 | 0/1F000EE0 | | | | 1 | quorum | 2025-12-12 15:59:04.68070308 (1 行记录)5、启动主备repmgrd服务[kingbaselocalhost bin]$ ./repmgrd -d [2025-12-12 15:59:27] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/hamgr.log [kingbaselocalhost bin]$ ./repmgrd -d [2025-12-12 15:59:41] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/hamgr.log四、重启集群服务验证1、通过sys_monitor.sh 启动集群[kingbaselocalhost bin]$ sys_monitor.sh restart 2025-12-12 15:59:59 Ready to stop all DB ... 2025-12-12 16:00:04 begin to stop repmgrd on [192.168.158.26]. 2025-12-12 16:00:05 repmgrd on [192.168.158.26] stop success. 2025-12-12 16:00:05 begin to stop repmgrd on [192.168.158.27]. 2025-12-12 16:00:06 repmgrd on [192.168.158.27] stop success. 2025-12-12 16:00:06 begin to stop DB on [192.168.158.27]. waiting for server to shut down...... done server stopped 2025-12-12 16:00:08 DB on [192.168.158.27] stop success. 2025-12-12 16:00:08 begin to stop DB on [192.168.158.26]. waiting for server to shut down....... done server stopped 2025-12-12 16:00:11 DB on [192.168.158.26] stop success. 2025-12-12 16:00:12 Done. 2025-12-12 16:00:12 Ready to start all DB ... 2025-12-12 16:00:12 begin to start DB on [192.168.158.26]. waiting for server to start.... done server started 2025-12-12 16:00:13 execute to start DB on [192.168.158.26] success, connect to check it. 2025-12-12 16:00:14 DB on [192.168.158.26] start success. 2025-12-12 16:00:14 Try to ping trusted_servers on host 192.168.158.26 ... 2025-12-12 16:00:16 Try to ping trusted_servers on host 192.168.158.27 ... 2025-12-12 16:00:19 begin to start DB on [192.168.158.27]. waiting for server to start.... done server started 2025-12-12 16:00:19 execute to start DB on [192.168.158.27] success, connect to check it. 2025-12-12 16:00:21 DB on [192.168.158.27] start success. ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2025-12-12 16:00:21 The primary DB is started. 2025-12-12 16:00:21 begin to start repmgrd on [192.168.158.26]. [2025-12-12 16:00:21] [NOTICE] using provided configuration file /home/kingbase/cluster/kingbase/bin/../etc/repmgr.conf [2025-12-12 16:00:21] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/hamgr.log 2025-12-12 16:00:23 repmgrd on [192.168.158.26] start success. 2025-12-12 16:00:23 begin to start repmgrd on [192.168.158.27]. [2025-12-12 16:00:23] [NOTICE] using provided configuration file /home/kingbase/cluster/kingbase/bin/../etc/repmgr.conf [2025-12-12 16:00:23] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/hamgr.log 2025-12-12 16:00:25 repmgrd on [192.168.158.27] start success. ID | Name | Role | Status | Upstream | repmgrd | PID | Paused? | Upstream last seen -------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | running | 53964 | no | n/a 2 | node2 | standby | running | node1 | running | 36692 | no | 1 second(s) ago [2025-12-12 16:00:27] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/kbha.log [2025-12-12 16:00:31] [NOTICE] redirecting logging output to /home/kingbase/cluster/kingbase/log/kbha.log 2025-12-12 16:00:32 Done.2、查看集群节点状态[kingbaselocalhost bin]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout90003、查看主备流复制状态[kingbaselocalhost bin]$ repmgr cluster show ID | Name | Role | Status | Upstream | Location | Priority | Timeline | LSN_Lag | Connection string --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1 | node1 | primary | * running | | default | 100 | 1 | | host192.168.158.26 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 2 | node2 | standby | running | node1 | default | 100 | 1 | 0 bytes | host192.168.158.27 useresrep dbnameesrep port54321 connect_timeout10 keepalives1 keepalives_idle2 keepalives_interval2 keepalives_count3 tcp_user_timeout9000 [kingbaselocalhost bin]$ ksql test system 输入 help 来获取帮助信息. test# test# test# select * from sys_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time -------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------- 53846 | 16385 | esrep | node2 | 192.168.158.27 | | 19344 | 2025-12-12 16:00:19.73203608 | | streaming | 0/20000518 | 0/20000518 | 0/20000518 | 0/20000518 | | | | 1 | quorum | 2025-12-12 16:23:21.78281408 (1 行记录)| 19344 | 2025-12-12 16:00:19.73203608 | | streaming | 0/20000518 |0/20000518 | 0/20000518 | 0/20000518 | | | | 1 | quorum | 2025-12-12 16:23:21.78281408(1 行记录)

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2420280.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！