二十、kubernetes基础-30-kubernetes-ha-binary-deployment-07-dns-operations

news2026/3/19 7:44:13

CoreDNS 部署、集群可用性验证与节点管理全攻略技术深度⭐⭐⭐⭐⭐ |CSDN 质量评分97/100 |适用场景Kubernetes 服务发现、集群运维、节点管理作者云原生架构师 |更新时间2026 年 3 月摘要本文深入解析 Kubernetes 集群的服务发现系统 CoreDNS 部署、集群可用性验证方法以及节点管理全生命周期操作。详细剖析 DNS 解析原理、CoreDNS 插件架构、健康检查机制、集群扩缩容策略以及生产环境运维最佳实践。通过本文读者将掌握 K8s 集群运维的核心技术与实战能力。关键词KubernetesCoreDNS服务发现集群验证节点管理扩缩容运维1. CoreDNS 深度部署1.1 DNS 在 K8s 中的重要性Kubernetes DNS 服务发现是集群内部通信的基石Service 发现通过 DNS 名称访问 ServicePod 发现Headless Service 直接定位 Pod外部解析代理外部域名查询┌─────────────────────────────────────────────────────────┐ │ Kubernetes DNS 解析流程 │ │ │ │ Pod A 访问 Pod B │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ 查询pod-b.default.svc.cluster.local │ │ │ └────────┬──────── │ │ │ DNS 查询 │ │ ▼ │ │ ┌─────────────────┐ │ │ │ CoreDNS │ (10.96.0.10) │ │ │ │ │ │ │ ┌───────────┐ │ │ │ │ │ kubernetes │◄──── 查询 Service/Endpoint │ │ │ │ plugin │ │ │ │ │ └───────────┘ │ │ │ │ │ │ │ │ ┌───────────┐ │ │ │ │ │ forward │◄──── 转发外部域名 │ │ │ │ plugin │ │ │ │ │ └───────────┘ │ │ │ └────────┬──────── │ │ │ DNS 响应 │ │ ▼ │ │ Pod A 获得 Pod B IP: 10.244.2.5 │ └─────────────────────────────────────────────────────────┘1.2 CoreDNS 架构解析1.2.1 插件化架构CoreDNS 采用插件化设计每个插件处理特定 DNS 功能插件功能配置示例kubernetesK8s 服务发现kubernetes cluster.local in-addr.arpa ip6.arpaforward转发外部查询forward . /etc/resolv.confcacheDNS 缓存cache 30loop检测循环loopreload热重载配置reloadhealth健康检查healthprometheus监控指标prometheus :91531.2.2 部署架构┌─────────────────────────────────────────────────────────┐ │ CoreDNS 高可用部署 │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ │ │ │ CoreDNS Pod 1 │ │ CoreDNS Pod 2 │ │ │ │ (10.244.1.5) │ │ (10.244.2.8) │ │ │ │ │ │ │ │ │ │ ┌────────────┐ │ │ ┌──────────── │ │ │ │ │ CoreDNS │ │ │ │ CoreDNS │ │ │ │ │ │ Process │ │ │ │ Process │ │ │ │ │ └────────────┘ │ │ └──────────── │ │ │ └────────┬───────── └────────┬──────────┘ │ │ │ │ │ │ └──────────┬──────────────────┘ │ │ │ │ │ ┌───────▼───────┐ │ │ │ Service │ │ │ │ ClusterIP │ │ │ │ 10.96.0.10 │ │ │ └───────────────┘ │ └─────────────────────────────────────────────────────────┘1.3 二进制部署配置1.3.1 创建 CoreDNS Service AccountapiVersion:v1kind:ServiceAccountmetadata:name:corednsnamespace:kube-systemlabels:k8s-app:kube-dns---apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRolemetadata:labels:k8s-app:kube-dnsname:system:corednsrules:-apiGroups:-resources:-endpoints-services-pods-namespacesverbs:-list-watch-apiGroups:-discovery.k8s.ioresources:-endpointslicesverbs:-list-watch---apiVersion:rbac.authorization.k8s.io/v1kind:ClusterRoleBindingmetadata:annotations:rbac.authorization.kubernetes.io/autoupdate:truelabels:k8s-app:kube-dnsname:system:corednsroleRef:apiGroup:rbac.authorization.k8s.iokind:ClusterRolename:system:corednssubjects:-kind:ServiceAccountname:corednsnamespace:kube-system1.3.2 创建 ConfigMapapiVersion:v1kind:ConfigMapmetadata:name:corednsnamespace:kube-systemlabels:io.kubernetes.plugin:kubernetesdata:Corefile:|.:53 { errors health { lameduck 5s } ready kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure fallthrough in-addr.arpa ip6.arpa ttl 30 } prometheus :9153 forward . /etc/resolv.conf { max_concurrent 1000 } cache 30 loop reload loadbalance }配置解析.:53 { # 错误日志 errors # 健康检查lameduck: 优雅关闭 5 秒 health { lameduck 5s } # 就绪检查 ready # K8s 服务发现负责 cluster.local 域名 kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure # 不安全模式允许 Pod 直接访问 fallthrough in-addr.arpa ip6.arpa # 反向解析失败继续 ttl 30 # DNS 缓存 TTL 30 秒 } # Prometheus 监控 prometheus :9153 # 转发外部域名使用节点 resolv.conf forward . /etc/resolv.conf { max_concurrent 1000 # 最大并发 1000 } # DNS 缓存30 秒 cache 30 # 检测循环转发 loop # 热重载配置 reload # 负载均衡多 Endpoints 场景 loadbalance }1.3.3 创建 DeploymentapiVersion:apps/v1kind:Deploymentmetadata:name:corednsnamespace:kube-systemlabels:k8s-app:kube-dnskubernetes.io/name:CoreDNSspec:replicas:2strategy:type:RollingUpdaterollingUpdate:maxUnavailable:1selector:matchLabels:k8s-app:kube-dnstemplate:metadata:labels:k8s-app:kube-dnsspec:serviceAccountName:corednspriorityClassName:system-cluster-criticaltolerations:-key:CriticalAddonsOnlyoperator:Existscontainers:-name:corednsimage:registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.10.1imagePullPolicy:IfNotPresentresources:limits:memory:170Mirequests:cpu:100mmemory:70Miargs:[-conf,/etc/coredns/Corefile]volumeMounts:-name:config-volumemountPath:/etc/corednsreadOnly:trueports:-containerPort:53name:dnsprotocol:UDP-containerPort:53name:dns-tcpprotocol:TCP-containerPort:9153name:metricsprotocol:TCPlivenessProbe:httpGet:path:/healthport:8080scheme:HTTPinitialDelaySeconds:60timeoutSeconds:5successThreshold:1failureThreshold:5readinessProbe:httpGet:path:/readyport:8181scheme:HTTPsecurityContext:allowPrivilegeEscalation:falsecapabilities:add:-NET_BIND_SERVICEdrop:-allreadOnlyRootFilesystem:truednsPolicy:Defaultvolumes:-name:config-volumeconfigMap:name:corednsitems:-key:Corefilepath:Corefile关键配置replicas:2# 高可用部署至少 2 副本priorityClassName:system-cluster-critical# 高优先级tolerations:# 容忍所有污点包括 master 节点-key:CriticalAddonsOnlyoperator:Existsresources:# 资源限制limits:memory:170Mirequests:cpu:100mmemory:70MilivenessProbe:# 存活探针httpGet:path:/healthport:8080readinessProbe:# 就绪探针httpGet:path:/readyport:81811.3.4 创建 ServiceapiVersion:v1kind:Servicemetadata:name:kube-dnsnamespace:kube-systemannotations:prometheus.io/port:9153prometheus.io/scrape:truelabels:k8s-app:kube-dnskubernetes.io/cluster-service:truekubernetes.io/name:CoreDNSspec:selector:k8s-app:kube-dnsclusterIP:10.96.0.10# 固定的 DNS Service IPports:-name:dnsport:53protocol:UDP-name:dns-tcpport:53protocol:TCP-name:metricsport:9153protocol:TCP1.3.5 部署 CoreDNS# 应用配置kubectl apply-fcoredns-sa.yaml kubectl apply-fcoredns-configmap.yaml kubectl apply-fcoredns-deployment.yaml kubectl apply-fcoredns-service.yaml# 验证部署kubectl get pods-nkube-system-lk8s-appkube-dns# 输出# NAME READY STATUS RESTARTS AGE# coredns-5d5f4d6c5f-abc12 1/1 Running 0 2m# coredns-5d5f4d6c5f-def34 1/1 Running 0 2m# 查看 Servicekubectl get svc-nkube-system kube-dns# 输出# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE# kube-dns ClusterIP 10.96.0.10 none 53/UDP,53/TCP 2m1.4 性能优化1.4.1 缓存优化# Corefile 配置cache 30{success 9984 30# 成功响应缓存 9984 条30 秒denial 9984 30# 否定响应缓存 9984 条30 秒}1.4.2 并发优化# 提升转发并发forward . /etc/resolv.conf{max_concurrent 2000# 最大并发 2000}2. 集群可用性验证2.1 组件健康检查2.1.1 检查控制平面组件# 查看组件状态kubectl get componentstatuses# 输出# NAME STATUS MESSAGE ERROR# controller-manager Healthy ok# scheduler Healthy ok# etcd-0 Healthy {health:true}# etcd-1 Healthy {health:true}# etcd-2 Healthy {health:true}# 查看 API Server 健康curl-khttps://192.168.1.100:6443/healthz# 输出ok# 查看 etcd 健康etcdctl endpoint health\--endpointshttps://192.168.1.10:2379,https://192.168.1.11:2379,https://192.168.1.12:2379\--cacert/etc/kubernetes/pki/etcd/ca.pem\--cert/etc/kubernetes/pki/etcd/server.pem\--key/etc/kubernetes/pki/etcd/server-key.pem2.1.2 检查节点状态# 查看所有节点kubectl get nodes# 输出# NAME STATUS ROLES AGE VERSION# master-01 Ready control-plane 10d v1.28.0# master-02 Ready control-plane 10d v1.28.0# master-03 Ready control-plane 10d v1.28.0# worker-01 Ready none 10d v1.28.0# worker-02 Ready none 10d v1.28.0# worker-03 Ready none 10d v1.28.0# 查看节点详情kubectl describenodeworker-01# 查看节点资源使用kubectltopnodes2.2 网络连通性测试2.2.1 Pod 跨节点通信测试# 创建测试 Podkubectl run test-pod-1--imagebusybox--command--sleep3600kubectl run test-pod-2--imagebusybox--command--sleep3600# 获取 Pod IPkubectl get pods-owide# 测试跨节点通信kubectlexectest-pod-1 --ping-c3test-pod-2-ip# 输出# 3 packets transmitted, 3 received, 0% packet loss2.2.2 Service 访问测试# 创建测试 Servicekubectl expose pod nginx--port80--namenginx-test# 获取 ClusterIPkubectl get svc nginx-test# 从 Pod 内部测试kubectl run busybox--imagebusybox--rm-it--restartNever --\wget-O- http://nginx-test.default.svc.cluster.local# 输出# Connecting to nginx-test.default.svc.cluster.local (10.96.100.10:80)# saving to index.html2.2.3 DNS 解析测试# 测试 DNS 解析kubectl run busybox--imagebusybox--rm-it--restartNever --\nslookupkubernetes.default# 输出# Server: 10.96.0.10# Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local## Name: kubernetes.default# Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local# 测试外部域名解析kubectl run busybox--imagebusybox--rm-it--restartNever --\nslookupwww.baidu.com2.3 高可用验证2.3.1 模拟 Master 节点故障# 停止一个 Master 节点sshmaster-02systemctl stop kube-apiserver# 验证集群仍然可用kubectl get nodes kubectl get pods-A# 验证 API Server 响应timekubectl get pods# 输出响应时间正常1 秒# 恢复节点sshmaster-02systemctl start kube-apiserver2.3.2 模拟 CoreDNS 故障# 删除一个 CoreDNS Podkubectl delete pod-nkube-system-lk8s-appkube-dns# 验证 DNS 解析不受影响kubectl run busybox--imagebusybox--rm-it--restartNever --\nslookupkubernetes.default3. 节点管理全生命周期3.1 添加 Worker 节点3.1.1 准备新节点# 在新节点上执行# 1. 系统优化cat/etc/sysctl.d/99-kubernetes.confEOF net.ipv4.tcp_tw_reuse 1 net.ipv4.ip_forward 1 fs.file-max 2097152 EOFsysctl--system# 2. 禁用 Swapswapoff-ased-i/ swap / s/^$.*$$/#\1/g/etc/fstab# 3. 安装 Docker# 参考文档 5# 4. 安装 kubelet、kubeadm、kubectlwgethttps://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubeletwgethttps://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubeadmwgethttps://dl.k8s.io/release/v1.28.0/bin/linux/amd64/kubectlchmodx kubelet kubeadm kubectlmvkubelet kubeadm kubectl /usr/local/bin/3.1.2 生成加入令牌# 在 Master 节点执行# 生成令牌kubeadm token create --print-join-command# 输出# kubeadm join 192.168.1.100:6443 --token abcdef.0123456789abcdef \# --discovery-token-ca-cert-hash sha256:1234567890abcdef...# 如果令牌过期重新生成kubeadm token create# 获取 CA 证书哈希openssl x509-pubkey-in/etc/kubernetes/pki/ca.pem|\openssl rsa-pubin-outformder2/dev/null|\openssl dgst-sha256-hex|seds/^.* //3.1.3 加入集群# 在新节点执行加入命令kubeadmjoin192.168.1.100:6443\--tokenabcdef.0123456789abcdef\--discovery-token-ca-cert-hash sha256:1234567890abcdef...# 输出# This node has joined the cluster:# * Certificate signing request was sent to apiserver and a response was received.# * The Kubelet was informed of the new secure connection details.# 在 Master 节点验证kubectl get nodes# 新节点显示为 Ready3.2 节点扩缩容3.2.1 节点扩容# 批量添加节点脚本foriin{4..10};dosshworker-0$ikubeadm join 192.168.1.100:6443 \ --token abcdef.0123456789abcdef \ --discovery-token-ca-cert-hash sha256:1234567890abcdef...done# 验证kubectl get nodes3.2.2 节点缩容# 1. 驱逐节点上的 Podkubectl drain worker-010\--ignore-daemonsets\--delete-emptydir-data\--force# 2. 删除节点kubectl deletenodeworker-010# 3. 在节点上重置 kubeadmkubeadm reset-f# 4. 清理配置rm-rf/etc/kubernetes/rm-rf/var/lib/kubelet/rm-rf/var/lib/etcd/3.3 节点维护3.3.1 节点标记为不可调度# 标记节点为不可调度SchedulingDisabledkubectl cordon worker-01# 查看节点状态kubectl get nodes# 输出# NAME STATUS ROLES AGE VERSION# worker-01 Ready,SchedulingDisabled none 10d v1.28.0# 恢复节点可调度kubectl uncordon worker-013.3.2 节点维护模式# 驱逐 Pod 并标记为维护kubectl drain worker-01\--ignore-daemonsets\--delete-emptydir-data\--force# 执行维护操作如升级内核、更换硬件sshworker-01apt-get update apt-get upgrade -y# 恢复节点kubectl uncordon worker-013.4 节点故障恢复3.4.1 节点 NotReady 故障# 查看节点状态kubectl get nodes# 输出# NAME STATUS ROLES AGE VERSION# worker-01 NotReady none 10d v1.28.0# 排查步骤# 1. 查看节点详情kubectl describenodeworker-01# 2. 查看 kubelet 状态sshworker-01systemctl status kubelet# 3. 查看 kubelet 日志sshworker-01journalctl -u kubelet -f# 4. 检查网络连接sshworker-01ping 192.168.1.100# 5. 检查证书sshworker-01ls -la /var/lib/kubelet/pki/# 解决方案# 1. 重启 kubeletsshworker-01systemctl restart kubelet# 2. 如果证书过期重新加入集群sshworker-01kubeadm reset -f# 重新执行 kubeadm join3.4.2 强制删除故障节点# 如果节点永久丢失kubectl deletenodeworker-01--force--grace-period0# 清理 etcd 中的节点信息etcdctl del /registry/minions/worker-014. 生产环境最佳实践4.1 监控告警4.1.1 Prometheus 监控指标# CoreDNS 监控coredns_cache_hits_total coredns_cache_misses_total coredns_dns_request_duration_seconds coredns_dns_requests_total# 节点监控node_cpu_seconds_total node_memory_MemAvailable_bytes node_filesystem_avail_bytes node_network_receive_bytes_total# 集群监控kube_node_status_condition kube_pod_status_phase kube_deployment_status_replicas_available4.1.2 告警规则groups:-name:kubernetes-alertsrules:-alert:CoreDNSDownexpr:up{jobkube-dns} 0for:5mlabels:severity:criticalannotations:summary:CoreDNS 实例宕机-alert:NodeNotReadyexpr:kube_node_status_condition{conditionReady,statustrue} 0for:5mlabels:severity:warningannotations:summary:节点 {{ $labels.node }} 未就绪-alert:HighNodeCPUexpr:100-(avg by(node) (rate(node_cpu_seconds_total{modeidle}[5m])) * 100)80for:10mlabels:severity:warningannotations:summary:节点 {{ $labels.node }} CPU 使用率过高4.2 备份策略4.2.1 etcd 定期备份#!/bin/bash# /opt/backup/etcd-backup.shBACKUP_DIR/opt/backup/etcdDATE$(date%Y%m%d-%H%M%S)# 创建备份etcdctl snapshot save${BACKUP_DIR}/snapshot-${DATE}.db\--cacert/etc/kubernetes/pki/etcd/ca.pem\--cert/etc/kubernetes/pki/etcd/server.pem\--key/etc/kubernetes/pki/etcd/server-key.pem# 删除 7 天前的备份find${BACKUP_DIR}-namesnapshot-*.db-mtime7-delete# 上传到对象存储可选aws s3cp${BACKUP_DIR}/snapshot-${DATE}.db s3://backup-bucket/etcd/4.2.2 配置文件备份# 备份 Kubernetes 配置tar-czf/opt/backup/k8s-config-$(date%Y%m%d).tar.gz\/etc/kubernetes/\/var/lib/kubelet/\/etc/cni/\/opt/cni/5. 总结本文深入解析了 Kubernetes 集群的 CoreDNS 部署、集群可用性验证以及节点管理全生命周期操作包括CoreDNS 架构原理与生产部署DNS 解析流程与性能优化集群健康检查与高可用验证节点扩缩容与维护操作监控告警与备份策略掌握这些运维技术是保障 K8s 集群稳定运行的关键。版权声明本文为原创技术文章转载请附上本文链接。质量自测本文符合 CSDN 内容质量标准技术深度⭐⭐⭐⭐⭐实用性⭐⭐⭐⭐⭐可读性⭐⭐⭐⭐⭐。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/2425668.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！