記一次Ceph pg unfound處理過程
今天檢查ceph集群,發(fā)現(xiàn)有pg丟失,于是就有了本文~~~
1.查看集群狀態(tài)
- [root@k8snode001 ~]# ceph health detail
 - HEALTH_ERR 1/973013 objects unfound (0.000%); 17 scrub errors; Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair; Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
 - OBJECT_UNFOUND 1/973013 objects unfound (0.000%)
 - pg 2.2b has 1 unfound objects
 - OSD_SCRUB_ERRORS 17 scrub errors
 - PG_DAMAGED Possible data damage: 1 pg recovery_unfound, 8 pgs inconsistent, 1 pg repair
 - pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
 - pg 2.44 is active+clean+inconsistent, acting [14,8,21]
 - pg 2.73 is active+clean+inconsistent, acting [25,14,8]
 - pg 2.80 is active+clean+scrubbing+deep+inconsistent+repair, acting [4,8,14]
 - pg 2.83 is active+clean+inconsistent, acting [14,13,6]
 - pg 2.ae is active+clean+inconsistent, acting [14,3,2]
 - pg 2.c4 is active+clean+inconsistent, acting [8,21,14]
 - pg 2.da is active+clean+inconsistent, acting [23,14,15]
 - pg 2.fa is active+clean+inconsistent, acting [14,23,25]
 - PG_DEGRADED Degraded data redundancy: 1/2919039 objects degraded (0.000%), 1 pg degraded
 - pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
 
從輸出發(fā)現(xiàn)pg 2.2b is active+recovery_unfound+degraded, acting [14,22,4], 1 unfound
現(xiàn)在我們來查看pg 2.2b,看看這個(gè)pg的想想信息。
- [root@k8snode001 ~]# ceph pg dump_json pools |grep 2.2b
 - dumped all
 - 2.2b 2487 1 1 0 1 9533198403 3048 3048 active+recovery_unfound+degraded 2020-07-23 08:56:07.669903 10373'5448370 10373:7312614 [14,22,4] 14 [14,22,4] 14 10371'5437258 2020-07-23 08:56:06.637012 10371'5437258 2020-07-23 08:56:06.637012 0
 
可以看到它現(xiàn)在只有一個(gè)副本
2.查看pg map
- [root@k8snode001 ~]# ceph pg map 2.2b
 - osdmap e10373 pg 2.2b (2.2b) -> up [14,22,4] acting [14,22,4]
 
從pg map可以看出,pg 2.2b分布到osd [14,22,4]上
3.查看存儲(chǔ)池狀態(tài)
- [root@k8snode001 ~]# ceph osd pool stats k8s-1
 - pool k8s-1 id 2
 - 1/1955664 objects degraded (0.000%)
 - 1/651888 objects unfound (0.000%)
 - client io 271 KiB/s wr, 0 op/s rd, 52 op/s wr
 - [root@k8snode001 ~]# ceph osd pool ls detail|grep k8s-1
 - pool 2 'k8s-1' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 88 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
 
4.嘗試恢復(fù)pg 2.2b丟失地塊
- [root@k8snode001 ~]# ceph pg repair 2.2b
 
如果一直修復(fù)不成功,可以查看卡住PG的具體信息,主要關(guān)注recovery_state,命令如下
- [root@k8snode001 ~]# ceph pg 2.2b query
 - {
 - "......
 - "recovery_state": [
 - {
 - "name": "Started/Primary/Active",
 - "enter_time": "2020-07-21 14:17:05.855923",
 - "might_have_unfound": [],
 - "recovery_progress": {
 - "backfill_targets": [],
 - "waiting_on_backfill": [],
 - "last_backfill_started": "MIN",
 - "backfill_info": {
 - "begin": "MIN",
 - "end": "MIN",
 - "objects": []
 - },
 - "peer_backfill_info": [],
 - "backfills_in_flight": [],
 - "recovering": [],
 - "pg_backend": {
 - "pull_from_peer": [],
 - "pushing": []
 - }
 - },
 - "scrub": {
 - "scrubber.epoch_start": "10370",
 - "scrubber.active": false,
 - "scrubber.state": "INACTIVE",
 - "scrubber.start": "MIN",
 - "scrubber.end": "MIN",
 - "scrubber.max_end": "MIN",
 - "scrubber.subset_last_update": "0'0",
 - "scrubber.deep": false,
 - "scrubber.waiting_on_whom": []
 - }
 - },
 - {
 - "name": "Started",
 - "enter_time": "2020-07-21 14:17:04.814061"
 - }
 - ],
 - "agent_state": {}
 - }
 
如果repair修復(fù)不了;兩種解決方案,回退舊版或者直接刪除
5.解決方案
- 回退舊版
 - [root@k8snode001 ~]# ceph pg 2.2b mark_unfound_lost revert
 - 直接刪除
 - [root@k8snode001 ~]# ceph pg 2.2b mark_unfound_lost delete
 
6.驗(yàn)證
我這里直接刪除了,然后ceph集群重建pg,稍等會(huì)再看,pg狀態(tài)變?yōu)閍ctive+clean
- [root@k8snode001 ~]# ceph pg 2.2b query
 - {
 - "state": "active+clean",
 - "snap_trimq": "[]",
 - "snap_trimq_len": 0,
 - "epoch": 11069,
 - "up": [
 - 12,
 - 22,
 - 4
 - ],
 
再次查看集群狀態(tài)
- [root@k8snode001 ~]# ceph health detail
 - HEALTH_OK
 
【編輯推薦】















 
 
 







 
 
 
 