Linkerd 金絲雀部署與 A/B 測(cè)試
本指南向您展示如何使用 Linkerd 和 Flagger 來(lái)自動(dòng)化金絲雀部署與 A/B 測(cè)試。
Flagger Linkerd Traffic Split(流量拆分)
前提條件
Flagger 需要 Kubernetes 集群 v1.16 或更新版本和 Linkerd 2.10 或更新版本。
安裝 Linkerd the Prometheus(Linkerd Viz 的一部分):
- linkerd install | kubectl apply -f -
- linkerd viz install | kubectl apply -f -
在 linkerd 命名空間中安裝 Flagger:
- kubectl apply -k github.com/fluxcd/flagger//kustomize/linkerd
引導(dǎo)程序
Flagger 采用 Kubernetes deployment 和可選的水平 Pod 自動(dòng)伸縮 (HPA),然后創(chuàng)建一系列對(duì)象(Kubernetes 部署、ClusterIP 服務(wù)和 SMI 流量拆分)。這些對(duì)象將應(yīng)用程序暴露在網(wǎng)格內(nèi)部并驅(qū)動(dòng) Canary 分析和推廣。
創(chuàng)建一個(gè) test 命名空間并啟用 Linkerd 代理注入:
- kubectl create ns test
- kubectl annotate namespace test linkerd.io/inject=enabled
安裝負(fù)載測(cè)試服務(wù)以在金絲雀分析期間生成流量:
- kubectl apply -k https://github.com/fluxcd/flagger//kustomize/tester?ref=main
創(chuàng)建部署和水平 pod autoscaler:
- kubectl apply -k https://github.com/fluxcd/flagger//kustomize/podinfo?ref=main
為 podinfo 部署創(chuàng)建一個(gè) Canary 自定義資源:
- apiVersion: flagger.app/v1beta1
- kind: Canary
- metadata:
- name: podinfo
- namespace: test
- spec:
- # deployment reference
- targetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: podinfo
- # HPA reference (optional)
- autoscalerRef:
- apiVersion: autoscaling/v2beta2
- kind: HorizontalPodAutoscaler
- name: podinfo
- # the maximum time in seconds for the canary deployment
- # to make progress before it is rollback (default 600s)
- progressDeadlineSeconds: 60
- service:
- # ClusterIP port number
- port: 9898
- # container port number or name (optional)
- targetPort: 9898
- analysis:
- # schedule interval (default 60s)
- interval: 30s
- # max number of failed metric checks before rollback
- threshold: 5
- # max traffic percentage routed to canary
- # percentage (0-100)
- maxWeight: 50
- # canary increment step
- # percentage (0-100)
- stepWeight: 5
- # Linkerd Prometheus checks
- metrics:
- - name: request-success-rate
- # minimum req success rate (non 5xx responses)
- # percentage (0-100)
- thresholdRange:
- min: 99
- interval: 1m
- - name: request-duration
- # maximum req duration P99
- # milliseconds
- thresholdRange:
- max: 500
- interval: 30s
- # testing (optional)
- webhooks:
- - name: acceptance-test
- type: pre-rollout
- url: http://flagger-loadtester.test/
- timeout: 30s
- metadata:
- type: bash
- cmd: "curl -sd 'test' http://podinfo-canary.test:9898/token | grep token"
- - name: load-test
- type: rollout
- url: http://flagger-loadtester.test/
- metadata:
- cmd: "hey -z 2m -q 10 -c 2 http://podinfo-canary.test:9898/"
將上述資源另存為 podinfo-canary.yaml 然后應(yīng)用:
- kubectl apply -f ./podinfo-canary.yaml
當(dāng) Canary 分析開(kāi)始時(shí),F(xiàn)lagger 將在將流量路由到 Canary 之前調(diào)用 pre-rollout webhooks。金絲雀分析將運(yùn)行五分鐘,同時(shí)每半分鐘驗(yàn)證一次 HTTP 指標(biāo)和 rollout(推出) hooks。
幾秒鐘后,F(xiàn)lager 將創(chuàng)建 canary 對(duì)象:
- # applied
- deployment.apps/podinfo
- horizontalpodautoscaler.autoscaling/podinfo
- ingresses.extensions/podinfo
- canary.flagger.app/podinfo
- # generated
- deployment.apps/podinfo-primary
- horizontalpodautoscaler.autoscaling/podinfo-primary
- service/podinfo
- service/podinfo-canary
- service/podinfo-primary
- trafficsplits.split.smi-spec.io/podinfo
在 boostrap 之后,podinfo 部署將被縮放到零, 并且到 podinfo.test 的流量將被路由到主 pod。在 Canary 分析過(guò)程中,可以使用 podinfo-canary.test 地址直接定位 Canary Pod。
自動(dòng)金絲雀推進(jìn)
Flagger 實(shí)施了一個(gè)控制循環(huán),在測(cè)量 HTTP 請(qǐng)求成功率、請(qǐng)求平均持續(xù)時(shí)間和 Pod 健康狀況等關(guān)鍵性能指標(biāo)的同時(shí),逐漸將流量轉(zhuǎn)移到金絲雀。根據(jù)對(duì) KPI 的分析,提升或中止 Canary,并將分析結(jié)果發(fā)布到 Slack。
Flagger 金絲雀階段
通過(guò)更新容器鏡像觸發(fā)金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.1
Flagger 檢測(cè)到部署修訂已更改并開(kāi)始新的部署:
- kubectl -n test describe canary/podinfo
- Status:
- Canary Weight: 0
- Failed Checks: 0
- Phase: Succeeded
- Events:
- New revision detected! Scaling up podinfo.test
- Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Advance podinfo.test canary weight 10
- Advance podinfo.test canary weight 15
- Advance podinfo.test canary weight 20
- Advance podinfo.test canary weight 25
- Waiting for podinfo.test rollout to finish: 1 of 2 updated replicas are available
- Advance podinfo.test canary weight 30
- Advance podinfo.test canary weight 35
- Advance podinfo.test canary weight 40
- Advance podinfo.test canary weight 45
- Advance podinfo.test canary weight 50
- Copying podinfo.test template spec to podinfo-primary.test
- Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
- Promotion completed! Scaling down podinfo.test
請(qǐng)注意,如果您在 Canary 分析期間對(duì)部署應(yīng)用新更改,F(xiàn)lagger 將重新開(kāi)始分析。
金絲雀部署由以下任何對(duì)象的更改觸發(fā):
- Deployment PodSpec(容器鏡像container image、命令command、端口ports、環(huán)境env、資源resources等)
- ConfigMaps 作為卷掛載或映射到環(huán)境變量
- Secrets 作為卷掛載或映射到環(huán)境變量
您可以通過(guò)以下方式監(jiān)控所有金絲雀:
- watch kubectl get canaries --all-namespaces
- NAMESPACE NAME STATUS WEIGHT LASTTRANSITIONTIME
- test podinfo Progressing 15 2019-06-30T14:05:07Z
- prod frontend Succeeded 0 2019-06-30T16:15:07Z
- prod backend Failed 0 2019-06-30T17:05:07Z
自動(dòng)回滾
在金絲雀分析期間,您可以生成 HTTP 500 錯(cuò)誤和高延遲來(lái)測(cè)試 Flagger 是否暫停并回滾有故障的版本。
觸發(fā)另一個(gè)金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.2
使用以下命令執(zhí)行負(fù)載測(cè)試器 pod:
- kubectl -n test exec -it flagger-loadtester-xx-xx sh
生成 HTTP 500 錯(cuò)誤:
- watch -n 1 curl http://podinfo-canary.test:9898/status/500
生成延遲:
- watch -n 1 curl http://podinfo-canary.test:9898/delay/1
當(dāng)失敗的檢查次數(shù)達(dá)到金絲雀分析閾值時(shí),流量將路由回主服務(wù)器,金絲雀縮放為零,并將推出標(biāo)記為失敗。
- kubectl -n test describe canary/podinfo
- Status:
- Canary Weight: 0
- Failed Checks: 10
- Phase: Failed
- Events:
- Starting canary analysis for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Advance podinfo.test canary weight 10
- Advance podinfo.test canary weight 15
- Halt podinfo.test advancement success rate 69.17% < 99%
- Halt podinfo.test advancement success rate 61.39% < 99%
- Halt podinfo.test advancement success rate 55.06% < 99%
- Halt podinfo.test advancement request duration 1.20s > 0.5s
- Halt podinfo.test advancement request duration 1.45s > 0.5s
- Rolling back podinfo.test failed checks threshold reached 5
- Canary failed! Scaling down podinfo.test
自定義指標(biāo)
Canary analysis 可以通過(guò) Prometheus 查詢進(jìn)行擴(kuò)展。
讓我們定義一個(gè)未找到錯(cuò)誤的檢查。編輯 canary analysis 并添加以下指標(biāo):
- analysis:
- metrics:
- - name: "404s percentage"
- threshold: 3
- query: |
- 100 - sum(
- rate(
- response_total{
- namespace="test",
- deployment="podinfo",
- status_code!="404",
- direction="inbound"
- }[1m]
- )
- )
- /
- sum(
- rate(
- response_total{
- namespace="test",
- deployment="podinfo",
- direction="inbound"
- }[1m]
- )
- )
- * 100
上述配置通過(guò)檢查 HTTP 404 req/sec 百分比是否低于總流量的 3% 來(lái)驗(yàn)證金絲雀版本。如果 404s 率達(dá)到 3% 閾值,則分析將中止,金絲雀被標(biāo)記為失敗。
通過(guò)更新容器鏡像觸發(fā)金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.3
生成 404:
- watch -n 1 curl http://podinfo-canary:9898/status/404
監(jiān)視 Flagger 日志:
- kubectl -n linkerd logs deployment/flagger -f | jq .msg
- Starting canary deployment for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary weight 5
- Halt podinfo.test advancement 404s percentage 6.20 > 3
- Halt podinfo.test advancement 404s percentage 6.45 > 3
- Halt podinfo.test advancement 404s percentage 7.22 > 3
- Halt podinfo.test advancement 404s percentage 6.50 > 3
- Halt podinfo.test advancement 404s percentage 6.34 > 3
- Rolling back podinfo.test failed checks threshold reached 5
- Canary failed! Scaling down podinfo.test
如果您配置了 Slack,F(xiàn)lager 將發(fā)送一條通知,說(shuō)明金絲雀失敗的原因。
Linkerd Ingress
有兩個(gè)入口控制器與 Flagger 和 Linkerd 兼容:NGINX 和 Gloo。
安裝 NGINX:
- helm upgrade -i nginx-ingress stable/nginx-ingress \
- --namespace ingress-nginx
為 podinfo 創(chuàng)建一個(gè) ingress 定義,將傳入標(biāo)頭重寫為內(nèi)部服務(wù)名稱(Linkerd 需要):
- apiVersion: extensions/v1beta1
- kind: Ingress
- metadata:
- name: podinfo
- namespace: test
- labels:
- app: podinfo
- annotations:
- kubernetes.io/ingress.class: "nginx"
- nginx.ingress.kubernetes.io/configuration-snippet: |
- proxy_set_header l5d-dst-override $service_name.$namespace.svc.cluster.local:9898;
- proxy_hide_header l5d-remote-ip;
- proxy_hide_header l5d-server-id;
- spec:
- rules:
- - host: app.example.com
- http:
- paths:
- - backend:
- serviceName: podinfo
- servicePort: 9898
使用 ingress controller 時(shí),Linkerd 流量拆分不適用于傳入流量,因?yàn)?NGINX 在網(wǎng)格之外運(yùn)行。為了對(duì)前端應(yīng)用程序運(yùn)行金絲雀分析,F(xiàn)lagger 創(chuàng)建了一個(gè) shadow ingress 并設(shè)置了 NGINX 特定的注釋(annotations)。
A/B 測(cè)試
除了加權(quán)路由,F(xiàn)lagger 還可以配置為根據(jù) HTTP 匹配條件將流量路由到金絲雀。在 A/B 測(cè)試場(chǎng)景中,您將使用 HTTP headers 或 cookies 來(lái)定位您的特定用戶群。這對(duì)于需要會(huì)話關(guān)聯(lián)的前端應(yīng)用程序特別有用。
Flagger Linkerd Ingress
編輯 podinfo 金絲雀分析,將提供者設(shè)置為 nginx,添加 ingress 引用,移除 max/step 權(quán)重并添加匹配條件和 iterations:
- apiVersion: flagger.app/v1beta1
- kind: Canary
- metadata:
- name: podinfo
- namespace: test
- spec:
- # ingress reference
- provider: nginx
- ingressRef:
- apiVersion: extensions/v1beta1
- kind: Ingress
- name: podinfo
- targetRef:
- apiVersion: apps/v1
- kind: Deployment
- name: podinfo
- autoscalerRef:
- apiVersion: autoscaling/v2beta2
- kind: HorizontalPodAutoscaler
- name: podinfo
- service:
- # container port
- port: 9898
- analysis:
- interval: 1m
- threshold: 10
- iterations: 10
- match:
- # curl -H 'X-Canary: always' http://app.example.com
- - headers:
- x-canary:
- exact: "always"
- # curl -b 'canary=always' http://app.example.com
- - headers:
- cookie:
- exact: "canary"
- # Linkerd Prometheus checks
- metrics:
- - name: request-success-rate
- thresholdRange:
- min: 99
- interval: 1m
- - name: request-duration
- thresholdRange:
- max: 500
- interval: 30s
- webhooks:
- - name: acceptance-test
- type: pre-rollout
- url: http://flagger-loadtester.test/
- timeout: 30s
- metadata:
- type: bash
- cmd: "curl -sd 'test' http://podinfo-canary:9898/token | grep token"
- - name: load-test
- type: rollout
- url: http://flagger-loadtester.test/
- metadata:
- cmd: "hey -z 2m -q 10 -c 2 -H 'Cookie: canary=always' http://app.example.com"
上述配置將運(yùn)行 10 分鐘的分析,目標(biāo)用戶是:canary cookie 設(shè)置為 always 或使用 X-Canary: always header 調(diào)用服務(wù)。
請(qǐng)注意,負(fù)載測(cè)試現(xiàn)在針對(duì)外部地址并使用 canary cookie。
通過(guò)更新容器鏡像觸發(fā)金絲雀部署:
- kubectl -n test set image deployment/podinfo \
- podinfod=stefanprodan/podinfo:3.1.4
Flagger 檢測(cè)到部署修訂已更改并開(kāi)始 A/B 測(cè)試:
- kubectl -n test describe canary/podinfo
- Events:
- Starting canary deployment for podinfo.test
- Pre-rollout check acceptance-test passed
- Advance podinfo.test canary iteration 1/10
- Advance podinfo.test canary iteration 2/10
- Advance podinfo.test canary iteration 3/10
- Advance podinfo.test canary iteration 4/10
- Advance podinfo.test canary iteration 5/10
- Advance podinfo.test canary iteration 6/10
- Advance podinfo.test canary iteration 7/10
- Advance podinfo.test canary iteration 8/10
- Advance podinfo.test canary iteration 9/10
- Advance podinfo.test canary iteration 10/10
- Copying podinfo.test template spec to podinfo-primary.test
- Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available
- Promotion completed! Scaling down podinfo.test