Devops - Highway Strategy 高速公路策略

Photo by Denys Nevozhai on Unsplash

在過去十年的開發經驗,大部分是新創團隊,全部工程師加起來不超過 20 人的團隊。最近才加入一個大團隊,光是 DevOps 就破百人了,還有破百個 AWS Accounts + Regions。

進來後就發現有各種 solution + architecture,其中有幾個團隊,試圖做一個大的自動化框架,標準化全公司的 CICD + EKS + AWS。

我個人也是希望大家都使用統一的標準、流程和工具,這樣資訊和技術的落差比較小,訓練和開發的成本也能降低許多。但是,搞到最後,我為了配合大標準,犧牲了一堆習慣的功能,一堆 tradeoff,變成我開發也困難、部署也困難

我開始反思,一間公司只需要一套標準嗎?把所有人的需求都加到裡面,就能變成一套完美的工具嗎?

我覺得事實剛好相反,這邊我提出一個高速公路策略(Highway Strategy):

  1. 不可能所有車,去到任何地方都要上高速公路,有些人只是想開到兩個巷口外的餐廳而已。
  2. 鄉間小路還是必要的。
  3. 可以從鄉間小路統計,哪邊有夠多的流量和用戶,才有必要開發高速公路。
  4. 在開發一條高速公路,應該是一段一段接上的。
  5. 在高速公路檢查胎壓、水箱?聽起來怪怪的吧,有些事情是出發前就應該自己檢查了。

下次如果你們的 DevOps 或是其他團隊,想要統一架構,合併你們的代碼或是 CICD 時,不仿想想高速公路策略,這條公路真的有加速你們到達目的地嗎?

最終還是要回到用戶導向(User-Oriented),你蓋個高速公路真的有使用者嗎?還是你蓋完後要過兩年才有用戶?這聽起來很不敏捷也很不 DevOps 吧?

Github Action - Can't use gh-cli in self-runner in Enterprise

Here are the tips to run a self-runner with gh-cli

  1. Install gh
  2. Use PAT with permission below to operate your org/repo
    1. repo
    2. workflow
    3. admin:org
  3. Need to run gh auth login
  4. Now you can start to use gh-cli in your self-runner

Make sure you:

  • Add secrets.MY_PAT in org/repo settings
  • --hostname git.mycompany.com is your domain
name: Weekly Report

on:
workflow_dispatch:

schedule:
- cron: '0 4 * * 4' # 12:00 UTC+8, Friday

jobs:
report:
runs-on: [self-hosted]
steps:
- uses: devops/gh_actions_checkout@v3
## TODO: remove this after runner image installed it.
- name: Install gh-cli
run: |
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg \
&& sudo chmod go+r /usr/share/keyrings/githubcli-archive-keyring.gpg \
&& echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null \
&& sudo apt update \
&& sudo apt install gh -y

- name: gh auth
run: |
echo ${{ secrets.MY_PAT }} | gh auth login --with-token --hostname git.mycompany.com --git-protocol https

- name: Search - Logging
run: |
gh issue list

Here are every type of error if you set it wrong or miss some steps

HTTP 401: This endpoint requires you to be authenticated. (https://git.mycompany.com/api/graphql)
Try authenticating with: gh auth login
Error: Process completed with exit code 1.
could not determine base repo: fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
/usr/bin/git: exit status 128
Error: Process completed with exit code 1.
/runner/_work/_temp/4e6ad363-9482-4dc1-8edd-a3ff80577dae.sh: line 1: gh: command not found
Error: Process completed with exit code 127.
HTTP 401: This endpoint requires you to be authenticated. (https://git.mycompany.com/api/graphql)
Try authenticating with: gh auth login
Error: Process completed with exit code 1.
The value of the GH_TOKEN environment variable is being used for authentication.
error fetching organization projects: Message: Resource not accessible by integration, Locations: [{Line:1 Column:92}]
Error: Process completed with exit code 1.

Kubernetes - ExternalSecret is not authorized to perform:secretsmanager:GetSecretValue

After I create a EKS from AWS Console, I install external-secret using Helm. I pretty sure my IAM role and policy are correct. But I still can’t get secretsmanager. It turns out that it needs a IDP(Identity Provider).

ERROR, User: arn:aws:sts::xxx:assumed-role/test-cluster-NodeRole/i-0b5ab8adzzzf27a1b is not authorized to perform: secretsmanager:GetSecretValue on resource: sm_rammus because no identity-based policy allows the secretsmanager:GetSecretValue action

Read More

Kubernetes - Increase metrics-server resources (cpu/memory)

在 GKE 中,如果 metrics-server 因為資源不足崩潰,可以透過更改 NannyConfiguration 和刪除 Deployment: metric-server 來改善這個問題。

> kubectl top pod
W0309 18:40:25.910477 53595 top_pod.go:274] Metrics not available for pod default/xxxx, age: 536h37m25.91045s
error: Metrics not available for pod default/xxxx, age: 536h37m25.91045s

Environment:

  • Kubernetes: v1.17.15-gke.800

範例

> kubectl apply -f metrics-server-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
labels:
addonmanager.kubernetes.io/mode: EnsureExists
kubernetes.io/cluster-service: "true"
name: metrics-server-config
namespace: kube-system
data:
NannyConfiguration: |-
apiVersion: nannyconfig/v1alpha1
kind: NannyConfiguration
baseCPU: 200m
cpuPerNode: 2m
baseMemory: 150Mi
memoryPerNode: 4Mi
> kubectl delete deployment -n kube-system metrics-server-v0.3.6
deployment.apps "metrics-server-v0.3.6" deleted

需要 3-5 分鐘,等待 kube-controller-manager 生效,用新的配置產生 Deployment。

Reference