Steve Yegge の Google とプラットフォームに関するぶっちゃけ話

2019年11月12日 17:24

元 Google 元 Amazon エンジニアの Steve Yegge が 2011 年に「Stevey's Google Platforms Rant」というタイトルで発表したブログがありました（日本語訳がありましたのでよかったらこちらからどうぞ）。中身は当時のクラウドプラットフォームに対する愚痴ですが、その中で当時の Amazon が SOA (今時の言い方だとマイクロサービス) 化を進めてきた理由と遭遇した課題など面白い知見がありました。

2011 年といえば Kubernetes (2014) も Service Mesh もまだなかった時代なのですが、当時の Amazon と今マイクロサービス化を進めている企業が抱えている課題は似ているところが多いので自分なりに整理してみました。

以下僕の主観が入っているので読む前に原文または日本語訳を読むことをお勧めします。

=======================================================

ある日 Amazon の CEO Jef Bezos から以下のような指示が下されました：

His Big Mandate went something along these lines:

1. All teams will henceforth expose their data and functionality through service interfaces.
2. Teams must communicate with each other through these interfaces.
3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4. It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
5. All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
6. Anyone who doesn't do this will be fired.
7. Thank you; have a nice day!

内容はサービスを分割し、お互いに連携可能な API プラットフォームにする内容ですが口調はものすごく厳しいものでした。

このポリシーを実行するために元レンジャー、元 Waltmart の Rick Dalzell 氏が登場し、嫌がる各チームに畏怖を撒き散らしたのですが、おそらくこのような大規模なアーキテクチャ改修は大きな金銭的コスト、時間的コスト、現場からの反発が多いなどの課題があるため、トップダウンでなければ難しかったのではないでしょうか？

そこから数年間 Amazon のエンジニアたちはサービスのアーキテクチャを変えていったのですが、その中で遭遇した課題が Steve のブログにリストアップされています。

1. pager escalation gets way harder, because a ticket might bounce through 20 service calls before the real owner is identified. If each bounce goes through a team with a 15-minute response time, it can be hours before the right team finally finds out, unless you build a lot of scaffolding and metrics and reporting.

2. every single one of your peer teams suddenly becomes a potential DOS attacker. Nobody can make any real forward progress until very serious quotas and throttling are put in place in every single service.

3. monitoring and QA are the same thing. You'd never think so until you try doing a big SOA. But when your service says "oh yes, I'm fine", it may well be the case that the only thing still functioning in the server is the little component that knows how to say "I'm fine, roger roger, over and out" in a cheery droid voice. In order to tell whether the service is actually responding, you have to make individual calls. The problem continues recursively until your monitoring is doing comprehensive semantics checking of your entire range of services and data, at which point it's indistinguishable from automated QA. So they're a continuum.

4. if you have hundreds of services, and your code MUST communicate with other groups' code via these services, then you won't be able to find any of them without a service-discovery mechanism. And you can't have that without a service registration mechanism, which itself is another service. So Amazon has a universal service registry where you can find out reflectively (programmatically) about every service, what its APIs are, and also whether it is currently up, and where.

5. debugging problems with someone else's code gets a LOT harder, and is basically impossible unless there is a universal standard way to run every service in a debuggable sandbox.

(1) はログとモニタリングと分散トレースを共通化すれば対応したすくなります。今の職場では Datadog を導入している最中ですが、完成するとサービス間の通信を俯瞰的にみて、パフォーマンスのボトルネックやエラーが発生した前後の工程を素早く確認することができるようになります。

(2) と (4) はまさに今時の Service Mesh が解決しようとしている課題で、サービスを探す Service Diccovery と API の速度制限、リトライ、トレースなどを担当するプロキシや API Management などのソリューションが公開されております。

(3) はサービスの SLA/SLO/SLI を決めてモニタリングをするのですが、ログ・監視・トレースは一緒に分析する場合が多いのでどこか一つのところにまとめた方が楽になります。

(5) はコンテナ化と手軽に動かせるサンドボックス環境があると圧倒的に楽になります。僕の場合は Cloud Run を使って手軽に検証環境をデプロイできるような仕組みを考えております。（GitHub のリンクです、よかったら使ってみてください）

=======================================================

見ていただいた通り、当時の Amazon が抱えていた課題は今時の企業にも当てはまるものが多いと思いますがいかがでしょうか？

今の企業はコンテナや Service Mesh、もしくはその周りの素晴らしいソリューションが使えるので、一見もう手探りで頑張る必要がないように見えますが、組織の課題とカルチャーの課題は昔も今も同じです。管理職の皆さん、もしよかったらエンジニアと今後の働き方について会話をしてみませんか？

この記事が気に入ったらサポートをしてみませんか？