Metropolis, a cluster operating system

Note: Metropolis is currently in heavy development. This documentation is written to reflect our goals, not necessarily the current state of the product. You are welcome to give Metropolis a try, but we cannot recommend running it anywhere near production workloads.

Welcome to the Metropolis Handbook, the primary documentation resource for Metropolis. Metropolis is a cluster operating system, meaning its goal is to run on a fleet of machines (be it physical or virtual) and pool their resources together into a unified API for operations and developer teams.

Metropolis stands on the shoulders of giants, and takes the best of battle-tested software like the Linux kernel and Kubernetes to build a cohesive, stable, reliable and secure platform.

What makes Metropolis unique

A self-contained operating system: Metropolis is a full software stack, including the Linux kernel, userspace code, Kubernetes distribution and cluster management system. In contrast to traditional cluster administration, there are no puzzles to put together from a dozen vendors. The entire stack is tested as a single deployable unit.
Eliminates state: Metropolis nodes don't have a traditional read-write filesystem, all of their state is contained on a separate partition with clear per-component ownership of data. All node configuration is managed declaratively on a per-node basis, and all cluster operations are all done by gRPC API.
No shell, no one-off hacks, no configuration drift: Metropolis nodes do not run SSH nor depend on low-level system administration tools for day-to-day operations, even debugging.
Opinionated on production readiness: Metropolis does not attempt to support every possible software configuration, instead focusing on scenarios that make for a high quality production experience .
Robust: Metropolis builds upon proven technology and does not take risks. Cluster consensus is maintained using the Raft protocol, user and node communication use well-defined gRPC services, while system services are limited in complexity and purpose-built for Metropolis.
Secure at rest: Metropolis nodes by default encrypt their data partitions and check the integrity of running code, providing tamper resistance and preventing data exfiltration even if an attacker can access a node's disk drives.
Self-locking: Metropolis can be configured to use TPM hardware attestation, in which cluster membership is limited to nodes that are running trusted versions of the software on trusted hardware.
Not magic: Metropolis clusters are complex, distributed systems. Managing any distributed system like Metropolis requires some knowledge of core concepts and components involved, and the Metropolis does not attempt to hide that complexity away. Limited internal abstractions and well documented source code lets anyone easily troubleshoot any deeper issues.

Kubernetes on Metropolis

While we aim to make Metropolis run various kinds of workloads in the future, Metropolis strongly focuses on using Kubernetes as an application platform for users. Workloads can be scheduled on Metropolis using any Kubernetes tools like kubecfg, Tanka or even Helm.

In comparison to other Kubernetes distributions, Metropolis does not attempt to simplify Kubernetes by providing extra wrappers or shortcuts for users. Instead, we believe that users should understand the Kubernetes production model and aim to be proficient in its API, as any high-level wrappers only paradoxically introduce complexity.

How to use this Handbook

This handbook is the canonical documentation for Metropolis. It aims to document all aspects of Metropolis, from a quick demo, through production deployment to architecture internals.

Note: This section is critical to understand the Handbook structure and must be read by anyone looking to use Metropolis. At the bottom of this page you will find information about which sections to read next, depending on how you want to use Metropolis.

Who is this book for?

Throughout this book, we will keep using the following terminology regarding groups of people who interact with Metropolis. Note, that these names do NOT imply that these groups are disjoint. Instead, think of them as different hats people can wear when using Metropolis.

Metropolis does not enforce these roles explicitly, but is designed and engineered in a way that makes the most sense for this kind of organizational structure.

Operators

Operators are responsible for managing Metropolis clusters - for example, bringing nodes into the cluster and decommissioning old nodes, monitoring resource usage and performing capacity planning, ensuring that the cluster is healthy, and responsible for handling cluster-level outages.

Operators are usually part of an organization-wide 'platform' team which acts as an internal 'as-a-Service' service, providing services to Users (like a workload scheduling system, a database service, a storage service...). They might manage some workloads running on Metropolis themselves, too, usually parts of the platform provided to Users (like running organization-wide database clusters on Metropolis).

Operators mostly act as system administrators, but are expected to also be able to use Metropolis APIs from a programming or scripting language of their choice to automate their work and make the most of Metropolis. Metropolis provides a set of management tools that allow management from a command line, but these are only thin wrappers around the underlying API which should be the main way to think about Metropolis.

Users

Users run workloads on Metropolis clusters, via the Kubernetes API. They might know that a cluster runs Metropolis, but this is generally not something they need to worry about - instead, they should be aware of the abstractions which Kubernetes provides. Some limited amount of interaction with Metropolis-specific APIs might exist for purposes of authentication or accounting, but would also be usually hidden away by Operators as part of some organizational integration code.

Critically, however, Metropolis does not provide Users with some 'friendly' higher-level API or tooling which duplicates functionality of the Metropolis API used by Operators. Instead, the Kubernetes API is provided to both Users and Operators in the same fashion so that any internal tooling built on top can be shared between users and Operators.

In addition, Metropolis makes no attempt to hide that it itself and Kubernetes are distributed systems and applications running on top of clusters need to be engineered to handle such a scenario.

In most organizations, Users will be part of product teams, both developing and operating the organization's product or service.

Developers

Developers work on the Metropolis codebase. Metropolis is an open source project, welcomes external contributions and attempts to have a fully open design process. However, any changes introduced must be carefully reviewed and tested - not only external contributions, but also contributions from Monogon employees.

Metropolis comes with high quality developer tooling to work on the codebase - all tests, including full cluster tests, can be run without any special software straight from a Monogon repository checkout.

People who wish to build Metropolis from source (for security, to reproduce official artifacts, or to apply internal organization patches) are also expected to fall into this category. In the future, purpose-specific documentation might be built for software packagers or people who wish to ensure Metropolis builds are reproducible, but that is not the case yet.

Which sections should be read, and in what order?

If you just want to try out Metropolis, head over to 2. Demo Cluster and come back here later.

If you're considering deploying Metropolis, you must read 1. Metropolis in your Organization. It lays some ground concepts of how Metropolis will fit in your organization, what it's good at, and what it's not. It's aimed towards future Operators that wish to better understand the relationship between them and Users, but should also be read by organization management teams that will oversee future Operator and User teams/roles.

Operators must read the following chapters:

3. Cluster Architecture, which describes how Metropolis is designed. The information contained therein is crucial to properly plan, deploy and manage a Metropolis cluster. Individual sections of the chapter will be marked if some information is optional in some kinds of deployments, and these parts might be skipped and read as needed later.
4. Production Deployment, which describes the standard procedures used to manage Metropolis clusters, troubleshooting procedures.

Users do not need to read any Metropolis-specific documentation to use Metropolis, and instead should rely on information provided by cluster Operators and the upstream Kubernetes documentation. However, we encourage users to skim through 3. Cluster Architecture if they are interested in knowing more about the internal Metropolis architecture.

Developers are generally expected to start out as Operators and thus have read all relevant documentation for Operators already. In addition to that, they are provided with information on how to develop Metropolis in 5. Developing Metropolis, which gives an introduction to the Metropolis codebase and how to get started writing code.

Metropolis in your Organization

Note: In this chapter, 'developers' mean product developers, ie. Metropolis Users, not Metropolis Developers. Whenever you see User, Operator or Developer, think of Metropolis roles. However, whenever you see developer, think of product development teams acting as Users.

As outlined in How to use this Handbook, Metropolis has at its core the concept of separate Users and Operators of a Metropolis cluster.

This split might, at first glance, seem antithetical to the spirit of 'DevOps'. However, this distinction doesn't exist to take away operational tasks from software developers (Users), but to let Metropolis scale to large organizations where developers cannot be expected to be responsible for operations from physical hardware (or a public cloud) up to their product. We believe product teams should be able to focus on the operational aspects specific to their product, and not have to deal with low-level fluff like cluster-level backups, monitoring and security.

This chapter aims to explain and argument the reasoning for such a split, and tie this into how Metropolis expects to be managed in different kinds of organizations.

Platform Teams

Metropolis allows large organizations to build internal Platform Teams. These exist to bring a 'PaaS'-style experience to multiple internal product development teams. Metropolis neatly fits into this scenario by exposing only a standard Kubernetes API to these development teams (acting as Metropolis Users), while also exposing a powerful but proprietary API for the platform team (Metropolis Operators) that concerns only operational work. The two APIs are separate but do not overlap in functionality.

In the following example, the Platform Team are Metropolis Operators, while Product Teams A and B are Metropolis Users. The Platform Team runs two multi-tenant Metropolis clusters, both of which can be used by any Product Team for any purpose.

.----------------.      Manages     .--------------------------.
| Product Team A | ---------------> | Product A                |
'----------------'       (k8s)      '--------------------------'
                                         |            |
.----------------.      Manages     .--------------------------.
| Product Team B | ---------------> | Product B                |
'----------------'       (k8s)      '--------------------------'
                                         | |          | |
                                         | |          | | Runs on
                                         V V          V V
.----------------.      Manages     .-----------.  .-----------.
| Platform Team  | ---------------> | Cluster X |  | Cluster Y |
|                | -----------------|           |->|           |
'----------------' (Metropolis API) '-----------'  '-----------'

At large scales, Product Teams benefit from Metropolis by using a product that does not require them to be aware of implementation details below the Kubernetes API layer, and can focus on day-to-day operations of core products. They do not need to coordinate with other Product Teams on sharing the underlying resources, nor do they need to take care of managing or scaling the cluster. Platform Teams likewise benefit from Metropolis having been designed for use in a multi-tenant fashion where product teams can share a cluster safely.

Smaller Organizations

While the above Platform Team system works great for larger organizations, smaller organizations usually do not benefit from having distinct teams of dozens of people responsible just for clusters and other organizational-wide resources.

In these cases, there is nothing which prevents the lone backend developer at a company from acting both as a Metropolis Operator and User and managing both Metropolis clusters and the actual workloads running on it:

.--------------.       Manages      .---------.
| Backend Team | -----------------> | Product |
|              |        (k8s)       |         |
|              | --.                '---------'
'--------------'   |                     | Runs on
                   |                     V
                   |   Manages      .---------.
                   '--------------> | Cluster |
                  (Metropolis API)  '---------'

As the organization grows, Metropolis will continue gently guiding (by way of Users/Operators role separation) workflows of the Backend team to not mix these two roles together. From the beginning, the Product can be deployed only using the Kubernetes API without needing to touch Metropolis-specific APIs. As new products and projects are developed, these can continue to use the existing Metropolis infrastructure without overhead of having each team manage their own production from the ground up.

Organizational anti-patterns

Monogon believes that organizational issues cannot simply be fixed by applying technical solutions. Thus, Metropolis explicitly avoids supporting usecases that stem from heavy internal siloization of organizations, or the broken incentives of a syadmin-style platform team. We believe that Metropolis can be used as a catalyst to build better teams and workflows, but it is not by itself a fix for organizational problems.

We would like to refer you to the following sources for more information on these organizational patterns and anti-patterns.

Infra teams: good, bad or none at all, which describes the typical emerging ways organizations deal with infrastructure work. Metropolis leans heavily towards a “Company A” environment.
The SRE Book, which describes Google's “implementation” of DevOps. While the processes described work best for extremely large companies, a significant amount of high-level observations and judgements can be pertinent to even the smallest organizations.
The SRE Workbook chapter “How SRE Relates to DevOps”, which describes an organizational approach to development and operation teams in which Metropolis works best.

The demo cluster

This chapter demonstrates the installation process and operation of a Metropolis cluster.

Prerequisites

Hardware

An x86_64/amd64 Linux host machine with glibc on which you can run metroctl. This will later be expanded to cover many more platforms, but for our first release this is the only supported platform.

And either:

KVM support on your host machine and a hypervisor capable of running OVMF with TPM2 (like libvirt/virt-manager)
A physical x86_68/amd64 machine (ideally at least 3 for reboot persistence) with UEFI boot and a TPM2 and a USB thumb drive (>=1G).

Software

metroctl

First, you'll need metroctl, the command line utility for working with Metropolis clusters. You can get it from GitHub Releases (https://github.com/monogon-dev/monogon/releases) with

curl -L -o metroctl https://github.com/monogon-dev/monogon/releases/download/metropolis-v0.1/metroctl
chmod +x metroctl

Optionally you can move the file to a location in PATH, like /usr/local/bin or ~/bin/.

The installation bundle

To install Metropolis, you'll need a bundle. A bundle contains all resources to install or update a Metropolis node. You can get a prebuilt bundle from GitHub Releases with

curl -L -o bundle.zip https://github.com/monogon-dev/monogon/releases/download/metropolis-v0.1/bundle.zip

Installation

The bootstrap node

Let's generate the installer image that we'll use to install the first node of the upcoming cluster. To do that, use the metroctl tool in the following way:

metroctl install genusb bootstrap-node-installer.img --bootstrap --bundle=<installation-bundle-path>

If you're going to install from a USB stick or other types of removable storage, supply metroctl with a device path:

metroctl install genusb /dev/sdx --bootstrap --bundle=<installation-bundle-path>

Since a new GPT will need to be generated for the target device, the image file cannot simply be copied into it. Caution: make sure you'll be using the correct path. metroctl will overwrite data on the target device.

Metropolis does not support installation from optical media.

The installer will be paired with your cluster owner's credentials, that metroctl will save to your XDG config directory. Please note that the resulting installer can be used only to set up the initial node.

If all goes well, this will leave you with the following output.

2022/07/07 10:29:44 Generating installer image (this can take a while, see issues/92).

Use the installer image to provision the first node. The image will contain an EFI-bootable payload.

Caution: the installer will install Metropolis onto the first suitable persistent storage it finds as soon as it boots. The installation process is non-interactive in this version of the OS. If you're going to install on physical hardware, make sure you have backed up all your data from the machine you'll be running it on.

If you'll be using a virtual machine, it is advised to pick smaller storage sizes, eg. 4G. Upon first boot, Metropolis will need to zero its data partition, which can take a long time.

The installer will produce the following output, that will be both sent over the serial interface, and displayed on your screen, if available:

Installing to /dev/vdx...

Afterwards, it will restart, and the installation media will need to be removed. At this point you should be left with a working bootstrap node.

Taking ownership of the new cluster

After the first node is set up and running, you can take ownership of the upcoming cluster:

metroctl takeownership <bootstrap-node-address>

This should result in the following output being displayed:

2022/07/07 10:42:07 Successfully retrieved owner credentials! You now own this cluster. Setting up kubeconfig now...
2022/07/07 10:42:07 Success! kubeconfig is set up. You can now run kubectl --context=metropolis ... to access the Kubernetes cluster.

If this didn't work out the first time you tried, try giving the bootstrap node more time. Depending on available storage size, setting up its data partition might take longer, in which case your connection attempts will be refused.

Additional nodes

Additional nodes can be provided with the non-bootstrap installer image. It can be generated with metroctl. This time, note the lack of the --bootstrap flag.

metroctl --endpoints <bootstrap-node-address> install genusb second-node-installer.img --bundle=<installation-bundle-path>

Complete the installation process with one or more nodes.

For the new nodes to join the cluster, you'll need to approve them first. Calling metroctl approve with no parameters will list nodes pending approval.

metroctl --endpoints <bootstrap-node-address> approve

You should get a list of node IDs which would look similar to:

metropolis-7eec2053798faab726bb9fd9e9444ec9

If there are no nodes that have already registered with the cluster, metroctl will produce the following output:

There are no nodes pending approval at this time.

To approve a node, use its node ID as a parameter.

metroctl --endpoints 192.168.122.238 approve <above-node-id-goes-here>

If the node was approved as a result, metroctl will say:

Approved node <node-id>

Using the cluster

At this point you can start exploring Metropolis. Try playing with kubectl, or take a look at the Cluster API chapter of this handbook.

The cluster state should be reflected by kubectl output:

kubectl --context=metropolis get nodes

NAME                                          STATUS   ROLES    AGE   VERSION
metropolis-4fb5a2aa4eec34080bea02ac8020028d   Ready    <none>   98m   v1.24.0+mngn
metropolis-7eec2053798faab726bb9fd9e9444ec9   Ready    <none>   86m   v1.24.0+mngn

If you need to install kubectl, try this chapter of the official Kubernetes Documentation.

Caveats

This is a preview version of Metropolis, and there's a couple of things to be aware of.

Cold start

The cluster recovery flow is still unimplemented. This means that a cold cluster, in which all member nodes have been shut down, will not start up again. This will be solved in an upcoming release.

Launch locally

Single node test cluster

Introduction

Each Metropolis deployment (Cluster) is fully self-contained and independent from other Clusters.

A Cluster is made up of Nodes. Nodes are machines (be it physical or virtual) running an instance of Metropolis. A Node can be part of only one Cluster.

             Cluster
 .-----------._.'._.------------.
 '                              '
 .--------. .--------. .--------.
 | Node A | | Node B | | Node C |
 '--------' '--------' '--------'

Nodes

Each Node runs the Linux kernel and Metropolis userspace. The userspace is comprised of Metropolis code on a signed read-only partition, and of persistent user data on an encrypted read-write partition. The signed read-only filesystem (the System filesystem) is verified by the Linux kernel, which in turn is signed and verified by a Node's firmware (EFI) via Secure Boot.

          
.--------------------.         .--------------------.
| Platform Firmware  |-------->| Secure Boot Image  |
|       (EFI)        | Checks  |--------------------|        
|--------------------|         |  .--------------.  |        .-------------------.
|      PK/KEK        |         |  | Linux Kernel |---------->| System FS (erofs) |
| Signature Database |         |  |--------------|  | Checks |-------------------|
'--------------------'         |  |  System FS   |  |        |    Node Code      |
                               |  |  Signature   |  |        '-------------------'
                               |  |  (dm-verity) |  |        
                               |  '--------------'  |        
                               '--------------------'

When booting, a Node needs to become part of a cluster (by either Bootstrapping a new one, Registering into an existing one for the first time, or Joining after reboot) to gather all the key material needed to mount the encrypted data partition. One part of the key is stored on the EFI System Partition encrypted by the TPM (sealed), and will only decrypt correctly if the Node's Secure Boot settings have not been tampered with. The other part of the key is stored by the Cluster, enforcing active communication (and possibly hardware attestation) with the Cluster before a Node can boot.

.-------------------.  Measures Secure Boot settings
| Platform Firmware |<----------.
'-------------------'           |
         | Checks               |
         v                      |
.-------------------.           |
| Secure Boot Image |           |
'-------------------'           |
         | Checks           .-------.
         v                  |  TPM  |
.-------------------.       '-------'
|     System FS     |           |
'-------------------'           | Seals/Unseals
         | Mounts               v
         |           .---------------------.        .------------------------.
         | .---------|   Node Unlock Key   |        |    Running Cluster     |
         |/          '---------------------'        |------------------------|
         | .----------------------------------------| Cluster Encryption Key |
         |/                                         |       (per node)       |
         |                                          '------------------------'
         v
.---------------------------.
| Data Partition (xfs/LUKS) |
'---------------------------'

The Node boot, disk setup and security model are described in mode detail in the Node chapter.

Each Node has the same minimal userland implemented in Go. However, this userland is unlike an usual GNU/Linux distribution, or most Linux-based operating systems for that matter. Metropolis does not have an LSB-compliant filesystem root (no /bin, /etc...) and does not run a standard init system / syslog. Instead, all process management is performed within a supervision tree (where supervised processes are called Runnables), and logging is performed within that supervision tree.

The supervision tree and log tree have some strict properties that are unlike a traditional Unix-like init system. Most importantly, any time a runnable restarts due to some unhandled error (or when it explicitly exits), all subordinate runnables will also be restarted.

In a more practical example, when working with Metropolis, you will see log messages like the following:

root.enrolment                   I0228 13:30:45.996960 cluster_bootstrap.go:48] Bootstrapping: mounting new storage...
root.network.interfaces          I0228 13:30:45.997359 main.go:252] Starting network interface management
root.time                        R 2022-02-28T13:30:45Z chronyd version 4.1-monogon starting (NTP RTC SCFILTER ASYNCDNS)
root.network.interfaces.dhcp     I0228 13:30:46.006082 dhcpc.go:632] DISCOVERING => REQUESTING
root.network.interfaces.dhcp     I0228 13:30:46.008871 dhcpc.go:632] REQUESTING => BOUND

The first column represents a runnable's Distinguished Name. It shows, for example, that the DISCOVERING => REQUESTING log line was emitted by a supervision tree runnable named dhcp, which was spawned by another runnable named interfaces, which in turn was spawned by a runnable named network, which in turn was started by the root of the Metropolis Node code.

The Node runnables axioms, supervision tree and log tree are described in more detail in the Node Runnables and Logging chapter.

Node roles and control plane

Each Node has a set of Roles assigned to it. These roles include, for example running the cluster control plane, running Kubernetes workloads, etc. At runtime, Nodes continuously retrieve the set of roles assigned to them by the cluster and maintain services which are required to fulfill the roles assigned to them. For example, if a node has a 'kubernetes worker' role, it will attempt to run the Kubelet service, amongst others.


   .-----------------------.
   | Cluster Control Plane |
   |-----------------------|
   |  Node Configuration   |
   |    & Node Status      |
   '-----------------------'
 Assigned   |      ^ Status
    roles   v      | updates
         .------------.
         |   Node A   |
         |------------|
         |            |
         |  Kubelet   |
         |            |
         '------------'

Nodes which have the 'control plane' role run core cluster services which other nodes depend on. These services make up a multi-node consensus which manages cluster configuration and management state. This effectively makes the cluster self-managed and self-contained. That is, the control plane block pictured above is in fact running on nodes in the same way as the Kubelet.


.---------------. .---------------. .---------------.
|    Node A     | |     Node B    | |    Node C     |
|---------------| |---------------| |---------------|
| Control Plane | | Control Plane | |       Kubelet |
| ^             | | ^     Kubelet | |               |
'-|-------------' '-|-------------' '---------------'
  |       |         |       |                  |
  '-------+---------+-------+------------------'
           Assigned roles & Status updates

The control plane services are described in more detail in the Cluster chapter.

The Control Plane services serve requests from Nodes (like the aforementioned retrieval of roles) and Users/Operators (like management requests) over gRPC, via an API named Cluster API.

Identity & Authentication

When Nodes or Users/Operators contact the Cluster API, they need to prove their identity to the Node handling the request. In addition, nodes handling these requests need to prove their identity to the client. This is performed by a providing both sides of the connection with TLS certificates, and with some early communication (when certificates are not yet available) being performed over self-signed certificates to prove ownership of a key.

The TLS Public Key Infrastructure (CA and certificates) is fully self-managed by the Cluster Control Plane, and Users or Operators never have access to the underlying private keys of nodes or the CA. These keys are also stored encrypted within the Node's data partition, so are only available to nodes that have successfully become part of the Cluster. This model is explained and documented further in the Identity and Authentication chapter.

In the future, we plan to implement TPM-based Hardware Attestation as part of the early connections of a Node to a Cluster, allowing full cross-node verification, and optionally connections from a User/Manager to a Cluster.

Node

Node Runnables and Logging

Node Storage

Cluster

Cluster API

This chapter goes through all of the services and data types exposed by Cluster API that's used in day-to-day operations. In most cases, it's easier to access this functionality with metroctl tool. However, it still can be used directly whenever custom implementation is needed, eg. when writing automation. The rule of thumb is that metroctl should be used for human interaction, while the direct API access should be used for machine-to-machine communication.

Services

Most cluster services require authenticated access (in the form of a TLS keypair and certificate), which can be obtained using cluster owner's credentials generated during cluster bootstrapping phase. Escrow service is the single exception to this rule.

Escrow

Escrow is extensively documented in its protofile. It's current usecase is exchanging the Initial Owner Credentials (generated before the installation of the first node) into long-term access credentials that are then used to perform further API accesses.

Management

The Management service is the main cluster manager-facing cluster management API. Management tasks include:

Addition and removal of cluster nodes
Querying of cluster and node status
Node configuration
Node and cluster bootstrapping

This chapter describes Management service calls together with their expected usage scenarios. For a more technical description, see: management.proto

GetRegisterTicket

GetRegisterTicket is used in cluster nodes' Register Flow. An up-to-date register ticket has to be retrieved ahead of time by a cluster manager, and included in Node Parameters supplied to the new node for it to be able to register and join the cluster. The ticket is used to protect the API surface from potential denial of service attacks by limiting the amount of entities that can start the Register Flow against the cluster. It can be regenerated at any time in case it leaks.

GetClusterInfo

GetClusterInfo returns summary information about the cluster, currently made up of the Cluster Directory containing node network addresses bundled with node public keys, which can be used to uniquely identify particular nodes.

GetNodes

GetNodes retrieves detailed information about cluster member nodes, such as their active roles and health. The call's output can be limited to nodes of interest with a CEL expression.

The filter expressions operate on Node protobuf messages. Here's a couple of examples: