16 things you didn't know about Kube APIs and CRDs

2021-06-18
12 min read

#15 will shock you!

If you're familiar with kubernetes, you're likely familiar with the kubernetes API and the concept of controllers: reconciliation loops that look at the data stored in the api and try to make the state of the cluster match that declared API state.

This is a pretty powerful model that has proven its value over time - but despite the seeming simplicity of the core ideas, there are plenty of details that may be surprising once you scratch the surface.

1. Not all kubernetes apis have controllers

The kubernetes control plane is normally divided into two sets of responsibilities: the API and controllers.

Most distributions of kubernetes run the core apis and core controllers in separate pods:

$ k -n kube-system get pods
NAME                                         READY   STATUS    RESTARTS   AGE
kube-apiserver-kind-control-plane            1/1     Running   0          23m
kube-controller-manager-kind-control-plane   1/1     Running   0          23m

The kube-apiserver is largely responsible for storage of api data into the backing store, which is typically etcd.

kube-controller-manager runs a series of reconciliation loops over the contents of those apis. over time, those controllers ensure the desired state is reached (or not) and reports back status via the api.

And typically this is how kubernetes APIs work: create desired state via the API, desired state is stored, controller works asyncronously to make the cluster state match the desired state.

For example: the deployment api allows you to create a deployment object that defines pods. The deployment controller (running in controller-manager) creates replicasets in response to the deployment object and updates the status.

This is especially apparent if you stop controller manager entirely and try to create an object like a deployment:

Without controller-manager running, the deployment object (which can still be created!) has no status, and no pods are created.

But there are actually a small set of apis that aren't directly managed by a controller in this way. SubjectAccessReview is one of them:

SubjectAccessReview responds with a status even without controller-manager.

There's another api that you're probably familiar with that is also available without controller-manager:

Pods work without controller-manager as well! In this case, the kubelet is the controller for the pods.

Most of this so far is probably not too surprising - but it's good background material for some of the other topics.

2. Group, Version, and Resource identify APIs.
Group, Version, and Kind identify objects.

Most APIs in kubernetes deal with objects.

Objects are associated a group, version, kind, and resource:

  • The kind is the familiar name you see at the top of a kube manifest. kind: Deployment, kind: Role, etc.
  • Each object is accessible via an API endpoint. Every kind is associated with some resource, which is the name used to access the object via the API. For most objects, the resource is just a variant of the kind name (i.e. Kind: Pod is accessible via the resource type pods). Some APIs may expose the same Kind under different resource names, but it is rare.
  • Each kind has a set of versions, which may have different schemas.
  • Each kind is organized into a group.

The apiVersion is a combination of the group and version for a specific object.

Looking at a single object:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: pod-reader
rules:
- apiGroups: [""] 
  resources: ["pods"]
  verbs: ["get", "watch", "list"]

This indicates the group is rbac.authorization.k8s.io, the version is v1 and the kind is Role.

There's nothing in the object that tells you what the resource is - the kube apiserver maintains a mapping of resource types <-> kinds, as do kube clients like kubectl and client-go. This is called the "REST Mapping" in these client projects and related docs.

Once you know the resource, there are simple rules for building URLs, e.g.:

  • /apis/GROUP/VERSION/RESOURCE/NAME to get a cluster-scoped object
  • /apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME to access a namespace-scoped object.

There are more details in the api-concepts documentation.

3. Objects are accessible at all api versions

Any object of a given Group and Kind can be accessed at all versions supported by the apiserver.

I can create an ingress object a v1beta1 and then get the same object at v1. Those versions have completely different schemas, but the API will convert it to the version I request. Or, I can create an ingress object at v1 and then get it at v1beta1.

If you don't request a specific version, kubectl will request the object at the api's preferredVersion, which may be different from what you created.

You can find out what the preferred version is with kubectl get --raw /apis/GROUP. For example, to find the supported and preferred versions for ingress

$ kubectl get --raw /apis/networking.k8s.io
{
  "kind": "APIGroup",
  "apiVersion": "v1",
  "name": "networking.k8s.io",
  "versions": [
    {
      "groupVersion": "networking.k8s.io/v1",
      "version": "v1"
    },
    {
      "groupVersion": "networking.k8s.io/v1beta1",
      "version": "v1beta1"
    }
  ],
  "preferredVersion": {
    "groupVersion": "networking.k8s.io/v1",
    "version": "v1"
  }
}

4. The version of the object stored in etcd may differ from the submitted version and the preferred version

For every group and resource, the kube-apiserver knows the:

  • served versions - the list of versions that are available in the api
  • decodeable versions - the list of versions the apiserver knows how to decode from storage. This may be different from served versions.
  • storage or encodeable version - the version the apiserver will convert to before storing in etcd
  • preferred version. kubectl will request with this version if not otherwise specified.

If you submit an object to the apiserver and it doesn't match the storage version, the apiserver will convert it to the storage version before storing it in etcd. (To be more precise, it first converts it to an internal version, before converting it back to the storage version etcd, but this is an implementation detail.)

There isn't a user-facing way to determine the storageversion for an API at runtime (at least not yet).

The storage version is usually the preferred version, but this can be overridden.

For example, in kube 1.20, the preferred version of ingress is v1, but the storage version is v1beta1. This can be seen by looking at the data the apiserver stores in etcd directly:

5. Changing an object's stored version requires writing with a newer apiserver

Since the storage version is hardcoded in kube-apiserver, the only way to have an object's storage version updated is to update to a newer kube-apiserver with a different storage version configured for the object's API.

Once the apiserver is on a newer version with a new storage version, the object must be overwritten - it is not automatically converted (more on automatic conversion later)

This example:

  • Creates an ingress at v1 in kube 1.20 - which stores it at v1beta1
  • Updates the apiserver to 1.21
  • Checks that the data in etcd is still at v1beta1
  • Overwrites the ingress with the new apiserver to see it stored at v1

6. API versions may have ratcheting validation (but are still round-trippable)

The apiserver may choose to ratchet validation between versions - something that may have been valid under an old api version may not be valid under the new apiversion.

Alone, this is probably not suprising. What might be surprising is the way ratcheting validation can interact with version round-tripping: just because you can get an object from the apiserver doesn't mean that you can create that exact same object.

For example, the CustomResourceDefinition API employs ratcheting validation between v1beta1 and v1 to enforce structural schemas under v1.

This means that:

  • You can create a v1beta1 CRD with a non-structural schema
  • You can get that CRD at from the v1 CRD api (versions are round-trippable)
  • If you delete the CRD, and recreate it with the v1 CRD that you just got from the API, it will fail to create
  • You can update that CRD at v1 after it's been created at v1beta1 (the validation is ratcheting)

7. The StorageVersion API exists

In 1.20+, there is a StorageVersion API that can provide information about objects stored in etcd, but it must be explicitly enabled.

This is the config needed to enabled the API in kind:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
  StorageVersionAPI: true
  APIServerIdentity: true
runtimeConfig:
  "internal.apiserver.k8s.io/v1alpha1": "true"

Once enabled, for each group/resource, the API provides:

  • the current set of versions that can be decoded
  • the version that is used for encoding - i.e. the storage version, used to encode the object to store in etcd.

The API provides this information for every instance of the apiserver.

This API is not currently user-facing (hence the internal prefix). It is used to make decisions around storage versions during an apiserver upgrade, where there may be multiple apiservers running at multiple versions. The enhancement has more detail, see also the API documentation.

Here's an example for the deployment API:

apiVersion: internal.apiserver.k8s.io/v1alpha1
kind: StorageVersion
metadata:
  creationTimestamp: "2021-06-04T17:16:57Z"
  name: apps.deployments
  resourceVersion: "52"
  uid: 0b80b0f3-72b0-4af6-adfd-e93eb4b4c29f
spec: {}
status:
  commonEncodingVersion: apps/v1
  conditions:
  - lastTransitionTime: "2021-06-04T17:16:57Z"
    message: Common encoding version set
    reason: CommonEncodingVersionSet
    status: "True"
    type: AllEncodingVersionsEqual
  storageVersions:
  - apiServerID: kube-apiserver-803c62b1-340f-4055-93ca-44aba8a35574
    decodableVersions:
    - apps/v1
    - apps/v1beta2
    - apps/v1beta1
    encodingVersion: apps/v1

In this case, since there is only one apiserver, we can be confident that the deployment will be stored at apps/v1.

8. Install the kube-storage-version-migrator to migrate storage versions automatically

In a previous section, we saw that an object needs to be rewritten with a new apiserver in order for the storage version to change.

By default in kubernetes, nothing performs this operation. The kube-storage-version-migrator is an optional component that will automate the get-and-put workflow over all objects in the cluster.

The kube-storage-version-migrator is enabled by default in OpenShift, but it does not run automatically (it must be triggered manually).

9. CRDs define new APIs (not just objects)

When you create a CRD to define a new object in the api, you are defining all of the same things that the apiserver defines for core apis:

  • the group, kind, and resourcetype for a new type of object
  • the served versions
  • the storage version
  • the decodeable versions
  • schemas for all decodable versions
  • whether the objects are namespace or cluster scoped
  • new url endpoints in the API

The new APIs that are generated will follow the kubernetes api conventions: that means that they are roundtrippable and there can only be a single storage version.

10. All APIs are cluster scoped, even scope: Namespaced CRDs

CRD's have a scope field.

This determines the scope of objects that you can create, it does not scope the availability of the api itself.

If scope is set to Cluster, then the API routes look like:

/apis/GROUP/VERSION/RESOURCE/NAME

If instead, scope is set to Namespaced, API routes look like:

/apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME

And namespaced resources across all namespaces can be queried with:

/apis/GROUP/VERSION/RESOURCETYPE

In both the Cluster and Namespaced scoped APIs, the Group and Version have been claimed for the entire cluster. There is no notion of an API that is only available in a single namespace.

11. A CRD's stored version determines how new objects are stored.

Just like kube-apiserver picks the version to use when storing in etcd, as a CRD developer you must pick the version to store in etcd as well.

The stored: true flag on a CRD version indicates how objects will be stored going forward. It does not affect existing stored objects.

12. CRD's storedVersions lists every version that has been a stored version (not what is actually in etcd)

The status block of a crd has a storedVersions field:

status:
  acceptedNames:
    kind: CronTab
    listKind: CronTabList
    plural: crontabs
    shortNames:
    - ct
    singular: crontab
  conditions:
  - lastTransitionTime: "2021-06-16T14:47:48Z"
    message: no conflicts found
    reason: NoConflicts
    status: "True"
    type: NamesAccepted
  - lastTransitionTime: "2021-06-16T14:47:48Z"
    message: the initial names have been accepted
    reason: InitialNamesAccepted
    status: "True"
    type: Established
  storedVersions:
  - v1beta1
  - v1

This field indicates every version that has been a stored version (i.e. has had storage: true set in the spec) and has no relationship to what stored versions exist in etcd.

13. You can't remove a version from a CRD until it has been removed from the status

.status.storedVersions, being a record of what versions have previously been set as the stored version for the api, indicate that there might still be data in etcd stored under those versions. You don't want to remove a decodeable version until there is nothing left in storage that might need to be decoded.

For this reason, it's not possible to entirely remove a version from a CRD if that version is listed as a storedVersion.

Note that served can be set to false for any version - a stored version that is no longer served can still be fetched under newer apiversions.

14. Stored versions must be manually removed from the CRD's status

Just like non-CRD-defined kube apis, objects need to be updated to the new storage version via a write. The kube-storage-version-migrator can automate this for CRs as well.

However, once that migration is complete, it's a manual process to remove unused storage versions from a CRD's .status.storedVersion.

kubectl has no direct support for editing status. In this example, we remove the version with curl.

15. It's not safe to tighten a schema between versions

Some kube apis may have ratcheting validation. But generally this validation tightening does not occur in the API schema - tightening validation can cause clients to have incorrect assumptions about data.

This is a scencario that can happen:

  • a field in v1 of an API has a tighter schema than v1beta1
  • the storage version is v1 for the API
  • an object is created at v1beta1 that doesn't comply with the tightened v1 schema
  • the object is accepted, because it was created at v1beta1, and the storage version is v1, so it is stored as a v1 object
  • the object in etcd is an "invalid" v1 object.

Similar situations arise if you update the schema of a single version to be tighter, after objects have already been created.

Any client that is using the schema to have expectations about api repsonses may not do the right thing if validation is tightened between versions.

16. kube-storage-version-migrator will fail for tightened schemas

kube-storage-version-migator does a get/update for each object. If the schema has been tightened (or ratcheting validation has not been implemented to only apply to create), then it will fail.

Appendix: kind, yvim, auger, etcdctl

In an attempt to be succinct, the demos make use of several tools.

kind is used to spin up clusters easily locally.

To inspect the contents of etcd with a kind cluster, first configure it to expose the etcd port:

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 2379
    hostPort: 2379

Then configure etcdctl to be able to talk to the kind etcd:

docker cp kind-control-plane:/etc/kubernetes/pki/etcd/ca.crt ca.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.crt peer.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.key peer.key
export ETCDCTL_CACERT=./ca.crt 
export ETCDCTL_CERT=./peer.crt 
export ETCDCTL_KEY=./peer.key 

# confirm it works
etcdctl get /registry  --prefix=true

I occasionally use yvim, which is just an alias to open vim with the assumption that file is yaml (useful for piping from kubectl):

alias yvim='nvim -c "doautocmd Filetype yaml" -R -'

I also sometimes use auger to decode the protobuf-encoded objects stored in etcd. Auger is not usable out of the box, instead it needs to be built with references to a specific version of kube so that it has the correct object definitions.

Once it's built correctly, you can pipe directly from etcdctl:

etcdctl get /registry/ingress/default/name-virtual-host-ingress --print-value-only | auger decode