#15 will shock you!
If you're familiar with kubernetes, you're likely familiar with the kubernetes API and the concept of controllers: reconciliation loops that look at the data stored in the api and try to make the state of the cluster match that declared API state.
This is a pretty powerful model that has proven its value over time - but despite the seeming simplicity of the core ideas, there are plenty of details that may be surprising once you scratch the surface.
1. Not all kubernetes apis have controllers
The kubernetes control plane is normally divided into two sets of responsibilities: the API and controllers.
Most distributions of kubernetes run the core apis and core controllers in separate pods:
$ k -n kube-system get pods
NAME READY STATUS RESTARTS AGE
kube-apiserver-kind-control-plane 1/1 Running 0 23m
kube-controller-manager-kind-control-plane 1/1 Running 0 23m
kube-apiserver is largely responsible for storage of api data into the backing store, which is typically etcd.
kube-controller-manager runs a series of reconciliation loops over the contents of those apis. over time, those controllers ensure the desired state is reached (or not) and reports back status via the api.
And typically this is how kubernetes APIs work: create desired state via the API, desired state is stored, controller works asyncronously to make the cluster state match the desired state.
For example: the deployment api allows you to create a deployment object that defines pods. The deployment controller (running in controller-manager) creates replicasets in response to the deployment object and updates the status.
This is especially apparent if you stop controller manager entirely and try to create an object like a deployment:
Without controller-manager running, the deployment object (which can still be created!) has no status, and no pods are created.
But there are actually a small set of apis that aren't directly managed by a controller in this way.
SubjectAccessReview is one of them:
SubjectAccessReview responds with a status even without controller-manager.
There's another api that you're probably familiar with that is also available without controller-manager:
Pods work without controller-manager as well! In this case, the kubelet is the controller for the pods.
Most of this so far is probably not too surprising - but it's good background material for some of the other topics.
2. Group, Version, and Resource identify APIs.
Group, Version, and Kind identify objects.
Most APIs in kubernetes deal with objects.
Objects are associated a group, version, kind, and resource:
- The kind is the familiar name you see at the top of a kube manifest.
kind: Role, etc.
- Each object is accessible via an API endpoint. Every kind is associated with some resource, which is the name used to access the object via the API. For most objects, the resource is just a variant of the kind name (i.e.
Kind: Podis accessible via the resource type
pods). Some APIs may expose the same Kind under different resource names, but it is rare.
kindhas a set of versions, which may have different schemas.
kindis organized into a group.
The apiVersion is a combination of the group and version for a specific object.
Looking at a single object:
- apiGroups: [""]
verbs: ["get", "watch", "list"]
This indicates the
rbac.authorization.k8s.io, the version is
v1 and the
There's nothing in the object that tells you what the
resource is - the kube apiserver maintains a mapping of resource types <-> kinds, as do kube clients like
client-go. This is called the "REST Mapping" in these client projects and related docs.
Once you know the resource, there are simple rules for building URLs, e.g.:
/apis/GROUP/VERSION/RESOURCE/NAMEto get a cluster-scoped object
/apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAMEto access a namespace-scoped object.
There are more details in the api-concepts documentation.
3. Objects are accessible at all api versions
Any object of a given Group and Kind can be accessed at all versions supported by the apiserver.
I can create an ingress object a
v1beta1 and then
get the same object at
v1. Those versions have completely different schemas, but the API will convert it to the version I request. Or, I can create an ingress object at
v1 and then get it at
If you don't request a specific version, kubectl will request the object at the api's
preferredVersion, which may be different from what you created.
You can find out what the preferred version is with
kubectl get --raw /apis/GROUP. For example, to find the supported and preferred versions for ingress
$ kubectl get --raw /apis/networking.k8s.io
4. The version of the object stored in etcd may differ from the submitted version and the preferred version
For every group and resource, the
kube-apiserver knows the:
- served versions - the list of versions that are available in the api
- decodeable versions - the list of versions the apiserver knows how to decode from storage. This may be different from served versions.
- storage or encodeable version - the version the apiserver will convert to before storing in etcd
- preferred version.
kubectlwill request with this version if not otherwise specified.
If you submit an object to the apiserver and it doesn't match the storage version, the apiserver will convert it to the storage version before storing it in etcd. (To be more precise, it first converts it to an
internal version, before converting it back to the storage version etcd, but this is an implementation detail.)
There isn't a user-facing way to determine the storageversion for an API at runtime (at least not yet).
The storage version is usually the preferred version, but this can be overridden.
For example, in kube 1.20, the preferred version of
v1, but the storage version is
v1beta1. This can be seen by looking at the data the apiserver stores in etcd directly:
5. Changing an object's stored version requires writing with a newer apiserver
Since the storage version is hardcoded in kube-apiserver, the only way to have an object's storage version updated is to update to a newer kube-apiserver with a different storage version configured for the object's API.
Once the apiserver is on a newer version with a new storage version, the object must be overwritten - it is not automatically converted (more on automatic conversion later)
- Creates an ingress at
v1in kube 1.20 - which stores it at
- Updates the apiserver to 1.21
- Checks that the data in etcd is still at
- Overwrites the ingress with the new apiserver to see it stored at
6. API versions may have ratcheting validation (but are still round-trippable)
The apiserver may choose to ratchet validation between versions - something that may have been valid under an old api version may not be valid under the new apiversion.
Alone, this is probably not suprising. What might be surprising is the way ratcheting validation can interact with version round-tripping: just because you can
get an object from the apiserver doesn't mean that you can
create that exact same object.
For example, the
CustomResourceDefinition API employs ratcheting validation between
v1 to enforce structural schemas under
This means that:
- You can create a
v1beta1CRD with a non-structural schema
- You can
getthat CRD at from the
v1CRD api (versions are round-trippable)
- If you delete the CRD, and recreate it with the
v1CRD that you just got from the API, it will fail to create
- You can
updatethat CRD at
v1after it's been created at
v1beta1(the validation is ratcheting)
7. The StorageVersion API exists
In 1.20+, there is a
StorageVersion API that can provide information about objects stored in etcd, but it must be explicitly enabled.
This is the config needed to enabled the API in kind:
Once enabled, for each group/resource, the API provides:
- the current set of versions that can be
- the version that is used for
encoding- i.e. the storage version, used to encode the object to store in etcd.
The API provides this information for every instance of the apiserver.
This API is not currently user-facing (hence the
internal prefix). It is used to make decisions around storage versions during an apiserver upgrade, where there may be multiple apiservers running at multiple versions. The enhancement has more detail, see also the API documentation.
Here's an example for the deployment API:
- lastTransitionTime: "2021-06-04T17:16:57Z"
message: Common encoding version set
- apiServerID: kube-apiserver-803c62b1-340f-4055-93ca-44aba8a35574
In this case, since there is only one apiserver, we can be confident that the deployment will be stored at
8. Install the
kube-storage-version-migrator to migrate storage versions automatically
In a previous section, we saw that an object needs to be rewritten with a new apiserver in order for the storage version to change.
By default in kubernetes, nothing performs this operation. The kube-storage-version-migrator is an optional component that will automate the get-and-put workflow over all objects in the cluster.
The kube-storage-version-migrator is enabled by default in OpenShift, but it does not run automatically (it must be triggered manually).
9. CRDs define new APIs (not just objects)
When you create a CRD to define a new object in the api, you are defining all of the same things that the apiserver defines for core apis:
- the group, kind, and resourcetype for a new type of object
- the served versions
- the storage version
- the decodeable versions
- schemas for all decodable versions
- whether the objects are namespace or cluster scoped
- new url endpoints in the API
The new APIs that are generated will follow the kubernetes api conventions: that means that they are roundtrippable and there can only be a single storage version.
10. All APIs are cluster scoped, even
scope: Namespaced CRDs
CRD's have a
This determines the scope of objects that you can create, it does not scope the availability of the api itself.
scope is set to
Cluster, then the API routes look like:
scope is set to
Namespaced, API routes look like:
And namespaced resources across all namespaces can be queried with:
In both the
Namespaced scoped APIs, the
Version have been claimed for the entire cluster. There is no notion of an API that is only available in a single namespace.
11. A CRD's
stored version determines how new objects are stored.
Just like kube-apiserver picks the version to use when storing in etcd, as a CRD developer you must pick the version to store in etcd as well.
stored: true flag on a CRD version indicates how objects will be stored going forward. It does not affect existing stored objects.
storedVersions lists every version that has been a stored version (not what is actually in etcd)
status block of a crd has a
- lastTransitionTime: "2021-06-16T14:47:48Z"
message: no conflicts found
- lastTransitionTime: "2021-06-16T14:47:48Z"
message: the initial names have been accepted
This field indicates every version that has been a stored version (i.e. has had
storage: true set in the spec) and has no relationship to what stored versions exist in etcd.
13. You can't remove a version from a CRD until it has been removed from the status
.status.storedVersions, being a record of what versions have previously been set as the stored version for the api, indicate that there might still be data in etcd stored under those versions. You don't want to remove a
decodeable version until there is nothing left in storage that might need to be decoded.
For this reason, it's not possible to entirely remove a version from a CRD if that version is listed as a
served can be set to
false for any version - a
stored version that is no longer
served can still be fetched under newer apiversions.
14. Stored versions must be manually removed from the CRD's status
Just like non-CRD-defined kube apis, objects need to be updated to the new storage version via a write. The
kube-storage-version-migrator can automate this for CRs as well.
However, once that migration is complete, it's a manual process to remove unused storage versions from a CRD's
kubectl has no direct support for editing
status. In this example, we remove the version with
15. It's not safe to tighten a schema between versions
Some kube apis may have ratcheting validation. But generally this validation tightening does not occur in the API schema - tightening validation can cause clients to have incorrect assumptions about data.
This is a scencario that can happen:
- a field in
v1of an API has a tighter schema than
v1for the API
- an object is created at
v1beta1that doesn't comply with the tightened
- the object is accepted, because it was created at
v1beta1, and the storage version is
v1, so it is stored as a
- the object in etcd is an "invalid"
Similar situations arise if you update the schema of a single version to be tighter, after objects have already been created.
Any client that is using the schema to have expectations about api repsonses may not do the right thing if validation is tightened between versions.
kube-storage-version-migrator will fail for tightened schemas
kube-storage-version-migator does a
update for each object. If the schema has been tightened (or ratcheting validation has not been implemented to only apply to
create), then it will fail.
In an attempt to be succinct, the demos make use of several tools.
kind is used to spin up clusters easily locally.
To inspect the contents of
etcd with a kind cluster, first configure it to expose the etcd port:
- role: control-plane
- containerPort: 2379
Then configure etcdctl to be able to talk to the
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/ca.crt ca.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.crt peer.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.key peer.key
# confirm it works
etcdctl get /registry --prefix=true
I occasionally use
yvim, which is just an alias to open vim with the assumption that file is yaml (useful for piping from kubectl):
alias yvim='nvim -c "doautocmd Filetype yaml" -R -'
I also sometimes use auger to decode the protobuf-encoded objects stored in etcd. Auger is not usable out of the box, instead it needs to be built with references to a specific version of kube so that it has the correct object definitions.
Once it's built correctly, you can pipe directly from etcdctl:
etcdctl get /registry/ingress/default/name-virtual-host-ingress --print-value-only | auger decode