#15 will shock you!
If you're familiar with kubernetes, you're likely familiar with the kubernetes API and the concept of controllers: reconciliation loops that look at the data stored in the api and try to make the state of the cluster match that declared API state.
This is a pretty powerful model that has proven its value over time - but despite the seeming simplicity of the core ideas, there are plenty of details that may be surprising once you scratch the surface.
1. Not all kubernetes apis have controllers
The kubernetes control plane is normally divided into two sets of responsibilities: the API and controllers.
Most distributions of kubernetes run the core apis and core controllers in separate pods:
$ k -n kube-system get pods
NAME READY STATUS RESTARTS AGE
kube-apiserver-kind-control-plane 1/1 Running 0 23m
kube-controller-manager-kind-control-plane 1/1 Running 0 23m
The kube-apiserver
is largely responsible for storage of api data into the backing store, which is typically etcd.
kube-controller-manager
runs a series of reconciliation loops over the contents of those apis. over time, those controllers ensure the desired state is reached (or not) and reports back status via the api.
And typically this is how kubernetes APIs work: create desired state via the API, desired state is stored, controller works asyncronously to make the cluster state match the desired state.
For example: the deployment api allows you to create a deployment object that defines pods. The deployment controller (running in controller-manager) creates replicasets in response to the deployment object and updates the status.
This is especially apparent if you stop controller manager entirely and try to create an object like a deployment:
Without controller-manager running, the deployment object (which can still be created!) has no status, and no pods are created.
But there are actually a small set of apis that aren't directly managed by a controller in this way. SubjectAccessReview
is one of them:
SubjectAccessReview
responds with a status even without controller-manager.
There's another api that you're probably familiar with that is also available without controller-manager:
Pods work without controller-manager as well! In this case, the kubelet is the controller for the pods.
Most of this so far is probably not too surprising - but it's good background material for some of the other topics.
2. Group, Version, and Resource identify APIs.
Group, Version, and Kind identify objects.
Most APIs in kubernetes deal with objects.
Objects are associated a group, version, kind, and resource:
- The kind is the familiar name you see at the top of a kube manifest.
kind: Deployment
,kind: Role
, etc. - Each object is accessible via an API endpoint. Every kind is associated with some resource, which is the name used to access the object via the API. For most objects, the resource is just a variant of the kind name (i.e.
Kind: Pod
is accessible via the resource typepods
). Some APIs may expose the same Kind under different resource names, but it is rare. - Each
kind
has a set of versions, which may have different schemas. - Each
kind
is organized into a group.
The apiVersion is a combination of the group and version for a specific object.
Looking at a single object:
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: pod-reader
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list"]
This indicates the group
is rbac.authorization.k8s.io
, the version is v1
and the kind
is Role
.
There's nothing in the object that tells you what the resource
is - the kube apiserver maintains a mapping of resource types <-> kinds, as do kube clients like kubectl
and client-go
. This is called the "REST Mapping" in these client projects and related docs.
Once you know the resource, there are simple rules for building URLs, e.g.:
/apis/GROUP/VERSION/RESOURCE/NAME
to get a cluster-scoped object/apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME
to access a namespace-scoped object.
There are more details in the api-concepts documentation.
3. Objects are accessible at all api versions
Any object of a given Group and Kind can be accessed at all versions supported by the apiserver.
I can create an ingress object a v1beta1
and then get
the same object at v1
. Those versions have completely different schemas, but the API will convert it to the version I request. Or, I can create an ingress object at v1
and then get it at v1beta1
.
If you don't request a specific version, kubectl will request the object at the api's preferredVersion
, which may be different from what you created.
You can find out what the preferred version is with kubectl get --raw /apis/GROUP
. For example, to find the supported and preferred versions for ingress
$ kubectl get --raw /apis/networking.k8s.io
{
"kind": "APIGroup",
"apiVersion": "v1",
"name": "networking.k8s.io",
"versions": [
{
"groupVersion": "networking.k8s.io/v1",
"version": "v1"
},
{
"groupVersion": "networking.k8s.io/v1beta1",
"version": "v1beta1"
}
],
"preferredVersion": {
"groupVersion": "networking.k8s.io/v1",
"version": "v1"
}
}
4. The version of the object stored in etcd may differ from the submitted version and the preferred version
For every group and resource, the kube-apiserver
knows the:
- served versions - the list of versions that are available in the api
- decodeable versions - the list of versions the apiserver knows how to decode from storage. This may be different from served versions.
- storage or encodeable version - the version the apiserver will convert to before storing in etcd
- preferred version.
kubectl
will request with this version if not otherwise specified.
If you submit an object to the apiserver and it doesn't match the storage version, the apiserver will convert it to the storage version before storing it in etcd. (To be more precise, it first converts it to an internal
version, before converting it back to the storage version etcd, but this is an implementation detail.)
There isn't a user-facing way to determine the storageversion for an API at runtime (at least not yet).
The storage version is usually the preferred version, but this can be overridden.
For example, in kube 1.20, the preferred version of ingress
is v1
, but the storage version is v1beta1
. This can be seen by looking at the data the apiserver stores in etcd directly:
5. Changing an object's stored version requires writing with a newer apiserver
Since the storage version is hardcoded in kube-apiserver, the only way to have an object's storage version updated is to update to a newer kube-apiserver with a different storage version configured for the object's API.
Once the apiserver is on a newer version with a new storage version, the object must be overwritten - it is not automatically converted (more on automatic conversion later)
This example:
- Creates an ingress at
v1
in kube 1.20 - which stores it atv1beta1
- Updates the apiserver to 1.21
- Checks that the data in etcd is still at
v1beta1
- Overwrites the ingress with the new apiserver to see it stored at
v1
6. API versions may have ratcheting validation (but are still round-trippable)
The apiserver may choose to ratchet validation between versions - something that may have been valid under an old api version may not be valid under the new apiversion.
Alone, this is probably not suprising. What might be surprising is the way ratcheting validation can interact with version round-tripping: just because you can get
an object from the apiserver doesn't mean that you can create
that exact same object.
For example, the CustomResourceDefinition
API employs ratcheting validation between v1beta1
and v1
to enforce structural schemas under v1
.
This means that:
- You can create a
v1beta1
CRD with a non-structural schema - You can
get
that CRD at from thev1
CRD api (versions are round-trippable) - If you delete the CRD, and recreate it with the
v1
CRD that you just got from the API, it will fail to create - You can
update
that CRD atv1
after it's been created atv1beta1
(the validation is ratcheting)
7. The StorageVersion API exists
In 1.20+, there is a StorageVersion
API that can provide information about objects stored in etcd, but it must be explicitly enabled.
This is the config needed to enabled the API in kind:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
featureGates:
StorageVersionAPI: true
APIServerIdentity: true
runtimeConfig:
"internal.apiserver.k8s.io/v1alpha1": "true"
Once enabled, for each group/resource, the API provides:
- the current set of versions that can be
decoded
- the version that is used for
encoding
- i.e. the storage version, used to encode the object to store in etcd.
The API provides this information for every instance of the apiserver.
This API is not currently user-facing (hence the internal
prefix). It is used to make decisions around storage versions during an apiserver upgrade, where there may be multiple apiservers running at multiple versions. The enhancement has more detail, see also the API documentation.
Here's an example for the deployment API:
apiVersion: internal.apiserver.k8s.io/v1alpha1
kind: StorageVersion
metadata:
creationTimestamp: "2021-06-04T17:16:57Z"
name: apps.deployments
resourceVersion: "52"
uid: 0b80b0f3-72b0-4af6-adfd-e93eb4b4c29f
spec: {}
status:
commonEncodingVersion: apps/v1
conditions:
- lastTransitionTime: "2021-06-04T17:16:57Z"
message: Common encoding version set
reason: CommonEncodingVersionSet
status: "True"
type: AllEncodingVersionsEqual
storageVersions:
- apiServerID: kube-apiserver-803c62b1-340f-4055-93ca-44aba8a35574
decodableVersions:
- apps/v1
- apps/v1beta2
- apps/v1beta1
encodingVersion: apps/v1
In this case, since there is only one apiserver, we can be confident that the deployment will be stored at apps/v1
.
8. Install the kube-storage-version-migrator
to migrate storage versions automatically
In a previous section, we saw that an object needs to be rewritten with a new apiserver in order for the storage version to change.
By default in kubernetes, nothing performs this operation. The kube-storage-version-migrator is an optional component that will automate the get-and-put workflow over all objects in the cluster.
The kube-storage-version-migrator is enabled by default in OpenShift, but it does not run automatically (it must be triggered manually).
9. CRDs define new APIs (not just objects)
When you create a CRD to define a new object in the api, you are defining all of the same things that the apiserver defines for core apis:
- the group, kind, and resourcetype for a new type of object
- the served versions
- the storage version
- the decodeable versions
- schemas for all decodable versions
- whether the objects are namespace or cluster scoped
- new url endpoints in the API
The new APIs that are generated will follow the kubernetes api conventions: that means that they are roundtrippable and there can only be a single storage version.
10. All APIs are cluster scoped, even scope: Namespaced
CRDs
CRD's have a scope
field.
This determines the scope of objects that you can create, it does not scope the availability of the api itself.
If scope
is set to Cluster
, then the API routes look like:
/apis/GROUP/VERSION/RESOURCE/NAME
If instead, scope
is set to Namespaced
, API routes look like:
/apis/GROUP/VERSION/namespaces/NAMESPACE/RESOURCETYPE/NAME
And namespaced resources across all namespaces can be queried with:
/apis/GROUP/VERSION/RESOURCETYPE
In both the Cluster
and Namespaced
scoped APIs, the Group
and Version
have been claimed for the entire cluster. There is no notion of an API that is only available in a single namespace.
11. A CRD's stored
version determines how new objects are stored.
Just like kube-apiserver picks the version to use when storing in etcd, as a CRD developer you must pick the version to store in etcd as well.
The stored: true
flag on a CRD version indicates how objects will be stored going forward. It does not affect existing stored objects.
12. CRD's storedVersions
lists every version that has been a stored version (not what is actually in etcd)
The status
block of a crd has a storedVersions
field:
status:
acceptedNames:
kind: CronTab
listKind: CronTabList
plural: crontabs
shortNames:
- ct
singular: crontab
conditions:
- lastTransitionTime: "2021-06-16T14:47:48Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2021-06-16T14:47:48Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
storedVersions:
- v1beta1
- v1
This field indicates every version that has been a stored version (i.e. has had storage: true
set in the spec) and has no relationship to what stored versions exist in etcd.
13. You can't remove a version from a CRD until it has been removed from the status
.status.storedVersions
, being a record of what versions have previously been set as the stored version for the api, indicate that there might still be data in etcd stored under those versions. You don't want to remove a decodeable
version until there is nothing left in storage that might need to be decoded.
For this reason, it's not possible to entirely remove a version from a CRD if that version is listed as a storedVersion
.
Note that served
can be set to false
for any version - a stored
version that is no longer served
can still be fetched under newer apiversions.
14. Stored versions must be manually removed from the CRD's status
Just like non-CRD-defined kube apis, objects need to be updated to the new storage version via a write. The kube-storage-version-migrator
can automate this for CRs as well.
However, once that migration is complete, it's a manual process to remove unused storage versions from a CRD's .status.storedVersion
.
kubectl
has no direct support for editing status
. In this example, we remove the version with curl
.
15. It's not safe to tighten a schema between versions
Some kube apis may have ratcheting validation. But generally this validation tightening does not occur in the API schema - tightening validation can cause clients to have incorrect assumptions about data.
This is a scencario that can happen:
- a field in
v1
of an API has a tighter schema thanv1beta1
- the
storage
version isv1
for the API - an object is created at
v1beta1
that doesn't comply with the tightenedv1
schema - the object is accepted, because it was created at
v1beta1
, and the storage version isv1
, so it is stored as av1
object - the object in etcd is an "invalid"
v1
object.
Similar situations arise if you update the schema of a single version to be tighter, after objects have already been created.
Any client that is using the schema to have expectations about api repsonses may not do the right thing if validation is tightened between versions.
16. kube-storage-version-migrator
will fail for tightened schemas
kube-storage-version-migator
does a get
/update
for each object. If the schema has been tightened (or ratcheting validation has not been implemented to only apply to create
), then it will fail.
Appendix: kind
, yvim
, auger
, etcdctl
In an attempt to be succinct, the demos make use of several tools.
kind is used to spin up clusters easily locally.
To inspect the contents of etcd
with a kind cluster, first configure it to expose the etcd port:
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
extraPortMappings:
- containerPort: 2379
hostPort: 2379
Then configure etcdctl to be able to talk to the kind
etcd:
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/ca.crt ca.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.crt peer.crt
docker cp kind-control-plane:/etc/kubernetes/pki/etcd/peer.key peer.key
export ETCDCTL_CACERT=./ca.crt
export ETCDCTL_CERT=./peer.crt
export ETCDCTL_KEY=./peer.key
# confirm it works
etcdctl get /registry --prefix=true
I occasionally use yvim
, which is just an alias to open vim with the assumption that file is yaml (useful for piping from kubectl):
alias yvim='nvim -c "doautocmd Filetype yaml" -R -'
I also sometimes use auger to decode the protobuf-encoded objects stored in etcd. Auger is not usable out of the box, instead it needs to be built with references to a specific version of kube so that it has the correct object definitions.
Once it's built correctly, you can pipe directly from etcdctl:
etcdctl get /registry/ingress/default/name-virtual-host-ingress --print-value-only | auger decode