AIStore is designed to run natively on Kubernetes. This folder contains the AIS Operator for managing the resources and lifecycle of AIS clusters on Kubernetes. The project extends the native Kubernetes API to deploy, scale, upgrade, decommission, and otherwise automate management of all aspects of the AIStore lifecycle.
WARNING: The AIS K8s Operator is currently undergoing active development. Please see the compatibility docs for info on upgrades and deprecations.
See our guide for deploying AIStore on Kubernetes.
To deploy the operator, only a K8s cluster, kubectl, and a certificate provider (see below) are required.
Check out our prerequisites doc for production deployment requirements.
AIS operator employs admission webhooks to enforce the validity of the managed AIS cluster.
AIS operator runs a webhook server with tls enabled, responsible for validating each AIS cluster resource being created or updated.
Operator-SDK recommends using cert-manager for provisioning the certificates required by the webhook server.
However, any solution which can provide certificates to the AIS operator pod should work.
The operator loads from the webhook-cert-path arg which defaults to /tmp/k8s-webhook-server/serving-certs/.
For quick deployment, the deploy command provides an option to deploy a basic version of cert-manager.
However, for more advanced deployments it's recommended to follow cert-manager documentation.
The operator communicates with the deployed AIS clusters over the AIS API.
By default, if the AIS cluster is using HTTPS the operator will not verify the certificate (OPERATOR_SKIP_VERIFY_CRT=True).
To enable certificate verification for AIS cluster connections, set the environment variable:
controllerManager:
manager:
env:
operatorSkipVerifyCrt: "False" # Enable certificate verificationFor kustomize deployments, modify config/overlays/default/manager_env_patch.yaml.
If your AIS cluster uses an untrusted CA (not in the system trust store), you need to provide the CA certificate. Configure this using Helm chart values:
controllerManager:
manager:
aisCAConfigmapName: my-ca-bundle # Name of ConfigMap with CA certificatesThe ConfigMap should contain .crt or .pem files with your CA certificates. The operator will automatically mount it to /etc/ais/ca.
For kustomize-based deployments, you can apply a patch to override the ConfigMap name. See manager_ca_configmap_patch.yaml for reference and the example below:
# config/overlays/custom/manager_ca_patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: controller-manager
namespace: system
spec:
template:
spec:
containers:
- name: manager
volumeMounts:
- name: ais-ca
mountPath: /etc/ais/ca
readOnly: true
volumes:
- name: ais-ca
configMap:
name: my-ca-bundle # Your CA bundle ConfigMap name
optional: trueWhen using auth services with HTTPS, TLS certificate verification is enabled automatically. By default, the operator uses the system CA trust store.
If your auth service uses an untrusted CA (not in the system trust store), you need to provide the CA certificate using Helm chart values:
controllerManager:
manager:
authCAConfigmapName: my-auth-ca-bundle # Name of ConfigMap with auth service CA certificatesThe ConfigMap should contain .crt or .pem files with your CA certificates. The operator will automatically mount it to /etc/ssl/certs/auth-ca and use it for auth service connections.
For AIStore CRs: When using auth with HTTPS URLs, the operator automatically uses the CA bundle configured via authCAConfigmapName. You typically don't need to set spec.auth.tls.caCertPath in the CR unless you have a custom certificate mount:
spec:
auth:
serviceURL: https://my-auth-service:52001
# tls.caCertPath is optional - operator uses its configured CA bundle by defaultNote: TLS configurations are cached for 6 hours by default to avoid repeated disk I/O. If you update an existing ConfigMap, changes propagate to running pods within ~60 seconds (kubelet sync), and the operator will use new certificates after the cache TTL expires. This can be adjusted via environment variable:
env:
- name: OPERATOR_AUTH_TLS_CACHE_TTL
value: "1h" # Adjust for frequent certificate rotationsTo enable mutual TLS (mTLS) between the operator and an AIS cluster, first create a certificate with usage: client auth defined (see cert-manager docs).
You can mount this into the pod with a tool such as the Vault agent, or you can create a secret operator-tls in the operator namespace.
This secret will be mounted by default at /etc/operator/tls.
The operator will use the client certificate at this location when communicating with AIS clusters.
To configure the location of the operator's client cert, use the ais-client-cert-path when running the manager.
If the ais-client-cert-per-cluster arg is provided, the operator will load the value from ais-client-cert-path and append the values from each cluster's namespace and name when loading certificates.
For example, /etc/operator/tls/aisNamespace/aisCluster.
See the linked certificates diagram for a visualization of the TLS options.
First, install the AIS CRD:
make installThen run the deployment with make deploy.
This will apply the default kustomization configuration.
$ IMG=aistorage/ais-operator:latest make deploy$ kubectl get pods -n ais-operator-system
NAME READY STATUS RESTARTS AGE
ais-operator-controller-manager-64c8c86f7b-8g8pj 1/1 Running 0 18sNote: If you are testing on minikube with multiple mounts, each mount defined in the AIS spec must have the same label
$ kubectl create namespace ais
$ kubectl apply -f config/samples/ais_v1beta1_aistore.yaml -n ais
$ kubectl get pods -n ais
NAME READY STATUS RESTARTS AGE
aistore-sample-proxy-0 1/1 Running 0 2m8s
aistore-sample-target-0 1/1 Running 0 2m21sThe operator can optionally deploy an admin client pre-configured with the cluster endpoint. The default image includes the AIStore CLI and Python SDK, as well as aisloader.
spec:
adminClient: {}For TLS-enabled clusters, provide a CA bundle via ConfigMap:
spec:
adminClient:
caConfigMap:
name: my-ca-bundleFor clusters with AuthN enabled via spec.auth.usernamePassword, the admin client is pre-configured with the AuthN service URL and credentials:
| Environment Variable | Source |
|---|---|
AIS_AUTHN_URL |
spec.auth.serviceURL (default: http://ais-authn.ais:52001) |
AIS_AUTHN_USERNAME |
SU-NAME key from spec.auth.usernamePassword.secretName |
AIS_AUTHN_PASSWORD |
SU-PASS key from spec.auth.usernamePassword.secretName |
Log in from the client pod:
ais auth login "$AIS_AUTHN_USERNAME" -p "$AIS_AUTHN_PASSWORD"Open a shell in the client:
$ kubectl exec -it -n ais deploy/aistore-sample-client -- /bin/bashOr, run a one-off command:
$ kubectl exec -it -n ais deploy/aistore-sample-client -- ais show clusterThis section discusses AIStore accessibility by external clients - the clients outside the Kubernetes cluster.
By default, each AIS pod will deploy with a HostPort configuration, allowing any client with access to the host to communicate to the pod directly over the specified port.
The AIStore custom resource also contains the enableExternalLB setting, which will instruct the operator to create K8s LoadBalancer services for the pods.
External access relies on the K8s capability to assign an external IP (or hostname) to these LoadBalancer services.
Setting up external IPs
- Bare-Metal On-Premises Deployments: For these setups, we recommend using MetalLB, a popular solution for on-premises Kubernetes environments.
- Cloud-Based Deployments: If your AIStore is running in a cloud environment, you can utilize standard HTTP load balancer services provided by the cloud provider.
enableExternalLB example Update your AIS spec as follows:
# config/samples/ais_v1beta1_aistore.yaml
apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
name: aistore-sample
spec:
...
enableExternalLB: true
# enableExternalLB: falseNOTE: Currently, external access can be enabled only for new AIS clusters. Updating the
enableExternalLBspec for an existing cluster is not yet supported.
Another important consideration is - the number of external IPs.
To deploy an AIS cluster of N storage nodes, the K8s cluster will have to assign external IPs to (N + 1) LoadBalancer services: one for each storage target plus one more for all the AIS proxies (aka AIS gateways) in that same cluster.
Failing that requirement will lead to a failure to deploy AIStore.
External access can be tested locally on minikube using the following command:
$ minikube tunnelFor more information and details on minikube tunneling, please see this link.
In a development/testing K8s setup, the mountpaths attached to storage target pods may either be block devices (no disks) or share a disk.
This will result in target pod errors such as has no disks or filesystem sharing is not allowed.
To deploy AIStore cluster in such K8s environments, set a shared label for each mountpath as follows:
# config/samples/ais_v1beta1_aistore.yaml
apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
name: aistore-sample
spec:
targetSpec:
mounts:
- path: "/ais1"
size: 10Gi
label: "disk1"
- path: "/ais2"
size: 10Gi
label: "disk1"
...The above spec will tell AIS to allow both mounts to share a single disk as long as the label is the same. If the label does not exist as an actual disk, the target pod will accept it and run in diskless mode without disk statistics.
AIS Operator supports deploying AIStore with distributed tracing enabled. To get started, below instructions demonstrate how to enable distributed tracing and export traces to Lightstep.
-
Create a Lightstep Freemium Account
Sign up for a Lightstep account if you haven't already: Lightstep Sign-Up. -
Obtain an Access Token Follow the instructions to generate an access token: Lightstep Access Token Guide.
kubectl create ns ais
kubectl create secret generic -n ais lightstep-token --from-literal=token=<YOUR-LIGHTSTEP-TOKEN>
kubectl apply -f config/samples/ais_v1beta1_aistore_tracing.yamlAfter a successful deployment, traces will be available in the Lightstep dashboard.
While Lightstep is used in the example for simplicity, AIStore supports exporting traces to any OpenTelemetry (OTEL)-compatible tracing solution.
Refer to the AIStore distributed-tracing doc for more details.
By default, AIS operator restricts having more than one AIS target per K8s node.
In other words, if AIS custom resource spec has a size greater than the number of K8s nodes, additional target pods will remain pending until we add a new K8s node.
However, this constraint can be relaxed for local testing using the disablePodAntiAffinity property as follows:
# config/samples/ais_v1beta1_aistore.yaml
apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
name: aistore-sample
spec:
size: 4 # > number of K8s nodes
disablePodAntiAffinity: true
...AIS operator supports configuring cloud providers for buckets. To enable the config for these providers, you need to create a secret with the corresponding credential file.
Helm deployments include a chart for generating these secrets based on local config and credentials. See the Helm AIS README for instructions.
For ansible deployments, see the ais_aws_config and ais_gcp_config playbooks and the associated README.
You can also create the secrets manually:
kubectl create secret -n ais-operator-system generic aws-creds \
--from-file=config=$HOME/.aws/config \
--from-file=credentials=$HOME/.aws/credentials
kubectl create secret -n ais-operator-system generic gcp-creds \
--from-file=gcp.json=<path-to-gcp-credential-file>.jsonOnce the secrets are created, update the AIS config yaml to reference the secrets:
# config/samples/ais_v1beta1_aistore.yaml
apiVersion: ais.nvidia.com/v1beta1
kind: AIStore
metadata:
name: aistore-sample
spec:
gcpSecretName: "gcp-creds" # corresponding secret name just created for GCP credential
awsSecretName: "aws-creds" # corresponding secret name just created for AWS credential
...For GCP configs, the environment variable for the location may be provided through the targetSpec.Env section.
By default, this will be /var/gcp/gcp.json
As of writing, the operator will always mount the provided secret to /var/gcp, so for a secret with data.gcp.json the resulting file location in the pod will be var/gcp/gcp.json.
This is the default value for the GOOGLE_APPLICATION_CREDENTIALS environment variable in the container.
You may want to enhance the security of your AIStore deployment by enabling HTTPS.
Important: Before proceeding, please ensure that you have cert-manager (or equivalent) installed.
To deploy with HTTPS, the AIS spec must define the spec.ConfigToUpdate.net.http section, example below:
net:
http:
server_crt: "/var/certs/tls.crt"
server_key: "/var/certs/tls.key"
use_https: true
skip_verify: true # if you are using self-signed certs without trust
client_ca_tls: "/var/certs/ca.crt"
client_auth_tls: 0Note: This will be included in the spec by default when enabling https and using our Helm charts or playbooks
If you are using a secret mount to access your certificate, define it with spec.tlsSecretName.
The operator will automatically mount the contents of the secret at the location /var/certs.
We provide automation for creating this secret for both Helm and Ansible Playbooks.
Helm: See HTTPS Deployment docs section
Playbooks: See generate_https_cert.yml and associated templates.
With cert-manager csi-driver installed, you can get signed certificates directly from your Issuer.
The sample configuration below contains definitions for RBAC and an Issuer for use with Vault.
kubectl apply -f config/samples/ais_v1beta1_aistore_tls_certmanager_csi.yamlTesting Considerations:
-
For tests utilizing the AIStore Command Line Interface (CLI), configure the CLI to bypass certificate verification by applying the setting: execute
$ ais config cli set cluster.skip_verify_crt true. This adjustment facilitates unverified connections to the AIStore cluster. -
When using
curlto interact with your AIStore cluster over HTTPS, use the-kflag to skip certificate validation. For example:
curl -k https://your-ais-cluster-url- If you prefer not to skip certificate validation, you can export the self-signed certificate for use with
curl. Here's how to export the certificate:
kubectl get secret tls-certs -n ais-operator-system -o jsonpath='{.data.tls\.crt}' | base64 --decode > tls.crtYou can now use the exported tls.crt as a parameter when using curl, like this:
curl --cacert tls.crt https://your-ais-cluster-urlBy following these steps, you can deploy AIStore in a Kubernetes environment with HTTPS support, leveraging a self-signed certificate provided by cert-manager.
AIS Operator leverages operator-sdk, which provides high-level APIs, scaffolding, and code generation utilities, making the operator development easy.
operator/api/v1beta1, contains go definitions for Custom Resource Definitions (CRDs) and Webhooks used by the operator.
Any modifications to these type definitions requires updating of the auto-generated code (operator/api/v1beta1/zz_generated.deepcopy.go) and the YAML manifests used for deploying operator related K8s resources to the cluster.
We use the following commands to achieve this:
# Update the auto-generated code
make generate
# Update the YAML base manifests in `config/base`
make manifests
# Apply kustomize to the base manifests to generate an installer manifest
make build-installer
# Apply kustomize, then helmify, then our custom templating
# This updates the helm chart in operator/helm
make build-installer-helmFor building and pushing the operator container images, use the following commands.
Currently, docker and podman are explicitly supported.
To use a different tool, we expect it to be aliased to one of these commands and compatible with their arguments.
# Define an image to build
export IMG=<REPOSITORY>/<IMAGE_TAG>
# Build and push the image:
# For the current platform, with docker
make docker-build docker-push
# For the current platform, with podman
make podman-build podman-push
# For $TARGET_PLATFORMS (default linux/amd64, linux/arm64), with docker buildx
make docker-buildx-push
# For $TARGET_PLATFORMS (default linux/amd64, linux/arm64), with podman
make podman-build-multiarch podman-pushFor comprehensive testing documentation including unit tests, E2E tests, cluster setup, and test configuration options, see tests/README.md.