Deploying a Query Cluster on Kubernetes

This topic explains how to install and configure the Databend query cluster on Kubernetes with MinIO as the storage backend.

Before You Begin

Make sure your cluster has enough resource for installation (at least 4 CPUs, 4GB RAM, 50GB disk).
Make sure you have a Kubernetes cluster up and running. For more information, see k3d or minikube.
Please note that Databend Cluster mode only works on a shared storage (AWS S3 or MinIO s3-like storage).

Deploy a Sample Databend Cluster with MinIO

Step 1. Install MinIO

caution

This configuration is for demonstration ONLY. Never use it for production. Refer to https://docs.min.io/docs/deploy-minio-on-kubernetes.html for more information on TLS and High Availability configurations for production.

We will bootstrap a MinIO server on Kubernetes with the following configurations:

minio-server-config
STORAGE_TYPE=s3
STORAGE_S3_BUCKET=sample-storage
STORAGE_S3_REGION=us-east-1
STORAGE_S3_ENDPOINT_URL=http://minio.minio.svc.cluster.local:9000
STORAGE_S3_ACCESS_KEY_ID=minio
STORAGE_S3_SECRET_ACCESS_KEY=minio123

The following configuration applies to the target Kubernetes cluster. It will create a bucket named sample-storage with 10Gi storage space:

minio-server-deployment
kubectl create namespace minio --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/minio-sample.yaml -n minio

Step 2. Deploy a Standalone Databend Meta-Service Layer

The following configuration creates a standalone Databend meta-service on databend-system namespace:

databend-meta-service-deployment
kubectl create namespace databend-system --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/meta-standalone.yaml -n databend-system

Step 3. Deploy a Databend Query Cluster

The following configuration creates a Databend query cluster on tenant1 namespace. Each pod under the deployment has a 900m vCPU with 900Mi memory:

databend-query-service-deployment
kubectl create namespace tenant1 --dry-run=client -o yaml | kubectl apply -f -
kubectl apply -f https://raw.githubusercontent.com/datafuselabs/databend/main/scripts/kubernetes/query-cluster.yaml -n tenant1

To scale up or down the query cluster, please use the following command:

 # scale query cluster number to 0
 kubectl scale -n tenant1 deployment query --replicas=0
 # scale query cluster number to 3
 kubectl scale -n tenant1 deployment query --replicas=3

Check the Cluster Information

tip

Make sure that the localhost port 3308 is available.

nohup kubectl port-forward -n tenant1 svc/query-service 3308:3307 &
mysql -h127.0.0.1 -uroot -P3308

SELECT * FROM system.clusters

+----------------------+------------+------+
| name                 | host       | port |
+----------------------+------------+------+
| dIUkzbOaqJEPudb0A7j4 | 172.17.0.6 | 9191 |
| NzfBm4KIQGEHe0sxAWa3 | 172.17.0.7 | 9191 |
| w3MuQR8aTHKHC1OLj5a6 | 172.17.0.5 | 9191 |
+----------------------+------------+------+

Step 4. Distributed Query

EXPLAIN SELECT max(number), sum(number) FROM numbers_mt(10000000000) GROUP BY number % 3, number % 4, number % 5 LIMIT 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| explain                                                                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit: 10                                                                                                                                                                                                         |
|   RedistributeStage[expr: 0]                                                                                                                                                                                      |
|     Projection: max(number):UInt64, sum(number):UInt64                                                                                                                                                            |
|       AggregatorFinal: groupBy=[[(number % 3), (number % 4), (number % 5)]], aggr=[[max(number), sum(number)]]                                                                                                    |
|         RedistributeStage[expr: sipHash(_group_by_key)]                                                                                                                                                           |
|           AggregatorPartial: groupBy=[[(number % 3), (number % 4), (number % 5)]], aggr=[[max(number), sum(number)]]                                                                                              |
|             Expression: (number % 3):UInt8, (number % 4):UInt8, (number % 5):UInt8, number:UInt64 (Before GroupBy)                                                                                                |
|               ReadDataSource: scan schema: [number:UInt64], statistics: [read_rows: 10000000000, read_bytes: 80000000000, partitions_scanned: 1000001, partitions_total: 1000001], push_downs: [projections: [0]] |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The distributed query works now, and the cluster will efficiently transfer data through flight_api_address.

Upload Data to the Cluster

CREATE TABLE t1(i INT, j INT);

INSERT INTO t1 SELECT number, number + 300 from numbers(10000000);

SELECT count(*) FROM t1;

+----------+
| count()  |
+----------+
| 10000000 |
+----------+

Install Databend Cluster with Helm Chart

We support installing Databend cluster with our official Helm Charts.

Install Meta Service

Install a standalone Databend meta service. Please follow the documentation for further configuration options (for example, high availability).

helm repo add databend https://charts.databend.rs
helm install my-release databend/databend-meta --namespace databend --create-namespace

Install Query Service

The following command registers the Databend query service to the meta service with 3 nodes:

helm repo add databend https://charts.databend.rs
helm install query databend/databend-query --namespace databend --create-namespace \
          --set config.meta.address=my-release-databend-meta.databend.svc.cluster.local:9191 \
          --set replicaCount=3 

Please follow the documentation for further configuration options (for example, object storage secrets).

Before You Begin​

Deploy a Sample Databend Cluster with MinIO​

Step 1. Install MinIO​

Step 2. Deploy a Standalone Databend Meta-Service Layer​

Step 3. Deploy a Databend Query Cluster​

Check the Cluster Information​

Step 4. Distributed Query​

Upload Data to the Cluster​

Install Databend Cluster with Helm Chart​

Install Meta Service​

Install Query Service​