Production Guide
As you take Cortex from development to production, here are a few pointers that might be useful.
Use images from a colocated ECR
Configure your cluster and APIs to use images from ECR in the same region as your cluster to accelerate scale-ups, reduce ingress costs, and remove the dependency on Cortex's public quay.io registry.
You can find instructions for mirroring Cortex images here
Handling Cortex updates/upgrades
Use a Route 53 hosted zone as a proxy in front of your Cortex cluster. Every new Cortex cluster provisions a new API load balancer with a unique endpoint. Using a Route 53 hosted zone configured with a subdomain will expose your Cortex cluster API endpoint as a static endpoint (e.g. cortex.your-company.com
). You will be able to upgrade Cortex versions without downtime, and you will avoid the need to updated your client code every time you migrate to a new cluster. You can find instructions for setting up a custom domain with a Route 53 hosted zone here, and instructions for updating/upgrading your cluster here.
Production cluster configuration
Securing your cluster
The following configuration will improve security by preventing your cluster's nodes from being publicly accessible.
You can make your load balancer private to prevent your APIs from being publicly accessed. In order to access your APIs, you will need to set up VPC peering between the Cortex cluster's VPC and the VPC containing the consumers of the Cortex APIs. See the VPC peering guide for more details.
You can also restrict access to your load balancers by IP address:
These two fields are also available for the operator load balancer. Keep in mind that if you make the operator load balancer private, you'll need to configure VPC peering to use the cortex
CLI or Python client.
See here for more information about the load balancers.
Workload load-balancing
Depending on your application's requirements, you might have different needs from the cluster's api load balancer. By default, the api load balancer is a Network load balancer (NLB). In some situations, a Classic load balancer (ELB) may be preferred, and can be selected in your cluster config by setting api_load_balancer_type: elb
. This selection can only be made before creating your cluster.
Ensure node provisioning
You can take advantage of the cost savings of spot instances and the reliability of on-demand instances by utilizing the priority
field in node groups. You can deploy two node groups, one that is spot and another that is on-demand. Set the priority of the spot node group to be higher than the priority of the on-demand node group. This encourages the cluster-autoscaler to try to spin up instances from the spot node group first. If there are no more spot instances available, the on-demand node group will be used instead.
Considerations for large clusters
If you plan on scaling your Cortex cluster past 300 nodes or 300 pods, it is recommended to set prometheus_instance_type
to an instance type with more memory (the default is t3.medium
, which has 4gb).
API Spec
Container design
Configure your health checks to be as accurate as possible to prevent requests from being routed to pods that aren't ready to handle traffic.
Pods section
Make sure that max_concurrency
is set to match the concurrency supported by your container.
Tune max_queue_length
to lower values if you would like to more aggressively redistribute requests to newer pods as your API scales up rather than allowing requests to linger in queues. This would mean that the clients consuming your APIs should implement retry logic with a delay (such as exponential backoff).
Compute section
Make sure to specify all of the relevant compute resources (especially cpu and memory) to ensure that your pods aren't starved for resources.
Autoscaling
Revisit the autoscaling docs for Realtime APIs and/or Async APIs to effectively handle production traffic by tuning the scaling rate, sensitivity, and over-provisioning.
Last updated