Comment on page
Production Guide
As you take Cortex from development to production, here are a few pointers that might be useful.
Configure your cluster and APIs to use images from ECR in the same region as your cluster to accelerate scale-ups, reduce ingress costs, and remove the dependency on Cortex's public quay.io registry.
Use a Route 53 hosted zone as a proxy in front of your Cortex cluster. Every new Cortex cluster provisions a new API load balancer with a unique endpoint. Using a Route 53 hosted zone configured with a subdomain will expose your Cortex cluster API endpoint as a static endpoint (e.g.
cortex.your-company.com
). You will be able to upgrade Cortex versions without downtime, and you will avoid the need to updated your client code every time you migrate to a new cluster. You can find instructions for setting up a custom domain with a Route 53 hosted zone here, and instructions for updating/upgrading your cluster here.The following configuration will improve security by preventing your cluster's nodes from being publicly accessible.
subnet_visibility: private
nat_gateway: single # use "highly_available" for large clusters making requests to services outside of the cluster
You can make your load balancer private to prevent your APIs from being publicly accessed. In order to access your APIs, you will need to set up VPC peering between the Cortex cluster's VPC and the VPC containing the consumers of the Cortex APIs. See the VPC peering guide for more details.
api_load_balancer_scheme: internal
You can also restrict access to your load balancers by IP address:
api_load_balancer_cidr_white_list: [0.0.0.0/0]
These two fields are also available for the operator load balancer. Keep in mind that if you make the operator load balancer private, you'll need to configure VPC peering to use the
cortex
CLI or Python client.operator_load_balancer_scheme: internal
operator_load_balancer_cidr_white_list: [0.0.0.0/0]
Depending on your application's requirements, you might have different needs from the cluster's api load balancer. By default, the api load balancer is a Network load balancer (NLB). In some situations, a Classic load balancer (ELB) may be preferred, and can be selected in your cluster config by setting
api_load_balancer_type: elb
. This selection can only be made before creating your cluster.You can take advantage of the cost savings of spot instances and the reliability of on-demand instances by utilizing the
priority
field in node groups. You can deploy two node groups, one that is spot and another that is on-demand. Set the priority of the spot node group to be higher than the priority of the on-demand node group. This encourages the cluster-autoscaler to try to spin up instances from the spot node group first. If there are no more spot instances available, the on-demand node group will be used instead.node_groups:
- name: gpu-spot
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 5
spot: true
priority: 100
- name: gpu-on-demand
instance_type: g4dn.xlarge
min_instances: 0
max_instances: 5
priority: 1
If you plan on scaling your Cortex cluster past 300 nodes or 300 pods, it is recommended to set
prometheus_instance_type
to an instance type with more memory (the default is t3.medium
, which has 4gb).Configure your health checks to be as accurate as possible to prevent requests from being routed to pods that aren't ready to handle traffic.
Make sure that
max_concurrency
is set to match the concurrency supported by your container.Tune
max_queue_length
to lower values if you would like to more aggressively redistribute requests to newer pods as your API scales up rather than allowing requests to linger in queues. This would mean that the clients consuming your APIs should implement retry logic with a delay (such as exponential backoff).Make sure to specify all of the relevant compute resources (especially cpu and memory) to ensure that your pods aren't starved for resources.
Revisit the autoscaling docs for Realtime APIs and/or Async APIs to effectively handle production traffic by tuning the scaling rate, sensitivity, and over-provisioning.
Last modified 1yr ago