LogoLogo
WebsiteSlack
0.28
0.28
  • Get started
  • Clients
    • Install
    • CLI commands
    • Python API
    • Environments
    • Telemetry
    • Uninstall
  • Workloads
    • Realtime APIs
      • Example
      • Predictor
      • Configuration
      • Models
      • Parallelism
      • Server-side batching
      • Autoscaling
      • Statuses
      • Multi-model
        • Example
        • Configuration
        • Caching
      • Traffic Splitter
        • Example
        • Configuration
      • Troubleshooting
    • Batch APIs
      • Example
      • Predictor
      • Configuration
      • Jobs
      • Statuses
    • Task APIs
      • Example
      • Definition
      • Configuration
      • Jobs
      • Statuses
    • Dependencies
      • Example
      • Python packages
      • System packages
      • Custom images
  • Clusters
    • Cortex Cloud on AWS
      • Install
      • Update
      • Security
      • Logging
      • Spot instances
      • Networking
        • Custom domain
        • HTTPS (via API Gateway)
        • VPC peering
      • Setting up kubectl
      • Uninstall
    • Cortex Cloud on GCP
      • Install
      • Logging
      • Credentials
      • Setting up kubectl
      • Uninstall
    • Cortex Core on Kubernetes
      • Install
      • Uninstall
    • Private Docker registry
Powered by GitBook
On this page
  1. Workloads
  2. Realtime APIs
  3. Multi-model

Caching

PreviousConfigurationNextTraffic Splitter

Last updated 4 years ago

Multi-model caching allows each replica to serve more models than would fit into its memory by keeping a specified number of models in memory (and disk) at a time. When the in-memory model limit is reached, the least recently accessed model is evicted from the cache. This can be useful when you have many models, and some models are frequently accessed while a larger portion of them are rarely used, or when running on smaller instances to control costs.

The model cache is a two-layer cache, configured by the following parameters in the predictor.models configuration:

  • cache_size sets the number of models to keep in memory

  • disk_cache_size sets the number of models to keep on disk (must be greater than or equal to cache_size)

Both of these fields must be specified, in addition to either the dir or paths field (which specifies the model paths, see for documentation). Multi-model caching is only supported if predictor.processes_per_replica is set to 1 (the default value).

Out of memory errors

Cortex runs a background process every 10 seconds that counts the number of models in memory and on disk, and evicts the least recently used models if the count exceeds cache_size / disk_cache_size. If many new models are requested between executions of the process, there may be more models in memory and/or on disk than the configured cache_size or disk_cache_size limits which could lead to out of memory errors.

models