Caching
Multi-model caching allows each replica to serve more models than would fit into its memory by keeping a specified number of models in memory (and disk) at a time. When the in-memory model limit is reached, the least recently accessed model is evicted from the cache. This can be useful when you have many models, and some models are frequently accessed while a larger portion of them are rarely used, or when running on smaller instances to control costs.
The model cache is a two-layer cache, configured by the following parameters in the handler.models
configuration:
cache_size
sets the number of models to keep in memorydisk_cache_size
sets the number of models to keep on disk (must be greater than or equal tocache_size
)
Both of these fields must be specified, in addition to either the dir
or paths
field (which specifies the model paths, see models for documentation). Multi-model caching is only supported if handler.processes_per_replica
is set to 1 (the default value).
Out of memory errors
Cortex runs a background process every 10 seconds that counts the number of models in memory and on disk, and evicts the least recently used models if the count exceeds cache_size
/ disk_cache_size
. If many new models are requested between executions of the process, there may be more models in memory and/or on disk than the configured cache_size
or disk_cache_size
limits which could lead to out of memory errors.
Last updated