Models
Model directory format
Whenever a model path is specified in an API configuration file, it should be a path to an S3 prefix which contains your exported model. Directories may include a single model, or multiple folders each with a single model (note that a "single model" need not be a single file; there can be multiple files for a single model). When multiple folders are used, the folder names must be integer values, and will be interpreted as the model version. Model versions can be any integer, but are typically integer timestamps. It is always assumed that the highest version number is the latest version of your model.
Each predictor type expects a different model format:
Python
For the Python predictor, any model structure is accepted. Here is an example:
or for a versioned model:
TensorFlow
For the TensorFlow predictor, the model path must be a SavedModel export:
or for a versioned model:
Inferentia
When Inferentia models are used, the directory structure is slightly different:
or for a versioned model:
ONNX
For the ONNX predictor, the model path must contain a single *.onnx
file:
or for a versioned model:
Single model
The most common pattern is to serve a single model per API. The path to the model is specified in the path
field in the predictor.models
configuration. For example:
For the Python predictor type, the models
field comes under the name of multi_model_reloading
. It is also not necessary to specify the multi_model_reloading
section at all, since you can download and load the model in your predictor's __init__()
function. That said, it is necessary to use the path
field to take advantage of live model reloading.
Multiple models
It is possible to serve multiple models from a single API. The paths to the models are specified in the api configuration, either via the models.paths
or models.dir
field in the predictor
configuration. For example:
or:
For the Python predictor type, the models
field comes under the name of multi_model_reloading
. It is also not necessary to specify the multi_model_reloading
section at all, since you can download and load the model in your predictor's __init__()
function. That said, it is necessary to use the models
field to take advantage of live model reloading or multi-model caching.
When using the models.paths
field, each path must be a valid model directory (see above for valid model directory structures).
When using the models.dir
field, the directory provided may contain multiple subdirectories, each of which is a valid model directory. For example:
In this case, there are two models in the directory, one of which is named "text-generator", and the other is named "sentiment-analyzer".
Live model reloading
Live model reloading is a mechanism that periodically checks for updated models in the model path(s) provided in predictor.models
. It is automatically enabled for all predictor types, including the Python predictor type (as long as model paths are specified via multi_model_reloading
in the predictor
configuration).
The following is a list of events that will trigger the API to update its model(s):
A new model is added to the model directory.
A model is removed from the model directory.
A model changes its directory structure.
A file in the model directory is updated in-place.
Usage varies based on the predictor type:
Python
To use live model reloading with the Python predictor, the model path(s) must be specified in the API's predictor
configuration, via the models
field. When models are specified in this manner, your PythonPredictor
class must implement the load_model()
function, and models can be retrieved by using the get_model()
method of the python_client
that's passed into your predictor's constructor.
The load_model()
function that you implement in your PythonPredictor
can return anything that you need to make a prediction. There is one caveat: whatever the return value is, it must be unloadable from memory via the del
keyword. The following frameworks have been tested to work:
PyTorch (CPU & GPU)
ONNX (CPU & GPU)
Sklearn/MLFlow (CPU)
Numpy (CPU)
Pandas (CPU)
Caffe (not tested, but should work on CPU & GPU)
Python data structures containing these types are also supported (e.g. lists and dicts).
The load_model()
function takes a single argument, which is a path (on disk) to the model to be loaded. Your load_model()
function is called behind the scenes by Cortex when you call the python_client
's get_model()
method. Cortex is responsible for downloading your model from S3 onto the local disk before calling load_model()
with the local path. Whatever load_model()
returns will be the exact return value of python_client.get_model()
. Here is the schema for python_client.get_model()
:
Here's an example:
When multiple models are being served in an API, python_client.get_model()
can accept a model name:
python_client.get_model()
can also accept a model version if a version other than the highest is desired:
TensorFlow
When using the TensorFlow predictor, inference is performed by using the predict()
method of the tensorflow_client
that's passed to the predictor's constructor:
For example:
When multiple models are being served in an API, tensorflow_client.predict()
can accept a model name:
tensorflow_client.predict()
can also accept a model version if a version other than the highest is desired:
Note: when using Inferentia models with the TensorFlow predictor, live model reloading is only supported if predictor.processes_per_replica
is set to 1 (the default value).
ONNX
When using the ONNX predictor, inference is performed by using the predict()
method of the onnx_client
that's passed to the predictor's constructor:
For example:
When multiple models are being served in an API, onnx_client.predict()
can accept a model name:
onnx_client.predict()
can also accept a model version if a version other than the highest is desired:
You can also retrieve information about the model by calling the onnx_client
's get_model()
method (it supports model name and model version arguments, like its predict()
method). This can be useful for retrieving the model's input/output signatures. For example, self.client.get_model()
might look like this:
Last updated