When a Cog Docker image is run, it serves an HTTP API for making predictions. For more information, take a look at the documentation for deploying models.
First, build your model:
cog build -t my-model
Then, start the Docker container:
# If your model uses a CPU:
docker run -d -p 5001:5000 my-model
# If your model uses a GPU:
docker run -d -p 5001:5000 --gpus all my-model
# If you're on an M1 Mac:
docker run -d -p 5001:5000 --platform=linux/amd64 my-model
The server is now running locally on port 5001.
To view the OpenAPI schema, open localhost:5001/openapi.json in your browser or use cURL to make requests:
curl http://localhost:5001/openapi.json
To stop the server, run:
docker kill my-model
The OpenAPI specification of the API, which is derived from the input and output types specified in your model's Predictor and Training objects.
Make a single prediction. The request body should be a JSON object with the following fields:
input
: a JSON object with the same keys as the arguments to thepredict()
function. AnyFile
orPath
inputs are passed as URLs.output_file_prefix
: A base URL to upload output files to.
The response is a JSON object with the following fields:
status
: Eithersucceeded
orfailed
.output
: The return value of thepredict()
function.error
: Ifstatus
isfailed
, the error message.
For example:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
{
"input": {
"image": "https://example.com/image.jpg",
"text": "Hello world!"
}
}
Responds with:
HTTP/1.1 200 OK
Content-Type: application/json
{
"status": "succeeded",
"output": "data:image/png;base64,..."
}
Or, with curl:
curl -X POST -H "Content-Type: application/json" -d '{"input": {"image": "https://example.com/image.jpg", "text": "Hello world!"}}' http://localhost:5000/predictions
Make a single prediction without waiting for the prediction to complete.
Callers can specify an HTTP header of Prefer: respond-async
when calling the
POST /predictions
endpoint. If provided, the request will return immediately
after starting the prediction with an HTTP 202 Accepted
status and a
prediction object in status processing
.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"}
}
The only supported mechanism for receiving updates on the status of predictions started asynchronously is via webhooks. There is as yet no support for polling for prediction status.
Note 1: that while this allows clients to create predictions "asynchronously," Cog can only run one prediction at a time, and it is currently the caller's responsibility to make sure that earlier predictions are complete before new ones are created.
Note 2: predictions created asynchronously use a different mechanism for
file upload than those created using the synchronous API. You must specify an
--upload-url
when running the Cog server process. All uploads will be PUT
using the provided --upload-url
as a prefix, in much the same way that
output_file_prefix
worked. There is currently no single upload mechanism which
works the same way for both synchronous and asynchronous prediction creation.
This will be addressed in a future version of Cog.
Clients can (and should, if a prediction is created asynchronously) provide a
webhook
parameter at the top level of the prediction request, e.g.
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction"
}
Cog will make requests to the URL supplied with the state of the prediction object in the request body. Requests are made when specific events occur during the prediction, namely:
start
: immediately on prediction startoutput
: each time a prediction generates an output (note that predictions can generate multiple outputs)logs
: each time log output is generated by a predictioncompleted
: when the prediction reaches a terminal state (succeeded/canceled/failed)
Requests for event types output
and logs
will be sent at most once every
500ms. This interval is currently not configurable. Requests for event types
start
and completed
will be sent immediately.
By default, Cog will send requests for all event types. Clients can change which
events trigger webhook requests by specifying webhook_events_filter
in the
prediction request. For example, if you only wanted requests to be sent at the
start and end of the prediction, you would provide:
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"input": {"prompt": "A picture of an onion with sunglasses"},
"webhook": "https://example.com/webhook/prediction",
"webhook_events_filter": ["start", "completed"]
}
Make a single prediction.
This is the idempotent version of the POST /predictions
endpoint. If you call
it multiple times with the same ID (for example, because of a network
interruption) and the prediction is still running, the request will not create
further predictions but will wait for the original prediction to complete.
Note: It is currently the caller's responsibility to ensure that the supplied prediction ID is unique. We recommend you use base32-encoded UUID4s (stripped of any padding characters) to ensure forward compatibility: these will be 26 ASCII characters long.
Make a single prediction without waiting for the prediction to complete.
Callers can specify an HTTP header of Prefer: respond-async
when calling the
PUT /predictions/<prediction_id>
endpoint. If provided, the request will
return immediately after starting the prediction with an HTTP 202 Accepted
status and a prediction object in status processing
.
This is the idempotent version of the asynchronous POST /predictions
endpoint.
If you call it multiple times with the same ID (for example, because of a
network interruption) and the prediction is still running, the request will not
create further predictions. The caller will receive a 202 Accepted response
with the initial state of the prediction.
Note 1: It is currently the caller's responsibility to ensure that the supplied prediction ID is unique. We recommend you use base32-encoded UUID4s (stripped of any padding characters) to ensure forward compatibility: these will be 26 ASCII characters long.
Note 2: As noted earlier, Cog can only run one prediction at a time, and it is the caller's responsibility to make sure that earlier predictions are complete before new ones (with new IDs) are created.
While an asynchronous prediction is running, clients can cancel it by making a
request to POST /predictions/<prediction_id>/cancel
. The prediction id
must
have been supplied when creating the prediction. Predictions created without a
supplied id
field will not be cancelable.
For example, if a prediction is created with
POST /predictions HTTP/1.1
Content-Type: application/json; charset=utf-8
Prefer: respond-async
{
"id": "abcd1234",
"input": {"prompt": "A picture of an onion with sunglasses"},
}
it can be canceled with
POST /predictions/abcd1234/cancel HTTP/1.1
Use of the cancelation API to cancel predictions started "synchronously" (i.e.
without the Prefer: respond-async
header) is currently not supported. This may
change in a future release of Cog.