A common language to talk to our algorithms
In Arise we are creating and hosting many different AI algorithms for recognising species from images and sounds. How do we send requests to these algorithms, and how can YOU make sense of the results they send back to you?
In the ARISE Digital Species Identification team we have defined an API - a standardised way for algorithms to describe what they have localised & identified in media items. It is flexible enough to cover images, sounds, video, and more; it can represent general classification results as well as detections in "bounding boxes", and detections linked across multiple images. We are using it to represent the outputs of our own algorithms for biodiversity classification on images, insect image detection, bird sound classification. We are now making our API specification public and open, so that everyone can use it if they wish. Programmers can use it to convert our AI results into user-friendly results pages, connect up with other data, etc. We also encourage developers of AI algorithms to use it to represent their outputs, so that our algorithms can "talk to each other" and be integrated easily.
You can find the API schema here on Gitlab, as well as documentation, examples, and a validator. Technical documentation of the API can be found here: ARISE Machine prediction API - technical documentation.
Different strokes The goal of our API definition is to cover a large number of use cases for different media (images, groups of images, video, sound, radar, etc.) and different tasks (classification, detection, segmentation, tracking, etc.). Another goal is to provide high-level compatibility with existing standards such as Camtrap DP, COCO, Audiovisual Core, etc, while allowing room for future more complex use-cases. The format also supports concepts such as Linked Open Data, Machine Learning Provenance, and API versioning.
The basic idea of the format is that regions (geometric areas) are described in the input media. These regions are grouped into region groups. One or more predictions are given per region group.
The input to the API can be a single media item, a sequence (ordered) or a set (unordered) of media items. Region groups can be formed between regions in a media item and between media items. The format supports multi-task predictions, for example per region group you can have a multi-class output, a multi-label output, and predict a scalar.
The simplest use-case is where the API predicts one class on a single media item. In this case there is one region corresponding to the full image, one region group consisting of the single region, and the region group has a single multi-class prediction.
The Naturalis Nature Identification API (NIA) that supports amongst others ObsIdentify and the Norwegian species identificator ArtsOrakelet provides a more complex use-case. The goal of NIA is to provide species predictions for people taking pictures of flora and fauna in the field using their smartphone. Here the user can input more than one image, in which case the system provides a single predictionthat integrates the predictions from the individual images. In this setup it is assumed that each picture contains one individual which is the same across images. In this case we have multiple input media, each with one region corresponding to the full image. The multiple regions are part of one region group.
An upcoming upgrade of NIA will contain an auto-zoom functionality which tries to automatically localise the organism in the image so that the algorithm can zoom in and get more detail for the prediction. It is still assumed that the input images contain one and the same individual. In this case we have multiple input media, each with 2 regions (the full original image, and an automatic crop of the organism). All the regions from all images form one region group. Some models in the API predict the “morph” (appearance; e.g. caterpillar, adult) of the organism next to the species. Two types of predictions are made for this single region group (1) a species prediction (2) a morph prediction, as in the image below.
But wait, there's more!
In the upcoming 2023 model for Europe we will take the localization of organisms in NIA one step further. In this use-case we will drop the assumption of one individual per image. In the case of multiple input media we still assume that at least one of the individuals is shared across all images. A localization algorithm will detect all individuals in a single image, these can be for example: a bee and a flower, multiple birds of possible different species, multiple fungi, etc. A matching algorithm determines which individuals are the same across input media. Predictions (species and morph) are done per individual. In this case we have multiple input media, each with one or more regions (but not the full image anymore), regions are grouped into region groups corresponding to individuals.
The next use-case comes from the DIOPSIS automatic insect monitoring project. You can read more about how we use them in the Amsterdamse Waterleidingduinen here. In this project there are sequences (ordered) of images taken at a regular interval (e.g. every 10 seconds). Each image is of a screen containing insects, which can range from zero to hundreds. Insects are first localized in single images. Insects are then matched across images if they are likely to be the same individual. For each individual we perform several predictions: taxon (species), estimated biomass, and body length.