Tutorials

Data Management Tutorial

Tutorials for data management

Connect Cloud Storage

Setup integration with GCS/S3/Azure

Connect Cloud Storage

If you already have your data managed and organized on a cloud storage service, such as GCS/S3/Azure, you may want to utilize that with Dataloop, and not upload the binaries and create duplicates.

Cloud Storage Integration

Access & Permissions - Creating an integration with GCS/S2/Azure cloud requires adding a key/secret with the following permissions:

List (Mandatory) - allowing Dataloop to list all of the items in the storage. Get (Mandatory) - get the items and perform pre-process functionalities like thumbnails, item info etc. Put / Write (Mandatory) - lets you upload your items directly to the external storage from the Dataloop platform. Delete - lets you delete your items directly from the external storage using the Dataloop platform.

Create Integration With GCS
Creating an integration GCS requires having JSON file with GCS configuration.
import dtlpy as dl
if dl.token_expired():
    dl.login()
organization = dl.organizations.get(organization_name=org_name)
with open(r"C:\gcsfile.json", 'r') as f:
    gcs_json = json.load(f)
gcs_to_string = json.dumps(gcs_json)
organization.integrations.create(name='gcsintegration',
                                 integrations_type=dl.ExternalStorage.GCS,
                                 options={'key': '',
                                          'secret': '',
                                          'content': gcs_to_string})
Create Integration With S3
import dtlpy as dl
if dl.token_expired():
    dl.login()
organization = dl.organizations.get(organization_name='my-org')
organization.integrations.create(name='S3integration', integrations_type=dl.ExternalStorage.S3,
                                 options={'key': "my_key", 'secret': "my_secret"})
Create Integration With Azure
import dtlpy as dl
if dl.token_expired():
    dl.login()
organization = dl.organizations.get(organization_name='my-org')
organization.integrations.create(name='azureintegration',
                                 integrations_type=dl.ExternalStorage.AZUREBLOB,
                                 options={'key': 'my_key',
                                          'secret': 'my_secret',
                                          'clientId': 'my_clientId',
                                          'tenantId': 'my_tenantId'})
Storage Driver

Once you have an integration, you can set up a driver, which adds a specific bucket (and optionally with a specific path/folder) as a storage resource.

Create Drivers in the Platform (browser)
# param name: the driver name
# param driver_type: ExternalStorage.S3, ExternalStorage.GCS , ExternalStorage.AZUREBLOB
# param integration_id: the integration id
# param bucket_name: the external bucket name
# param project_id:
# param allow_external_delete:
# param region: relevant only for s3 - the bucket region
# param storage_class: relevant only for s3
# param path: Optional. By default, path is the root folder. Path is case sensitive.
# return: driver object
import dtlpy as dl
driver = dl.drivers.create(name='driver_name', driver_type=dl.ExternalStorage.S3, integration_id='integration_id',
                           bucket_name='bucket_name', project_id='project_id',
                           allow_external_delete=True,
                           region='eu-west-1', storage_class="", path="")

Manage Datasets

Create and manage Datasets and connect them with your cloud storage

Manage Datasets

Datasets are buckets in the dataloop system that hold a collection of data items of any type, regardless of their storage location (on Dataloop storage or external cloud storage).

Create Dataset

You can create datasets within a project. There are no limits to the number of dataset a project can have, which correlates with data versioning where datasets can be cloned and merged.

dataset = project.datasets.create_and_shlomi(dataset_name='my-dataset-name')
Create Dataset With Cloud Storage Driver

If you’ve created an integration and driver to your cloud storage, you can create a dataset connected to that driver. A single integration (for example: S3) can have multiple drivers (per bucket or even per folder), so you need to specify that.

project = dl.projects.get(project_name='my-project-name')
# Get your drivers list
project.drivers.list().print()
# Create a dataset from a driver name. You can also create by the driver ID.
dataset = project.datasets.create(driver='my_driver_name', dataset_name="my_dataset_name")
Retrieve Datasets

You can read all datasets that exist in a project, and then access the datasets by their ID (or name).

datasets = project.datasets.list()
dataset = project.datasets.get(dataset_id='my-dataset-id')
Create Directory

A dataset can have multiple directories, allowing you to manage files by context, such as upload time, working batch, source, etc.

dataset.items.make_dir(directory="/directory/name")
Hard-copy a Folder to Another Dataset

You can create a clone of a folder into a new dataset, but if you want to actually move between datasets a folder with files that are stored in the Dataloop system, you’ll need to download the files and upload again to the destination dataset.

copy_annotations = True
flat_copy = False  # if true, it copies all dir files and sub dir files to the destination folder without sub directories
source_folder = '/source_folder'
destination_folder = '/destination_folder'
source_project_name = 'source_project_name'
source_dataset_name = 'source_dataset_name'
destination_project_name = 'destination_project_name'
destination_dataset_name = 'destination_dataset_name'
# Get source project dataset
project = dl.projects.get(project_name=source_project_name)
dataset_from = project.datasets.get(dataset_name=source_dataset_name)
source_folder = source_folder.rstrip('/')
# Filter to get all files of a specific folder
filters = dl.Filters()
filters.add(field='filename', values=source_folder + '/**')  # Get all items in folder (recursive)
pages = dataset_from.items.list(filters=filters)
# Get destination project and dataset
project = dl.projects.get(project_name=destination_project_name)
dataset_to = project.datasets.get(dataset_name=destination_dataset_name)
# Go over all projects and copy file from src to dst
for page in pages:
    for item in page:
        # Download item (without save to disk)
        buffer = item.download(save_locally=False)
        # Give the item's name to the buffer
        if flat_copy:
            buffer.name = item.name
        else:
            buffer.name = item.filename[len(source_folder) + 1:]
        # Upload item
        print("Going to add {} to {} dir".format(buffer.name, destination_folder))
        new_item = dataset_to.items.upload(local_path=buffer, remote_path=destination_folder)
        if not isinstance(new_item, dl.Item):
            print('The file {} could not be upload to {}'.format(buffer.name, destination_folder))
            continue
        print("{} has been uploaded".format(new_item.filename))
        if copy_annotations:
            new_item.annotations.upload(item.annotations.list())

Data Versioning

How to manage versions

Data Versioning

Dataloop’s powerful data versioning provides you with unique tools for data management - clone, merge, slice & dice your files, to create multiple versions for various applications. Sample use cases include: Golden training sets management Reproducibility (dataset training snapshot) Experimentation (creating subsets from different kinds) Task/Assignment management Data Version “Snapshot” - Use our versioning feature as a way to save data (items, annotations, metadata) before any major process. For example, a snapshot can serve as a roll-back mechanism to original datasets in case of any error without losing the data.

Clone Datasets

Cloning a dataset creates a new dataset with the same files as the original. Files are actually a reference to the original binary and not a new copy of the original, so your cloud data remains safe and protected. When cloning a dataset, you can add a destination dataset, remote file path, and more…

dataset = project.datasets.get(dataset_id='my-dataset-id')
dataset.clone(clone_name='clone-name',
              filters=None,
              with_items_annotations=True,
              with_metadata=True,
              with_task_annotations_status=True)
Merge Datasets

Dataset merging outcome depends on how similar or different the datasets are.

  • Cloned Datasets - items, annotations, and metadata will be merged. This means that you will see annotations from different datasets on the same item.

  • Different datasets (not clones) with similar recipes - items will be summed up, which will cause duplication of similar items.

  • Datasets with different recipes - Datasets with different default recipes cannot be merged. Use the ‘Switch recipe’ option on dataset level (3-dots action button) to match recipes between datasets and be able to merge them.

dataset_ids = ["dataset-1-id", "dataset-2-id"]
project_ids = ["dataset-1-project-id", "dataset-2-project-id"]
dataset_merge = dl.datasets.merge(merge_name="my_dataset-merge",
                                  project_ids=project_ids,
                                  dataset_ids=dataset_ids,
                                  with_items_annotations=True,
                                  with_metadata=False,
                                  with_task_annotations_status=False)

Upload and Manage Data and Metadata

Upload data items and metadata

Upload & Manage Data & Metadata

Upload specific files

When you have specific files you want to upload, you can upload them all into a dataset using this script:

import dtlpy as dl
if dl.token_expired():
    dl.login()
project = dl.projects.get(project_name='project_name')
dataset = project.datasets.get(dataset_name='dataset_name')
dataset.items.upload(local_path=[r'C:/home/project/images/John Morris.jpg',
                                 r'C:/home/project/images/John Benton.jpg',
                                 r'C:/home/project/images/Liu Jinli.jpg'],
                     remote_path='/folder_name')  # Remote path is optional, images will go to the main directory by default
Upload all files in a folder

If you want to upload all files from a folder, you can do that by just specifying the folder name:

import dtlpy as dl
if dl.token_expired():
    dl.login()
project = dl.projects.get(project_name='project_name')
dataset = project.datasets.get(dataset_name='dataset_name')
dataset.items.upload(local_path=r'C:/home/project/images',
                     remote_path='/folder_name')  # Remote path is optional, images will go to the main directory by default
Upload Items and Annotations Metadata

You can upload items as a table using a pandas data frame that will let you upload items with info (annotations, metadata such as confidence, filename, etc.) attached to it.

import pandas
import dtlpy as dl
dataset = dl.datasets.get(dataset_id='id')  # Get dataset
to_upload = list()
# First item and info attached:
to_upload.append({'local_path': r"E:\TypesExamples\000000000064.jpg",  # Item file path
                  'local_annotations_path': r"E:\TypesExamples\000000000776.json",  # Annotations file path
                  'remote_path': "/first",  # Dataset folder to upload the item to
                  'remote_name': 'f.jpg',  # Dataset folder name
                  'item_metadata': {'user': {'dummy': 'fir'}}})  # Added user metadata
# Second item and info attached:
to_upload.append({'local_path': r"E:\TypesExamples\000000000776.jpg",  # Item file path
                  'local_annotations_path': r"E:\TypesExamples\000000000776.json",  # Annotations file path
                  'remote_path': "/second",  # Dataset folder to upload the item to
                  'remote_name': 's.jpg',  # Dataset folder name
                  'item_metadata': {'user': {'dummy': 'sec'}}})  # Added user metadata
df = pandas.DataFrame(to_upload)  # Make data into table
items = dataset.items.upload(local_path=df,
                             overwrite=True)  # Upload table to platform

Upload and Manage Annotations

Upload annotations into data items

Upload & Manage Annotations

import dtlpy as dl
item = dl.items.get(item_id="")
annotation = item.annotations.get(annotation_id="")
annotation.metadata["user"] = True
annotation.update()
Upload User Metadata

To upload annotations from JSON and include the user metadata, add the parameter local_annotation_path to the dataset.items.upload function, like so:

project = dl.projects.get(project_name='project_name')
dataset = project.datasets.get(dataset_name='dataset_name')
dataset.items.upload(local_path=r'<items path>',
                     local_annotations_path=r'<annotation json file path>',
                     item_metadata=dl.ExportMetadata.FROM_JSON,
                     overwrite=True)
Convert Annotations To COCO Format
converter = dl.Converter()
converter.upload_local_dataset(
    from_format=dl.AnnotationFormat.COCO,
    dataset=dataset,
    local_items_path=r'C:/path/to/items',
    # Please make sure the names of the items are the same as written in the COCO JSON file
    local_annotations_path=r'C:/path/to/annotations/file/coco.json'
)
Upload Entire Directory and their Corresponding Dataloop JSON Annotations
# Local path to the items folder
# If you wish to upload items with your directory tree use : r'C:/home/project/images_folder'
local_items_path = r'C:/home/project/images_folder/*'
# Local path to the corresponding annotations - make sure the file names fit
local_annotations_path = r'C:/home/project/annotations_folder'
dataset.items.upload(local_path=local_items_path,
                     local_annotations_path=local_annotations_path)
Upload Annotations To Video Item

Uploading annotations to video items needs to consider spanning between frames, and toggling visibility (occlusion). In this example, we will use the following CSV file. In this file there is a single ‘person’ box annotation that begins on frame number 20, disappears on frame number 41, reappears on frame number 51 and ends on frame number 90.

Video_annotations_example.CSV

import pandas as pd
# Read CSV file
df = pd.read_csv(r'C:/file.csv')
# Get item
item = dataset.items.get(item_id='my_item_id')
builder = item.annotations.builder()
# Read line by line from the csv file
for i_row, row in df.iterrows():
    # Create box annotation from csv rows and add it to a builder
    builder.add(annotation_definition=dl.Box(top=row['top'],
                                             left=row['left'],
                                             bottom=row['bottom'],
                                             right=row['right'],
                                             label=row['label']),
                object_visible=row['visible'],  # Support hidden annotations on the visible row
                object_id=row['annotation id'],  # Numbering system that separates different annotations
                frame_num=row['frame'])
# Upload all created annotations
item.annotations.upload(annotations=builder)

Show Annotations Over Image

After uploading items and annotations with their metadata, you might want to see some of them and perform visual validation.

To see only the annotations, use the annotation type show option.

# Use the show function for all annotation types
box = dl.Box()
# Must provide all inputs
box.show(image='',
         thickness='',
         with_text='',
         height='',
         width='',
         annotation_format='',
         color='')

To see the item itself with all annotations, use the Annotations option.

# Must input an image or height and width
annotation.show(image='',
                height='', width='',
                annotation_format='dl.ViewAnnotationOptions.*',
                thickness='',
                with_text='')

Download Data, Annotations & Metadata

The item ID for a specific file can be found in the platform UI - Click BROWSE for a dataset, click on the selected file, and the file information will be displayed in the right-side panel. The item ID is detailed, and can be copied in a single click.

Download Items and Annotations

Download dataset items and annotations to your computer folder in two separate folders. See all annotation options here.

dataset.download(local_path=r'C:/home/project/images',  # The default value is ".dataloop" folder
                 annotation_options=dl.VIEW_ANNOTATION_OPTIONS_JSON)
Multiple Annotation Options

See all annotation options here.

dataset.download(local_path=r'C:/home/project/images',  # The default value is ".dataloop" folder
                 annotation_options=[dl.VIEW_ANNOTATION_OPTIONS_MASK,
                                     dl.VIEW_ANNOTATION_OPTIONS_JSON,
                                     dl.ViewAnnotationOptions.INSTANCE])
Filter by Item and/or Annotation
  • Items filter - download filtered items based on multiple parameters, like their directory. You can also download items based on different filters. Learn all about item filters here.

  • Annotation filter - download filtered annotations based on multiple parameters like their label. You can also download items annotations based on different filters, learn all about annotation filters here. This example will download items and JSONS from a dog folder of the label ‘dog’.

# Filter items from "folder_name" directory
item_filters = dl.Filters(resource='items', field='dir', values='/dog_name')
# Filter items with dog annotations
annotation_filters = dl.Filters(resource=dl.FiltersResource.ANNOTATION, field='label', values='dog')
dataset.download(local_path=r'C:/home/project/images',  # The default value is ".dataloop" folder
                 filters=item_filters,
                 annotation_filters=annotation_filters,
                 annotation_options=dl.VIEW_ANNOTATION_OPTIONS_JSON)
Filter by Annotations
  • Annotation filter - download filtered annotations based on multiple parameters like their label. You can also download items annotations based on different filters, learn all about annotation filters here.

item = dataset.items.get(item_id="item_id")  # Get item from dataset to be able to view the dataset colors on Mask
# Filter items with dog annotations
annotation_filters = dl.Filters(resource='annotations', field='label', values='dog')
item.download(local_path=r'C:/home/project/images',  # the default value is ".dataloop" folder
              annotation_filters=annotation_filters,
              annotation_options=dl.VIEW_ANNOTATION_OPTIONS_JSON)
Download Annotations in COCO Format
  • Items filter - download filtered items based on multiple parameters like their directory. You can also download items based on different filters, learn all about item filters here.

  • Annotation filter - download filtered annotations based on multiple parameters like their label. You can also download items annotations based on different filters, learn all about annotation filters here.

This example will download COCO from a dog items folder of the label ‘dog’.

# Filter items from "folder_name" directory
item_filters = dl.Filters(resource='items', field='dir', values='/dog_name')
# Filter items with dog annotations
annotation_filters = dl.Filters(resource='annotations', field='label', values='dog')
converter = dl.Converter()
converter.convert_dataset(dataset=dataset,
                          to_format='coco',
                          local_path=r'C:/home/coco_annotations',
                          filters=item_filters,
                          annotation_filters=annotation_filters)

FaaS Tutorial

Tutorials for FaaS

FaaS Interactive Tutorial – Using Python & Dataloop SDK

FaaS Interactive Tutorial

FaaS Interactive Tutorial – Using Python & Dataloop SDK

Concept

Dataloop Function-as-a-Service (FaaS) is a compute service that automatically runs your code based on time patterns or in response to trigger events.

You can use Dataloop FaaS to extend other Dataloop services with custom logic. Altogether, FaaS serves as a super flexible unit that provides you with increased capabilities in the Dataloop platform and allows achieving any need while automating processes.

With Dataloop FaaS, you simply upload your code and create your functions. Following that, you can define a time interval or specify a resource event for triggering the function. When a trigger event occurs, the FaaS platform launches and manages the compute resources, and executes the function.

You can configure the compute settings according to your preferences (machine types, concurrency, timeout, etc.) or use the default settings.

Use Cases

Pre annotation processing: Resize, video assembler, video dissembler

Post annotation processing: Augmentation, crop box-annotations, auto-parenting

ML models: Auto-detection

QA models: Auto QA, consensus model, majority vote model

Introduction

Getting started with FaaS.

Introduction

This tutorial will help you get started with FaaS.

  1. Prerequisites

  2. Basic use case: Single function

  • Deploy a function as a service

  • Execute the service manually and view the output

  1. Advance use case: Multiple functions

  • Deploy several functions as a package

  • Deploy a service of the package

  • Set trigger events to the functions

  • Execute the functions and view the output and logs

First, log in to the platform by running the following Python code in the terminal or your IDE:

import dtlpy as dl
if dl.token_expired():
    dl.login()

Your browser will open a login screen, allowing you to enter your credentials or log in with Google. Once the “Login Successful” tab appears, you are allowed to close it.

This tutorial requires a project. You can create a new project, or alternatively use an existing one:

# Create a new project
project = dl.projects.create(project_name='project-sdk-tutorial')
# Use an existing project
project = dl.projects.get(project_name='project_name')

Let’s create a dataset to work with and upload a sample item to it:

dataset = project.datasets.create(dataset_name='dataset-sdk-tutorial')
item = dataset.items.upload(
    local_path=['https://raw.githubusercontent.com/dataloop-ai/tiny_coco/master/images/train2017/000000184321.jpg'],
    remote_path='/folder_name')

Run Your First Function

Create and run your first FaaS in the Dataloop platform

Basic Use Case: Single Function

Create and Deploy a Sample Function

Below is an image-manipulation function in Python to use for converting an RGB image to a grayscale image. The function receives a single item, which later can be used as a trigger to invoke the function:

def rgb2gray(item: dl.Item):
    """
    Function to convert RGB image to GRAY
    Will also add a modality to the original item
    :param item: dl.Item to convert
    :return: None
    """
    import numpy as np
    import cv2
    buffer = item.download(save_locally=False)
    bgr = cv2.imdecode(np.frombuffer(buffer.read(), np.uint8), -1)
    gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
    bgr_equalized_item = item.dataset.items.upload(local_path=gray,
                                                   remote_path='/gray' + item.dir,
                                                   remote_name=item.filename)
    # add modality
    item.modalities.create(name='gray',
                           ref=bgr_equalized_item.id)
    item.update(system_metadata=True)

You can now deploy the function as a service using Dataloop SDK. Once the service is ready, you may execute the available function on any input:

service = project.services.deploy(func=rgb2gray,
                                  service_name='grayscale-item-service')
Execute the function

An execution means running the function on a service with specific inputs (arguments). The execution input will be provided to the function that the execution runs.

Now that the service is up, it can be executed manually (on-demand) or automatically, based on a set trigger (time/event). As part of this tutorial, we will demonstrate how to manually run the “RGB to Gray” function.

To see the item we uploaded, run the following code:

item.open_in_web()

Multiple Function

Create a Package with multiple functions and modules

Advanced Use Case: Multiple Functions

Create and Deploy a Package of Several Functions

First, login to the Dataloop platform:

import dtlpy as dl
if dl.token_expired():
    dl.login()

Let’s define the project and dataset you will work with in this tutorial. To create a new project and dataset:

project = dl.projects.create(project_name='project-sdk-tutorial')
project.datasets.create(dataset_name='dataset-sdk-tutorial')

To use an existing project and dataset:

project = dl.projects.get(project_name='project-sdk-tutorial')
dataset = project.datasets.get(dataset_name='dataset-sdk-tutorial')
Write your code

The following code consists of two image-manipulation methods:

  • RGB to grayscale over an image

  • CLAHE Histogram Equalization over an image - Contrast Limited Adaptive Histogram Equalization (CLAHE) to equalize images

To proceed with this tutorial, copy the following code and save it as a main.py file.

import dtlpy as dl
import cv2
import numpy as np
class ImageProcess(dl.BaseServiceRunner):
    @staticmethod
    def rgb2gray(item: dl.Item):
        """
        Function to convert RGB image to GRAY
        Will also add a modality to the original item
        :param item: dl.Item to convert
        :return: None
        """
        buffer = item.download(save_locally=False)
        bgr = cv2.imdecode(np.frombuffer(buffer.read(), np.uint8), -1)
        gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
        gray_item = item.dataset.items.upload(local_path=gray,
                                              remote_path='/gray' + item.dir,
                                              remote_name=item.filename)
        # add modality
        item.modalities.create(name='gray',
                               ref=gray_item.id)
        item.update(system_metadata=True)
    @staticmethod
    def clahe_equalization(item: dl.Item):
        """
        Function to perform histogram equalization (CLAHE)
        Will add a modality to the original item
        Based on opencv https://docs.opencv.org/4.x/d5/daf/tutorial_py_histogram_equalization.html
        :param item: dl.Item to convert
        :return: None
        """
        buffer = item.download(save_locally=False)
        bgr = cv2.imdecode(np.frombuffer(buffer.read(), np.uint8), -1)
        # create a CLAHE object (Arguments are optional).
        lab = cv2.cvtColor(bgr, cv2.COLOR_BGR2LAB)
        lab_planes = cv2.split(lab)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        lab_planes[0] = clahe.apply(lab_planes[0])
        lab = cv2.merge(lab_planes)
        bgr_equalized = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
        bgr_equalized_item = item.dataset.items.upload(local_path=bgr_equalized,
                                                       remote_path='/equ' + item.dir,
                                                       remote_name=item.filename)
        # add modality
        item.modalities.create(name='equ',
                               ref=bgr_equalized_item.id)
        item.update(system_metadata=True)
Define the module

Multiple functions may be defined in a single package under a “module” entity. This way you will be able to use a single codebase for various services.

Here, we will create a module containing the two functions we discussed. The “main.py” file you downloaded is defined as the module entry point. Later, you will specify its directory file path.

modules = [dl.PackageModule(name='image-processing-module',
                            entry_point='main.py',
                            class_name='ImageProcess',
                            functions=[dl.PackageFunction(name='rgb2gray',
                                                          description='Converting RGB to gray',
                                                          inputs=[dl.FunctionIO(type=dl.PackageInputType.ITEM,
                                                                                name='item')]),
                                       dl.PackageFunction(name='clahe_equalization',
                                                          description='CLAHE histogram equalization',
                                                          inputs=[dl.FunctionIO(type=dl.PackageInputType.ITEM,
                                                                                name='item')])
                                       ])]
Push the package

When you deployed the service in the previous tutorial (“Single Function”), a module and a package were automatically generated.

Now we will explicitly create and push the module as a package in the Dataloop FaaS library (application hub). For that, please specify the source path (src_path) of the “main.py” file you downloaded, and then run the following code:

src_path = 'functions/opencv_functions'
project = dl.projects.get(project_name=project_name)
package = project.packages.push(package_name='image-processing',
                                modules=modules,
                                src_path=src_path)
Deploy a service

Now that the package is ready, it can be deployed to the Dataloop platform as a service. To create a service from a package, you need to define which module the service will serve. Notice that a service can only contain a single module. All the module functions will be automatically added to the service.

Multiple services can be deployed from a single package. Each service can get its own configuration: a different module and settings (computing resources, triggers, UI slots, etc.).

In our example, there is only one module in the package. Let’s deploy the service:

service = package.services.deploy(service_name='image-processing',
                                  runtime=dl.KubernetesRuntime(concurrency=32),
                                  module_name='image-processing-module')
Trigger the service

Once the service is up, we can configure a trigger to automatically run the service functions. When you bind a trigger to a function, that function will execute when the trigger fires. The trigger is defined by a given time pattern or by an event in the Dataloop system.

Event based trigger is related to a combination of resource and action. A resource can be any entity in our system (item, dataset, annotation, etc.) and the associated action will define a change in the resource that will prompt the trigger (update, create, delete). You can only have one resource per trigger.

The resource object that triggered the function will be passed as the function’s parameter (input).

Let’s set a trigger in the event a new item is created:

filters = dl.Filters()
filters.add(field='datasetId', values=dataset.id)
trigger = service.triggers.create(name='image-processing2',
                                  function_name='clahe_equalization',
                                  execution_mode=dl.TriggerExecutionMode.ONCE,
                                  resource=dl.TriggerResource.ITEM,
                                  actions=dl.TriggerAction.CREATED,
                                  filters=filters)

In the defined filters we specified a dataset. Once a new item is uploaded (created) in this dataset, the CLAHE function will be executed for this item. You can also add filters to specify the item type (image, video, JSON, directory, etc.) or a certain format (jpeg, jpg, WebM, etc.).

A separate trigger must be set for each function in your service. Now, we will define a trigger for the second function in the module rgb2gray. Each time an item is updated, invoke the rgb2gray function:

trigger = service.triggers.create(name='image-processing-rgb',
                                  function_name='rgb2gray',
                                  execution_mode=dl.TriggerExecutionMode.ALWAYS,
                                  resource=dl.TriggerResource.ITEM,
                                  actions=dl.TriggerAction.UPDATED,
                                  filters=filters)

To trigger the function only once (only on the first item update), set TriggerExecutionMode.ONCE instead of TriggerExecutionMode.ALWAYS.

Execute the function

Now we can upload (“create”) an image to our dataset to trigger the service. The function clahe_equalization will be invoked:

item = dataset.items.upload(
    local_path=['https://raw.githubusercontent.com/dataloop-ai/tiny_coco/master/images/train2017/000000463730.jpg'])

To see the original item, please click here.

Review the function’s logs

You can review the execution log history to check that your execution succeeded:

service.log()

The transformed image will be saved in your dataset. Once you see in the log that the execution succeeded, you may open the item to see its transformation:

item.open_in_web()
Pause the service:

We recommend pausing the service you created for this tutorial so it will not be triggered:

service.pause()

Congratulations! You have successfully created, deployed, and tested Dataloop functions!

Model Management

Tutorials for creating and managing model and snapshots

Introduction

Getting started with Model.

Model Management

Introduction

Dataloop’s Model Management is here to provide Machine Learning engineers the ability to manage their research and production process.

We want to introduce Dataloop entities to create, manage, view, compare, restore, and deploy training sessions.

Our Model Management gives a separation between Model code, weights and configuration, and the data.

in Offline mode, there is no need to do any code integration with Dataloop - just create a model and snapshots entities and you can start managing your work on the platform create reproducible training:

  • same configurations and dataset to reproduce the training

  • view project/org models and snapshots in the platform

  • view training metrics and results

  • compare experiments NOTE: all functions from the codebase can be used in FaaS and pipelines only with custom functions! User must create a FaaS and expose those functions any way he’d like

Online Mode: In the online mode, you can train and deploy your models easily anywhere on the platform. All you need to do is create a Model Adapter class and expose some functions to build an API between Dataloop and your model. After that, you can easily add model blocks to pipelines, add UI slots in the studio, one-button-training etc

TODO: add more documentation in the Adapter function and maybe some example

Model and Snapshot entities
Model

The model entity is basically the algorithm, the architecture of the model, e.g Yolov5, Inception, SVM, etc.

  • In online it should contain the Model Adapter to create a Dataloop API

TODO: add the module attributes

Snapshot

Using the Model (architecture), Dataset and Ontology (data and labels) and configuration (a dictionary) we can create a Snapshot of a training process. The Snapshot contains the weights and any other artifact needed to load the trained model

a snapshot can be used as a parent to another snapshot - to start for that point (fine-tune and transfer learning)

Buckets and Codebase
  1. local

  2. item

  3. git

  4. GCS

The Model Adapter

The Model Adapter is a python class to create a single API between Dataloop’s platform and your Model

  1. Train

  2. Predict

  3. load/save model weights

  4. annotation conversion if needed

We enable two modes of work: in Offline mode, everything is local, you don’t have to upload any model code or any weights to platform, which causes the platform integration to be minimal. For example, you cannot use the Model Management components in a pipeline, can easily create a button interface with your model’s inference and more. In Online mode - once you build an Adapter, our platform can interact with your model and trained snapshots and you can connect buttons and slots inside the platform to create, train, inference etc and connect the model and any train snapshot to the UI or to add to a pipeline

Create a Model and Snapshot

Create a Model with a Dataloop Model Adapter

Create Your own Model and Snapshot

We will create a dummy model adapter in order to build our model and snapshot entities NOTE: This is an example for a torch model adapter. This example will NOT run as-is. For working examples please refer to our models on github

The following class inherits from the dl.BaseModelAdapter, which have all the Dataloop methods for interacting with the Model and Snapshot There are four methods that are model-related that the creator must implement for the adapter to have the API with Dataloop

import dtlpy as dl
import torch
import os
class SimpleModelAdapter(dl.BaseModelAdapter):
    def load(self, local_path, **kwargs):
        print('loading a model')
        self.model = torch.load(os.path.join(local_path, 'model.pth'))
    def save(self, local_path, **kwargs):
        print('saving a model to {}'.format(local_path))
        torch.save(self.model, os.path.join(local_path, 'model.pth'))
    def train(self, data_path, output_path, **kwargs):
        print('running a training session')
    def predict(self, batch, **kwargs):
        print('predicting batch of size: {}'.format(len(batch)))
        preds = self.model(batch)
        return preds

Now we can create our Model entity with an Item codebase.

project = dl.projects.get('MyProject')
codebase: dl.ItemCodebase = project.codebases.pack(directory='/path/to/codebase')
model = project.models.create(model_name='first-git-model',
                              description='Example from model creation tutorial',
                              output_type=dl.AnnotationType.CLASSIFICATION,
                              tags=['torch', 'inception', 'classification'],
                              codebase=codebase,
                              entry_point='dataloop_adapter.py',
                              )

For creating a Model with a Git code, simply change the codebase to be a Git one:

project = dl.projects.get('MyProject')
codebase: dl.GitCodebase = dl.GitCodebase(git_url='github.com/mygit', git_tag='v25.6.93')
model = project.models.create(model_name='first-model',
                              description='Example from model creation tutorial',
                              output_type=dl.AnnotationType.CLASSIFICATION,
                              tags=['torch', 'inception', 'classification'],
                              codebase=codebase,
                              entry_point='dataloop_adapter.py',
                              )

Creating a local snapshot:

bucket = dl.buckets.create(dl.BucketType.ITEM)
bucket.upload('/path/to/weights')
snapshot = model.snapshots.create(snapshot_name='tutorial-snapshot',
                                  description='first snapshot we uploaded',
                                  tags=['pretrained', 'tutorial'],
                                  dataset_id=None,
                                  configuration={'weights_filename': 'model.pth'
                                                 },
                                  project_id=model.project.id,
                                  bucket=bucket,
                                  labels=['car', 'fish', 'pizza']
                                  )

Building to model adapter and calling one of the adapter’s methods:

adapter = model.build()
adapter.load_from_snapshot(snapshot=snapshot)
adapter.train()

Using Dataloop’s Dataset Generator

Use the SDK and the Dataset Tools to iterate, augment and serve the data to your model

Dataloop Dataloader

A dl.Dataset image and annotation generator for training and for items visualization

We can visualize the data with augmentation for debug and exploration. After that, we will use the Data Generator as an input to the training functions

from dtlpy.utilities import DatasetGenerator
import dtlpy as dl
dataset = dl.datasets.get(dataset_id='611b86e647fe2f865323007a')
dataloader = DatasetGenerator(data_path='train',
                              dataset_entity=dataset,
                              annotation_type=dl.AnnotationType.BOX)
Object Detection Examples

We can visualize a random item from the dataset:

for i in range(5):
    dataloader.visualize()

Or get the same item using it’s index:

for i in range(5):
    dataloader.visualize(10)

Adding augmentations using imgaug repository:

from imgaug import augmenters as iaa
import numpy as np
augmentation = iaa.Sequential([
    iaa.Resize({"height": 256, "width": 256}),
    # iaa.Superpixels(p_replace=(0, 0.5), n_segments=(10, 50)),
    iaa.flip.Fliplr(p=0.5),
    iaa.flip.Flipud(p=0.5),
    iaa.GaussianBlur(sigma=(0.0, 0.8)),
])
tfs = [
    augmentation,
    np.copy,
    # transforms.ToTensor()
]
dataloader = DatasetGenerator(data_path='train',
                              dataset_entity=dataset,
                              annotation_type=dl.AnnotationType.BOX,
                              transforms=tfs)
dataloader.visualize()
dataloader.visualize(10)

All of the Data Generator options (from the function docstring):

:param dataset_entity: dl.Dataset entity :param annotation_type: dl.AnnotationType - type of annotation to load from the annotated dataset :param filters: dl.Filters - filtering entity to filter the dataset items :param data_path: Path to Dataloop annotations (root to “item” and “json”). :param overwrite: :param label_to_id_map: dict - {label_string: id} dictionary :param transforms: Optional transform to be applied on a sample. list or torchvision.Transform :param num_workers: :param shuffle: Whether to shuffle the data (default: True) If set to False, sorts the data in alphanumeric order. :param seed: Optional random seed for shuffling and transformations. :param to_categorical: convert label id to categorical format :param class_balancing: if True - performing random over-sample with class ids as the target to balance training data :param return_originals: bool - If True, return ALSO images and annotations before transformations (for debug) :param ignore_empty: bool - If True, generator will NOT collect items without annotations

The output of a single element is a dictionary holding all the relevant informtaion. the keys for the DataGen above are: [‘image_filepath’, ‘item_id’, ‘box’, ‘class’, ‘labels’, ‘annotation_filepath’, ‘image’, ‘annotations’, ‘orig_image’, ‘orig_annotations’]

print(list(dataloader[0].keys()))

We’ll add the flag to return the origin items to understand better how the augmentations look like. Let’s set the flag and we can plot:

import matplotlib.pyplot as plt
dataloader = DatasetGenerator(data_path='train',
                              dataset_entity=dataset,
                              annotation_type=dl.AnnotationType.BOX,
                              return_originals=True,
                              shuffle=False,
                              transforms=tfs)
fig, ax = plt.subplots(2, 2)
for i in range(2):
    item_element = dataloader[np.random.randint(len(dataloader))]
    ax[i, 0].imshow(item_element['image'])
    ax[i, 0].set_title('After Augmentations')
    ax[i, 1].imshow(item_element['orig_image'])
    ax[i, 1].set_title('Before Augmentations')
Segmentation Examples

First we’ll load a semantic dataset and view some images and the output structure

dataset = dl.datasets.get(dataset_id='6197985a104eb81cb728e4ac')
dataloader = DatasetGenerator(data_path='semantic',
                              dataset_entity=dataset,
                              transforms=tfs,
                              return_originals=True,
                              annotation_type=dl.AnnotationType.SEGMENTATION)
for i in range(5):
    dataloader.visualize()

Visualize original vs augmented image and annotations mask:

fig, ax = plt.subplots(2, 4)
for i in range(2):
    item_element = dataloader[np.random.randint(len(dataloader))]
    ax[i, 0].imshow(item_element['orig_image'])
    ax[i, 0].set_title('Original Image')
    ax[i, 1].imshow(item_element['orig_annotations'])
    ax[i, 1].set_title('Original Annotations')
    ax[i, 2].imshow(item_element['image'])
    ax[i, 2].set_title('Augmented Image')
    ax[i, 3].imshow(item_element['annotations'])
    ax[i, 3].set_title('Augmented Annotations')

Converting to 3d one-hot to visualize the binary mask per label. We will plot only 8 label (there might be more on the item):

item_element = dataloader[np.random.randint(len(dataloader))]
annotations = item_element['annotations']
unique_labels = np.unique(annotations)
one_hot_annotations = np.arange(len(dataloader.id_to_label_map)) == annotations[..., None]
print('unique label indices in the item: {}'.format(unique_labels))
print('unique labels in the item: {}'.format([dataloader.id_to_label_map[i] for i in unique_labels]))
plt.figure()
plt.imshow(item_element['image'])
fig = plt.figure()
for i_label_ind, label_ind in enumerate(unique_labels[:8]):
    ax = fig.add_subplot(2, 4, i_label_ind + 1)
    ax.imshow(one_hot_annotations[:, :, label_ind])
    ax.set_title(dataloader.id_to_label_map[label_ind])