9. Dataset Binding with Google Cloud Storage¶
We will create a GCP cloud function to continuously sync a bucket with Dataloop’s Dataset
If you want to catch events from the GCS bucket and update the Dataloop Dataset you need to set up a Cloud function.The function will catch the GCS bucket events and will reflect them into the Dataloop Platform.
If you are familiar with GCP Cloud Functions, you can just use our integration function below.
We assume you already have a GCP account. If you don’t, follow the GCP docs and create it.
9.1. Create the Cloud Function¶
Create a GCS bucket
NOTE: This bucket should be used as the external storage for the Dataloop dataset.
Go to Cloud Functions and click Create Function -> to create a new function
Basic
Environment -> 1st gen Subscription
Choose Function Name
Choose the function region
Trigger
Trigger type -> Cloud Storage
Event type -> On (finalizing/creating) file in the selected bucket
Bucket -> Choose the bucket you would like allow auto sync with Dataloop
Click Save
Runtime, build, connections and security settings
Choose the Runtime tab
Runtime environment variable -> Add variable
Add the 3 environment variables
DATASET_ID
,DTLPY_USERNAME
andDTLPY_PASSWORD
To populate the values for the vars:
DTLPY_USERNAME
,DTLPY_PASSWORD
you’ll need to create a DataLoop Bot on your Dataloop project using the following code:
The output:
username ->
DTLPY_USERNAME
password ->
DTLPY_PASSWORD
After adding all environment variables -> Next
import dtlpy as dl
dl.login()
project = dl.projects.get(project_name='project name')
bot = project.bots.create(name='serviceAccount', return_credentials=True)
print('username: ', bot.id)
print('password: ', bot.password)
Code
Runtimes: => python 3.7
Entry point: Your function name from the code snippet (default
create_gcs
)In the requirements.txt file -> add
dtlpy
Copy the following code to the main.py file:
import os
os.environ["DATALOOP_PATH"] = "/tmp"
import dtlpy as dl
dataset_id = os.environ.get('DATASET_ID')
dtlpy_username = os.environ.get('DTLPY_USERNAME')
dtlpy_password = os.environ.get('DTLPY_PASSWORD')
def create_gcs(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
dl.login_m2m(email=dtlpy_username, password=dtlpy_password)
dataset = dl.datasets.get(dataset_id=dataset_id,
fetch=False # to avoid GET the dataset each time
)
driver_path = dl.drivers.get(driver_id=dataset.driver).path
remote_path = None
if driver_path == '/':
driver_path = None
if driver_path is not None and driver_path not in file['name']:
return
if driver_path:
remote_path = file['name'].replace(driver_path, '')
file_name = 'external://' + file['name']
dataset.items.upload(local_path=file_name, remote_path=remote_path)
Click -> Deploy
Add another function for delete actions
Repeat the process
Create another function for
delete
withdelete event
with the following code and the same settingsTrigger -> Event type -> On (deleting) file in the selected bucket
Entry point: Your function name from the code snippet (default
delete_gcs
)
import os
os.environ["DATALOOP_PATH"] = "/tmp"
import dtlpy as dl
dataset_id = os.environ.get('DATASET_ID')
dtlpy_username = os.environ.get('DTLPY_USERNAME')
dtlpy_password = os.environ.get('DTLPY_PASSWORD')
def delete_gcs(event, context):
"""Triggered by a change to a Cloud Storage bucket.
Args:
event (dict): Event payload.
context (google.cloud.functions.Context): Metadata for the event.
"""
file = event
dl.login_m2m(email=dtlpy_username, password=dtlpy_password)
dataset = dl.datasets.get(dataset_id=dataset_id,
fetch=False # to avoid GET the dataset each time
)
driver_path = dl.drivers.get(driver_id=dataset.driver).path
if driver_path == '/':
driver_path = None
if driver_path is not None and driver_path not in file['name']:
return
if driver_path:
remote_path = file['name'].replace(driver_path, '')
else:
remote_path = file['name']
dataset.items.delete(filename=remote_path)
9.2. You’re good to go!¶
9.2.1. For pictures examples:¶
add_layer
add_layer