Issues This KB Resolves
- Pulling data from an external system without workload identity mapping
- Accessing Kaggle datasets from a Kubeflow as a Service
Summary
Often during a pipeline step or from a notebook, we need to access an external API. This NORMALLY can be done with workload identity on GCP. If we are unable to use our service account tokens directly, we can query an API using stored credentials. This is NOT totally secure. You shouldn't be storing any credentials on a cluster, but if you need to store credentials as a secret and use the token for a non super secure project, this is the best way.
How does this work?
This proccess goes through the vanilla pipeline and Kale ways of adding a secret to a pipeline step. The process creates a secret, mounts that secret to a pod with the "kaggle-secret: "true"" label, uses the secret mounted on the volume to pull the specified data set from the Kaggle API, and saves it locally to the volume at a specified path.
The Process
1. Create the secret for our credentials so we can mount them to our pods
kubectl create secret generic kaggle-secret --from-literal=KAGGLE_USERNAME=<username> --from-literal=KAGGLE_KEY=<api_token>
2. Create a pod-default resource to mount the secret to any pod with a specific label ( in our case kaggle-secret=true)
apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
name: kaggle-access
spec:
selector:
matchLabels:
kaggle-secret: "true"
desc: "kaggle-access"
volumeMounts:
- name: secret-volume
mountPath: /secret/kaggle
volumes:
- name: secret-volume
secret:
secretName: kaggle-secret
3. Leverage our "download_kaggle_dataset" function without passing our credentials directly since they should be mounted at the desired path.
def download_kaggle_dataset(data_set:str, path:str):
import os
import glob
with open('/secret/kaggle/KAGGLE_KEY', 'r') as file:
kaggle_key = file.read().rstrip()
with open('/secret/kaggle/KAGGLE_USERNAME', 'r') as file:
kaggle_user = file.read().rstrip()
os.environ['KAGGLE_USERNAME'] = kaggle_user
os.environ['KAGGLE_KEY'] = kaggle_key
import kaggle
os.chdir(os.environ.get('HOME'))
os.system("mkdir " + path)
os.chdir(path)
os.system("kaggle datasets download -d " + data_set + " --unzip")
extension = 'csv'
csv = glob.glob('*.{}'.format(extension))
csv=csv[0]
print(csv)
os.system("mv " + csv + " data.csv") # renames our data to data.csv for easier calling
4. Last but not least, we need to make sure our pipeline step has the right label. Easy enough.
get_data = download_kaggle_dataset_op("l3llff/-dark-souls-3-weapon","/mnt/data").add_pvolumes({"/mnt": vop.volume}).add_pod_label("kaggle-secret", "true")
If you are using Kale, you can add a label via DeployConfig Parameter
if you are running KubeFlow 1.5 or later (at the time of this writing KFaaS leverages 1.4.3)
Basically Navigate to the configuration type you want to add and expand the menu of that group. For each section you want to add a configuration for, click the + Add button and fill in the form.
If you are using earlier to 1.4.3 and once you deploy the PodDefault resource, you can create a notebook with access to the kaggle APIs.from that notebook, you can run the Kaggle function manually, and then skip the step. The data will be present and used across your pipeline. The data will be on the initial notebook you are leveraging.
def download_kaggle_dataset(data_set:str, path:str):
import os
import glob
with open('/secret/kaggle/KAGGLE_KEY', 'r') as file:
kaggle_key = file.read().rstrip()
with open('/secret/kaggle/KAGGLE_USERNAME', 'r') as file:
kaggle_user = file.read().rstrip()
os.environ['KAGGLE_USERNAME'] = kaggle_user
os.environ['KAGGLE_KEY'] = kaggle_key
import kaggle
os.chdir(os.environ.get('HOME'))
os.system("mkdir " + path)
os.chdir(path)
os.system("kaggle datasets download -d " + data_set + " --unzip")
extension = 'csv'
csv = glob.glob('*.{}'.format(extension))
csv=csv[0]
print(csv)
os.system("mv " + csv + " data.csv") # renames our data to data.csv for easier calling
download_kaggle_dataset("l3llff/-dark-souls-3-weapon","./")
Comments
0 comments
Please sign in to leave a comment.