The Arrikto Way
The overall goal when accessing an external data source (I.E. a data lake) is to create reproducibility with both one off and reoccurring jobs. Inevitably we will need access to an external source because you do not want to store 30PBs of data on your local Rok. What you DO want to store are your datasets for pipelines. Currently we(Arrikto) see data lakes as a dumping ground for raw data without purpose. During the exploration stage, a data scientist will be querying this data to determine what features are important or gaps are present in the data itself. They can choose to bring the data locally and run some explorative steps within their notebook (often with something like Pandas). That means the notebook service account (default-editor) will need access to the IAM role for the external service. Eventually a data scientist will want to gather insights from and train their model. So they will define their pipeline in the notebook, kick off the pipeline, and generate a model. They COULD choose to use the data they have been exploring by bringing it locally and saving it in a format supported by their pipeline job (such as a CSV). That means the data will be passed to the pipeline and the data scientist could return to the EXACT SAME notebook to iterate on the data. If they want to automatically pull data, the service account our pipeline pods leverage (pipeline-runner) will need to be mapped to an IAM role with access to the external data lake service. Then the data as a pipeline step can be pulled into the pipeline run. The data pulled from the run WILL NOT be present on the original notebook that started the pipeline. The data WILL be saved in a snapshot that has that data locally if you choose a snapshot URL after the data ingestion step. Then the user can create a NEW notebook from that snapshot with the data. We have now enabled data lineage! They could opt out of that data ingestion step, and use the same data from the previous run they are exploring on the new notebook to completly reproduce and iterate on an IDENTICAL pipeline run. We allow you to automatically create reproducible runs using an external data source. If something seems off after a specific run, you CAN (and probably should) revisit the exact data set from a previous run without a dependency on an external data lake. Future state is a data set registry, but for now you could share a notebook with the data from the data lake across organizations or departments without needing to grant access to that external data lake. You can also snapshot a volume with the prepulled data and share the data set as well.
Accessing Big Query on GCP for Data Lineage and Exploration
1. Create GCP service account
gcloud iam service-accounts create GSA_NAME \ --project=GSA_PROJECT
gcloud projects add-iam-policy-binding PROJECT_ID \ --member "serviceAccount:GSA_NAME @GSA_PROJECT .iam.gserviceaccount.com" \ --role "ROLE_NAME "
gcloud iam service-accounts add-iam-policy-binding GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com \ --role roles/iam.workloadIdentityUser \ --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/default-editor]"
kubectl annotate serviceaccount default-editor \ --namespace NAMESPACE \ iam.gke.io/gcp-service-account=GSA_NAME@GSA_PROJECT.iam.gserviceaccount.com
apiVersion: v1 kind: Pod metadata: name: workload-identity-test namespace: NAMESPACE spec: containers: - image: google/cloud-sdk:slim name: workload-identity-test command: ["sleep","infinity"] serviceAccountName: KSA_NAME nodeSelector: iam.gke.io/gke-metadata-server-enabled: "true"
curl -H "Metadata-Flavor: Google" http://169.254.169.254/computeMetadata/v1/instance/service-accounts/
Now for BQ we need to do some additional steps from the notebook
pip install --upgrade 'google-cloud-bigquery[bqstorage,pandas]' %load_ext google.cloud.bigquery %%bigquery SELECT source_year AS year, COUNT(is_male) AS birth_count FROM `bigquery-public-data.samples.natality` GROUP BY year ORDER BY year DESC LIMIT 15
You should see a table from the publicly available BQ data.