What is a Data Lake?
A Data Lake is a data source that contains cleaned, curated, and validated data that serves as the single source of truth for all data processing. Multiple users, systems, and data processes will pull from this Data Lake. Writing to the Data Lake by end users is often restricted to ensure data integrity.
How Does the Data Lake Fit into Kubeflow MLOps?
Enterprises may have many Data Scientists or other MLOps practitioners who all need to access the single source of truth for their experimentation in Jupyter Notebooks and ultimately in Kubeflow Pipelines. Enterprises maintain this Data Lake set up so that users do not need to concern themselves with the data preparation or integrity. Having all data loaded into a Data Lake and as close to the developer and execution process as possible improves time to production. This also reduces overhead in the system since users do not need to continually load their own data, it is all preloaded and the system is warm and waiting.
How Are Environments Compartmentalized in Kubeflow
Namespaces are the Kubernetes concept that Kubeflow uses to compartmentalize the work of either a single individual or multiple individuals. Data Scientists working in Kubeflow are assigned to either a private namespace or a shared namespace. Namespaces will host the workspace volume (where the code and libraries are stored) and the data volumes (where the data is stored) for the user(s). Namespaces will also host the Rok Snapshots that are created during Notebook snapshotting, or the execution of Kubeflow Pipelines.
Differences Between Volumes and Snapshots?
Snapshots are immutable, versioned, not in the critical I/O path and can be shared within namespaces by default and across namespaces by using Rok Registry. Snapshots contain both workspace and data volumes. In contrast volumes (also referred to as K8s Volumes of Persistent Volumes) are mutable and not versioned and therefore changes are possible and history can be lost. These volumes are created empty from scratch or - and this is important - are cloned from an immutable Snapshot. Volumes can be RWO (Read Write Once) or RWX (Read Write Many). Volumes should never cross namespaces on their own, this is a core Kubernetes security tenant - volumes should only cross namespaces as part of a Snapshot that is being shared with Rok Registry.
How to Facilitate a Data Lake Snapshot in a Private Namespace?
For a single user in a private namespace this can be solved by using a RWX volume which can be attached to multiple pods in the cluster. In this case since the user is the only person on the cluster the RWX volume can be attached to each Notebook Server that the user creates or any of the pipeline steps they create.
How to Facilitate a Data Lake Snapshot in Shared Namespace?
For multiple users in a shared namespace, this can be solved by using a RWX volume which can be attached to multiple pods in the cluster. In this case, since the users are all in the same namespace the RWX volume can be attached to each Notebook Server that the users create. However, the preferred approach is to take a snapshot of the desired volume and clone the snapshot for use in Jupyter Notebooks. This way the source volume is treated as a data release checkpoint and is not subject to inadvertent global modifications.
How to Facilitate a Data Lake Snapshot Across Namespace?
For single or multiple users this can be solved using Rok Registry to share the Rok Snapshot between namespaces and then clone the volume in the destination. This is the only way to do this that abides by Kubernetes security tenants and ensures that there is a versioned history of the data.
An example flow is below:
- User “joe” in namespace “kubeflow-joe” takes a snapshot of a volume and stores the snapshot in Rok bucket “bucket-joe”.
- User “joe” publishes the bucket (“bucket-joe”) the snapshot lives in.
- User “taylor” in namespace “kubeflow-taylor” creates a Rok bucket “bucket-from-joe” and subscribes it to bucket-joe.
- User “taylor” in namespace “kubeflow-taylor” is able to see inside bucket-from-joe the snapshot that user “joe” created in “bucket-joe” which lives in namespace “kubeflow-joe”.
- User “taylor” creates a notebook in namespace “kubeflow-taylor” and attaches a PVC that is a clone of the aforementioned snapshot (the one that joe originally created in namespace “kubeflow-joe”)
By taking this approach the cloned volume is ephemeral and gets cleaned up at the end of usage, the history is captured in the Rok Snapshot and the migration history is captured in Rok Registry. There is a complete audit trail and environment reproducibility.
How to Facilitate External Data Lake Access?
With Workload Identity / IRSA on AWS you can have service-specific access, you only enable specific Kubernetes services accounts to access your data lake. Since we are bringing the data in locally, we map the service account of the namespace (default-editor) to an IAM role within the cloud. This allows you to access the external data lake for the purpose of data exploration. Once the data is ready to go (features selected and data frames created) you can bring the data locally, create a local Rok Snapshot to version everything, and kick off your pipeline. This way, your notebook is a hermetically sealed pipeline definition. Kale and Rok use the power of Kubeflow components and immutable snapshots to make the entire workflow (data and all) shareable and subscribable. Any data scientist anywhere can now reproduce and iterate on your pipeline without needing access to your internal data lake! You have just managed to make your work drastically more accessible to peer reviewers and have implemented versioning of your entire data science stack.