Created: 2024-04-27 Sat 09:25
This is the usual overview of what the LSST is going to be doing, and how much of it there is.
This is the LSST Science Platform Interactive Notebook Component. Basically, it's a way of letting scientists quickly iterate through hypotheses looking for the ones interesting enough to burn a lot of resources investigating.
My talk at JupyterCon 2018 (slides) is not a bad overview, if I do say so myself.
You're free to argue with me about this.
If you do, you're wrong.
Abstraction that lets you care about the application software rather than the lower layers.
We're using Docker.
Kubernetes abstractions (e.g. the service) are designed such that we can load-balance and (in some cases) get HA without having to work very hard at it. The deployment manages container lifecycles so we have the right number of a given component running. We don't have to manage the (miserable) Docker-container-port-to-host-port mapping stuff ourselves.
This is where Kubernetes is magnificent.
You either already do provide a managed Kubernetes service or you're going to have to. The longer you wait the more it will hurt.
Again, you can argue with me. Again, you're wrong.
Kustomize, Terraform, Helm, or roll-your own. Each of the first three has advantages.
Why write your own spawner? I haven't heard a convincing reason.
No sense in starting, several years from Science First Light, with something that's already being supplanted.
You can still get the Classic Notebook view from it, if you have users with notebooks that rely on things JupyterLab doesn't have extensions for. Encourage them to write those extensions or at least open issues.
It's full of corner cases and harder than it looks.
Are you really such a special snowflake that "users are members of groups, and groups map to capabilities" won't work for you?
Wide support, good JupyterHub support, easy to add new providers.
This is our configuration.
Custom header checking/injection in an Nginx ingress with a diversion through OAuth2 flow, followed by passing around JWT.
Our ingress annotations and header validation and parsing.
Note that Node.js has a default maximum header size of 8KB.
CILogon+NCSA IDP supports association of identities, which is a nice feature. See if your OAuth2 provider can do it.
For instance, I'm usually signed into GitHub within ten minutes of logging on somewhere.
A group is really a mapping to a set of capabilities.
Any reasonable authentication provider should be able to also do multiple group memberships for an identity.
What a user is allowed to do is the union of the capabilities of each of their groups.
Once you start constructing complex user environments, it's easy to leak.
Namespace teardown removes all namespaced resources; in our experience, everything but PVs.
If you have a complex set of analysis tools, your images may be very large. Ours are 16GB now.
This can take a very long time to pull.
Run something to continually pull some set of versions of your standard images. Couple with a CI system and by the time people show up in the morning, the new image is pulled.
Cuts startup time from 10 minutes to 15 seconds for us.
Don't take a base JupyterLab and add your software to it if your software is large.
Instead, add JupyterLab to your software.
Say, a handful of columns across a couple billion rows. (GAIA DR2, "l" and "b" columns only)
LSST DR11 final catalog size: 15PB.
By the end of the survey, much that we would now use a batch environment for will be reasonable in an interactive Dask-like framework. 15PB of catalog data?
Use the same container with a different environmental flag set to say "be a Dask worker, not a JupyterLab server."
In our environment, both Jupyter machinery and Dask machinery are small compared to our analysis software.
We populate a Dask worker yml document at each login that does the right thing. Modify at your own risk and you're still subject to quotas.
We anticipate very few users will ever need this level of control.
Some attention to partitioning is still required.
But in the same namespace, so quotas are still easy.
It's not that scary.
This is an example for JupyterHub.
This is a JupyterHub minimal configuration wrapper that loads the (sorted) contents of a configuration directory.
This is one of the files it loads.
Make your ConfigMaps generic.
Put them in templated environment, or in Secrets for sensitive data.
You can use groups to control what's displayed.
Pass information into the user container and do user setup as a semiprivileged user with tightly controlled sudo.
Then start the JupyterLab server as the user, in the user's home directory.
Do not give any sudo privileges to the user.
Set up gid/groupname mappings, uid/username, and parse in the shell on the far end…
This is what we've been doing, and we've found we need to…
Here is how we do our initial Dask container template setup.
This gets silly fast. Instead try:
Define ConfigMaps (which are namespaced) at spawn time and map them into the user's Lab container as read-only files.
You just need a consistent and persistent way to assign uids/gids.
Your LDAP system should already do this. GitHub has unique 32-bit identifiers for users and groups. Google will require you to map 64-bit IDs to 32-bit.
Works, ubiquitous, but…
"Get out of jail free."
This talk (source): https://github.com/lsst-sqre/Jupyter-PCW-2019.git.
The Notebook Aspect of the LSST Science Platform: https://github.com/lsst-sqre/nublado.git (example sources).
LSST JupyterHub Utilities: https://github.com/lsst-sqre/jupyterhubutils (prepuller and reposcanner).
LSST JupyterLab Utilities: https://github.com/lsst-sqre/jupyterlabutils (Dask cluster proxy for use in Kubernetes).
This talk: https://athornton.github.io/Jupyter-PCW-2019.
Adam Thornton <athornton@lsst.org>