-
Notifications
You must be signed in to change notification settings - Fork 187
Add Notebook-scoped packages for command submits or notebook job run #1321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Federico Manuel Gomez Peter <federico.gomez@payclip.com>
tejassp-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you taken this into account - The Databricks docs state "Starting with Databricks Runtime 13.0 %pip commands do not automatically restart the Python process. If you install a new package or update an existing package, you may need to use dbutils.library.restartPython() to see the new packages."
- https://docs.databricks.com/aws/en/libraries/notebooks-python-libraries#manage-libraries-with-pip-commands
- https://docs.databricks.com/aws/en/libraries/restart-python-process
Also have you considered the recommendation on driver node sizes for notebook scoped libraries
sd-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the PR. Took a look at the code changes(didn't review tests yet) and added comments.
| logger.debug("Submitting Python model using the Command API.") | ||
|
|
||
| # Prepare code with notebook-scoped package installation if needed | ||
| code_to_execute = self._prepare_code_with_packages(compiled_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure we only call this path if notebook_scoped_libraries is set.
| parsed_model.config.packages if parsed_model.config.notebook_scoped_libraries else [] | ||
| ) | ||
|
|
||
| def _prepare_code_with_packages(self, compiled_code: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of duplication, better to extract into a re-usable function component. Maybe something like
def prepare_code_with_packages(compiled_code: str, packages: list[str], separator: str = "\n\n") -> str:
| # Only add packages to cluster-level libraries if not using notebook-scoped | ||
| if not notebook_scoped_libraries: | ||
| for package in packages: | ||
| if index_url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us look to add support for index_url in notebook scoped packages as well. Since this is supported today; it would be weird if some model silently breaks if something breaks if an user who has set this switches to using notebook scoped packages.
| logger.debug(f"Adding notebook-scoped package installation: {pip_install_cmd}") | ||
|
|
||
| # Prepend the pip install command to the compiled code | ||
| return f"{pip_install_cmd}\n\n# COMMAND ----------\n{compiled_code}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As recommended by @tejassp-db, let us add dbutils.library.restartPython() after the pip_install_cmd
| return compiled_code | ||
|
|
||
| # Build the %pip install command for notebook-scoped packages | ||
| pip_install_cmd = "%pip install " + " ".join(self.packages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe can try %pip install -q {packages} to reduce noise in notebook output
I didn't take into account the recommendation to restart python process, thanks for that. I will refactor this draft PR to make it more production ready (avoiding duplicated code) and consider that. About the how large should the driver node be, this solution is agnostic about cluster size. It will depend on the team that previously defined the cluster (no matter if it was an all purpose cluster, job cluster, whatever). |
Resolves #1320
Description
Add feature to install packages in Notebook-scoped environment for PythonCommandSubmitt and PythonNotebookUploader classes
Execution Test
I made the changes and tested it in our environment, looking that the compiled code now have the prepended package installation
Checklist
CHANGELOG.mdand added information about my change to the "dbt-databricks next" section.