feat: Add new input_file_name UDF for file-backed scans#22978
Conversation
input_file_name UDF for file-backed scans
|
@alamb you mentioned in a previous review your dislike of all the helpers that keep accumulating in |
|
|
||
| Returns the path of the input file that produced the current row. | ||
|
|
||
| Note: file paths/URIs may be sensitive metadata depending on your environment. |
There was a problem hiding this comment.
that is an interesting point -- can we add a note to the upgrading guide (and the release notes) and explain how to disable the function for people for whom it will be a security problem?
Maybe the explanation is just "register a function with the same filename" 🤔 Or should we add a config flag as a follow on PR to avoid registering these 🤔
There was a problem hiding this comment.
I suggest consolidating the the parquet tests in datafusion/sqllogictest/test_files/input_file_name.slt so the tests for the same function live together in the same file
There was a problem hiding this comment.
Will do. Do you think there's value in having tests for Parquet that do both in this file?
|
|
||
| Note: file paths/URIs may be sensitive metadata depending on your environment. | ||
|
|
||
| This function is intended to be rewritten at file-scan time (when the file is |
There was a problem hiding this comment.
It occurs to me that we might want to add some documentation (as a follow on PR) to the TableProvider about functions that might need special handling (e.g. input_file_name, input_row_number, get_field)
Otherwise people implementing table providers might not know it would be helpful
|
|
||
| This function is intended to be rewritten at file-scan time (when the file is | ||
| known). If the input file is not known (for example, if this function is | ||
| evaluated outside a file scan, or was not pushed down into one), direct evaluation returns an error. |
There was a problem hiding this comment.
I wonder if we should be returning an error or try be more permissive and return null/empty string for this case 🤔
For reference, Spark seems to return an empty string:
>>> spark.sql("select input_file_name()").show()
+-----------------+
|input_file_name()|
+-----------------+
| |
+-----------------+There was a problem hiding this comment.
file_row_index() also errors, I think its nicer because it makes it harder to misuse, I can imagine in some complex query a call might be misplaced, with results becoming unexpected.
There was a problem hiding this comment.
I think @AdamGS 's implementation of file_row_index() did return NULL rather than an error (and I suggested returning an error).
I don't have a strong preference either way, as long as we have test coverage for the behavior
Which issue does this PR close?
Rationale for this change
Adds useful metadata functions to DataFusion. This PR builds on @ethan-tyler's #20071.
What changes are included in this PR?
input_file_name()UDF, which reports a UTF8 return_type. Likefile_row_index(), it errors when evaluated out of context.FileOpener(Both inParquetOpenerandProjectionOpener), making it a literal with the file's path.rewrite_input_file_name_in_projectionindatafusion-physical-expr-adapter.Are these changes tested?
file_row_index().Are there any user-facing changes?
datafusion-physical-expr-adapter-rewrite_input_file_name_in_projection.AI was used in this PR, mostly when helping to come up with test cases.