Add a JSON reader option to ignore type conflicts#7276
Conversation
| for p in pos { | ||
| if !matches!(tape.get(*p), TapeElement::Null) { | ||
| return Err(tape.error(*p, "null")); | ||
| } |
There was a problem hiding this comment.
NOTE: Indentation-only change
|
Have you run the benchmarks for this? |
Not yet... but https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md makes it look very easy. Will do so and report back. |
|
@tustvold is |
I think this one is probably what @tustvold is referring to: https://github.com/apache/arrow-rs/blob/a75da00eed762f8ab201c6cb4388921ad9b67e7e/arrow/benches/json_reader.rs#L30-L29 so like cargo bench --bench json_reader |
|
Hmm, the benchmark results are not stable from run to run. Even benchmarking the main branch against itself gives a randomly and changing set of regressions and improvements. I tried on two very different computers: 2021 MacBook Pro (Apple M1 Max) and a 2019 Lenovo T490s (Intel Core i5-8365U). Different absolute numbers, same large jitter. Is there some trick for getting stable numbers? I tried increasing the measurement interval to 10s, it didn't solve the problem. |
I am not suepr familar Maybe you could use a non laptop (sometimes they vary based on thermostats, etc)? |
|
Finally got some reasonably stable benchmark results using an EC2 m6i.8xlarge instance, rustc 1.85.0. It uncovered one issue with the
I don't know why my changes should have caused a speedup, but at least there's no slowdown. Benchmarking commands used# 5 runs against upstream main branch
git checkout 82c2d5f4c
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline main$i; done
# 5 runs against this PR
git switch json-ignore-type-conflicts-option
for i in $(seq 5); do cargo bench --bench json_reader -- --save-baseline feature$i; done
# compare the run results
for i in $(seq 5); do cargo bench --bench json_reader -- --load-baseline feature$i --baseline main$i; done(see https://bheisler.github.io/criterion.rs/book/user_guide/command_line_options.html#baselines) |
|
@tustvold -- does the above work? Or are there other benchmarks to double check? |
|
bump? |
|
Hey @tustvold, have you had a chance to look at this? :) It would be very useful for our use-case as well 🤞 |
|
@scovich and @Blizzara -- would this PR be superceded by this PR? BTW I am pretty sure this usecase is what @mwylde described in his blog from arroyo last year: https://www.arroyo.dev/blog/fast-arrow-json-decoding |
Potentially? But I'd be very interested to see how one could actually achieve the same result with that other PR in practice. I suspect it would require a pretty annoying exercise to replicate each decoder type, just to inject the null handling. Maybe each custom decoder could be a thin wrapper around the default one... except I don't think any of the decoders are public today (see e.g. https://docs.rs/arrow-json/55.0.0/arrow_json/all.html)? |
@cht42 can confirm from our side, but I believe we'd need (or like) both; this PR handles type conflicts for most types, while the other (#7442) allows us to handle the specific case for strings. Using #7442 for all conflicts may be possible but likely requires us to have more custom code (as we then need to override all decoders rather than just one, I think). |
This and similar PR for customizing JSON parsing have stalled on the architectural/API question of whether we should publicly widen the JSON parser interface with more configs, rely on variant for fancy tricks, or just expose the tape decoder and let people do whatever they want. I don't have a good answer to that, tho now that we actually have variant support, suspect that going through variant would only address some of the motivations for customizing a JSON parser. |
|
@scovich Thanks for providing that added context. Given I've got some outstanding PRs that also expand the capabilities of the JSON parser, and thus the available config options, it would be good to tie it together into one conversation. Where is that discussion being held? |
|
Sorry -- I have been overwhelmed for a few weeks on DataFusion releases and other things. I hope to get back to this at some point. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
alamb
left a comment
There was a problem hiding this comment.
Thank you @scovich
I just went through this PR carefully (and had codex help me). I think it looks good
Some things I think we should do before merging:
- Update the comments to explain the user facing behavior in more detail (I left some suggestions)
- Resolve the merge conflicts
arrow-json/src/reader/mod.rs
Outdated
| /// Sets whether the decoder should produce NULL instead of returning an error if it encounters | ||
| /// a type conflict on a nullable column (effectively treating it as a non-existent column). | ||
| /// | ||
| /// NOTE: The inferred NULL on type conflict will still produce errors for non-nullable columns, | ||
| /// the same as any other NULL or missing value. |
There was a problem hiding this comment.
Can we please define what a "type conflict" means more specifically? Perhaps with an example
I think it means something like:
the JSON decoder encounters a value that can not be parsed into the specified column type.
For example, if the type is declared to be a nullable [DataType::Int32] but the reader encounters a string value "foo":
- If
with_ignore_type_conflictsis set to false (the default), the reader will return an error. - If
with_ignore_type_conflictsis set to true, the reader will fill inNULLvalue for that array element
| #[test] | ||
| fn test_type_conflict_non_nullable() { | ||
| let fields = [ | ||
| Field::new("bool", DataType::Boolean, false), |
|
The initial benchmarks show some slowdown: #7276 (comment) I am rerunning to see if it reproducable |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
The benchmarks appear to show 10-15% slowdown for some queries
|
Need to look into potential performance regression
|
@alamb -- I tweaked the inner loop logic in a way that seems to help considerably on my laptop. Can you re-spin the benchmarks? |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
New benchmarks look much better to me (no regressions) However, the tests are now failing 🤔 |
🎉
🤦 found+fixed an unfaithful bit of the refactoring. I don't think it would impact performance, just match arm wrangling. |
|
run benchmark json_reader |
1 similar comment
|
run benchmark json_reader |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing json-ignore-type-conflicts-option (0e0d257) to 3b61796 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing json-ignore-type-conflicts-option (0e0d257) to 3b61796 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
👌 |
|
This PR looks like it has some conflicts but otherwise is ready to go |
|
Thank you @scovich |
Which issue does this PR close?
Rationale for this change
JSON data is notoriously non-homogenous, but the JSON parser today is super strict -- it requires a concrete schema and parsing fails if any field of any row encounters a type conflict. In such cases, it can be preferable for incompatible fields to parse as NULL instead of producing a hard error.
What changes are included in this PR?
Adds a new method
arrow_json::reader::ReaderBuilder::with_ignore_type_conflicts, which can override the default behavior of throwing on type conflict, to return NULL values instead.Plumb that flag through to all ten decoders so they honor it: Null, Boolean, Primitive, Decimal, Timestamp, String, StringView, List, Map, Struct.
Add both positive and negative unit tests for each decoder type, to ensure the plumbing worked.
Are there any user-facing changes?
New API method, see above.