-
-
Notifications
You must be signed in to change notification settings - Fork 66
Escaped regex characters in variable_dims() #1120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Thanks @VisruthSK! @yizhang-yiz can you check if this works for you? |
|
I do wonder if anything else will break downstream if we have parameter names with parenthesis in them. We don't have tests for all the post-processing functions with non-standard parameter names, but it might be ok. An alternative could maybe be to convert these names to names with brackets instead of trying to allow names with parenthesis? But let's see if the solution in this PR works before going down that path. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1120 +/- ##
==========================================
+ Coverage 87.15% 87.18% +0.02%
==========================================
Files 14 14
Lines 5973 5984 +11
==========================================
+ Hits 5206 5217 +11
Misses 767 767 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
After playing around a bit, this alone won't solve the issue in #1116. Unfortunately there are other issues caused by variable names with parentheses. I'm thinking it's best to just not support names with parentheses, especially since they can be converted before trying to read the CSV files and then everything should work. |
|
Should swapping parens with branckets be added to |
|
If changing But if other things also need to be changed (I'm not sure) then I would lean towards making the user do the conversion ahead of time and not changing so many things in CmdStanR. CmdStanR was designed to handle CSV files written by CmdStan. If other CSV files can be converted to be in that format then I'm happy for them to be used with CmdStanR, but I would be hesitant to make the package more complicated in order to accommodate the rare edge case of CSV files using other conventions. Does that seem reasonable? Some additional background info: CmdStan actually writes the variable names using periods ( |
|
Yeah definitely, especially as this is somewhat rare user behaviour I don't see the point of adding more complexity to cmdstanr. Do you happen to have an example/test I could run against? I got a test from an LLM but I don't know enough about cmdstanr to know if it makes sense and/or is testing the right things. I just pushed changes where this test passes. (but not changing unrepair so the names are permantly swapped to using brackets, basically the same as the user doing so manually) test_that("as_cmdstan_fit handles variable names with parentheses", {
csv_file <- tempfile(fileext = ".csv")
writeLines(c(
"# stan_version_major = 2",
"# stan_version_minor = 33",
"# stan_version_patch = 0",
"# model = norm_model",
"# method = sample (Default)",
"# sample",
"# num_samples = 2",
"# num_warmup = 0",
"# save_warmup = 0",
"# thin = 1",
"# random",
"# seed = 123",
"# algorithm = hmc",
"# metric = diag_e",
"# stepsize = 1",
"# id = 1",
"THETA4,SIGMA(1,1)",
"2.00000E+00,2.00000E+00",
"2.00000E+00,2.00000E+00"
), con = csv_file)
expect_no_error({
fit <- as_cmdstan_fit(csv_file, check_diagnostics = FALSE, format = "draws_matrix")
})
draws <- fit$draws()
vars <- posterior::variables(draws)
expect_equal(posterior::ndraws(draws), 2L)
expect_true(any(grepl("THETA4", vars)))
expect_true(any(grepl("SIGMA", vars)))
}) |
|
Thanks. If you can get that test to pass then you're probably on the right track, but we should test this with parameters that have more than one element. I just modified the variable names in a CSV written by CmdStan to include Sigma(1,1), Sigma(1,2), Sigma(2,1), Sigma(2,2): logistic-202512051745-1-989cfc.csv I wasn't sure if the names with parentheses needed to be in quotes (probably?). CmdStan doesn't write the names in quotes, but maybe with the parentheses they need to be. I guess you can modify the names in the CSV file in various ways to test it. I wouldn't spend too much time on this since it's an edge case, but if there's a simple fix we can do it. |
|
Thanks for working on this! I agree this is an edge case and there’s workaround (bracket for example) on my side. Originally I was just curious why it didn’t work as fread does. Please feel free to close #1116. |
|
Just swapped the metadata parsing to use fread instead of a string split on commas which was messing up the variable names. I think this should bring behaviour more inline with fread. I think this PR is ready for review, predicated on the tests passing. |
This PR escapes regex metacharacters in variable names when computing dimensions in
variable_dims(), so names likeOMEGA(1,1)don't error. This PR also swaps variables which have parentheses to brackets inrepair_variable_names(). Notably, these don't get recovered inunrepair_variable_names(), so users will have to use the bracketed name in further calls.read_csv_metadata()now relies ondata.table::fread()to parse the CSV header and generate the correctly quoted variable names.Fixes #1116.
Copyright and Licensing
Visruth Srimath Kandali
By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses: