Skip to content

Conversation

@VisruthSK
Copy link
Member

@VisruthSK VisruthSK commented Dec 3, 2025

This PR escapes regex metacharacters in variable names when computing dimensions in variable_dims(), so names like OMEGA(1,1) don't error. This PR also swaps variables which have parentheses to brackets in repair_variable_names(). Notably, these don't get recovered in unrepair_variable_names(), so users will have to use the bracketed name in further calls. read_csv_metadata() now relies on data.table::fread() to parse the CSV header and generate the correctly quoted variable names.

Fixes #1116.

Copyright and Licensing

Visruth Srimath Kandali

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the following licenses:

@jgabry
Copy link
Member

jgabry commented Dec 3, 2025

Thanks @VisruthSK!

@yizhang-yiz can you check if this works for you?

@jgabry
Copy link
Member

jgabry commented Dec 3, 2025

I do wonder if anything else will break downstream if we have parameter names with parenthesis in them. We don't have tests for all the post-processing functions with non-standard parameter names, but it might be ok.

An alternative could maybe be to convert these names to names with brackets instead of trying to allow names with parenthesis? But let's see if the solution in this PR works before going down that path.

@codecov-commenter
Copy link

codecov-commenter commented Dec 3, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.18%. Comparing base (1d58078) to head (5527cdb).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1120      +/-   ##
==========================================
+ Coverage   87.15%   87.18%   +0.02%     
==========================================
  Files          14       14              
  Lines        5973     5984      +11     
==========================================
+ Hits         5206     5217      +11     
  Misses        767      767              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jgabry jgabry linked an issue Dec 5, 2025 that may be closed by this pull request
@jgabry
Copy link
Member

jgabry commented Dec 5, 2025

After playing around a bit, this alone won't solve the issue in #1116. Unfortunately there are other issues caused by variable names with parentheses. I'm thinking it's best to just not support names with parentheses, especially since they can be converted before trying to read the CSV files and then everything should work.

@VisruthSK
Copy link
Member Author

VisruthSK commented Dec 5, 2025

Should swapping parens with branckets be added to repair_variable_names() and unrepair_variable_names()? Or maybe just to repair to normalize output to always be brackets?

@VisruthSK VisruthSK marked this pull request as draft December 5, 2025 23:50
@jgabry
Copy link
Member

jgabry commented Dec 6, 2025

If changing repair_variable_names() and unrepair_variable_names() is all that's needed to get this working and the change is simple, I'm OK with adding it.

But if other things also need to be changed (I'm not sure) then I would lean towards making the user do the conversion ahead of time and not changing so many things in CmdStanR. CmdStanR was designed to handle CSV files written by CmdStan. If other CSV files can be converted to be in that format then I'm happy for them to be used with CmdStanR, but I would be hesitant to make the package more complicated in order to accommodate the rare edge case of CSV files using other conventions. Does that seem reasonable?

Some additional background info: CmdStan actually writes the variable names using periods (Sigma.1.1) and we convert them to brackets (Sigma[1,1]) after reading the CSVs.

@VisruthSK
Copy link
Member Author

VisruthSK commented Dec 6, 2025

Yeah definitely, especially as this is somewhat rare user behaviour I don't see the point of adding more complexity to cmdstanr. Do you happen to have an example/test I could run against? I got a test from an LLM but I don't know enough about cmdstanr to know if it makes sense and/or is testing the right things.

I just pushed changes where this test passes. (but not changing unrepair so the names are permantly swapped to using brackets, basically the same as the user doing so manually)

test_that("as_cmdstan_fit handles variable names with parentheses", {
  csv_file <- tempfile(fileext = ".csv")
  writeLines(c(
    "# stan_version_major = 2",
    "# stan_version_minor = 33",
    "# stan_version_patch = 0",
    "# model = norm_model",
    "# method = sample (Default)",
    "#   sample",
    "#     num_samples = 2",
    "#     num_warmup = 0",
    "#     save_warmup = 0",
    "#     thin = 1",
    "#   random",
    "#     seed = 123",
    "#   algorithm = hmc",
    "#     metric = diag_e",
    "#     stepsize = 1",
    "# id = 1",
    "THETA4,SIGMA(1,1)",
    "2.00000E+00,2.00000E+00",
    "2.00000E+00,2.00000E+00"
  ), con = csv_file)

  expect_no_error({
    fit <- as_cmdstan_fit(csv_file, check_diagnostics = FALSE, format = "draws_matrix")
  })

  draws <- fit$draws()
  vars  <- posterior::variables(draws)

  expect_equal(posterior::ndraws(draws), 2L)
  expect_true(any(grepl("THETA4", vars)))
  expect_true(any(grepl("SIGMA", vars)))
})

@jgabry
Copy link
Member

jgabry commented Dec 6, 2025

Thanks. If you can get that test to pass then you're probably on the right track, but we should test this with parameters that have more than one element. I just modified the variable names in a CSV written by CmdStan to include Sigma(1,1), Sigma(1,2), Sigma(2,1), Sigma(2,2):

logistic-202512051745-1-989cfc.csv

I wasn't sure if the names with parentheses needed to be in quotes (probably?). CmdStan doesn't write the names in quotes, but maybe with the parentheses they need to be. I guess you can modify the names in the CSV file in various ways to test it.
When I try this file I get an error where it thinks there are duplicate parameter names (there aren't):

Error: Duplicate variable names are not allowed in draws objects.
The following variable names are duplicates:
{'"Sigma[1]', '1"', '"Sigma[2]', '2"'}

I wouldn't spend too much time on this since it's an edge case, but if there's a simple fix we can do it.

@yizhang-yiz
Copy link

yizhang-yiz commented Dec 6, 2025

Thanks for working on this! I agree this is an edge case and there’s workaround (bracket for example) on my side. Originally I was just curious why it didn’t work as fread does. Please feel free to close #1116.

@VisruthSK VisruthSK marked this pull request as ready for review December 6, 2025 04:05
@VisruthSK
Copy link
Member Author

VisruthSK commented Dec 6, 2025

Just swapped the metadata parsing to use fread instead of a string split on commas which was messing up the variable names. I think this should bring behaviour more inline with fread. I think this PR is ready for review, predicated on the tests passing.

@VisruthSK VisruthSK requested a review from jgabry December 9, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read csv with variable names that contain parentheses

5 participants