Skip to content

avoid HDF5.name when possible#238

Merged
matthijscox merged 4 commits intomasterfrom
avoid-HDF5-name
Apr 7, 2026
Merged

avoid HDF5.name when possible#238
matthijscox merged 4 commits intomasterfrom
avoid-HDF5-name

Conversation

@matthijscox
Copy link
Copy Markdown
Member

Addresses: #237
@foreverallama this may interest you.

I want a more performant HDF5.name variant, but in the meantime I can at least speed up .mat files without subsystem information.

using MAT, LinearAlgebra

function powerlaw_fit(x, y)
    lx = log10.(x)
    ly = log10.(y)

    A = hcat(ones(length(lx)), lx)
    c = A \ ly

    logC, α = c
    yfit = 10.0^logC .* x .^ α
    return logC, α, yfit
end

function nested_dict()
    Dict{String, Any}(
        "a" => Dict{String, Any}("b" => 1),
    )
end

sizes = logrange(10,1000,10)
timings = Float64[]
timings2 = Float64[]
file_sizes = Float64[]
filename = "matfile.mat"
for N in sizes
    file = matopen(filename, "w")
    write(file, "arr", [nested_dict() for _ in 1:N])
    t = @elapsed read(file, "arr")
    push!(timings, t)
    file.subsystem.class_id_counter = 1 # force use of HDF5.name
    t2 = @elapsed read(file, "arr")
    push!(timings2, t2)
    close(file)
    push!(file_sizes, filesize(filename))
end

logC1, a1, fit1 = powerlaw_fit(file_sizes, timings)
logC2, a2, fit2 = powerlaw_fit(file_sizes, timings2)

using Makie, GLMakie
fig = Figure(size=(700, 300))
ax = Axis(fig[1, 1], xlabel="Size of file (KB)", ylabel="Time (s)")
scatter!(ax, file_sizes/1e3, timings, color=:black, label="no")
scatter!(ax, file_sizes/1e3, timings2, color=:red, label="yes")
lines!(ax, file_sizes/1e3, fit1, color=:black, linestyle=:dash)
lines!(ax, file_sizes/1e3, fit2, color=:red, linestyle=:dash)
Legend(fig[1,3,], ax, "HDF5.name usage")
ax = Axis(fig[1, 2], xlabel="Size of file (KB)", ylabel="Time (s)", xscale=log10, yscale=log10)
scatter!(ax, file_sizes/1e3, timings, color=:black)
scatter!(ax, file_sizes/1e3, timings2, color=:red)
lines!(ax, file_sizes/1e3, fit1, color=:black, linestyle=:dash)
text!(ax, 0.4*file_sizes[end]/1e3, fit1[end], text="x^$(round(a1, sigdigits=2))", color=:black)
lines!(ax, file_sizes/1e3, fit2, color=:red, linestyle=:dash)
text!(ax, 0.4*file_sizes[end]/1e3, 0.7*fit2[end], text="x^$(round(a2, sigdigits=2))", color=:red)
fig
save("timings.png", fig)
timings

As you see, HDF5.name creates quadratic time scaling with file sizes. With this PR we go back to roughly linear scaling by avoiding HDF5.name when possible.

end

function Base.isempty(subsys::Subsystem)
return subsys.class_id_counter == 0
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got some failing tests, because I dont know when I can consider the subsystem as fully empty/missing?

Copy link
Copy Markdown
Member Author

@matthijscox matthijscox Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no wait, the problem is that m_read is called in the subsystem initalization, which then checks for isempty(subsys) : https://github.com/JuliaIO/MAT.jl/blob/master/src/MAT_HDF5.jl#L143-L147

fid.subsystem.table_type = table
fid.subsystem.convert_opaque = convert_opaque
subsys_data = m_read(fid.plain[subsys_refs], fid.subsystem)
MAT_subsys.load_subsys!(fid.subsystem, subsys_data, endian_indicator)

I think this is also the only place we have to check for #subsystem# names? If so, we can avoid HDF5.name checking anywhere else in the .mat file.

@matthijscox
Copy link
Copy Markdown
Member Author

I think I solved the problem entirely now. By only checking for the HDF5.name once. I don't know if this is correct? Is there always one, and only one, #subsystem# HDF5 group in the entire .mat file? They are not even nested somehow? Our tests pass at least!

@foreverallama
Copy link
Copy Markdown
Contributor

Sorry I haven't been able to take a look at his yet, but yes there is only one #subsystem# group in a MAT-file which is written as a struct but requires special handling. That's why the HDF5.name check was included to delegate to special handling or else load as a normal struct.

Does the proposed fix resolve the performance issue?

@matthijscox
Copy link
Copy Markdown
Member Author

Yes the current proposal fixes the performance issue by only calling HDF5.name once. If there's truly only one subsystem group and we know which one (the one used in the subsystem loading), then I could even avoid it entirely, though it might be good to check it once just in case.

@foreverallama
Copy link
Copy Markdown
Contributor

Yeah it's a neat solution, as there's only one #subsystem# group in the whole MAT-file. Should work!

@matthijscox matthijscox merged commit 7bc1449 into master Apr 7, 2026
15 checks passed
@matthijscox matthijscox deleted the avoid-HDF5-name branch April 7, 2026 08:16
@foreverallama
Copy link
Copy Markdown
Contributor

I actually think we don't need this check as well:
HDF5.name(subsys_group) == "/#subsystem#" || error("Invalid subsystem group name").

The previous line would error out if something is corrupted anyways:
subsys_group::HDF5.Group = fid.plain[subsys_refs]
where subsys_refs = "#subsystem#"

@matthijscox
Copy link
Copy Markdown
Member Author

Too late, it's merged and registered :)
But alright, we can remove this in some other PR if we remember

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants