Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: read optional PROJECTCOMPATVERSION Record. #21

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

JanMarvin
Copy link
Contributor

Hi @tim-weis ,

I'm the author of an R package to create and modify Office Open XML files and have written various not so safe C++ parsers, including one for the XLSB format. Yesterday I tried to look into the vbaProject.bin file and came across your ovba project. I tried to embed it in an R package, but had no luck. Given that I have no prior knowledge of either rust development or rust development in R, this didn't surprise me. Still I tried to toy around with it and this toying around results in this PR. But as stated, until yesterday I hadn't written a single line of rust code. Therefore ...

The PR added the CompatVersionRecord which is an optional field that is defined as "A 32-bit number that identifies the Office Model version used by a VBA project." This record is optional and according to the OBVA documentation not written by VBA 5.0. Previously, if this record was found, the parser would simply panic. (There is another optional field ConstantsRecord, but this is at the end of the PROJECTINFORMATION record, therefore it didn't bother me).

The second minor change replaces the backslash to slash. This allows using ovba on Mac (and probably Linux). I ran the code on an M1 Mac and the two files I used for testing are created with MS365 on Mac.

A file which has this record can be found here, not sure anymore when or why this was created. Well this file loads now, and even works in my R rust package.

PS: I looked into your ovba-cli file, but this uses some Windows exclusive APIs and I couldn't convince ChatGPT to provide me with a drop in replacement for this.

PPS: Because I initially didn't know what caused the panic I updated the cfb crate to 0.10, hoping that this might solve some hidden issue. Since I'm not aware how the dependency updates work, I left this out of the PR.

@tim-weis
Copy link
Owner

Cheers, Jan!

I had a look at the issue. From what I gather the (optional) PROJECTCOMPATVERSION record was first described in version 11 of the [MS-OVBA] specification. This parser was implemented back when version 9.1 was current (and hasn't been substantially updated since). This has turned into an issue that needs to be addressed, so thanks a lot for bringing this to my attention. This is helpful, and much appreciated!

First, though, let's make sure I fully understand. When you say that "the parser would simply panic", is it the parser implementation that does, or the (presumed) final call to unwrap or expect in the calling code? I'd expected that the parser wouldn't panic and return a Result::Err in the presence of an unexpected PROJECTCOMPATVERSION record instead. Could you verify which one it is? (A Result in Rust is similar to std::expected in C++23.)

While you noted that you updated cfb to the latest version, did that change anything? I'd expect it didn't. Can you confirm this as well?


The ovba-cli tool is in a rough state. It used to be platform-agnostic up to and including this commit. The Windows-specific parts are concerned with transcoding between character encodings (code-page, per Information::code_page, to UTF-8). That change merely supported dumping VBA source code using a standard character encoding (UTF-8). It's otherwise just a tool for my personal use that I made public due to being referenced from this repo. It's not meant to be production-ready, or a point of reference.


While I don't fully understand your use case, you might also want to look into the calamine crate and see if that better suits your needs. It takes a more pragmatic approach to deserializing spreadsheet documents.

@JanMarvin
Copy link
Contributor Author

Hi,

thanks for the quick response!

When I said panicked, I was referring to the output I'm seeing without my fix (this is with the backslash -> slash correction):

> library(ovbars)
> fl <- "https://github.com/JanMarvin/openxlsx-data/raw/refs/heads/main/gh_issue_416.xlsm"
> wb <- openxlsx2::wb_load(fl)
> vba <- wb$vbaProject
> code <- ovbars::ovbar_out(name = vba)
thread '<unnamed>' panicked at /Users/janmarvingarbuszus/Source/ovbars/src/.cargo/registry/src/index.crates.io-6f17d22bba15001f/extendr-api-0.7.1/src/robj/into_robj.rs:71:13:
called `Result::unwrap()` on an `Err` value: Parser
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at src/lib.rs:12:1:
explicit panic
Error in ovbars::ovbar_out(name = vba) : 
  User function panicked: ovbar_out

I assume that the issue is with the R side of things due to called 'Result::unwrap()' on an 'Err' value: Parser

Updating cfb did not change a thing. I have reverted it for my package implementation and it still works fine. Like I've said. Initially I wasn't sure why the parser wasn't working and assumed that updating the cfb crate could help.

Regarding ovba-cli I already assumed that much, it was just another straw that could have potentially solved my issue. Given that the issue was due to something in ovba, it wouldn't have helped much either way.

Thanks, I will have a look at calamine, maybe one day I want to extract from the xls OLE streams, but that's a task for another raining month. For this project I only wanted to be able to read the macro code: https://github.com/JanMarvin/ovbars

@tim-weis
Copy link
Owner

Sweet, thanks, that's helpful!

From what I understand, the parser reports an error when presented with an (unexpected) optional PROJECTCOMPATVERSION record, that's subsequently converted to a panic! via the extendr-api crate. This aligns with my expectations. Thanks for verifying.

Updating the cfb crate to a later version without exhibiting a change in observable behavior is reassuring as well.

While looking into this I realized that the final PROJECTCONSTANTS record is also optional (and has been per v9.1), but isn't honored by the implementation. That's a bug I'd also like to address here.

I'm putting together test cases to verify that either record is accepted, when present, without failing when either isn't.


That said, I'd like to address the path separator issue separately. I don't fully understand it or know how best to address it quite yet.

The cfb crate uses the Path type in its public interface. This type is designed to work with the local file system and file system APIs. CFB's make this tricky: They are essentially virtual file systems wrapped into files, allowing them to travel across systems with different local file system semantics. I'll have to look into this some more.

I'll get back to you once the testing is in place.

@JanMarvin
Copy link
Contributor Author

Perfect, thanks! As said there is also the optional PROJECTINFORMATION record which I skipped for now, but obviously this could become an issue if it is available. Let me know if I should look into this further (thankfully the documentation is just ~120 pages and not a whooping 1600 as for XLSB.

@tim-weis tim-weis changed the base branch from develop to feature/parse_opt_records December 19, 2024 14:23
@tim-weis
Copy link
Owner

That didn't turn out as expected...

I'd implemented (and pushed) tests to the feature/parse_opt_records branch and changed the base branch for this PR, hoping that it would be moved to the tip of this branch. Apparently, that did not happen.

I'm not intimately familiar with Git or GitHub, but could you try to rebase and (force) push your PR?

Just to make sure we aren't talking past each other: This branch is all about addressing the issues revolving around acknowledging the optional PROJECTCOMPATVERSION and PROJECTCONSTANTS records in the PROJECTINFORMATION.

@JanMarvin
Copy link
Contributor Author

Sure thing, will push once I’m home

@JanMarvin
Copy link
Contributor Author

@tim-weis should be fine now. I removed the backslash change and adjusted the commit message accordingly. My motivation for this was that this backslash is available only in Microsoft Windows. (A reference that it still works is my tiny R package, embedding the patched ovba sources, which runs on Windows, Mac and Linux). Let me know if this works for you.

This record is optional and according to the OBVA documentation not
written by VBA 5.0. Previously, if this record was found, the parser
would simply panic.
@tim-weis tim-weis merged commit d2ea066 into tim-weis:feature/parse_opt_records Dec 19, 2024
@JanMarvin
Copy link
Contributor Author

Sorry, this should fix the formatting warning. I had it adjusted, to look similar to the other code parts, but apparently that wasn't a good idea 😬

@JanMarvin JanMarvin deleted the compat branch December 19, 2024 17:07
@tim-weis
Copy link
Owner

Thanks, I was about to ask whether you had a rustfmt.toml lingering somewhere that caused the formatting test failure. The last push fixes things up.

I'll publish a new release as soon as I can. Will be looking into the path separator issue, too.

The effort you put into this is much appreciated. Thanks a lot for that!

@JanMarvin
Copy link
Contributor Author

Thanks! No need rush from my side (I helped myself out) :)
Enjoy the holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants