Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish as modular JAR(s) to Maven Central #14

Open
anthonyvdotbe opened this issue Apr 22, 2020 · 23 comments · May be fixed by #43
Open

Publish as modular JAR(s) to Maven Central #14

anthonyvdotbe opened this issue Apr 22, 2020 · 23 comments · May be fixed by #43

Comments

@anthonyvdotbe
Copy link
Contributor

Motivation:

I'm willing to help by providing PRs such as:

  • adapt the folder layout to Maven conventions
  • split the project into multiple Maven modules as needed (since it's a lot easier if there's a 1-on-1 mapping between Maven and Java modules)
  • make the Maven build work with a JDK 11+ release. In fact, such a JDK would be required in order to be able to compile the module-info.java file, so I'd like to upgrade the compiler level to 11 as well (the code can stay on the 1.5 syntactic level. I'm not planning to make any code changes, apart from package declarations and import statements)
  • remove jchardet and XOM support: jchardet is an unmaintained library, and I don't see how XOM will be able to modularize itself any time soon, given that it relies on the xml-apis:xml-apis Maven artifact, and thus has a "split packages" issue

However, I have a few questions:

  • first and foremost: if I provide PRs such as the above, will they be reviewed & accepted, and will a new release be published once modularization is done?
  • there's a lot of files that are apparently unmaintained. Why aren't these deleted? It would help a lot if any legacy files would be deleted before starting
  • there's 3 packages: encodings, htmlparser, saxtree. I understand why encodings is a separate package outside of htmlparser, but why is this also the case for saxtree? It's relied upon by the htmlparser package, and I don't readily see why someone using this library would use its classes directly
  • following from my previous question: would it be acceptable to move saxtree into a package under the htmlparser package? Or is this package used directly by users of this library, and would changing the package name be a problem w.r.t. backward compatibility?

@hsivonen @sideshowbarker what are your thoughts on this?

@sideshowbarker
Copy link
Member

As far as Maven goes, this project so far isn’t internally relying on Maven for anything. Maven and the Central Repository aren’t the sole way or even the primary way that the HTML parser releases has been distributed. And Maven isn’t the primary build tool used for building releases.

Given that, making any disruptive changes to the sources primarily just to get things to work more easily with Maven seems like a “tail wagging the dog” kind of thing; in particular, the following:

  • adapt the folder layout to Maven conventions
  • split the project into multiple Maven modules as needed

Personally speaking, I don’t use Maven at all. For the HTML checker and its dependencies, the only reason I publish Maven packages is that others have asked for them.

So I’m not super enthusiastic about the prospect of changes being made to the HTML parser sources if those changes would end up meaning that I’d then need to put significant time into making fixes myself to the HTML checker to align it with the HTML parser changes.

However, all that said, if @hsivonen sees benefits in what you’re proposing, then I would not be opposed — and I’d be willing to commit time to reviewing and testing patches.

@anthonyvdotbe
Copy link
Contributor Author

Thanks for your feedback. Just for the record, my sole goal is to have modular JAR(s) published to Maven Central. I'm open to any suggestions on how to achieve that. What I've described, is just 1 way to do so, but one that I'm willing to help out with. And I could keep the packages unchanged, such that there shouldn't be any code changes at all, except for XOM/jchardet removal.

@carlosame
Copy link

@anthonyvdotbe:

adapt the folder layout to Maven conventions

As mentioned by @sideshowbarker, changing the source layout just to adapt to Maven defaults does not look reasonable. My projects do not follow Maven defaults, instead I change the POM to override the defaults.

upgrade the compiler level to 11 as well (the code can stay on the 1.5 syntactic level

I assume that you mean a multi-release jar, but recent JDKs cannot target 1.5 (1.7 is the minimum AFAIR). So the question is whether the users of this project require 1.5 binary compatibility, or which one would be the minimum acceptable. Otherwise the idea of adding a module-info looks good.

@sideshowbarker:

Maven isn’t the primary build tool used for building releases.

No matter which tool is used to build a release, setting an automatic module name would be useful to those of us that have builds for recent JDKs; my pull request does that for Maven.

Otherwise, people using JDKs version 9 or later are going to get similar warnings to the following one that I get when I build my css4j-agent module with Maven:

[WARNING] ******************************************************************************************************************************************
[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *
[WARNING] ******************************************************************************************************************************************

Having a full JPMS modularization like @anthonyvdotbe proposes would be good, but my pull request is a simpler solution that is available just now.

More information about module names:

Automatic-Module-Name: Calling all Java Library Maintainers

@sideshowbarker
Copy link
Member

it's been since 2012 since the last release

Yeah, going for 8 years without a new release is a long time, so I agree it’s time we should strongly consider trying to get a new release out.

Just for the record, my sole goal is to have modular JAR(s) published to Maven Central.

I also agree that given the shift in Java overall to a module system, it makes sense to strongly consider moving to distribution of a modular jar.

And as far as Central, in practice even for non-Maven users, it now seems to be the single primary common place to get jars from — so I also agree it makes sense to provide distributions there.

Having a full JPMS modularization like @anthonyvdotbe proposes would be good, but my pull request is a simpler solution that is available just now.

I personally have no objection to first just merging that small change from #13, if @hsivonen is supportive. But even if we were to only do that, we’d still need to do a new release, right?

If we’re gonna release a new jar at all, I guess it makes sense to strongly consider doing the work to ensure it’s got the whole set of current-best-practice characteristics for jar releases that are expected by users nowadays.

But I can’t commit to support for moving forward with anything until @hsivonen weighs in.

@carlosame
Copy link

If we’re gonna release a new jar at all, I guess it makes sense to strongly consider doing the work to ensure it’s got the whole set of current-best-practice characteristics for jar releases

Other projects are releasing a jar with an automatic module first, and then another with full modularization. But any issues about split packages etc. should be resolved before that. Also, my PR assumes that there will be a single module.

I am unaware of any split-package issue inside this project (yes the dependencies do have them), and the single-module approach seems correct unless one wants to start making deep changes to the project just now.

But of course if one and only one release is going to be produced, by all means let it have full JPMS.

@carlosame
Copy link

@anthonyvdotbe I forgot to mention something about this:

remove jchardet and XOM support

There are tricks to use dependencies with split packages (I have a similar problem in my css4j-dom4j module), so XOM could possibly be kept. Yes, I know that's less than perfect.

And jchardet could be kept as a filename-generated module name (much like this project's htmlparser.jar is currently). Again, yes I know...

@hsivonen
Copy link
Member

I'm completely OK with publishing a new release. I should have done it many times over the past years. It's been a "next week" item for way too long. 8 years indeed is a long time. 😬

I also want to get rid of jchardet and the ICU detectors. Neither represents present-day browser-compatible behavior. Ideally, these would be replaced with chardetng, but adding JNI dependencies isn't really in the Java culture, and I haven't seriously investigating compiling it into Java byte code. (asmble might be an avenue.)

I'd like to better understand the motivation for removing XOM support. I gather that Maven ships both source packages and binary packages. The parser is designed such that it requires a XOM jar to be present at build time, but once you have the binary jar, if you don't want to use XOM, you don't need to have XOM present. How big a problem is this with Maven?

It also seems bad to make breaking changes (shuffle around the Java package naming structure) for the sake of Maven. saxtree is useful on its own and it's possible to run htmlparser in a mode that doesn't use it. However, the htmlparser API is such that not using saxtree is a run-time decision, so saxtree needs to be a hard dependency for htmlparser in order not to break API compatibility.

So I'd be OK with saxtree being spun off into a distinct Maven artifact, but I'm not happy about the idea of moving it in the package naming structure just to please Maven.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon. It probably makes sense to throw it away and use Peter Occil's implementation.

[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

@carlosame
Copy link

I'd like to better understand the motivation for removing XOM support.

I believe that he wants to remove XOM because it is not modularized.

shuffle around the Java package naming structure) for the sake of Maven

Neither Maven nor JPMS require that (in the specific case of this project). The motivation to move saxtree is unclear to me.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon.

Looks like a good candidate for a package that is not to be JPMS-exported (if not removed finally).

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

They were considering to automatically enforce the rejection, but do not think that they finally did it. In any case, this is only going to affect the Maven deployment if your build is a JPMS-modularized one.

The warning is mostly due to concerns about a possible 'DLL hell' if explicit module names and filename-based names are mixed together. That's why people are encouraged to declare a name ASAP.

@anthonyvdotbe
Copy link
Contributor Author

I also want to get rid of jchardet and the ICU detectors. Neither represents present-day browser-compatible behavior. Ideally, these would be replaced with chardetng, but adding JNI dependencies isn't really in the Java culture, and I haven't seriously investigating compiling it into Java byte code. (asmble might be an avenue.)

asmble looks interesting indeed. Another alternative might be to run it on GraalVM. There's also project Panama, but that's WIP and focused on C for now. Either way, just to set expectations, supporting chardetng is out of my league :)

I'd like to better understand the motivation for removing XOM support. I gather that Maven ships both source packages and binary packages. The parser is designed such that it requires a XOM jar to be present at build time, but once you have the binary jar, if you don't want to use XOM, you don't need to have XOM present. How big a problem is this with Maven?

tl;dr I'll leave XOM support untouched, since it doesn't pose a problem to neither Maven, nor Java modules.

(The motivation was: we should start with a clean slate and drop support for anything that isn't itself modularized (or is likely to be anytime soon). And modularizing XOM will be hard, because it depends, both directly & indirectly, on xml-apis:xml-apis (the version that is currently depended on by htmlparser, 1.1, uses its predecessor, xerces:xmlParserAPIs). The problem with these artifacts, is that they package APIs which are part of Java SE's java.xml module. And it's not possible to use 2 modules that define the same package (at least not in the same ModuleLayer).)

It also seems bad to make breaking changes (shuffle around the Java package naming structure) for the sake of Maven. saxtree is useful on its own and it's possible to run htmlparser in a mode that doesn't use it. However, the htmlparser API is such that not using saxtree is a run-time decision, so saxtree needs to be a hard dependency for htmlparser in order not to break API compatibility.

No problem, the package structure will be retained.

(The motivation for asking about moving it into the htmlparser package was: "I don't readily see how this can be useful on its own, and moving it inside htmlparser would allow to limit ourselves to a single module")

So I'd be OK with saxtree being spun off into a distinct Maven artifact, but I'm not happy about the idea of moving it in the package naming structure just to please Maven.

encodings is a project that has started but isn't in use and is unlikely to be finished anytime soon. It probably makes sense to throw it away and use Peter Occil's implementation.

Given the above, I'm proposing to remove the encodings package, and split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module, namely nu.validator.htmlparser, and nu.validator.saxtree. Package names will remain unchanged. @hsivonen Are you OK with this?

[WARNING] * Required filename-based automodules detected: [htmlparser-1.4.jar]. Please don't publish this project to a public artifact repository! *

Do I understand correctly that some policy change prevents next step being just bumping the version number is pom.xml and publishing the same way as 8 years ago?

I'm not sure I understand your question, but I can explain what the warning means: JARs which aren't modularized (i.e. don't contain a module-info.class file), can be used as modules nonetheless (in order to ease migration, such JARs are called automatic modules). As an intermediary step, library maintainers can add an entry in the manifest (as in #13), which asserts "once this library is modularized, its module name will be as given". Now, when an umodularized JAR is put on the module path, Java checks that entry to determine its module name. If it doesn't have it, Java determines the module name from the JARs filename.
The problem is: if I want to modularize my library, foo, which depends on library bar, where bar isn't modularized yet, I can write:

module foo {
    requires bar;
}

but as long as bar doesn't at least have the manifest entry to assert its future module name, this is risky and I shouldn't publish my library (since the module name is derived from a JARs filename and may still change in the future either way). And that's what the warning says: you shouldn't publish a library whose module descriptor requires an artifact whose module name is filename-based.

@carlosame
Copy link

Something that has not been mentioned: you are already declaring a module name, just OSGi instead of JPMS:

Bundle-SymbolicName: nu.validator.htmlparser

So you are already shipping those packages under that name (not sure if you have real OSGi users though).

@carlosame
Copy link

moving it inside htmlparser would allow to limit ourselves to a single module
[...]
split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module

There is no need to have two modules, you can have the two packages under the same module.

@anthonyvdotbe
Copy link
Contributor Author

moving it inside htmlparser would allow to limit ourselves to a single module
[...]
split off saxtree in its own Maven module. So we would have 2 Maven modules, each mapping to their own Java module

There is no need to have two modules, you can have the two packages under the same module.

Yes, but from the following:

  • saxtree is useful on its own, and should thus be in a Java module of its own
  • not having a 1-on-1 mapping between Maven modules and Java modules unnecessarily complicates the Maven build

it follows that saxtree should be split off into its own Maven module

@carlosame
Copy link

Modules are intended to group related packages, and saxtree is closely related to the other package(s). If your modification proceeds, users would have to change their setup for the additional artifact, as opposed to just bumping the version number in the Maven POM or whatever they were using.

Not to talk about potential additional work for those in the validator-nu branch, as they would need to adapt to the changes in master if they want to stay somehow in sync.

Let's see what @hsivonen thinks though.

@sideshowbarker
Copy link
Member

Not to talk about potential additional work for those in the validator-nu branch, as they would need to adapt to the changes in master if they want to stay somehow in sync.

I’ve come around to realizing that the best way to avoid any problems with keeping the validator-nu branch in sync is to just end-of-life that branch, as proposed in #17.

I also want to retract something related that I wrote here earlier:

So I’m not super enthusiastic about the prospect of changes being made to the HTML parser sources if those changes would end up meaning that I’d then need to put significant time into making fixes myself to the HTML checker to align it with the HTML parser changes.

I realize now it wasn’t really relevant to mention that here — because for one thing, the mechanism the HTML checker uses to build the integrated parser is that it directly walks the src tree here to collect the pathnames for the source files, and then essentially just calls javac on those. So in hindsight I can’t really imagine what kind of changes to the sources would require me to make any significant changee to the HTML checker’s build of the integrated parser.

Also, any choice of build tools we end up deciding on within the parser repo itself isn’t gonna affect the checker build of the parser. I would just be keeping the checker build of the parser the same anyway, regardless.

So please ignore anything I wrote here previously about needing to consider the HTML checker usage of the parser when making decisions about how we want to do things here going forward.

Whatever we decide, it should just be based on trying to determine what’s best for the wider community of users for the parser, and current best practices.

@anthonyvdotbe
Copy link
Contributor Author

anthonyvdotbe commented Apr 25, 2020

To summarize, here's the open questions I'd like confirmation on. Is it ok to...:

  • split the htmlparser and saxtree packages into separate Maven modules?

    • arguments were already given here, but regardless of Java modules, this would allow to distribute saxtree as a separate artifact. And since saxtree is useful on its own, this is a nice enhancement by itself
  • remove support for RPM packages (the current pom.xml allows to build an RPM package)?

    • I'm not sure if this was ever actually used to distribute htmlparser to official RPM repositories
    • distribution maintainers typically package and maintain such packages themselves
    • looking at https://pkgs.org/search/?q=htmlparser, the RPM package has been removed from every major RPM-based distribution by now
    • given that the RPM package is installed in a system-wide directory, and this would be a backward-incompatible release (some packages would no longer be accessible due to modularization), this seems risky either way
  • remove support for OSGi bundles?

    • OSGi is working on native support for Java modules (see https://github.com/apache/felix-atomos), so this support will soon be unnecessary anyway
    • the OSGi bundle and Java module should match (in terms of exported API etc.), which would be a backward-incompatible change (since currently the bundle exports every package except for the impl package, but the Java module wouldn't export all those packages as well) and a maintenance burden (changes in module-info.java should be matched with changes to the bundle's configuration in pom.xml)
    • it would simplify the required changes, and the Maven build in general
  • set the compilation version to Java SE 11? I.e. the published artifacts would only run on Java SE 11+

I also filed elharo/xom#142 w.r.t. XOM support: to be able to publish modular JARs, we'd need a XOM release which has at least Automatic-Module-Name set, but such release is currently not available yet (technically this isn't true: we could manually change the name of the XOM jar to match the intended module name when publishing, but that's just messy, imho).

@carlosame
Copy link

Java SE 8 is pretty much EOL by now

There are about 60% of developers still on Java 8, according to latest surveys, and OpenJDK 8 is supported until May 2026. Figures are between 3 and 7% for Java 7 usage.

My project produces multi-release JARs valid for Java 8 and 11 (7 and 11 for 1.0), and a dependency that only runs on 11 would be useless.

@anthonyvdotbe
Copy link
Contributor Author

My project produces multi-release JARs valid for Java 8 and 11 (7 and 11 for 1.0), and a dependency that only runs on 11 would be useless.

Good point, I scrapped the question: the sources themselves will still be compiled for Java SE 8.

@carlosame
Copy link

the sources themselves will still be compiled for Java SE 8.

Not sure that I follow you here: do you mean that the source level will be 8, or that you compile targeting Java 8?

The former would be irrelevant to the JAR, and the latter rules out module-info and, thus, modularization.

@anthonyvdotbe
Copy link
Contributor Author

With "the sources themselves", I meant: "everything but module-info.java". So: module-info will be compiled with --release 11, and everything else will be compiled with --release 8

@carlosame
Copy link

With "the sources themselves", I meant: "everything but module-info.java". So: module-info will be compiled with --release 11, and everything else will be compiled with --release 8

module-info.java is a source as well, so we are talking about a multi-release jar now.

@anthonyvdotbe
Copy link
Contributor Author

Strictly speaking it won't be a multi-release JAR, but even it it were: I don't see what point you're trying to make? The JAR will work as a modular JAR on Java SE 11+, and as a plain old JAR on Java SE 8.

@carlosame
Copy link

I don't see what point you're trying to make?

My point is to try to clarify what you are proposing, because I suggested to target different releases in my first response to your post (where you talked about keeping source level 1.5), and your last summary still involved a Java 11-only thing.

Saying "the sources themselves will still be compiled for Java SE 8" in the context of a modular project is something that needs clarification.

An then, I wonder why Java 8 and not 7.

@anthonyvdotbe
Copy link
Contributor Author

where you talked about keeping source level 1.5

I never talked about keeping source level 1.5. I said:

[...] upgrade the compiler level to 11 as well (the code can stay on the 1.5 syntactic level. I'm not planning to make any code changes, apart from package declarations and import statements)

which I thought was sufficiently clear to mean:

I'm planning to use --release 11 to compile, but I won't make code changes that require to compile with a --source value that is higher than what's currently used by the project, which, looking at the POM, is 1.5 (apart from introducing module-info.java files, obviously)

@anthonyvdotbe anthonyvdotbe linked a pull request Aug 17, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants