Skip to content
This repository has been archived by the owner on Nov 10, 2017. It is now read-only.

NIO to allow GC during JNI and access to special memory regions #58

Open
almson opened this issue Dec 30, 2013 · 22 comments
Open

NIO to allow GC during JNI and access to special memory regions #58

almson opened this issue Dec 30, 2013 · 22 comments

Comments

@almson
Copy link

almson commented Dec 30, 2013

It would be nice if netlib-java exposed NIO interfaces to its routines.

Direct ByteBuffers (and FloatBuffer, DoubleBuffer, etc) have the following neat features:

  • zero-copy between Java and native code, giving 100% performance parity with C.
  • as fast as primitive arrays
  • can memory-map files (Should be possible to do calculations directly on huge matrices residing on SSDs)
  • compatibility with code that already uses them (I've been using jcuda and jcublas and use direct bytebuffers, and have to incure double copy overhead to use an array-based jni api)
  • simple to expose
@fommil
Copy link
Owner

fommil commented Dec 31, 2013

we already have zero copy between Java and native code through the critical array primitive sections, so I don't see what advantage this could bring.

In addition, if you have huge files on disc, it is highly likely that they will be sparse (and hence not appropriate for blas routines anyway) and the memory pipelining of such a setup would offset any potential benefits from native implementations.

I'm closing this ticket because I don't intend to work on this: it would result in a dramatic change to the API (effectively making the entire current API legacy). I really don't believe this is simple.

However, I'd love to see a feasibility that shows that NIO buffers are faster than arrays with critical sections. Would you be prepared to do this? I would honestly be shocked if you could show this, as all major OSes already have zero copy. It sounds like jcuda and jcublas should be exposing an array based API.

BTW, you can potentially swap the netlib implementation to use CUBLAS as the backend if you like, see MultiBLAS, no need to use "jcublas" directly (which, incidentally, I've never heard of).

My performance tests (see the Performance section of the README) showed that GPU BLAS backends are not quite there yet as drop-in replacements, as the memory transfer costs outweigh the benefits of the faster computation, so it is necessary to rewrite programs to make use of the GPU memory addressing system. Perhaps this is what jcublas does: a wrapper over the C API.

BTW, BLAS/LAPACK are Fortran APIs, not C ;-) The C API (which simply delegates to the BLAS routines) are called CBLAS and LAPACKE, respectively.

@fommil fommil closed this as completed Dec 31, 2013
@almson
Copy link
Author

almson commented Dec 31, 2013

Using GetPrimitiveArrayCritical can cause showdowns and stalls in multi-threaded code: mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2007-December/000074.html

I suppose it isn't a big deal, but ByteBuffers are often preferred when working with native code.

JCublas is a wrapper of the C api. I manage memory myself. ByteBuffers are used to expose native memory regions, including those that have special properties to the CUDA runtime (such as those that are pinned in physical memory and are much faster to copy to/from the GPU). JCublas also can take plain arrays (this is all managed through the neat Pointer class that the C functions take).

I think overloads that take FloatBuffer would be a simple addition.

@fommil
Copy link
Owner

fommil commented Dec 31, 2013

hmm, that's all pretty old thinking for the critical regions. I've never seen any problems - does one use of a critical region in one piece of JNI code really stop GC for everything? For all implementations of the GC? And for all memory regions? If that is true, it's a pretty epic performance bottleneck in the JVM.

In any case, netlib-java is best used in a manner such that the native code is only being called by one (or a very small number of) Java threads at a time. The reason for this is because native implementations will tend to use all your CPU cores anyway, so trying to do any multi-threading organisation on top is just going to burden the OS's context switcher. I created MultiBLAS to see if multiple processor architectures could be used to get additional throughput (e.g. if the CPU is busy, use the GPU, even though it's not really much faster) but I've been distracted by other projects.

I'd be happy to advise on a pull request with a ByteBuffer alternative API. NIO is something I considered at the outset (many years ago) but I concluded that the user base for such an API would be too small to warrant any work on it, the additional cost to PermGen would be high, and it would only be beneficial to users of the native backends (it would necessarily slow down the fallback F2J implementation).

I can certainly see the benefit of using ByteBuffers if CUDA is maintaining its own special memory regions. That's something we can't map as a critical region, since it is created in the native code. I wasn't aware that CUDA created any special memory zones in the CPU memory space... I thought it was all GPU stuff which one can't access via NIO.

The reality is that "CUBLAS" is not actually BLAS. It's a BLAS-like thing, which requires special memory management. I'm going to keep an eye on this whole space over the next few years and my hope is that the memory transfer costs reduce to the point where special memory regions are no longer needed.

BTW, If you start work on a Pull Request, you'll see that it is definitely not a simple addition ;-) You'll need to appreciate the amount of code that is auto generated, and the process required to push native binaries, to fully appreciate the magnitude of what you're asking.

@almson
Copy link
Author

almson commented Jan 2, 2014

You're probably right that the demand for this isn't big enough to bother. I've put in copies, and they're not a problem.

NVIDIA (ie CUDA) probably will never overcome the PCIe bandwidth bottleneck. But AMD and Intel might. Their integrated GPUs, though not the famed teraflop beasts, connect to system memory and should work better for this kind of "drop-in" GPGPU. (Hence AMD referring to their chips as "APUs")

To make PCIe transfer faster, CUDA optionally uses pinned and/or write-combined memory. These let the DMA engines do their work without worrying about virtual memory and cpu caches, respectively. I don't believe it helps much for small transfers, though.

I'm surprised you haven't had much exposure to GPUs. If one wants to do fast linear algebra, a GPU is the only way to go. Using them from Java is also very pleasant.

@almson
Copy link
Author

almson commented Jan 2, 2014

Any compacting GC (CMS isn't compacting, which is why i use G1), can't run while JNI code has entered a critical section in order to access JVM memory. That's why the function has "critical" in its name and why it is advised to be used only for short operations. But I agree that when BLAS is keeping all CPUs busy, this doesn't usually matter (although if I/O code stops, it might).

@fommil
Copy link
Owner

fommil commented Jan 3, 2014

have you got a definitive reference on critical sections blocking different types of GC? I'd really like to know the fuller story as I've heard several . (It suppose it wouldn't be too hard to write a small JNI that simply blocks for 5 minutes, and then do some work and observe the whole thing in VisualVM, but I'd like to hear the official story.)

I have certainly used GPUs, that's what my other project is for, but I didn't realise that CUDA had a special type of memory mapping between CPU and GPU memory spaces. What is the native function call that you're making to create the memory region?

@fommil
Copy link
Owner

fommil commented Jan 14, 2014

I'm reopening this just because it is still of interest, but I don't have any time for plans to work on this personally so Pull Requests (or funding!) is realistically the only way its going to happen :-(

@fommil fommil reopened this Jan 14, 2014
@almson
Copy link
Author

almson commented Jan 15, 2014

The CUDA function is simply cudaHostAlloc. See:
http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/group__CUDART__MEMORY_g15a3871f15f8c38f5b7190946845758c.html

GetPrimitiveArrayCritical seems to cause hangs when using G1 GC and Java 8.
In JCuda it is a complete deadlock, while with Netlib it's a temporary one
where all threads but one stop working. Concurrent Mark Sweep GC doesn't
have this problem on Java 8. I'll open up an issue as soon as I figure out
how to use Visual Studio to get the stack trace in the JDK.

On Tue, Jan 14, 2014 at 11:45 AM, Sam Halliday [email protected]:

Reopened #58 #58.


Reply to this email directly or view it on GitHubhttps://github.com//issues/58
.

@dlwh
Copy link

dlwh commented Aug 5, 2015

(obviously this is low priority and you have probably less bandwidth than I do, but I am somewhat interested in support for NIO Buffers. It's come up a few times in issues and questions for Breeze.)

@fommil
Copy link
Owner

fommil commented Aug 5, 2015

If you know anyone willing to fund this for a few months, please let me know.

@fommil
Copy link
Owner

fommil commented Aug 5, 2015

BTW most people who think they need nio buffers don't really understand what netlib-java is doing and don't actually need it. But I can see it would be an improvement and there are rare cases that would benefit.

@bkgood
Copy link

bkgood commented Nov 30, 2015

I've got a fork over at https://github.com/bkgood/netlib-java that dirtily adds NIO buffer support. Would you be interested in merging it in if I polished it up?

It's got some extraneous stuff I want to remove, the F2J implementation fails on direct buffers and the JNI implementations fail on array-backed buffers. It currently fulfills my needs (multithreading done in the BLAS implementation wasn't cutting it so I'm doing in myself in-JVM; long story) but I'd love to get a cleaned up version in in a proper release if you'd be interested in such a version.

On fixing the latter two issues I brought up- I agree that letting the F2J implementation handle non-array-backed buffers will slow it down considerably, but given the incredibly poor performance this implementation currently exhibits, I'm not sure I see the extra copying work leading to much of a regression. If performance is a goal of the Java implementation, F2J seems to really miss the mark - I, at one point, reimplemented dgemv in Java in such a way that I had it running in the same order of magnitude as the OpenBLAS implementation because I was able to write Java in such a way that Hotspot was able to vectorize it. F2J's bytecode doesn't seem to be written (well, generated) in such a way. I am concerned about the necessary allocations, though. Any ideas here? Like I mentioned earlier, I currently just throw an exception which might be better than implicitly allocating a potentially huge array.

To fix the final issue (JNI implementation fails on array-backed buffers), it seems to make sense to just get the backing array from the buffer and call GetPrimitiveArrayCritical on it. To my knowledge there isn't a sort of nio buffer that has neither a 'peer' nor a backing array but this can also be guarded against.

FWIW there are a couple of nice uses I see for supporting NIO buffers here but I've been unable to see gains in TTSP (time to safepoint, which ideally avoiding the pinning of GetPrimitiveArrayCritical would reduce, but I've yet to see evidence of any gains). The biggest gain has been in memory locality as all the hardware backing my current project is commodity multi-socked x64 boxes under NUMA constraints, achieved with the help of some topology-aware allocation routines.

@fommil
Copy link
Owner

fommil commented Nov 30, 2015

@bkgood hi, thanks for looking at this! From skimming your branch, it does look a bit hacky for me to be able to merge into the main code branch: this would effectively create two extra API variants for everything that we currently have? (correct me if I'm wrong)

If this work is going to be done, I'd like it to be done from the ground up and for F2J to support the same API. That approach would involve going the #76 direction. Realistically, I'd be looking for funding for a 6 month period to do this.

@bkgood
Copy link

bkgood commented Nov 30, 2015

From skimming your branch, it does look a bit hacky for me to be able to merge into the main code branch

Agreed, that's why I wanted to ask for your thoughts before doing a big code drop-style pull request ;)

this would effectively create two extra API variants for everything that we currently have?

I was planning to delete the POINTER_AS_LONG variant as part of my cleanup (performance improvement over NIO_BUFFER was appreciable but not what I'd consider significant), so it would only be the NIO_BUFFER variant plus the original array-based one.

If this work is going to be done, I'd like it to be done from the ground up and for F2J to support the same API. That approach would involve going the #76 direction. Realistically, I'd be looking for funding for a 6 month period to do this.

Unfortunately my patches came from an immediate need and since they fulfill my needs (the F2J implementation is very much useless in my project, although I see its utility elsewhere) I can't really sell 6 months' funding and time.

One could presumably modify the F2J class files such that we duplicate all the methods (and associated bytecode), splice out the relevant Xaload instructions and splice in calls to XBuffer.get and replace the relevant bits of the method signatures. I'm less than excited about such a proposition for perhaps obvious reasons :) but I agree that fully supporting F2J is necessary for the goals of netlib-java. This would entail significantly less work IMO than a new Fortran-to-Java transpiler (and I would be willing to give it a go), although has a hacky nature of its own.

If you're interested in a patch that solves the issue in a way similar to what I've got now, I'd very much like to work to get something mergeable :) I'm more than willing to do the work to get it there.

edit: blah, POINTER_TO_LONG should have been POINTER_AS_LONG

@fommil
Copy link
Owner

fommil commented Nov 30, 2015

I suppose I could be convinced by something that supports F2J in a second API. But actually cutting a belli is extremely complex. Perhaps it's easier now that docker exists, but OS X was always the problem.

@fommil
Copy link
Owner

fommil commented Nov 30, 2015

/belli/release/ (to idea what autocorrect was doing there)

@bkgood
Copy link

bkgood commented Nov 30, 2015

A second API as in a second interface and set of implementations? Or additional methods to the existing class, along with support for copying in from DirectBuffers? (IOW an extension of what I'm doing now) My memory is a bit fuzzy but I imagine I could get this to generate a separate set of classes.

FWIW I've been developing on OSX and have my SNAPSHOT releases building there so I could lend a hand as needed. I imagine you'd want to do the building since you're signing but I can at least share my experiences (homebrew in particular made it fairly easy to get the fortran compiler I needed, and then it was just a matter of fixing a few paths and commenting out the non-osx stuff from the various pom.xmls).

@fommil
Copy link
Owner

fommil commented Nov 30, 2015

I think a second API would be best otherwise we'll be hitting classfile limits on the existing interfaces/classes.

@bkgood
Copy link

bkgood commented Dec 4, 2015

Sorry, your reply got buried in a gmail thread somehow :/

Looking at the classes generated by my fork, it seems we're quite far away from classfile limits on methods/class (https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.11 says ~65k)

17:25 $ grep 'public native' native_system/java/target/netlib-native/com/github/fommil/netlib/NativeSystemARPACK.java |wc -l
      48
17:28 $ grep 'public native' native_system/java/target/netlib-native/com/github/fommil/netlib/NativeSystemLAPACK.java |wc -l
    2352
17:29 $ grep 'public native' native_system/java/target/netlib-native/com/github/fommil/netlib/NativeSystemBLAS.java |wc -l
     404

It seems ok from a classfile perspective (and these numbers actually include the methods taking a pointer in a long which I don't want to merge, so they're actually higher than they ought to be).

@fommil
Copy link
Owner

fommil commented Dec 4, 2015

Well, there is that limit, but I'm also thinking about the burden on anyone who uses this library.

@eiennohito
Copy link

eiennohito commented Jul 1, 2016

JNI stubs can be similar to what lz4-java does for array/BB support (https://github.com/jpountz/lz4-java/blob/master/src/jni/net_jpountz_lz4_LZ4JNI.c#L94)

@almson
Copy link
Author

almson commented Jul 1, 2016

@bkgood It's exciting to hear that you were about to get the JVM to vectorize your methods. I've been following the autovectorization developments, but in my experience only loops that index the array starting from 0 get vectorized. Were you able to get it working with non-0 offsets? Or is your dgemv limited in that sense?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants