Java On GPUs And FPGAs

There was a time when running a program on an array of processors meant that you worked in some high-powered lab somewhere. Now your computer probably has plenty of processors hiding in its GPU and if you have an FPGA, you have everything you need to make something custom. The idea behind TornadoVM is to modify OpenJDK and GraalVM to support running some Java code on parallel architectures supported by OpenCL. The system can utilize multi-core CPUs, GPUs (NVIDIA and AMD), Intel integrated GPUs, and Intel FPGAs.

If you want to try your hand at accelerated Java, there are some docker containers to get you started fast. There’ are also quite a few examples, such as a computer vision application.

There are some easier examples, such as this one that uses an FPGA. You can see the use of the @Parallel inside for loops and some basic task management. If you prefer, you can start with the simple hello world example.

There are several articles and papers about TornadoVM, some of which are behind paywalls. However, we enjoyed this article which has a good blend of theory and practice.

Java isn’t always the first choice for high-performance computing and we have to wonder how this would benchmark against someone using OpenCL in a more traditional language. On the other hand, if you know Java this might be a great way to get started with parallel processing.

We talked about CUDA, a competing technology awhile back, but many of the concepts are the same. OpenCL will even run on the Raspberry PI.

48 thoughts on “Java On GPUs And FPGAs

  1. Java would be awesome on a GPU. Java runs fast on as little as a quad core CPU, with as little as 16GB of ram, and just a few hundred GB of helper libraries.

    Imagine what 4096 cores would do!

    Pity GPUs have a paltry 8GB of memory or so, hello world could compile and run in just a few minutes!

    1. You know we should be glad the designers of Java weren’t around during the early years of the PC, they wpuld have drowned it under the weight of their Jabba the Hut programming language.

      1. Yep, it quickly drops my input, quickly crashes, quickly eats the memory, etc.

        Apps that run quick on android are native c/c++ apps.

        everything else, is the “training wheels” of languages, aka java.

        FFS, BASIC and FreeBASIC, run 10X faster.

        Android itself it just a native C/C++ wrapper for java crap.

    2. I’ve written some large-ish Java programs and it doesn’t HAVE to be a memory hog. Java only becomes obese once you start using nested dependencies, just like any other **cough** python **cough** language.

  2. Funny you should mention that: Back in the mid-90’s I was selling the Sun Microsystems Netra line of workstations, right at the time Java was being developed. A couple of my associates used to work for Cray and Silicon Grapics, and these guys were geniuses at optimizing Cray supercomputers. One of their associates (a mathematician from Russia) figured out that the GPU on the Netra was accessible, and that we could increase the speed of doing floating-point arithmetic by a factor of 10 by utilizing the GPU.

    Sun was dumbfounded by their discovery, and although we gave demos all over the country, I never heard of Sun promoting or publicizing that capability. (Another overlooked opportunity by Sun…) However, one of the big energy companies was taking over every computer on its campus after work hours and doing radiation simulations using thousands of workstations under a parallel processing scheme. When we showed them, they bought a few Netras from us and proved that the system would work well for them. (Then they cut us out as the sales company and made a deal directly with Sun to buy hundreds of workstations. Needless to say, I felt cheated by both Sun and the company…)

    Anyway, that was my first experience using the GPU as an adjunct, and I have followed the progress for years, even making a number of video and ANN apps.

    1. Which Netra and Which GPU?

      I imagine this was either Netra i20 or Netra i 1/140 150 170?

      The GPU could have been a TGX could do accelerated wirefram, ZX could do accelerated shaded graphics no textures except via software, or possibly an AG10E late in the game as it is basically a Glint300sx + DSP + Imagine 128. Also could have been a Creator3D?

      Just curious as I have most of this hardware on hand. The Netra lineup was just the regular SparcStation and Ultra Lineup bundled with server software.

      1. I almost forgot the SX on the SS20 and SS10SX is basically a vector engine used for graphics that you just push data to from the CPU… quite possible that is what was used.

        NetBSD has at least partial support for running accelerated X11 on there as the instruction set has some documentation floating around I think.

  3. I might be wrong, but, Isn’t Java is known to be memory heavy? For a GTX 1050 mentioned in the example, it has 640 CUDA cores and 4 GB RAM. This works out as per core ram of 6.25 MB. Of course it’s not what we do in practice, but, it still inhibits us to use most of the Java functionalities on a GPU. In every examples, even the code blocks marked with “@parallel”, were pretty much plain C-style for-loops with nothing in particular hinting that they are Java code.
    .
    Another thing thing to note is that, the tutorial specifically mentioned: with respect to sequential CPU performance, we get 27X from multicore CPU, 62X from HD 630 graphics and 81X from GTX 1050.
    Another way of saying the same is GTX 1050 delivers 3X the performance of multithreaded CPU and 30% more performance than Integrated HD Graphics 630. Like How????
    If GTX 1050 gives 30% boost w.r.t HD 630 then there was no point in including it with the laptop mentioned. Also i7 7700HQ’s FP32 performance is 16 GFLOPS whereas GTX 1050’s FP32 performance is 1860 GFLOPS, a ration of 116:1. Sure GFLOPS might mot be a legal comparision, but, in my case, GEGL grayscale transformation gives ~17.5X boost against against multicore CPU implemention (not single core non-vector CPU) for A10 9600p CPU and Radeon 340mx GPU.
    .
    Lastly, no reference against C/C++ implementation was given. Every example in the official git repository had an easy to read CUDA or OpenCL equivalent.
    .
    Like I mentioned in the beginning, I might be wrong, since I am not an expert and might be covering the wrong aspects.
    .
    This is with reference to article https://jaxenter.com/tornado-vm-java-162460.html
    One of it’s authors is a lead developer of TornadoVM.

    1. GPU is designed to handle SIMD (Single Instruction Multiple Data), so its raw FPU performance would not be usable as a general computation vs a CPU. It is like asking why those AVX (Advanced Vector Extensions) performance aren’t usable on non-vectored code.

      https://stackoverflow.com/questions/23447817/how-slow-is-comparison-and-branching-on-gpu
      >GPUs compute multiple work items (typ. 16 or 32) in lock-step in “warps” or “wavefronts” and if different work items take different paths they all take all paths but gate writes based on which path they are on (using predicate flags)). So if your work items always (or mostly) branch the same way, you’re good. If they don’t the penalty can rob performance.

      >On GPUs, don’t access global memory multiple times (as GPU memory management and caching work not exactly like a CPU). Instead, cache the global memory elements into thread’s private variables / shared memory as much as possible.

      I don’t code in Java nor GPU languages.

      1. Well….. Of course, comparing raw floating point performance is not a good idea. But still…… 3 times boost from
        CPU to GPU for mere Grey scale transformation is too less. Grey scale image transformation is something that can be easily be SIMD optimized.
        .
        They advertise that, by using their tool to implement Grey Scale Transformation in Java,
        GTX 1050 is 82X faster than scalar single core CPU
        integrated HD 630 is 62X than scalar single core CPU
        Multi core vector CPU is 27X faster than scalar single core CPU
        .
        Another way to interpret that is,
        GTX 1050 is 3X faster than vector multi core CPU
        GTX 1050 is 1.3X faster than integrated HD 630
        .
        The task here is decoloring image, which can easily be vectorized. I would expect GTX 1050 to be at least 15X if not 30X faster (or 100X if initialization and memory transfer time was insignificant w.r.t calculation time).
        .
        If CPU vs GPU is unfair, GTX 1050 is only 30% faster than integrated HD 630. Isn’t it supposed to be 4 times faster. Performance isn’t scaling well with more GPU power. Maybe for similar reasons, performance numbers were not given for GPUs like GTX 1080 ti or RTX 2080 super.

      2. Java is interpreted. The “compiler” simply translates it into byte code, which is interpreted by the byte-code interpreter, which eventually, slowly, turns it into very UN-optimized machine code, when it “feels like it”.

        the “fuzzing” and other static analysis, is done by tools written in C/C++, not in tools written in java. Fuzzing and static analysis is also available for C/C++ programs.

        Granted , the “damage” a poor programmer can do, is very much limited in java, since there are several levels of abstraction protecting the hardware from poor programmers. It’s like riding a tricycle, wearing bubble wrap, and having everything around you bubble wrapped and isolated.

        The garbage collection is aptly named, as it’s garbage. Unless you really are a poor programmer, there is no reason why you need java to do it for you, when it “feels like it”. Dropped user input is simply unacceptable.

        Only the most trivial of code runs faster in java, than in C, even “hello world” in java takes 1.1s to execute, where the C takes 0.01s. Java is only 10X slower!

        Then again, this was run on a ryzen-7, 2700, with 32Gb of RAM, everything installed on an SSD, and only 8C/16T which is probably “entry level” for java these days.

        Strings translations and unicode has long since been solved.

        great for the candy crash generation, but for those of us that need IEEE-754, and other things that java simply doesn’t do, not so much.

        1. Short run benchmarks are always going to be affected by having startup overhead for one platform versus another where it is preloaded as part of the OS.

          The anti-Java arguments of the 90s are long gone – in part because the use case of consumer desktop applications died massively as the language shuffled into the massive niche of business applications, where startup time is irrelevant, JIT is good enough, and the frameworks and platforms are more important than the language. They’ve even made some efforts to fix the verbosity problem. GC research didn’t stop with Java 1 either, so modern GC is far better.

          Decades of C application memory leaks and security issues in the core libraries clearly show that most C programmers are ‘poor’ as well – or maybe bugs are to be expected from even the best programmers?

    2. Memory heavy is relative.

      Running a possibly widely parallel but simple microservice (say, message processing, but could be rest/web, etc) application written in Java on top of an enterprise stack (Spring, say) can run fine in 500MB, or could be 2GB (usually VMs these days). The more clients/messages you have to process in parallel, the more memory you need. But yes, there is a JVM/JIT memory overhead you cannot ignore in certain use cases.

      Remove the frameworks, remove/limit the parallelism, remove the caches, and so on, and you can trim it down a lot.

      I feel this comment thread does rehash all the arguments of the 90s, when memory capacity was an issue.

  4. This was in May or June 1995. I can’t find my notes so I can’t tell you exactly what the other guys had, but I know that my device was an i 150. I used it almost exclusively for teaching Java programming, and (of course) self-education.

  5. I think this is pretty much the goal of Intel’s ‘open standard’ one api initiative – https://software.intel.com/en-us/oneapi. Not Java,, but portability of C ( Visual C++) and Fortran ( still important in HPC ) across cpu,gpu, fpga, npu… It is relevant to code support for the Aurora Exascale platform(coherent ). If nothing else it enables code portability between GPU vendors.

  6. Java can run fast when optimized for particular workloads. It also has next-level dependency management when building with maven or gradle. On a project with more than four developers, the dependency management alone can save man-weeks of time.

    That said, the native UI stuff for java is not very responsive, it’s slow to startup and memory intensive, and you’ll have perpetual devops issues with java security updates breaking stuff. Also, you’re going to be miserable if you have to deal with java keystores.

    My guess is that this is more intended for defense stuff.

    1. Or, one can just compile and link with C/C++, and avoid the “many man-weeks” of gradle, maven,raven,craven, drable, brabble, frabble, frooble, drooble, doodle, poodle, moodle” and other nonsense dependencies.

      1. Yeah instead you can try to manage incompatible Microsoft runtime libraries and incompatibilities between Microsoft C, gcc and clang. And let’s not forget that make is not the same program on every platform so you need to be extra careful with your makefiles. Oh and good luck debugging your C++ program on the latest
        macOS with the locked-down debugger.

        And we can just conveniently forget that there are zero humans who are capable of writing C++ code that is not riddled with buffer overrun problems.

        1. The same buffer overrun would also affect java, since there are “zero humans” capable of writing code that is not riddled with buffer overrun problems.

          Java interpreters are written in C/C++. Therefore, they are also vulnerable.

          For maximum speed and safety, we can write the java interpreter in java, and run that on another java interpreter written in java, running on a java cpu, written in java.

          When it boots in a few months, think of the speed and security!

          1. The java compiler and runtime libraries have been peer reviewed and fuzzed and subjected to massive static analysis many times by large corporation with millions of dollars to spend on the task. There is no way that you can test your code to the same level.

            java is not interpreted, it is a compiler, it generates machine-specific object code and executes it directly. In most cases the generated code has identical performance to the equivalent C++ code.

            The algorithms used by java’s collection classes are far superior to those in the C++ standard library so your code will run faster in java than C++.

            It’s much easier to reach a worldwide market for your product if it’s in java. Handling unicode is a massive nightmare in C++, and adding support for multiple languages is not supported by the language or the runtime so you have to do it all yourself. By contrast java has extensive support for unicode and for localization of strings.

          2. @N
            > The algorithms used by java’s collection classes are far superior to those in the C++ standard library so your code will run faster in java than C++.

            That is if you stick to or even need standard collection classes in the first place.

            I’m certain Starduino on Arduboy would not run faster if written in Java, or run at all for that matter.

            The strength of C++ is that you can implement those “superior” algorithms right on the standard containers if you need to.

            Every benchmark I’ve seen “proving” that Java is “superior” was blindly 1:1 translated to C++ without using any of C++’s strength and flexibility.

            > And we can just conveniently forget that there are zero humans who are capable of writing C++ code that is not riddled with buffer overrun problems.

            Unless Java stuff is entirely written by perfect inter-dimensional demigods you’re just shifting the problem.

            Morons in C++ will have their code riddled with buffer overruns.
            Morons in Java will have their code riddled with SQL injections and other unvalidated inputs.

            It’s a disaster regardless.

    2. yet another person utterly unaware of the enormous java presence on the server, where server processes run continuously and startup time is a non-issue.

      “you’ll have perpetual devops issues with java security updates breaking stuff.”

      All software needs to be updated, java is no different from any. other language in that respect.

      1. There is a large java presence on servers, what else is going to run the crapplet serving websites. trust me, I can wait for the crap to load, as my computer does real work, and some poor server smokes under the weight of all that cruft.

        1. Nobody cares about your problems with applets, the technology is deprecated, deader than Elvis. These web sites you talk about, have you actually visited them in the past 10 years?

  7. Java, write once, debug everywhere.

    Java, write once, maintain 10 different JREs, and 100Gb of helper library versions.

    Java, write once, run nowhere.

    Java, now “somewhat’ IEEE-754 compliant.

    Java, because your time is worthless.

    Java, because we don’t debug, let the user do it!

    Java, because programming is hard, let someone else do the hard stuff for you!

    1. What is your problem? My customers and I work extensively in java with great productivity and success. Many of the web applications that you use every day have extensive back-end java components, it is proven successful. Customers are abandoning their dreadful ODBC-based C++ back ends, instead opting for clean, simple JDBC-based back ends in java. Have you eve worked with ODBC? What a nightmare, but you gotta do it if you want a database in your C++ application.

      We spend zero time tracking down buffer overruns and we don’t have to hunt down third-party libraries for things like XML and LDAP and TLS because they are rolled into java.

      We have no issue with running on high-performance platforms like AIX that don’t have an extensive set of third-party libraries included with the distribution and we don’t have to worry about keeping up with security updates on third party libraries that would have to manage ourselves on these platforms.

      The portability problems that you cite with IEEE-754 will also be present in your cross-platform C++ application.

      1. +1 – One compile of a (large) codebase runs fine across AIX, windows, and linux servers. And even if a C++ hello world runs a bit faster, so what? Unless you’re serving amazon.com, sometimes savings on development, testing, and upgrade time outweigh shaving 1 ms off of a request. Even some old java-based sites we have kicking around can easily handle 100+ concurrent user sessions, on a single-core VM with a couple gig of ram, and have very snappy response time and minimal CPU. But feel free to carry on with the nonsense of 16GB SSD machines…

      2. – Oh, and +10 on ODBC. I can’t believe the complication / pain in the arse it is to get third-party software setup on occasion to databases we use JDBC with – a simple/small/lightweight driver jar with no install. ODBC with 32 vs 64 bit datasources in windows, needing to install windows database client drivers to even get the ODBC setup up, user vs system connections, etc. Sillyness.

  8. Benchmarks:

    time ./java HelloWorld — ” Compiled ” java class

    Hello World

    real 0m0.049s
    user 0m0.047s
    sys 0m0.035s

    time ./HelloWorld — gcc HelloWorld.cpp -o0

    Hello World

    real 0m0.001s
    user 0m0.001s
    sys 0m0.000s

    49X slower, not 10X slow, but 49X slower, for “Hello World”.

    On a 2700X, 32Gb , SSD, 8C/16T machine.

    1. So much Java bashing here!

      This is more tribalism than in a god damn soccer riot!

      Java is a REALLY nice performance and syntax mix
      between interpreted hipster garbage like Python and
      memory and thread unsafe debugging hell holes like C++.

      Take a look at GraalVM, it’s performance is crazy!

      Languages have their domains and specialities, get over it!

      1. Yeah, syntax mix is truth. Why can’t java be consistent in it’s syntax use?
        I regret wasting precious university time on Java when i could have used that to get better at C, C++, python, pascal or other actually not stupid languages. Now that knowledge is useless because i get rashes just thinking about using Java.

        1. What is inconsistent about the lang? You REALLY think it’s less consistent than python or c++?!? Java barely has any syntax. It avoids fancy stuff like operator overloading. You don’t like that? Well then go have fun code golfing with your script kiddy friends. But don’t expect to ever write well maintainable large scale systems.

          Also, java compiles to bytecode alongside sooo many other jvm languages like Kotlin, Scala, Groovy, Jython, JRuby, Closure….
          You can use ANY java library with these langs. You want to bash all of them too?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.