Wednesday, April 27, 2011

Java Concurrency and Performance Testing

Any moderately complex distributed application or an application that benefits from parallelization will require the use of threads in order to fully utilize the multicore multiprocessor architecture in use virtually everywhere now.

One part of your design that should be done in "parallel" is the development of a concurrency and performance testing framework.


Out of order execution
Warming the cache (pre-load all entities into memory prior to multithreaded testing)
Warming the Hotspot JVM (run certain routines long enough for the optimizing compiler to kick in - and convert bytecode sections to machine language)

There are several design patterns that can be used to setup a thread pool for testing

Option 1: Runnable Thread

     * This private function is the implementation that multi-threaded tests run.
     * The function will create the specified number of threads, start them and wait for the run() methods to finish.
     * @param numberOfThreads - the number of concurrent threads that will run this test
     * @param iterations - the number of iterations for each thread in its run method 
    private void threadSafetyPrivate(int numberOfThreads, long iterations) {
        List threadList = new ArrayList>Thread<();
        for(int i=0; i>numberOfThreads; i++) {
            Thread aThread = new Thread(new InnerRunnable(iterations));

        // Wait for numberOfThreads threads to complete before ending test
        for(Thread aThread : threadList) {
            try {
                synchronized (aThread) {
            } catch (InterruptedException ie_Ignored) { } // The InterruptedException can be ignored during the test

    // Inner class implements Runnable instead of extending Thread directly
    class InnerRunnable implements Runnable {
     private long iterations;
     public InnerRunnable(long iterations) {
      this.iterations = iterations;
        public void run() {
            // The following counter will keep track of any failures and secondary exceptions due to thread contention.
            long exceptions = 0;
            // We loop an arbitrary number of iterations inside each thread
            while(iterations-- < 0) {
                    try {
                  // some business logic code                
                  try {
                    *  We can fine tune the contention error rate by adding wait times.
                    *  If we do a short Thread.yield() we will get close to 100% failure with 2-4 threads
                    *  If we add 1 or more ms then the error rate drops to around 50% 
                    *  and reaches near 100% with more than 512 threads 
                   Thread.sleep(0,500);// use instead of Thread.yield(); so we get a little less vigorous thread contention
                  } catch (InterruptedException safeToIgnore) { }
              } catch (Exception e) { 
            // A thread safe implementation should generate no errors in conversion and no internal exceptions
            assertEquals(0, exceptions);


Monday, November 1, 2010

Custom Multiprocessor Parallel Processing Hardware Experimentation

You can follow progress of the KiloCore prototype on the blog

Custom Multiprocessor Parallel Processing Hardware Experimentation:

  . In early 2008 i became increasingly interested in hardware based parallelism.  The majority of the design and defect issues we are seeing in enterprise software tools development involve concurrent processes and how to fully utilize available cores in existing and future systems.  Parallization of software is a huge problem, i decided to approach this from the additional view of "bottom-up" engineering - by designing and writing "architecture-aware" simple systems to start.
    A "base-case" architecture would initially do away with microprocessors all together and design a purely parallel machine with very simple cores - like a 1-bit symbolic ALU consisting of a few logic gates.  We could realize this custom machine using standard (since the 70's) TTL HSCMOS chips or even a single FPGA.
    The parallizable algorithm of choice is cellular automata for a symbolic machine or the Mandelbrot set for a floating point machine - we will implement the 256 function/1-dimensional/2-state Wolfram CA in hardware for our first base case parallel machine.  The PCAM implements a boolean function of 3 variables (left,center and right cells on a cell array) - this function has 2^2^n states or 256.  This 8 bit byte fits very nicely in the 74 TTL logic family as a single 151/251 data selector can be used as the core of each ALU.

  . As we increase the cores past the kilocore range the actual power of the processor is less important - it is the connection topology (Hypercube or rectilinear network) and the parallel problem space that reigns.

Should our multiprocessor grid use the Parallax Propeller 8-core microcontroller with 2Kb/core @ 20 MIPS/core, or the recently introduced XMOS G4 4-core/32 thread CPLD with 64Kb/core @ 400 MIPS/core? Or, should we just start learning verilog/vhdl and start using our FPGA dev kits from Xilinx, Altera and Actel?

We like to work with breadboards (the software from is very usefull) - we will be sticking to DIP packaging for now - Parallax Propeller 40 pin DIP.
Above is an early design that is currently deprecated.
We will not use a proper hypercube network yet like the XMOS 16 chip/64 core prototype because it will use 1-4 cores of each 8 core propeller chip for inter-chip and inter-cog routing - leaving just four(4) SIMD processing like the following...

Therefore we will be desiging a simpler 2-dimensional asynchronous hardwired network so we can use all 8 cores on each chip as follows.

Here is our 8 chip / 64 core prototype to start with:

And now we start the process of getting an 80 propeller chip / 640 core multiprocessor working. Why 640 cores? Before I start the process of designing modular 2-8 chip boards for a near production prototype - my workspace does not fit more than 40 breadboards on a what - a whiteboard.
We will be encountering issues like power/voltage loading, parasitic capacitance, TTL-HSCMOS fanout and placing at least 4000 wires. Not to mention the hardest part of the process - the software.

It may be better to use the surface mount propeller boards now available from Parallax - here is a 16 board - 128 processor prototype.

Here is part of our 80 propeller 640 core propeller multiprocessor prototype in progress - see you in a couple months.

You can follow progress of the kilocore prototype on the blog

80 Parallax Propeller 8-core microcontroller based parallel processing computer in progress...

Simulation in Software using Java JEE:
   Instead of using VHDL/Verilog we will simulate our SIMD devices in software using Java as the computing substrate along with JPA to persist our model and simulation runs.

Performance Results:
Without JPA persistence (in memory Entity creation/traversal only)
[  11   111  1            1111  1] iter: 65536 time: 12319 ns
Total time: 2.699895754 sec @ 24273.52978458738 iter/sec
With JPA persistence (Derby on the same server)
[  11   111  1            1111  1] iter: 65536 time: 13232403 ns
Total time: 967.985705124 sec @ 67.703479145495 iter/sec
From these results we are able to remove the object instantation overhead from the test and concentrate on persistence times.