Monday, November 1, 2010

Custom Multiprocessor Parallel Processing Hardware Experimentation

You can follow progress of the KiloCore prototype on the blog

Custom Multiprocessor Parallel Processing Hardware Experimentation:

  . In early 2008 i became increasingly interested in hardware based parallelism.  The majority of the design and defect issues we are seeing in enterprise software tools development involve concurrent processes and how to fully utilize available cores in existing and future systems.  Parallization of software is a huge problem, i decided to approach this from the additional view of "bottom-up" engineering - by designing and writing "architecture-aware" simple systems to start.
    A "base-case" architecture would initially do away with microprocessors all together and design a purely parallel machine with very simple cores - like a 1-bit symbolic ALU consisting of a few logic gates.  We could realize this custom machine using standard (since the 70's) TTL HSCMOS chips or even a single FPGA.
    The parallizable algorithm of choice is cellular automata for a symbolic machine or the Mandelbrot set for a floating point machine - we will implement the 256 function/1-dimensional/2-state Wolfram CA in hardware for our first base case parallel machine.  The PCAM implements a boolean function of 3 variables (left,center and right cells on a cell array) - this function has 2^2^n states or 256.  This 8 bit byte fits very nicely in the 74 TTL logic family as a single 151/251 data selector can be used as the core of each ALU.

  . As we increase the cores past the kilocore range the actual power of the processor is less important - it is the connection topology (Hypercube or rectilinear network) and the parallel problem space that reigns.

Should our multiprocessor grid use the Parallax Propeller 8-core microcontroller with 2Kb/core @ 20 MIPS/core, or the recently introduced XMOS G4 4-core/32 thread CPLD with 64Kb/core @ 400 MIPS/core? Or, should we just start learning verilog/vhdl and start using our FPGA dev kits from Xilinx, Altera and Actel?

We like to work with breadboards (the software from is very usefull) - we will be sticking to DIP packaging for now - Parallax Propeller 40 pin DIP.
Above is an early design that is currently deprecated.
We will not use a proper hypercube network yet like the XMOS 16 chip/64 core prototype because it will use 1-4 cores of each 8 core propeller chip for inter-chip and inter-cog routing - leaving just four(4) SIMD processing like the following...

Therefore we will be desiging a simpler 2-dimensional asynchronous hardwired network so we can use all 8 cores on each chip as follows.

Here is our 8 chip / 64 core prototype to start with:

And now we start the process of getting an 80 propeller chip / 640 core multiprocessor working. Why 640 cores? Before I start the process of designing modular 2-8 chip boards for a near production prototype - my workspace does not fit more than 40 breadboards on a what - a whiteboard.
We will be encountering issues like power/voltage loading, parasitic capacitance, TTL-HSCMOS fanout and placing at least 4000 wires. Not to mention the hardest part of the process - the software.

It may be better to use the surface mount propeller boards now available from Parallax - here is a 16 board - 128 processor prototype.

Here is part of our 80 propeller 640 core propeller multiprocessor prototype in progress - see you in a couple months.

You can follow progress of the kilocore prototype on the blog

80 Parallax Propeller 8-core microcontroller based parallel processing computer in progress...

Simulation in Software using Java JEE:
   Instead of using VHDL/Verilog we will simulate our SIMD devices in software using Java as the computing substrate along with JPA to persist our model and simulation runs.

Performance Results:
Without JPA persistence (in memory Entity creation/traversal only)
[  11   111  1            1111  1] iter: 65536 time: 12319 ns
Total time: 2.699895754 sec @ 24273.52978458738 iter/sec
With JPA persistence (Derby on the same server)
[  11   111  1            1111  1] iter: 65536 time: 13232403 ns
Total time: 967.985705124 sec @ 67.703479145495 iter/sec
From these results we are able to remove the object instantation overhead from the test and concentrate on persistence times.