### **Evolutionary Computation With GPUs**

### Pierre Collet

SONIC Stochastic Optimisation and Nature Inspired Computing

## BFO – LSIIT

Université De Strasbourg pierre.collet@unistra.fr

> Copyright is held by the author/owner(s). GECCO'12 Companion, July 7–11, 2012, Philadelphia, PA, USA. ACM 978-1-4503-1178-6/12/07.

Simon Harding

IDSIA

SUPSI, USI

Lugano, Switzerland

### Organisation of the tutorial • General scope (P. Collet) > GPGPU Super-computers Clusters of GPGPU machines • GPU Programming 101 (S. Harding) Programming model > CUDA 101 > Alternatives to Cuda ◆ EASEA platform (P. Collet) > Benchmarks and real problems Pierre Collet Simon Harding 2 LSiiT CNIS





IDSI



### Future computers will be massively parallel

- In Nov 2007, the fastest machine was 500 Tflops (212,992 PowerPC 440 cores clocked at 700MHz).
- In Oct 2010, Tianhe-1a is capable of 2.5 Pflops. It is composed of 112 computer cabinets, 12 storage cabinets, 6 communications cabinets, and 8 I/O cabinets, for a total of 3,584 blades, containing 14,336 2.93GHz CPUs and 3 million 575MHz GPU cores.
- March 2011: Oak Ridge National Lab launches TITAN, a 20 Pflop machine based on NVIDIA GPU cards.
- Exaflop machines are predicted to appear by 2020, and Zetaflop machines by 2030.

Question : how do you parallelize programs to efficiently use such machines ?



### EC and massive parallelism

- Evolutionary Algorithms are *generic* optimization methods that are intrinsically parallel.
- They can exploit *ANY* kind of parallelism (SIMD, SPMD, MIMD, ...)
- EA parallelism scales well (possibly supralinear speedup !)
- Perfect paradigm for Peta and Exascale computing, provided one can handle 4 levels of parallelism:
  - > Massive parallelism on one GPU card
  - > Parallelizing over several GPU cards
  - > Parallelizing over several GPU machines
  - > Parallelizing over several clusters of GPU machines







### Coming up

- We will primarily look at CUDA
- But will also mention some alternatives:
  - > Shaders
  - > Products from GASS, Tidepowrd and Microsoft
  - ▹ OpenCL













### Programming model

- Each thread can see the local memory of that processor core
- Threads within a block see the same shared memory

ĹŜiiT

Simon Harding

IDSIA

• Between blocks, there is the global memory

CNrs

Pierre Collet

### Programming model

- Each thread can calculate an ID using predefined variables available in the kernel:
  - > gridDim : Dimensions of the grid in blocks
  - blockDim : Dimensions of the block in threads
  - blockIdx : Block index within the grid
  - threadIdx : Thread index within the block
- Dimensions are defined when the kernels are launched.



### **CUDA 101**

- Let's write a program to add two vectors of numbers together
- E.g. [0,1,2,3,4] + [5,6,7,8,9] = [5,7,9,11,13]
- ♦ Do this in parallel
  - > Each thread will deal with one index in the vectors







### **CUDA 101**







### **CUDA 101**

// Pointers to the device side data
int \*dev\_a, \*dev\_b, \*dev\_c;

cudaMalloc( (void\*\*)&dev\_a, DataSize ); cudaMalloc( (void\*\*)&dev\_b, DataSize ); cudaMalloc( (void\*\*)&dev\_c, DataSize );

cudaMemcpy(dev\_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(dev\_b, b, size, cudaMemcpyHostToDevice);



### **CUDA 101**



### **CUDA 101**





# <section-header><section-header><section-header><section-header><text><text><text><text><text><text><text>





Simon Harding

IDSI

## Another CUDA example

\_global\_\_\_ void dot( int \*a, int \*b, int \*c ) \_shared\_\_ int temp[THREADS\_PER\_BLOCK]; int index = threadIdx.x + blockIdx.x \* blockDim.x; temp[threadIdx.x] = a[index] \* b[index]; \_\_syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS\_PER\_BLOCK; i++ ) sum += temp[i]; atomicAdd( c , sum ); Pierre Collet Simon Harding UNIVERSITÉ DE STRASBOURG ĹŜiiT 1 CNIS IDSI/







### CUDA

- ◆ This tutorial only provides the real basics.
- ♦ Lots more to learn
  - > Especially if you want highly optimized CUDA
- Lots of resources on line:
  - > Google is your friend here.





# According to wikipedia There are MANY options:



## Quick note!

- ♦ Nvidia also supply BLAS and FFT libraries
- These are also wrapped in other languages
  - > And in fact can be very easy to use from other languages.

### import numpy from pycublas import CUBLASMatrix

- A = CUBLASMatrix( numpy.mat([[1,2,3],[4,5,6]],numpy.float32) )
- B = CUBLASMatrix( numpy.mat([[2,3],[4,5],[6,7]],numpy.float32)

C = A\*B print C.np\_mat()

(code from Wikipedia)



# pyCuda mod = comp.SourceModule(""" global \_ void multiply them(float \*dest, float \*a, float \*b) { const int i = threadIdx.x; dest[i] = a[i] \* b[i]; } """) multiply\_them = mod.get\_function("multiply\_them") a = numpy.random.randn(400).astype(numpy.float32) b = numpy.random.randn(400).astype(numpy.float32) b = numpy.zeros\_like(a) multiply\_them(drv.Out(dest), drv.ln(a), drv.ln(b), block=(400,1,1)) print dest-a\*b

### pyCuda



# Attp://www.hoopoe-gloud.com/Solutions/CUDA.NET/ Image: Output of the state of the s



### Don't like the look of CUDA?

• What else is there that has been used in the EA community and elsewhere?

# Pierre Collet Simon Harding







### RapidMind

- ♦ Obslete
- Company bought up by Intel
- Seems to have moved to new product : Array Building Blocks
- <u>http://software.intel.com/en-us/articles/intel-array-building-blocks/</u>
- See "A SIMD Interpreter For Genetic Programming on Graphics Cards" by W. Langdon and W. Banzhaf



### Shader programming

- As the name suggests, this is really about graphics
- But it is possible to abuse for computations
- See "Linear Genetic Programming GPGPU on Microsoft's Xbox 360" by G. Wilson and W. Banzhaf for an example using HLSL

| Pierre Collet            |      |       |   | Simon Harding |
|--------------------------|------|-------|---|---------------|
| UNIVERSITÉ DE STRASBOURG | CITS | ĹŜiiT | B | IDSIA         |

### MS Accelerator

- ♦ Research platform
- <u>http://research.microsoft.com/en-us/projects/accelerator/</u>
- .net library for vector maths







### **GPU.NET**

- Commercial product from TidePowerd
- <u>http://www.tidepowerd.com/</u>
- Converts compiled .net code to device code
- See new paper tomorrow in the CIGPU session : "Implementing Cartesian Genetic Programming Classifiers on Graphics Processing Units using GPU.NET" by S. Harding and W. Banzhaf



### GPU.Net



### GPU.Net



### OpenCL

- http://www.khronos.org/opencl/
- Open standard for GPU and parallel programming
  - > Multivendor
  - > Diverse hardware (CPU, FPGA, GPU etc)
- See competition entry "CUDA and OpenCL-based asynchronous PSO" by Y. Nashed et al.



### OpenCL

- Still relatively unexplored by the community
- Wrappers exist for access from other languages
- Other tech uses it. See WebCL



### OpenCL \_kernel void vector\_add\_gpu (\_\_global const float\* src\_a, \_\_global const float\* src\_b, \_\_\_\_\_global float\* res, const int num) /\* get\_global\_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.\*/ const int idx = get global id(0); Now each work-item asks itself: "is my ID inside the vector's range?" If the answer is YES, the work-item performs the corresponding computation\*/ if (idx < num)</pre> res[idx] = src a[idx] + src b[idx]; Pierre Collet Simon Harding CNIS ĹŜijŢ http://opencl\_codepley

# What does the future hold? • Who knows? ♦ Hardware: Intel multiple cores, AMD's APUs • Software: TPL, New updates to CUDA, OpenCL Pierre Collet Simon Harding ĹŜijŢ CINIS

### Different kinds and levels of parallelism

- ◆ Low-level massive parallelism:
  - > GeForce GTX 580 contains 512 cores running in an SPMD model :
    - One single program (one context).
    - Cores grouped by 32, running in SIMD mode, but different bundles can run on different parts of the code.
  - > Several cards in one machine : MIMD model.
- + High-level parallelism:
  - > Several machines together (cluster of machines).
  - Several clusters together.



### EASEA: an evolutionary platform for massive parallel machines

- First version of EASEA (EAsy Specification of Evolutionary Algorithms) presented ad EA'99
- ◆ 2000-2003: EASEA is the programming language of the DREAM (Distributed Resource Evolutionary Algorithm Machine).
- ♦ 2008- EASEA is an evolutionary platform to automatically parallelize evolutionary algorithms on 4 levels of parallelism:
  - parallelization on one GPU card,
  - parallelization on several GPU cards,
  - parallelization on several heterogeneous machines
  - parallelization on several heterogeneous clusters of GPU machines.
- 2012 EASEA will run on the French Grid

### Pierre Collet





### 3 cards in the same PC = x3 speedup !



### EASEA : a language that can handle GPGPUs

- EASEA stands for Easy Specification of Evolutionary Algorithm.
- In order to code an evolutionary algorithm in EASEA, the user only needs to write (in C) the application-specific code, i.e. :
  - > Initialisation function (how to initialise the genome)
  - Evaluation function (dependent on the problem)
  - > Crossover operator (how to recombine 2 genomes)
  - > Mutation operator (how to mutate the genome).

\*EASEA is available on Sourceforge: http://sourceforge.net/projects/easea/ Pierre Collet Simon Harding 63



### Weierstrass initialisation and evaluation code











## Island parallelism



### High-level massive parallelism (island model)













### Island model on Weierstrass h=0.35 1000 dim





### Combining LL and HL massive parallelism

### • Zeolite hunt:

- Zeolites are porous crystalline structures with many applications in industry made of oxygen atoms around a silicon or aluminium atom.
- A replication of the unit structure allows to obtain the pores.



Zeolites are used for filtering, catalysis, energy storing, medicine, absorbing liquids, odours (cat litter), ...





GPU Islands speedup on zeolite problem

1, 5, 10 and 20 GPU machines
 Zeolite Hurt
 270



### EC is ready for EC !

- EC is a *generic* massively parallel optimization method that can exploit peta and exascale computing.
- Not rocket science anymore: EASEA regularly runs on 20 machines with 256 cores = 5000 cores on many different problems (cf. Zeolite problem)
- Developments: new algorithms must be coined for HUGE populations (100K to 1M individuals on one island).
- New practices must be developed (increasing mutation, dealing with heterogeneous machines, ALPS-like algorithms, ...)
- Current problem: finding large enough problems to get such machines to heat up.



# <section-header><section-header><section-header><list-item><list-item><list-item><list-item><list-item><list-item><list-item><list-item>



