Fast, Parallel Implementation of Particle Filter on GPU

(1)

Fast, Parallel Implementation

of Particle Filter on GPU

(2)

Introduction

 Image processing, navigation, financial mathematics, etc:

 Non-linear, non-Gaussian state-space

 Analytic solution: not possible

 Kalman filter suboptimal

 Particle filter

 Bootstrap filter: resampling

 Standard PF: high running time

(3)

Review – 1.

1) Parallel implementation:

 share ratio 25%

 Quality degradation

 CPU RNG

 Data transfer!

 Textures (2D)

(4)

Review – 2.

2) Resampling

 Spreading-narrowing

 N x P_i

 Stratified random: {10, 20, 50, 100, 200, 500}

 Metropolis resampler

 Pair-wise operations

– B iterations / each p particle – Uniform choice

– Weight ration based refresh

 Parallelizable

(5)

Background – 1.HMM & PF

 Hidden Markov Model

 Particle filter:

 N particles / state

 System dynamics → fitness & weight

 „Weak” particles are filtered

(6)

Background – 2.PF algorithm

 Steps / system-state:

1. Fitness calculation: density function of hidden sequence noise: actual observation ↔ observation based on model

2. Resampling: normalized weights, uniform random sweepstake

Estimation for x_t ← average of choosen particles 3. Iteration:

 based on model

 Init for state t+1

(7)

Background – 3. CPF

 Step 1 and 3: parallelizable

 Idea for resampling: CNN architecture

 Information exchange

 Parallelizable

• Runtime: decrease

• error: decrease

(8)

HW details – GPU

 Considetations:

 Memory: access time ↔ information share

 Warp conflict avoidance: 256 thr/blck

(9)

Random number generation – 1.

Our principal: GPU shared memory

 NVIDIA Mersenne Twister

 Inappropriate distribution for low numbers

 Data file, seeds: 4096 different values

 LFSR

 Modified NVIDIA MT

 Parameters and masks: defined as published in original article

 Seed for recursion array based on: time, thread ID

(10)

Random number generation – 2.

Compare results:

60 bins on [0,1] for 1000 random numbers, 60 bins on [0,1] for 1000 random numbers

MATLAB vs. NVIDIA SDK Mersenne Twister MATLAB vs. modified NVIDIA SDK MT

(11)

Technical details: CPF on GPU

 Restructure

 2D → 1D → ring

 Speed + information

 Shared memory

 Overlapping arrays

(12)

System overview

Allocate memory in global memory Initialize particles;

t = 1

t < T Main kernel

Summing kernel Averageing

kernel

t += 1 Estimation

array in global mem array

(13)

Main kernel

1. Copy: Y(t) to shared memory 2. Copy x_particles to shared memory

3. Calculate L_shared using: Y(t) and x_shared(i)

4. W_shared: Norming sums for each x_shared(i) on N_i 5. Reseed block

6. Generate RND num for each particle 7. Parallel resampling for each particle

8. Write resampled particles to global memory array 9. Iterate with random noise on x_shared

10. Copy particles from x_shared to x_particles

global → shared global → shared global → shared

shared → global shared → global

shared shared shared shared shared

shared

(14)

Measurements – 1.

 Benchmark model

 Widely used, non-linear and continous state- space, linear tools not applicable

 24 (N,r) configurations, 1000 trajectories

 N: 512, 1024, 2048, 4096, 8192, 16384

 r = 32, 64, 128, 256

 Mean sqaure error: estimation ↔ hidden sequence

 Time

(15)

Measurements – results 1.

 LFSR RMSE: sigmoid type decrease

 MT:

 RMSE for small numbers better

 No significant improvement:

quality of MT still could be improved

 Time: N_i sequential;

multiprocessor schedule

(16)

Measurements – results 2.

CPF:

 Slight error difference but!

speedup

 GPU: 4096 to 16K particles:

150 to 282 ms

 CPF: 4096 particles 441 ms

(17)

Thank you for your kind attention!

Submitted to:

EURASIP Journal on Advances in Signal Processing

Questions?

(18)

System overview

(19)

Random number generation – 1.

 LFSR

(20)

Random number generation – 1.

 LFSR

(21)

Kenyérszöveg: Verdana 16/18pt

egy egyszerű szövegrészlete, szövegutánzata a betűszedő és nyomdaiparnak.

A Lorem Ipsum az 1500-as évek óta standard szövegrészletként szolgált az iparban; mikor egy ismeretlen nyomdász összeállította a betűkészletét és egy példa-könyvet vagy szöveget nyomott papírra, ezt használta. Nem csak 5

évszázadot élt túl, de az elektronikus betűkészleteknél is változatlanul

megmaradt. Az 1960-as években népszerűsítették a Lorem Ipsum részleteket magukba foglaló Letraset lapokkal, és legutóbb softwarekkel mint például az Aldus Pagemaker. Az 1960-as években népszerűsítették a Lorem Ipsum

részleteket magukba foglaló Letraset lapokkal, és legutóbb softwarekkel mint például az Aldus Pagemaker. Az 1960-as években népszerűsítették a Lorem Ipsum részleteket magukba foglaló Letraset lapokkal, és legutóbb

softwarekkel mint például az Aldus Pagemaker.

Címsor: Verdana 24pt

(22)

Kenyérszöveg: Verdana 16/18pt

egy egyszerű szövegrészlete, szövegutánzata a betűszedő és nyomdaiparnak.

A Lorem Ipsum az 1500-as évek óta standard szövegrészletként szolgált az iparban; mikor egy ismeretlen nyomdász összeállította a betűkészletét és egy példa-könyvet vagy szöveget nyomott papírra, ezt használta.

Nem csak 5 évszázadot élt túl, de az elektronikus betűkészleteknél is változatlanul megmaradt.