# An Analytical Framework for High-Speed Hardware Particle Swarm Optimization

Issam Damaj<sup>a,∗</sup>, Mohammed El-Shafei<sup>b</sup>, Mohammed El-Abd<sup>c</sup>, Mehmet Emin Aydin<sup>d</sup>

<sup>a</sup>Electrical and Computer Engineering Department, American University of Kuwait, Salmiya, Kuwait, idamaj@auk.edu.kw

 $b$  Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada, m lshafe@encs.concordia.ca

 $c$ Electrical and Computer Engineering Department, American University of Kuwait, Salmiya, Kuwait, melabd@auk.edu.kw

<sup>d</sup>Computer Science and Creative Technologies Department, University of the West of England, Bristol, United Kingdom, Mehmet.Aydin@uwe.ac.uk

# Abstract

Engineering optimization techniques are computationally intensive and can challenge implementations on tightly-constrained embedded systems. Particle Swarm Optimization (PSO) is a well-known bio-inspired algorithm that is adopted in various applications, such as, transportation, robotics, energy, etc. In this paper, a high-speed PSO hardware processor is developed with focus on outperforming similar state-of-the-art implementations. In addition, the investigation comprises the development of an analytical framework that captures wide characteristics of optimization algorithm implementations, in hardware and software, using key simple and combined heterogeneous indicators. The framework proposes a combined Optimization Fitness Indicator that can classify the performance of *PSO* implementations when targeting different evaluation functions. The two targeted processing systems are a Field Programmable Gate Arrays for hardware implementations and a high-end multi-core computer for software implementations. The investigation confirms the successful development of a PSO processor with appealing performance characteristics that outperforms recently presented implementations. The proposed hardware implementation attains

Preprint submitted to Journal of ...  $December\ 6, 2019$ 

<sup>∗</sup>Corresponding Author

23300 improvement ratio of execution times with an elliptic evaluation function. In addition, an speedup of 1777 times is achieved with a Shifted Schwefels function. Indeed, the developed framework successfully classifies PSO implementations according to multiple and heterogeneous properties for a variety of benchmark functions.

Keywords: Particle Swarm Optimization, Hardware, Software, Performance Indicators, Analysis, Gate Arrays

## 1. Introduction

Optimization is an important concept in the engineering domain [\[1,](#page-36-0) [2,](#page-36-1) [3,](#page-36-2) [4\]](#page-36-3). Indeed, all engineering applications involve some sort of optimization in order to realize the final product. Optimization involves reducing cost, power consump-

<sup>5</sup> tion, time delay or increases the yield, profit, quality of solution, etc. As engineering problems became more challenging; with properties such as large size, discontinuity, non-differentiability, non-linearity, multi-objectiveness, and mixed variable types; it becomes essential to develop new optimization paradigms that can reach acceptable solutions in less time.

- <sup>10</sup> Meta-heuristics is a substantial and considerably important optimization paradigm that can be used to tackle nowadays engineering problems. A major class of meta-heuristics is Population-based algorithms, which update a population of solutions over a number of iterations until some stopping condition is satisfied. Population-based algorithms are categorized based on the inspiration
- <sup>15</sup> behind their population update scheme. These categories include Evolutionary Algorithms and Swarm Intelligence algorithms. Due to the heavy computational workload constituting of optimization with population-based algorithms, computational speed matters for solving many dynamic and real-time problems with Evolutionary Algorithms and Swarm Intelligence algorithms. For exam-
- <sup>20</sup> ple, radio resource scheduling for recent generations of wireless communication networks requires a decision to be made within nano-seconds level, which is not viable with conventional computing infrastructures [\[5,](#page-36-4) [6\]](#page-36-5). Further research has

been conducted over the past two decades in order to speed up the execution of such algorithms, whether by manipulating parameters, developing multiple/dis-<sup>25</sup> tributed versions or implementing them on specific hardware.

In this paper, we investigate the implementation of one Swarm Intelligence algorithm, namely Particle Swarm Optimization (PSO) [\[7,](#page-36-6) [8\]](#page-37-0), on Field Programmable Gate Arrays ( $FPGAs$ ) [\[9,](#page-37-1) [10,](#page-37-2) [11,](#page-37-3) [12\]](#page-37-4). The investigation benefits from the advancements in  $FPGA$  capabilities and aims at developing high-speed hard-

- <sup>30</sup> ware cores. Such a hardware implementation enables embedding optimization algorithms in application to assist, or completely replace, a central processing component. The investigation is driven by the need for appealing performance characteristics, and the reliable operation within real-time applications, that hardware implementations can provide. Furthermore, the investigation com-
- <sup>35</sup> prises the development of a statistical analysis framework that captures wide characteristics of optimization computations. The proposed framework statistically combines the mathematical properties of optimization algorithms and their evaluation function complexities, besides, their implementations in HW and SW. To that end, the developed PSO implementations are employed to
- <sup>40</sup> validate the proposed framework and verify its effectiveness in application. Indeed, limited work in the literature is found to analyze the performance of PSO implementations based on heterogeneous properties as in the proposed analysis framework. The general contribution of the paper is summarized as follows; a detailed discussion on the motivation, contributions, and objectives is presented
- <sup>45</sup> in Section [2:](#page-3-0)
	- Develop efficient hardware cores for the PSO algorithm on FPGAs.
	- Develop an analytical framework that evaluates the fitness of optimization implementations based on heterogeneous performance indicators.

The rest of the paper is organized so that Section [2](#page-3-0) present the motivation, <sup>50</sup> research questions, and the paper contribution. In Section [3,](#page-5-0) a survey of the literature is presented. In Section [4,](#page-9-0) the processor design and implementation are presented. Section [5](#page-14-0) introduces the statistical analysis model. A thorough performance evaluation is presented in Section [6.](#page-20-0) Section [7](#page-33-0) concludes the paper and sets the ground for future work.

## <span id="page-3-0"></span><sup>55</sup> 2. Research Objectives

Challenges to optimization algorithms, including PSO, comprise performance characteristics (time, speed, efficiency, and complexity), storage requirements, reliability and accuracy, and first and foremost the dealing with the intrinsic sequential behavior of the algorithm. As related to PSO, the following research <sup>60</sup> opportunities are highlighted:

- Identify and investigate the performance aspects of  $SW$  implementations of PSO while targeting high-performance multi-core processors.
- Develop efficient hardware cores for PSO under FPGAs.
- Identify and investigate the performance aspects of the  $HW$  implementa-<sup>65</sup> tions.
	- Identify a set of performance indicators that aids the evaluation of  $HW$ and SW PSO implementations with different benchmark evaluation functions.
- Develop combined performance indicators for *PSO* that capture the qual-<sup>70</sup> itative and quantitative characteristics, and enables the classification of different implementations based-on combined HW, SW, and mathematical properties.
	- Identify the type of problems that can be efficiently optimized using *PSO* based on heterogeneous HW, SW, and algorithmic properties.
- <sup>75</sup> The proposed investigation has several research objectives. The investigation aims at developing a high-speed hardware core for PSO on FPGAs. To this end, the focus is on outperforming similar state-of-the-art implementations reported in the literature under the same implementation environment, algorithm specifications, and the targeted set of Benchmark Evaluation Functions (BEFs).
- <sup>80</sup> The hardware development includes the identification of effective implementation specifics and best practices. In addition, the investigation comprises the development of a statistical analysis model that captures wide characteristics of PSO implementations in HW and SW. Accordingly, the investigation includes

the development of high-speed SW implementations on high-end multi-core

- <sup>85</sup> processors. The analysis model comprises simple and combined performance indicators including a main indicator called the Optimization Fitness Indicator (OFI). The developed indicators analyze PSO implementations in terms of performance, hardware size, throughput, success rate, and combined forms. The aim of the OFI is to classify the performance of PSO implementations
- <sup>90</sup> when targeting different *BEFs*. Therefore, the type of problems that can be best solved using PSO can be identified. The evaluation confirms the successful implementation of an FPGA core for the PSO with accelerated processing throughput for different BEFs. The developed combined indicators successfully classified implementations using a variety of benchmark functions according to
- <sup>95</sup> desired properties.

In relation to the similar work presented in Section [3,](#page-5-0) the proposed developments enable the following comparisons:

- Compare the performance of partitioned versus nonpartitioned FPGA implementations. The comparison is done between the proposed all-in-<sup>100</sup> hardware FPGA core with the partitioned implementation in [\[13\]](#page-37-5). Given that the same implementation environment, algorithm specifications, and the targeted set of BEFs are adopted in our investigation.
- Compare the performance of the proposed sequential core with parallel hardware versions of the *PSO*. The comparison includes the parallel-*PSO* <sup>105</sup> version in [\[14\]](#page-37-6) at a swarm population size of 8 particles. Besides, presenting a wider analysis than that presented in [\[14\]](#page-37-6) to include populations of 16 and 32 particles. In addition, the comparison includes the multi-swarm parallel hardware implementation in [\[15\]](#page-37-7).
- Present classifications of *PSO* implementations based on heterogeneous <sup>110</sup> performance characteristics. The classifications enable straightforward identifications of the type of problems that can be best solved using a PSO implementation. Such straightforward identifications are proposed to replace tedious, intuitive, and multifaceted evaluations based on several single indicators—as usually adopted in the literature [\[16,](#page-38-0) [17,](#page-38-1) [18,](#page-38-2) [19,](#page-38-3) [20,](#page-38-4) <sup>115</sup> [15,](#page-37-7) [14,](#page-37-6) [20,](#page-38-4) [21,](#page-38-5) [22,](#page-39-0) [13\]](#page-37-5).

## <span id="page-5-0"></span>3. Related Work

The emergence of high-performance FPGAs enabled their use in computationallyintensive applications, such as optimization. Many recent investigations are identified in the literature to target implementing the PSO algorithm on FP-

<sup>120</sup> GAs. In this section, state-of-the-art related work is identified and presented, while closely-related work is thoroughly addressed in Section [6.2.](#page-23-0) In presenting similar work, we focus on a variety of implementation-specific aspects, such as, target devices, the used metrics to analyze the attained performance, and the main achievement of each investigation. Table [1](#page-8-0) accompanies the discussion and <sup>125</sup> provides a summary of findings.

Several attempts to achieve high processing speeds are presented in the literature  $[19, 14, 15, 20, 23]$  $[19, 14, 15, 20, 23]$  $[19, 14, 15, 20, 23]$  $[19, 14, 15, 20, 23]$  $[19, 14, 15, 20, 23]$ . Calazan et al.  $[14]$  propose a HW implementation of a parallel version of the PSO to run on a Xilinx Virtex-6 as a co-processor. The approach aims to improve the execution of PSO so that the performance

- <sup>130</sup> can be optimized for solving benchmark problems. The results demonstrate a significant speedup achieved through solving benchmark functions. In addition, Tewlode et al. [\[23,](#page-39-1) [20,](#page-38-4) [15\]](#page-37-7) develop a direct HW implementation of PSO to solve the problem of emission source localization within the context of environment monitoring in [\[23\]](#page-39-1), and extended with two more numerical benchmark functions
- <sup>135</sup> in [\[20\]](#page-38-4) to achieve significant speedups. The speedup is studied with respect to execution time as well as the number of cycles, where the level of achievement is clearly indicated. The authors published further details of the study in [\[15\]](#page-37-7), where parallel HW implementations under Xilinx Spartan 3E FPGA is compared to MicroBlaze-based SW implementations.
- <sup>140</sup> Improving accuracy using hardware implementations is the target of different investigations. Karakuzu et al. [\[17\]](#page-38-1) present an implementation of neuro-fuzzy system trained with *PSO* to improve the learning and prediction performances. The implementation provides efficiency with avoiding some steps through the process in comparison to using look-up tables (LUTs). The proposed approach <sup>145</sup> and implementation, using Xilinx Virtex-5, is tested with two scenario-based

benchmark system identifications and a realistic number-plate recognition. The implemented neuro-fuzzy system model was trained with swarm intelligence algorithms including ABC and ordinary PSO alongside the proposed PSO approach. The proposed HW implementation achieves the expected level of solu-

- $150$  tion quality using significantly less HW resources. Furthermore, Li et al. [\[22\]](#page-39-0) introduce a  $HW/SW$  co-design approach for *PSO* based-on *SOPC* technique and pipeline design, where the algorithm is partitioned between SW and HW to possibly improve the performance when solving various numerical benchmark functions. The algorithm is partitioned so that the fitness evaluation is devel-
- 155 oped in SW, while particle updating is implemented in  $HW$  and referred to as Particle Updating Accelerator. The proposed implementation is created to run the SW part on a Nios II CPU, while the HW part to run on a Cyclone II FPGA. The results demonstrate that a speedup of 20 times is achieved over a SW implementation.
- 

<sup>160</sup> A variety of investigations, on performance evaluation of PSO implementations, is presented in the literature. Common investigations include comparing the performance of Genetic Algorithms and PSO, optimizing HW /SW partitioned implementations, and integrating hardware PSO in engineering applications. Ben Ameur and Sakly [\[16\]](#page-38-0) present parallel HW implementations for

- <sup>165</sup> PSO and Genetic Algorithms. The implementations are formulated with finite state machines and target a Xilinx Spartan-3 FPGA. The implementations are tested with a number of benchmark functions to compare PSO with Genetic Algorithms; the reported results are in favor of the PSO algorithm. In addi-tion, Abdelhalim et al. [\[24\]](#page-39-2) present constrained and unconstrained  $HW/SW$
- $170$  partitioning problem with SW-based PSO implementations. Optimizing the intended system is done using various heuristic algorithms including Genetic Algorithms and PSO. The results and comparative analysis suggest that the proposed PSO, so-called Re-exited PSO, help achieving near-optimum results. The Re-exited PSO algorithm is used in [\[18\]](#page-38-2) to solve partitioning problems
- $175$  for the *JPEG* encoding system introduced by [\[25\]](#page-39-3). The developed embedded system is implemented under a *Cyclone FPGA* as the co-processor of a main

Nios-II CPU. The results are compared with findings of [\[25\]](#page-39-3) to demonstrate the success of the proposed approach. The work presented in [\[26\]](#page-39-4) proposes different approaches to implement PSO on FPGAs. To this end, the investigation tar-

- 180 gets an FPGA and an ARM processor in a Xilinx Zynq-7000 system on chip. The analysis includes testing three benchmark functions. Moreover, Ettouil et al. present a PSO implementation that targets an extended set of evaluation functions [\[27\]](#page-39-5); the investigation includes thorough analyses of hardware area utilization and the attained clock period. Applications of hardware PSO include
- <sup>185</sup> training neural networks [\[28\]](#page-39-6), implementing Multiple-Input Multiple-Output (MIMO) detection systems [\[29\]](#page-40-0), to name but a few.

|                                    | Main Achievement | including hardware speedup per<br>Improving multiple indicators<br>HW/SW implementation | Small-sized hardware area     | Improved accuracy                 | High processing speed | High processing speed         | High processing speed              | Optimizing performance per<br>HW/SW partitions                             | Improved accuracy                         | High processing speed                           | High processing speed                           | Optimizing result quality per<br>HW/SW Partition                        |
|------------------------------------|------------------|-----------------------------------------------------------------------------------------|-------------------------------|-----------------------------------|-----------------------|-------------------------------|------------------------------------|----------------------------------------------------------------------------|-------------------------------------------|-------------------------------------------------|-------------------------------------------------|-------------------------------------------------------------------------|
| Table 1: Summary of related works. | Metrics          | Hardware speedup; area utilization;<br>execution time; period                           | Area                          | Precision; accuracy of prediction | Execution time        | Execution time; speedup; area | No. of iterations; execution time  | communication time between HW<br>Area; delay; power consumption;<br>and SW | Execution time; and no. of<br>evaluations | No. iterations; execution time; clock<br>cycles | No. Iterations, execution time; clock<br>cycles | captures result quality; delay; power<br>A developed cost equation that |
|                                    | Hardware         | FPGA ZedBoard<br>ARM and Xilinx                                                         | Xilinx Spartran-3             | Xilinx Virtex-5                   | Xilinx Virtex-5       | Xilinx Virtex-6               | Xilinx Spartran-3                  | Altera Cyclone                                                             | Altera Cyclone                            | Xilinx Spartran-3                               | Xilinx Spartran-3                               | N/A                                                                     |
|                                    | Implementation   | PSO; HW and SW                                                                          | PSO and Genetic<br>Algorithms | Neuro-Fuzzy system<br>with $PSO$  | PSO                   | Parallel PSO                  | Parallel PSO                       | Re-exited PSO                                                              | other components<br>Core of $PSO$ and     | PSO <sub></sub>                                 | PSO <sub></sub>                                 | PSO: HW/SW<br>partitioning                                              |
|                                    | Year             | 2018                                                                                    | 2017                          | 2016                              | 2014                  | 2014                          | 2012                               | 2011                                                                       | 2011                                      | 2009                                            | 2008                                            | 2007                                                                    |
|                                    | Ref.             | $[26]$                                                                                  | $\frac{16}{1}$                | $[17]$                            | $\overline{19}$       |                               | $\begin{bmatrix} 15 \end{bmatrix}$ | $[18]$                                                                     | $[22]$                                    | [20]                                            | $[23]$                                          | $[24]$                                                                  |

<span id="page-8-0"></span>Table 1: Summary of related works.

## <span id="page-9-0"></span>4. Hardware Design

An informal and systematic approach is adopted to develop hardware cores for the PSO algorithm [\[1,](#page-36-0) [30\]](#page-40-1). The approach is informal in the sense that it <sup>190</sup> does not rely on engineering formal methods [\[31\]](#page-40-2). In addition, the approach is systematic in the sense that its procedure can be reused to develop similar hardware solutions, however, the method is not yet automatic and does not include any code generations, compilations, or rapid prototyping of hardware circuits. Furthermore, the methodology is unified in the sense that it uses <sup>195</sup> common software engineering techniques, such as flowcharts and state machines, to model the algorithm; accordingly, HW and SW designs are derived and implemented.

The development steps of the  $HW$  and  $SW$  implementations of the  $PSO$  are as follows:

- <sup>200</sup> 1. Depict the algorithm using flowcharts.
	- 2. Develop the software version.
	- 3. Design the processor Datapath by identifying, allocating, and binding hardware resources at the register-transfer level [\[32\]](#page-40-3).
- 
- 4. Develop the Finite State Machine (FSM ) of the control unit based on the <sup>205</sup> flowchart.
	- 5. Describe the developed hardware using a description language and synthesize the implementation for FPGAs.

The approach can be used to design partitioned hardware and software implementations of the PSO algorithm. However, the current use aims at devel- $_{210}$  oping independent hardware implementation for FPGAs. The adopted PSO is described in Algorithm [1](#page-10-0) in pseudocode.

Figure [1](#page-11-0) depicts the abstract behavior of the PSO algorithm. The main actions in the algorithm are (i) generate random numbers, (ii) initialize particles positions and velocities, (iii) evaluate a benchmark function to calculate the

<sup>215</sup> fitness value of a particle, (iv) update the particle's record with the best fitness value attained yet, and (v) update the swarm's global record with the best fitness value reached yet. At this point, update the particle's velocity and position with

Algorithm 1 The adopted PSO algorithm

<span id="page-10-0"></span>1: Initialize particles 2: do 3: **for**  $i = 1$  to Number of Particles **do**<br>4: **if**  $f(x_i) \leq f(p_{best_i})$  **then**<br>5:  $p_{best_i} = x_i$ 4: if  $f(x_i) \leq f(p_{best_i})$  then 5:  $p_{best_i} = x_i$ <br>6: **if**  $f(x_i) \leq f(gbest)$  then 7:  $gbest = x_i$ <br>8: end if 9: end if  $\frac{10}{2}$ 10: **for**  $j_{ij} = 1$  to Number of Dimensions **do**<br>11:  $v_{ij}^{i+1} = wv_{ij}^{t} + c_1r_1(p_{best_{ij}}^i - x_{ij}^t) + c_2r_2(g_{best_{j}} - x_{ij}^t)$  (1) 12: if  $v_{ij}^{t+1} > v_{max}$  then 13:  $v_{ij}^{t+1} = v_{max}$ 14:  $\begin{array}{cc} \n\text{and if} \\
15: \n\end{array}$ 15: **if**  $v_{ij}^{t+1} < v_{min}$  then<br>
16:  $v_{ij}^{t+1} = v_{min}$ 16:  $v$ 17: **end if**<br>  $x_{i,j}^{t+1} = x_{i,j}^t + v_{i,j}^{t+1}$  (2)  $18:$ 19: if  $x_{ij}^{t+1} > x_{max}$  then 20:  $x_{ij}^{t+1} = x_{max}$ 21:  $v_{ij}^t = 0$ <br>
22: end if<br>
23: if  $x_{ij}^{t+1} < x_{min}$  then 24:  $x_{ij}^{t+1} = x_{min}$ 25:  $v_{ij}^t = 0$ <br>
26: end if  $27:$  end for  $28:$  end for end for 29: while maximum iterations or minimum error is not reached 30: 31: where, 32:  $c_1$  and  $c_2$  are the acceleration constants. 33:  $r_1$  and  $r_2$  are random numbers between 0 and 1.<br>34:  $x_i(t)$  is the position of the  $i^{th}$  particle at iteration t. 35:  $v_i(t)$  is the velocity of the i<sup>th</sup> particle at iteration t. 36:  $w$  is the inertia weight factor. 37:  $g_{best}$  is the best position among all particles.<br>38:  $p_{best_i}$  is the  $i_{th}$  particle best position. 39: Equation (1) is the velocity update of the  $i^{th}$  particle. 40: Equation (2) is for position update of the  $i^{th}$  particle.

11



<span id="page-11-0"></span>Figure 1: An abstract flowchart for the PSO algorithm



<span id="page-12-0"></span>Figure 2: The developed datapath for the PSO algorithm

newly calculated values plays the main role. The procedure is repeated until a target number of iterations or a minimum error value is reached.

220 Based on to the *PSO* algorithm description, the datapath development includes the allocation of several computational hardware resources (See Figure [2\)](#page-12-0). The main allocated hardware units are a Fitness unit,  $p_{best}$  Update unit,  $g_{best}$ Update unit, Velocity Update unit, Position Update unit, Random Number Generator  $(RNG)$ , and three delay registers for the particle position x. Sam-<sup>225</sup> ple internal organizations of PSO computational units are shown in Figures [3](#page-13-0) and [4.](#page-13-1) In Figure [3,](#page-13-0) the hardware structure is for the Velocity Update Unit,

- while Figure [4](#page-13-1) presents a Fitness Unit depicting the Rosenbrock function. In both figures, the computational components run in parallel. Moreover, Figure [2](#page-12-0) includes standard components, such as, registers and memory elements;
- <sup>230</sup> the units Position, Global, and Particle Updates are implemented using simple selection statements, variable assignments, or include a single operation. In the presented datapath, the Random Number Generator core is imported from the work presented in [\[33\]](#page-40-4).

The behavior of the processor control unit is described in the FSM shown <sup>235</sup> in Figure [5.](#page-14-1) States  $S_0$  through  $S_6$  are responsible for the position and velocity



<span id="page-13-0"></span>Figure 3: The hardware structure of the Velocity Update Unit



<span id="page-13-1"></span>Figure 4: The hardware structure of the Fitness Unit depicting a Rosenbrock function.



<span id="page-14-1"></span>Figure 5: The developed FSM of the PSO Algorithm and its correspondence to the flowchart steps.

initialization of particles, preparation for the random number generation, and the forwarding of the position values through the three delay registers. State  $S_7$ is the main execution procedure that includes benchmark function evaluation, best positions updates, and velocity and position updates. In addition, state  $S_7$ <sup>240</sup> repeats unit the stoppage criterion is satisfied.

# <span id="page-14-0"></span>5. Analytical Model Development

To present the proposed performance analysis model, we adopt the Generic Benchmark Model (GBM) of Damaj et al. [\[34\]](#page-40-5). The GBM comprises six elements that define the Goal, Inputs, Activities, Output, Outcomes, and the

- <sup>245</sup> desired Performance profile of the performance analysis framework. The model captures the relationships among the resources, implementation, mathematical formulation, and the obtained results. The Goal defines the aim of the analysis framework. Moreover, the Input identifies the algorithms under study, implementation environments, reference algorithm, performance metrics, etc. Fur-
- <sup>250</sup> thermore, Activities present the algorithm implementations and the obtained results. The Output is the formulation of the key indicators and development of their rubrics—if needed. The Outcomes are the formulations of the statistical assessment as combinations of the Output. In addition, the Performance is the application of the developed assessment framework to profile and classify <sup>255</sup> algorithms according to the obtained results.

# 5.1. Goal

Analyze the performance of a PSO implementations in HW and SW, and enable its comparison to similar work in the literature.

## 5.2. Input

<sup>260</sup> The input identifies the targeted algorithms, computing systems, and the performance metrics. The analysis model targets a set of benchmark evaluation functions of different complexities and characteristics (See Tables [2](#page-16-0) and [3\)](#page-16-1). Moreover, the targeted HW development board is the DE2-70 by Altera. The board has a Cyclone II FPGA with a total of 68,416 Logic Elements (LEs) <sub>265</sub> and a maximum frequency of 300 MHz. SW implementations are done on Dell

Precision T7500 with its dual quad-core Xeon processors and 24 GB of RAM.

The identified performance metrics of the PSO are classified into general algorithmic profile  $(GAP)$ , hardware profile  $(HWP)$ , and software profile  $(SWP)$ . The general algorithmic profile includes the complexity of the benchmark eval-

 $_{270}$  uation functions. The HWP includes the number of benchmark evaluations, resource utilization, maximum frequency, throughput, etc. Moreover, the SWP comprises the number of benchmark evaluations, throughput, execution time, etc.



<span id="page-16-0"></span>

<span id="page-16-1"></span>Table 3: Properties of the target Benchmark Evaluation Functions in terms of separability, scalability, and whether the function is unimodal or multimodal.

| <b>BEF</b> Index | Function                                  | Separable    | Scalable     | Unimodal |
|------------------|-------------------------------------------|--------------|--------------|----------|
| F1               | B2                                        | $_{\rm Yes}$ | No           | No       |
| F2               | <b>Branin</b>                             | No           | No           | No       |
| F3               | Goldstein-Price                           | No           | No           | No       |
| F4               | Rosenbrock                                | No           | Yes          | No       |
| F5               | Zakharov                                  | Yes          | $_{\rm Yes}$ | Yes      |
| F6               | Sphere                                    | Yes          | $_{\rm Yes}$ | Yes      |
| F7               | Hartmann                                  | No.          | No           | No       |
| F8               | Variably-dimensioned                      | Yes          | Yes          | Yes      |
| F9               | Shifted Sphere                            | Yes          | Yes          | Yes      |
| F10              | Shifted Rosenbrock                        | Nο           | Yes          | Yes      |
| F11              | Shifted Schwefel 1.2                      | No           | Yes          | Yes      |
| F12              | Shifted Rastrigin                         | Yes          | Yes          | No       |
| F <sub>13</sub>  | Shifted Rotated High Conditioned Elliptic | No           | Yes          | Yes      |

#### 5.3. Activities

<sub>275</sub> The activities include hardware implementations under *VHDL*. The Software tools used for hardware implementation and profiling are *Quartus* and ModelSim. Software implementations are done under MATLAB.

#### 5.4. Output

The outputs of the analysis framework are three sets of indicators that cor- $_{280}$  respond to the proposed GAP, HWP, and SWP. The main KI of the GAP is the Benchmark Evaluation Function Complexity (BEFC ), which is defined as follows:

• Benchmark Evaluation Function Complexity (BEFC): an asymptotic complexity analysis using the Big-O, small- $\omega$ , and  $\Theta$  notations.

<sup>285</sup> To analyze the complexity of the evaluation functions, we study their asymptotic behavior. The asymptotic behavior classifies algorithms according to their rate of growth with respect to the increase in input size. The following standard complexity analysis classification is adopted from [\[35,](#page-41-0) [34\]](#page-40-5):

- 
- $O(f(n))$ : The rate of growth of an algorithm is asymptotically no worse 290 than the function  $f(n)$  but can be equal to it.
	- $\omega(f(n))$ : The rate of growth of an algorithm is asymptotically no better than the function  $f(n)$ .
	- $\Theta(f(n))$ : The rate of growth of an algorithm is asymptotically equal to the function  $f(n)$ .
- $295$  Here, *n* is the size of input.

To facilitate the assessment of the studied ciphers, we adopt the rubric from [\[34\]](#page-40-5) as shown in Table [4.](#page-18-0) In preparation for the statistical formulation, we map this qualitative properties onto quantities. For every point in the scale, we map

<sup>300</sup> it onto a fixed number. Hence, each point in the scale is mapped onto the values 20%, 40%, 60%, 80%, and 100% [\[34\]](#page-40-5).

The hardware profile includes the following indicators:

<span id="page-18-0"></span>

| General    |             |                       | Scale       |                    |               |
|------------|-------------|-----------------------|-------------|--------------------|---------------|
| Indicator  | Logarithmic | Logarithmic           | Linear      | Almost             | Higher than   |
|            | Low         | High                  |             | Quadratic          | Quadratic     |
| Complexity | O(logn)     | but<br>$\omega(logn)$ | $\Theta(n)$ | $O(n^2)$ but worse | $\omega(n^2)$ |
| Analysis   |             | than<br>better        |             | than Linear        |               |
|            |             | Linear                |             |                    |               |

Table 4: The rubric of the Complexity Analysis indicator

- Number of Benchmark Function Evaluations (NBFE): the total number of calls of the benchmark function; equals the number of iterations <sup>305</sup> times the total number of benchmark function calls per iteration. In this investigation, we report the average NBFE required for optimizing the benchmark functions.
	- Execution Time  $(ET)$ : the time between the start and the completion of a task.
- **Throughput**  $(TH)$ : the total amount of work done in a given time. In this investigation, we calculate  $TH$  as the *NBFE* divided by  $ET$ ; the results are represented in Kilo BFE per Seconds (KBFEps).
- Logic Elements  $(LE)$ : the number of combinational logic elements required to implement an algorithm in hardware. The number of  $LEs$  is an <sup>315</sup> indicator of the size of hardware in Altera devices.
	- Logic Register  $(LR)$ : the total number of logic registers in the design.

The Software profile includes three indicators, namely, NBFE, ET, and TH.

## 5.5. Outcomes

- <sup>320</sup> The Outcomes element is the formulation of Combined Measurement Indicators CMIs as function of KIs. Four CMIs are developed to analyze the performance of the PSO implementation, namely, Success Rate (SR), Performance Rate  $(PR)$ , Success Rate Density  $(SRD)$ , and the Optimization Fitness Indicator (*OFI*). The definitions of the  $SR$ ,  $PR$ , and  $SRD$  are
- <sup>325</sup> as follows:
	- SR: The percentage number of runs that successfully converges to the minimum which is below the specified error divided by the total number of runs.
	- PR: The average *NBFE* divided by the percentage *SR*.

# **• SRD:** The number of LEs divided by SR. SRD captures the size of hardware used per 1% SR.

The OFI is the main CMI calculation in the presented statistical analysis model. A higher OFI is achieved through a higher throughput, a lower execution time, with less resource utilization, and at a higher performance rate; while <sup>335</sup> targeting the evaluation function with a higher complexity. The combination of indicators is done using the Geometric Mean of KI ratios. The generic equation of CMIs from [\[34\]](#page-40-5) is as follows:

 $CMI = \sqrt[n]{ratio_1 \times ratio_2 \times ...ratio_n}$ 

340

Where 
$$
ratio_i = \frac{KI_{i,j}}{KI_{i,j}^{ref}}
$$

 $KI_{i,j}$  is the i<sup>th</sup> KI of the j<sup>th</sup> profile,  $i \in \{1..n\}$  and  $j \in \{1..2\},\$ 

and  $KI^{ref}_{i,j}$  is the reference measurement of the indicator  $KI_{i,j}$ 

To calculate a CMI, the Geometric Mean is used as it is able to measure the central tendency of data values that are obtained from ratios. The attraction for using the Geometric Mean is that its ratio is equal to the Geometric <sup>350</sup> Mean of performance ratios; which implies that when comparing two different implementations' performance, the choice of the reference implementation is irrelevant [\[34,](#page-40-5) [36\]](#page-41-1). In the current investigation, the reference measurements are considered as an evaluation function that attains an average performance as compared to the targeted BEFs. In other words, the reference measurement is <sub>355</sub> calculated as the arithmetic average of results achieved by the targeted *BEFs*.

The *OFI* enables the classification of *PSO* algorithm according to its fitness in application. The  $OFI$  is either directly or inversely proportional to the indicators. The master OFI formula, using the developed indicators, is shown in Equation [1.](#page-20-1) The indicators that are common to the Software  $(sw)$  and Hardware <span id="page-20-1"></span> $360$  (hw) profiles are labeled with the profile name.

$$
OFI = \sqrt[9]{GAP \cdot HWP \cdot SWP} \tag{1}
$$

where,

$$
GAP = \frac{BEFC}{BEFC_{ref}}\tag{2}
$$

$$
HWP = \frac{ET_{hw,ref}}{ET_{hw}} \cdot \frac{TH_{hw}}{TH_{hw,ref}} \cdot \frac{LE_{ref}}{LE} \cdot \frac{LR_{ref}}{LR} \cdot \frac{SR_{hw}}{SR_{hw,ref}} \tag{3}
$$

$$
SWP = \frac{ET_{sw,ref}}{ET_{sw}} \cdot \frac{TH_{sw}}{TH_{sw,ref}} \cdot \frac{SR_{sw}}{SR_{sw,ref}} \tag{4}
$$

<span id="page-20-2"></span>Besides the main OFI, two additional CMIs are proposed in Equations [5](#page-20-2) and [6](#page-20-3) to separately capture the optimization fitness of  $HW$  and  $SW$ :

$$
OFI_{hw} = \sqrt[6]{GAP \cdot HWP} \tag{5}
$$

<span id="page-20-3"></span>and,

$$
OFI_{sw} = \sqrt[4]{GAP \cdot SWP} \tag{6}
$$

## <sup>365</sup> 5.6. Performance

The analysis based on the *OFI Output* and *Outcomes* provides measurements for all KIs and enables the calculation of the defined CMIs. The results enable classifying the targeted evaluations. The six elements of the OFI are summarized in Figure [6.](#page-21-0)

## <span id="page-20-0"></span>370 6. Analysis and Evaluation

The analysis of the developed PSO implementations are produced to serve for several evaluation purposes. Important implementation aspects are pre-sented in Section [6.1.](#page-22-0) In addition, a set of thirteen BEFs, namely F1 through

#### Figure 6: The six elements diagram of the PSO performance analysis model

<span id="page-21-0"></span>

Table 5: PSO execution parameters.

<span id="page-22-1"></span>

| Parameter                    | <b>Values</b>        |
|------------------------------|----------------------|
| Particle Coding              | Binary               |
| No. of Bits of Variables     | 8                    |
| Population Size              | 8, 16, 32            |
| No. of Independent Runs      | 100                  |
| Maximal Function Evaluations | $10,000$ for $F1-F8$ |
|                              | 200,000 for F9-F13   |
| $c_1.r_1$                    | $0 - 2$              |
| $c_2.r_2$                    | $0 - 2$              |
| Inertia Weight $w$           | 0.25                 |
| Termination Error Threshold  | $< 10^{-4}$          |

 $F13$ , is targeted to perfectly match the test-cases of the work presented in [\[13\]](#page-37-5) 375 and enable comparisons with similar work. To that end, the HW analysis is presented in Section [6.2](#page-23-0) with comparisons to similar findings from closely-related work in the literature. A second set of six  $BEFs$ ; namely  $F1$ ,  $F3$ ,  $F4$ ,  $F5$ ,  $F6$ , and  $FS$ ; is targeted to compare the HW and SW implementations and evaluate the effectiveness of the proposed CMIs (See Sections [6.3\)](#page-29-0). The analysis is 380 done using the developed KIs and CMIs. The values of the execution parameters are shown in Table [5.](#page-22-1) In addition, Section [6.4](#page-32-0) presents a thorough general evaluation of achievements of the research objectives.

### <span id="page-22-0"></span>6.1. Implementation Aspects

A variety of tools is used to develop, validate, and analyze the hardware  $385$  implementations. The targeted FPGA is Cyclone II by Altera. The tools used for hardware syntehsis and analysis are Quartus and ModelSim. The hardware implementations are done under *VHDL*. The adopted *VHDL* style mixes structural and behavioral implementations. The units Fitness Module,  $p_{best}$ Update,  $g_{best}$  Update, Velocity Update, Position Update, RNG, and the three 390 delay registers are implemented as separate VHDL components. The RNG is implemented using neighborhood-of-four cellular automata for FPGAs [\[33\]](#page-40-4). In the proposed hardware implementation, the fixed-point package from [\[37\]](#page-41-2) is used to express floating-point numbers; the implementation varies in widths among the different computational entities. The *ieee\_proposed.fixed\_pkg.ALL* 

395 and ieee-proposed.math\_utility\_pkg.ALL VHDL libraries are employed, where the highest utilized width is of 64 bits and the highest adopted precision is with a fraction part of 9 bits. The adopted fixed-point representation provides a set of efficient operations that can replace an intrinsically complex floating-point alternative at a compromised but adequate accuracy per context.

## <span id="page-23-0"></span><sup>400</sup> 6.2. Hardware Performance Analysis

#### 6.2.1. Partitioned versus Nonpartitioned Hardware Implementations

Li et al. in [\[13\]](#page-37-5) adopt a  $HW/SW$  co-design approach to improve the execution performance of PSO for embedded applications. The investigation targets the DE2-70 board from Altera with its Cyclone II FPGA for HW implementa-

- <sup>405</sup> tions, and *Nios II* processor for SW implementations. The main features of the presented approach comprise partitioned HW and SW implementations of the PSO. The main proposed system components are a partitioned HW and SW evaluation, and a  $HW$  particle updating accelerator. The  $HW$  and  $SW$  subsystems communicate through the board interfaces and use an on-board SDRAM.
- $410$  Furthermore, the features include the design of a HW RNG. Experimental results demonstrate that the proposed HW design attains adequate efficiency and accuracy. The reported results, of the partitioned implementations, are compared with SW implementations.

In our investigation, exactly the same development board, FPGA, BEFs, <sup>415</sup> and the execution parameters of [\[13\]](#page-37-5) are targeted—to enable a sound comparison. However, we provide a fully-autonomous HW implementation on FPGAs and adopt a different RNG algorithm (See Section [4\)](#page-9-0). Tables [6](#page-24-0) and [7](#page-25-0) present the results of the developed  $HW$  implementation. Here, thirteen  $BEFs$  are targeted to enable the comparison with the work reported in [\[13\]](#page-37-5). Table [6](#page-24-0) shows

<sup>420</sup> that the shortest  $ET_{hw}$  of 2.31 msec is achieved when targeting  $F_4$ , and the highest  $TH_{hw}$  of 1777.48  $KBFEps$  is attained when targeting F11. The smallest attained hardware area is for targeting  $F<sub>4</sub>$  with 6055 LEs and 74 LRs. On the other hand, the longest  $ET_{hw}$  is taken by  $F13$  with a maximum value of 302.99 msec for a population size of 32. Moreover, the lowest  $TH_{hw}$  is attained

 $425$  by targeting F5 with a population size of 32. The largest HW areas are required by F13. The highest  $PR_{hw}$  of 3561.84 is reached when targeting F11; here, the smallest is attained by  $F2$  with a value of 1.42. To that end, the function that reached the highest SRD is F11 with a maximum of 1418.19 at a population size of 8.

<sup>430</sup> In this investigation, adopting the same *FPGA* board and execution parameters as in [\[13\]](#page-37-5) supports the conclusion that the achieved performance gains are due to the differences in the proposed HW system architectures. With no doubt, the communication cost between the partitioned  $HW$  and  $SW$  subsystems affects the overall performance as reported in [\[13\]](#page-37-5). In addition, the <sup>435</sup> communication cost increases with the increase in *NBFE*; such an increase in communication further degrades the performance of the developed system. A replacement all- $HW$  system is proposed by the authors that targets  $F9$ . The all-HW achieves an  $ET$  of 13.4673 ms with a performance improvement ratio of

<span id="page-24-0"></span>

12115 over their partitioned implementation and 4.29 over our implementation.



440 In Tables [8](#page-27-0) and [9,](#page-29-1) the  $SR_{hw}$ ,  $PR_{hw}$ , and  $SRD$  are shown. Targeting the evaluation functions  $F2$ ,  $F9$ ,  $F12$ , and  $F13$  enables the achievement of the best  $SR$  value of 100%; while F7, F8, and F11 achieve the smallest values with a

|                |                        | LE per Population Size |                         | LR per Population Size |      |      |
|----------------|------------------------|------------------------|-------------------------|------------------------|------|------|
| <b>BEF</b>     | 8                      | 16                     | 32                      | 8                      | 16   | 32   |
| F1             | $10340_{15.1\%}$       | $11534_{16.9\%}$       | $12770_{18.7\%}$        | 239                    | 278  | 309  |
| F <sub>2</sub> | $9753_{14.3\%}$        | $10882_{15.9\%}$       | 12048 <sub>17.6</sub> % | 217                    | 254  | 282  |
| F3             | 9948 <sub>14.5</sub> % | 1109816.2%             | $12288_{18\%}$          | 224                    | 262  | 291  |
| F4             | $6055_{8.9\%}$         | $6774_{9.9\%}$         | $7500_{11.0\%}$         | 74                     | 95   | 105  |
| F5             | $10943_{16.1\%}$       | $12203_{17.8\%}$       | $13512_{19.7\%}$        | 262                    | 304  | 338  |
| F6             | $12486_{18.3\%}$       | $13917_{20.3\%}$       | 15409 <sub>22.5</sub> % | 322                    | 370  | 412  |
| F7             | $14171_{20.7\%}$       | $15789_{23.1\%}$       | 17482 <sub>25.6%</sub>  | 387                    | 443  | 493  |
| F8             | $15756_{23.1\%}$       | $17549_{25.7\%}$       | 19432 <sub>28.4</sub> % | 448                    | 510  | 569  |
| F9             | $42295_{61.8\%}$       | 47022 $_{68.7\%}$      | $52075_{76.1\%}$        | 1472                   | 1647 | 1838 |
| F10            | 4560866.7%             | $50702_{74.1\%}$       | $56150_{81.1\%}$        | 1599                   | 1788 | 1996 |
| F11            | 4396464.3%             | 4887671.4%             | 5412779.1%              | 1536                   | 1718 | 1918 |
| F12            | $47056_{68.8\%}$       | 52310 <sub>76.5%</sub> | $57931_{84.7\%}$        | 1655                   | 1850 | 2065 |
| <b>F13</b>     | $48725_{71.2\%}$       | 5416479.2%             | 59984 <sub>87.7%</sub>  | 1719                   | 1921 | 2145 |

<span id="page-25-0"></span>Table 7: Hardware Profile: LE with Percent Hardware Utilization and LR. The Percent Hardware Utilization is calculated as the ratio of measured LEs divided by 68416, which is the total number of LEs available in the target FPGA. BEFs F1 through F13 are included to enable the comparison with the work presented in [\[13\]](#page-37-5)

minimum of  $24\%$  for F8 with a population size of 8. Indeed, the results show that the best  $SR_{hw}$  and  $PR_{hw}$  are achieved, in most cases, at a population size 445 of 32. Furthermore, the evaluation function  $F\ddot{A}$  achieves the lowest  $SRD$  score; this reflects the smallest hardware area utilization per percent success among all populations.

The results achieved by the proposed hardware implementation show significant improvements, in terms of the measured indicators, over the work reported

450 in [\[13\]](#page-37-5). BEFs F9 through F13 achieve the best  $ET_{hw}$  improvement ratios that reached 23300, 11233, and 5478 times better than the work reported in [\[13\]](#page-37-5) for F13 (See Figure [7\)](#page-26-0). In terms of  $TH_{hw}$ , functions  $F_4$  and F9 through F13 significantly outperform the implementation in [\[13\]](#page-37-5) with a maximum speedup of 1777 for F11 (See Figure [8\)](#page-26-1). The best  $SR$  improvement ratio is achieved when 455 targeting  $F11$  with a maximum of 195 (See Figure [9\)](#page-28-0).

# 6.2.2. Sequential versus Parallel Hardware Implementations

In [\[14\]](#page-37-6), Calazan et al. present a parallel co-processor for *PSO* on *FPGAs*. In the proposed implementation, all-particles computations are executed simultaneously until finding the  $g_{best}$ . To well-synchronize the computations, and



<span id="page-26-0"></span>Figure 7: ETs performance improvement ratio of the achieved execution time over the results reported in [\[13\]](#page-37-5).

<span id="page-26-1"></span>Figure 8: TH speedup ratio of the achieved throughput over the results reported in [\[13\]](#page-37-5).



|                |     | $SR_{hw}\%$ per Population Size |     |         | $PR_{hw}$ per Population Size |         |
|----------------|-----|---------------------------------|-----|---------|-------------------------------|---------|
| BEF            | 8   | 16                              | 32  | 8       | 16                            | 32      |
| F1             | 84  | 98                              | 96  | 1.48    | 1.73                          | 2.82    |
| F <sub>2</sub> | 90  | 91                              | 100 | 1.42    | 3.02                          | 3.18    |
| F3             | 77  | 89                              | 95  | 2.57    | 3.13                          | 3.24    |
| F <sub>4</sub> | 56  | 83                              | 95  | 19.7    | 13.46                         | 10.05   |
| F5             | 92  | 95                              | 89  | 1.53    | 1.77                          | 3.60    |
| F6             | 76  | 93                              | 90  | 3.08    | 3.25                          | 5.07    |
| $_{\rm F7}$    | 45  | 77                              | 87  | 5.31    | 4.42                          | 6.05    |
| F8             | 24  | 81                              | 93  | 38      | 15.64                         | 18.29   |
| F9             | 100 | 100                             | 100 | 526.41  | 542.20                        | 558.47  |
| F10            | 62  | 69                              | 83  | 539.47  | 499.28                        | 427.52  |
| F11            | 31  | 45                              | 59  | 3561.84 | 2527.33                       | 1985.46 |
| F12            | 100 | 100                             | 100 | 350.49  | 361.00                        | 371.83  |
| F13            | 100 | 100                             | 100 | 271.29  | 279.43                        | 287.81  |

<span id="page-27-0"></span>Table 8: Hardware CMIs:  $SR_{hw}$  and  $PR_{hw}$ ; BEFs F1 through F13 are included to enable the comparison with the work presented in [\[13\]](#page-37-5).

- $460$  prevent racing in the values of  $g_{best}$ , the velocity and position computations can only start once  $g_{best}$  is identified among all particles. The execution results are obtained under *Xilinx MicroBlaze* and a high-end *Virtex 6 FPGA*. In addition, the execution parameters are comparable to the ones adopted in this investigation; with the exception to that the authors did not specify the termi-
- <sup>465</sup> nation error threshold of the stoppage criterion. For a population size of 8, the reported ETs attain performance ratios that are between 2.4 and 67.27 times better than those achieved by the hardware implementation proposed in this investigation. However, the proposed implementation outperforms the reported ET under the Xilinx MicroBlaze; the performance improvement is between 3 to
- 470 85 times (See Figure [10\)](#page-28-1). Virtex 6 is a high-end FPGA by Xilinx, while Cyclone II is presented as a low-end FPGA by Altera.

In [\[23\]](#page-39-1), Tewlode et al. present  $HW$  architectures that significantly accelerates execution performance of PSO over SW implementations. The proposed HW implementation targets a Xilinx Spartan 3E FPGA, while SW implemen-

475 tations are done using a Xilinx MicroBlaze and a Freescale MC9S12DP256B microcontroller. The authors also present a multi-swarm parallel HW implementation. The parallel implementation achieves a maximum speedup of 3.89 for  $F_4$  and 6.96 for  $F_9$  over the sequential HW implementation—with 30 parti-



<span id="page-28-0"></span>Figure 9: SR improvement ratio of the achieved success rates over the results reported in [\[13\]](#page-37-5).

<span id="page-28-1"></span>Figure 10: Improvement ratios of the achieved ET in HW as compared with the results reported in [\[14\]](#page-37-6). The results achieved by the Virtex 6 FPGA are represented in the negative range as they outperform the *Cyclone II* implementation proposed in this investigation.



|            | <b>SRD</b> per Population Size |         |        |  |  |  |  |  |
|------------|--------------------------------|---------|--------|--|--|--|--|--|
| <b>BEF</b> | 8                              | 16      | 32     |  |  |  |  |  |
| F1         | 123.10                         | 117.69  | 133.02 |  |  |  |  |  |
| F2         | 108.37                         | 119.58  | 120.48 |  |  |  |  |  |
| F3         | 129.19                         | 124.70  | 129.35 |  |  |  |  |  |
| F4         | 108.13                         | 81.61   | 78.95  |  |  |  |  |  |
| F5         | 118.95                         | 128.45  | 151.82 |  |  |  |  |  |
| F6         | 164.29                         | 149.65  | 171.21 |  |  |  |  |  |
| F7         | 314.91                         | 205.05  | 200.94 |  |  |  |  |  |
| F8         | 656.50                         | 216.65  | 208.95 |  |  |  |  |  |
| F9         | 422.95                         | 470.22  | 520.75 |  |  |  |  |  |
| F10        | 735.61                         | 734.81  | 676.51 |  |  |  |  |  |
| F11        | 1418.19                        | 1086.13 | 917.41 |  |  |  |  |  |
| F12        | 470.56                         | 523.10  | 579.31 |  |  |  |  |  |
| F13        | 487.25                         | 541.64  | 599.84 |  |  |  |  |  |

<span id="page-29-1"></span>Table 9: Hardware CMIs: SRD; the BEFs F1 through F13 are included to enable the comparison with the work presented in [\[13\]](#page-37-5).

cles divided among 5 sub-swarms. However, the HW implementation presented 480 in this paper outperforms the sequential core in [\[23\]](#page-39-1) by 30.34 times for  $F\ddot{4}$  and 1.59 times for F9.

### <span id="page-29-0"></span>6.3. Analysis of Combined Indicators

To enable the calculation of OFI as a main combined indicator, Tables [10](#page-30-0) and [11](#page-30-1) present the implementation results of the general algorithmic and SW 485 profiles. Table [10](#page-30-0) presents the the complexity of the targeted BEFs that are needed for the OFI calculations and comparisons. Table [11](#page-30-1) presents the KI re-sults for the SW implementation. Moreover, Table [12](#page-32-1) shows the measurements of the developed CMIs for the SW implementations. The calculated results of  $OFI_{hw}$ ,  $OFI_{sw}$ , and  $OFI$  are shown in Figures [11,](#page-30-2) [12,](#page-31-0) and [13.](#page-31-1) The results 490 confirm that  $F\mathcal{L}$  attains the best  $OFI_{hw}$ ,  $OFI_{sw}$ , and  $OFI$  ranking with values

of 3.04, 3.15, and 3.04—for the three different population sizes of 8, 16, and 13. Accordingly, the proposed PSO HW and SW implementations can best perform when targeting the types of problems that are similar to F4.

Closely-related work in the literature, including [\[16,](#page-38-0) [17,](#page-38-1) [18,](#page-38-2) [19,](#page-38-3) [20,](#page-38-4) [15,](#page-37-7) [14,](#page-37-6)  $495$  [20,](#page-38-4) [21,](#page-38-5) [22,](#page-39-0) 13, relies on an almost identical patterns of simple KIs to analyze and evaluate their implementations. The adopted analysis patterns can success-

<span id="page-30-0"></span>

| <b>BEF</b>     | <b>BEFC</b> | Mapped BEFC |
|----------------|-------------|-------------|
| F1             | АQ          | 0.8         |
| F3             | AQ.         | 0.8         |
| F <sub>4</sub> | НQ          | 1           |
| F5             | HQ          | 1           |
| F6             | AQ          | 0.8         |
| F8             |             | 0.8         |

Table 10: General Algorithmic Profile

Table 11: Software Profile: ET and TH.

<span id="page-30-1"></span>

|                |        | $ET(msec)$ per Population Size |       | $TH_{(KBFEps)}$ per Population Size |        |        |  |
|----------------|--------|--------------------------------|-------|-------------------------------------|--------|--------|--|
| <b>BEF</b>     | 8      | 16                             | 32    | 8                                   | 16     | 32     |  |
| F1             | 2.879  | 2.915                          | 2.651 | 94.98                               | 145.92 | 270.75 |  |
| F3             | 2.377  | 2.285                          | 1.96  | 85.07                               | 142.60 | 265.31 |  |
| F <sub>4</sub> | 23.11  | 11.438                         | 6.301 | 97.54                               | 198.62 | 358.70 |  |
| F5             | 2.402  | 2.033                          | 1.845 | 60.88                               | 115.14 | 210.73 |  |
| ${\bf F6}$     | 2.752  | 2.09                           | 1.909 | 66.02                               | 129.15 | 241.38 |  |
| $_{\rm F8}$    | 106.86 | 8.015                          | 6.199 | 8.35                                | 144.75 | 253.87 |  |

Figure 11: The  $OFI_{hw}$  classification

<span id="page-30-2"></span>

<span id="page-31-0"></span>

Figure 12: The  $OFI_{sw}$  classification

Figure 13: The  $OFI$  classification

<span id="page-31-1"></span>

<span id="page-32-1"></span>

|             |     |     | $SR_{sw}\%$ per Population Size | $PR_{sw}$ per Population Size |       |       |  |
|-------------|-----|-----|---------------------------------|-------------------------------|-------|-------|--|
| <b>BEF</b>  | 8   | 16  | 32                              | 8                             | 16    | 32    |  |
| F1          | 94  | 100 | 100                             | 2.91                          | 4.25  | 7.18  |  |
| F3          | 87  | 99  | 100                             | 2.32                          | 3.29  | 5.20  |  |
| F4          | 71  | 100 | 100                             | 31.75                         | 22.72 | 22.60 |  |
| F5          | 100 | 100 | 100                             | 1.46                          | 2.34  | 3.89  |  |
| F6          | 100 | 100 | 100                             | 1.82                          | 2.70  | 4.61  |  |
| $_{\rm F8}$ | 60  | 100 | 100                             | 14.87                         | 11.60 | 15.74 |  |

Table 12: Software CMIs:  $SR_{sw}$  and  $PR_{sw}$ .

fully classify implementations with-respect-to a single or a limited set of KIs. For example, the investigations in [\[14,](#page-37-6) [20,](#page-38-4) [13\]](#page-37-5) successfully rank implementations per ET. However, combined indications and evaluative conclusions can hardly <sup>500</sup> be made due to the absence of formal representations of CMIs. For instance, it is not straightforward to identify the evaluation function that can best perform in terms of a compromise that includes  $ET$ , TH,  $SR$ , etc. Our investigation confirms that classifications of implementations based on heterogeneous properties is limitedly addressed in the literature. With no doubt, the identification of the <sup>505</sup> type of problem that can be effectively performed per algorithm implementation is an important evaluation. Such identifications can soundly identify suitable

## <span id="page-32-0"></span>6.4. General Evaluation

implementation options per application.

The current investigation can be evaluated as related to the developed hard-<sup>510</sup> ware processor, and analysis framework development and application. The developed processor fully maps the PSO algorithm onto an FPGA. As compared with similar work, the proposed processor attains higher-speeds than similar implementations in the literature. In addition, the results show that, when using partitioned HW and SW implementations on FPGA boards, communication  $515$  between the  $FPGA$  and other on-board components can significantly reduce the processing throughput.

The developed framework is unique in combining algorithmic, hardware, and software characteristics to provide unified performance evaluation criteria and useful performance indicators. The investigation proposes the creation of uni-

- <sup>520</sup> fied indicators that can capture specific qualities in terms of a wide range of heterogeneous key performance indicators, such as the OFI. The OFI serves as a master CMI while an indicator like the BEFC is developed with focus on evaluation function complexity. Indeed, the framework is scalable and upgradeable without changing the statistical computation or the structure of the
- <sup>525</sup> measurement. For instance, an additional profile can be incorporated into the calculations of the OFI to include the performance characteristics of Graphics Processing Units.

At the application level, the developed framework can be used to examine qualities of importance and interest to optimization specialists. For example,

- <sub>530</sub> the developed *OFI* successfully identifies the type of evaluation function that can be best solved by PSO and achieve the highest overall performance in terms of the identified KIs. The OFI enables classifying the PSO algorithm performance as targeting a heterogeneous set of evaluation functions. In addition, the framework produces a rich and comprehensive set of reference KIs. KIs, such
- $535$  as ET, TH, LE, and LR, are independent of the context of application and thus highly reusable. Other KIs, such as NBFE, SR, PR, and SRD are specific to optimization algorithms including the PSO.

The proposed framework aims at capturing the HW and SW properties. The current investigation doesn't include partitioned  $HW$  and  $SW$  implemen-

<sup>540</sup> tations. However, partitioned implementations can be analyzed, based on the proposed framework, by enabling measurements of KIs under the HW and SW subsystems. The proposed partitioned KI measurements can capture subsystems' characteristics. In addition, well-defined CMIs can classify different partitioning strategies per optimization target, such as, area, speed, power con-<sup>545</sup> sumption, etc.

# <span id="page-33-0"></span>7. Conclusion

Optimization is a key approach in engineering that enables effective solutions. PSO is a current and widely used heuristic; it is well-known for its effectiveness in application. In this paper, an on-chip PSO implementation is

<sub>550</sub> developed and mapped onto an FPGA. The developed processor significantly outperform the partitioned HW and SW implementations of [\[13\]](#page-37-5) that target the same development board, FPGA, BEFs, and the execution parameters. Our proposed HW implementation achieves an  $ET_{hw}$  improvement ratio of 23300 for F13, a  $TH_{hw}$  speedup of 1777 for F11, and the best SR improvement ra-

- $555$  tio of 195 for F11 over the results reported in [\[13\]](#page-37-5). This paper includes the development of a statistical framework that enables thorough analysis and evaluation of optimization algorithms, such as PSO. The proposed indicators include NBFE, BEFC, ET, TH, LE, LR, SR, PR, SRD,  $OFI_{hw}$ ,  $OFI_{sw}$ , and  $OFI$ . The analysis of results confirms that, when targeting  $F<sub>4</sub>$ , PSO achieves the highest
- <sub>560</sub> performance characteristics with the highest *OFI* value of 3.15. The presented framework enjoys being reusable in the wider optimization context. Future work includes the development of pipelined, parallel, and multi-swarm PSO processors. The statistical framework can be expanded to capture partitioned implementations and to consider additional processing systems, such as Graph-
- <sup>565</sup> ics Processing Units. Furthermore, the statistical framework can be applied to other metaheuristic algorithms for a wider study. The proposed framework is applicable outside the context of optimization [\[34,](#page-40-5) [38\]](#page-41-3).

# Appendix



### <span id="page-36-0"></span>References

- <sup>570</sup> [1] S. J. Kasbah, I. W. Damaj, R. A. Haraty, [Multigrid solvers in reconfigurable](http://dx.doi.org/10.1016/j.cam.2006.12.031) [hardware,](http://dx.doi.org/10.1016/j.cam.2006.12.031) J. Comput. Appl. Math. 213 (1) (2008) 79–94. [doi:10.1016/j.](http://dx.doi.org/10.1016/j.cam.2006.12.031) [cam.2006.12.031](http://dx.doi.org/10.1016/j.cam.2006.12.031). URL <http://dx.doi.org/10.1016/j.cam.2006.12.031>
	- [2] S. J. Kasbah, I. W. Damaj, The jacobi method in reconfigurable hardware.,

<span id="page-36-1"></span><sup>575</sup> in: World Congress on Engineering, 2007, pp. 823–827.

<span id="page-36-2"></span>[3] J.-S. Chou, A.-D. Pham, Nature-inspired metaheuristic optimization in least squares support vector regression for obtaining bridge scour information, Information Sciences 399 (2017) 64–80.

<span id="page-36-3"></span>[4] M. El-Abd, [Performance assessment of foraging algorithms vs. evo-](http://www.sciencedirect.com/science/article/pii/S0020025511004555)<sup>580</sup> [lutionary algorithms,](http://www.sciencedirect.com/science/article/pii/S0020025511004555) Information Sciences 182 (1) (2012) 243 – 263, nature-Inspired Collective Intelligence in Theory and Practice. [doi:https://doi.org/10.1016/j.ins.2011.09.005](http://dx.doi.org/https://doi.org/10.1016/j.ins.2011.09.005). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S0020025511004555) [S0020025511004555](http://www.sciencedirect.com/science/article/pii/S0020025511004555)

<span id="page-36-4"></span><sup>585</sup> [5] M. E. Aydin, R. Kwan, J. Wu, [Multiuser scheduling on the lte downlink](http://www.sciencedirect.com/science/article/pii/S1874490712000134) [with meta-heuristic approaches,](http://www.sciencedirect.com/science/article/pii/S1874490712000134) Physical Communication 9 (2013) 257 – 265. [doi:https://doi.org/10.1016/j.phycom.2012.01.004](http://dx.doi.org/https://doi.org/10.1016/j.phycom.2012.01.004). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S1874490712000134) [S1874490712000134](http://www.sciencedirect.com/science/article/pii/S1874490712000134)

<span id="page-36-5"></span><sup>590</sup> [6] M. E. Aydin, R. Kwan, C. Leung, C. Maple, J. Zhang, [A hy](http://www.sciencedirect.com/science/article/pii/S1568494611004911)[brid swarm intelligence algorithm for multiuser scheduling in](http://www.sciencedirect.com/science/article/pii/S1568494611004911) [hsdpa,](http://www.sciencedirect.com/science/article/pii/S1568494611004911) Applied Soft Computing 13 (5) (2013) 2990 – 2996. [doi:https://doi.org/10.1016/j.asoc.2011.12.007](http://dx.doi.org/https://doi.org/10.1016/j.asoc.2011.12.007). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S1568494611004911)

- <span id="page-36-6"></span><sup>595</sup> [S1568494611004911](http://www.sciencedirect.com/science/article/pii/S1568494611004911)
	- [7] M. Clerc, Particle swarm optimization, Vol. 93, John Wiley & Sons, 2010.
- <span id="page-37-0"></span>[8] J. L. Awange, B. Paláncz, R. H. Lewis, L. Völgyesi, Particle swarm optimization, in: Mathematical Geosciences, Springer, 2018, pp. 167–184.
- <span id="page-37-2"></span><span id="page-37-1"></span>[9] I. Damaj, J. Hawkins, A. Abdallah, Mapping high-level algorithms onto <sup>600</sup> massively parallel reconfigurable hardware, in: IEEE International Conference of Computer Systems and Applications, 2003, pp. 14–22.
	- [10] S. M. Trimberger, Three ages of FPGAs: A retrospective on the first thirty years of fpga technology, Proceedings of the IEEE 103 (3) (2015) 318–331.
	- [11] J. de Fine Licht, M. Blott, T. Hoefler, Designing scalable FPGA architec-

<span id="page-37-3"></span><sup>605</sup> tures using high-level synthesis, in: Proceedings of the 23rd ACM SIG-PLAN Symposium on Principles and Practice of Parallel Programming, ACM, 2018, pp. 403–404.

- <span id="page-37-5"></span><span id="page-37-4"></span>[12] X. Zou, L. Wang, Y. Tang, Y. Liu, S. Zhan, F. Tao, Parallel design of intelligent optimization algorithm based on FPGA, The International Journal <sup>610</sup> of Advanced Manufacturing Technology (2018) 1–14.
	- [\[](http://www.sciencedirect.com/science/article/pii/S002002551000335X)13] S.-A. Li, C.-C. Hsu, C.-C. Wong, C.-J. Yu, [Hardware/software co-design](http://www.sciencedirect.com/science/article/pii/S002002551000335X) [for particle swarm optimization algorithm,](http://www.sciencedirect.com/science/article/pii/S002002551000335X) Information Sciences 181 (20) (2011) 4582 – 4596, special Issue on Interpretable Fuzzy Systems. [doi:https://doi.org/10.1016/j.ins.2010.07.017](http://dx.doi.org/https://doi.org/10.1016/j.ins.2010.07.017).
- <span id="page-37-6"></span><sup>615</sup> URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S002002551000335X) [S002002551000335X](http://www.sciencedirect.com/science/article/pii/S002002551000335X)
	- [\[](http://www.sciencedirect.com/science/article/pii/S1568494613000033)14] R. M. Calazan, N. Nedjah, L. M. Mourelle, [A hardware accelerator for](http://www.sciencedirect.com/science/article/pii/S1568494613000033) [particle swarm optimization,](http://www.sciencedirect.com/science/article/pii/S1568494613000033) Applied Soft Computing 14 (2014) 347 – 356. [doi:https://doi.org/10.1016/j.asoc.2012.12.034](http://dx.doi.org/https://doi.org/10.1016/j.asoc.2012.12.034).
- <span id="page-37-7"></span><sup>620</sup> URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S1568494613000033) [S1568494613000033](http://www.sciencedirect.com/science/article/pii/S1568494613000033)
	- [\[](http://www.sciencedirect.com/science/article/pii/S014193311200018X)15] G. S. Tewolde, D. M. Hanna, R. E. Haskell, [A modular and ef](http://www.sciencedirect.com/science/article/pii/S014193311200018X)[ficient hardware architecture for particle swarm optimization algo](http://www.sciencedirect.com/science/article/pii/S014193311200018X)[rithm,](http://www.sciencedirect.com/science/article/pii/S014193311200018X) Microprocessors and Microsystems 36 (4) (2012) 289 – 302.
- <sup>625</sup> [doi:https://doi.org/10.1016/j.micpro.2012.02.001](http://dx.doi.org/https://doi.org/10.1016/j.micpro.2012.02.001). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S014193311200018X) [S014193311200018X](http://www.sciencedirect.com/science/article/pii/S014193311200018X)
- <span id="page-38-0"></span>[\[](http://dx.doi.org/10.14569/IJACSA.2016.070809)16] M. S. Ben Auemur, A. Sakly, [Fpga implementation of parallel particle](http://dx.doi.org/10.14569/IJACSA.2016.070809) [swarm optimization algorithm and compared with genetic algorithm,](http://dx.doi.org/10.14569/IJACSA.2016.070809) In-<sup>630</sup> ternational Journal of Advanced Computer Science and Applications 7 (8). [doi:10.14569/IJACSA.2016.070809](http://dx.doi.org/10.14569/IJACSA.2016.070809). URL <http://dx.doi.org/10.14569/IJACSA.2016.070809>
	- [\[](http://dx.doi.org/10.1016/j.neunet.2016.02.004)17] C. Karakuzu, F. Karakaya, M. A. Cavuşlu, [Fpga implementation of neuro](http://dx.doi.org/10.1016/j.neunet.2016.02.004)[fuzzy system with improved pso learning,](http://dx.doi.org/10.1016/j.neunet.2016.02.004) Neural Netw. 79 (C) (2016) 128–
- <span id="page-38-2"></span><span id="page-38-1"></span><sup>635</sup> 140. [doi:10.1016/j.neunet.2016.02.004](http://dx.doi.org/10.1016/j.neunet.2016.02.004). URL <http://dx.doi.org/10.1016/j.neunet.2016.02.004>
	- [\[](https://doi.org/10.1007/s10617-010-9068-9)18] M. B. Abdelhalim, S. E.-D. Habib, [An integrated high-level hardware/soft](https://doi.org/10.1007/s10617-010-9068-9)[ware partitioning methodology,](https://doi.org/10.1007/s10617-010-9068-9) Design Automation for Embedded Systems 15 (1) (2011) 19–50. [doi:10.1007/s10617-010-9068-9](http://dx.doi.org/10.1007/s10617-010-9068-9).
- <span id="page-38-3"></span><sup>640</sup> URL <https://doi.org/10.1007/s10617-010-9068-9>
	- [\[](https://doi.org/10.1007/978-3-319-03110-1_3)19] N. Nedjah, L. de Macedo Mourelle, [A Reconfigurable Hardware for Particle](https://doi.org/10.1007/978-3-319-03110-1_3) [Swarm Optimization,](https://doi.org/10.1007/978-3-319-03110-1_3) Springer International Publishing, Cham, 2014, pp. 29–42. [doi:10.1007/978-3-319-03110-1\\_3](http://dx.doi.org/10.1007/978-3-319-03110-1_3). URL [https://doi.org/10.1007/978-3-319-03110-1\\_3](https://doi.org/10.1007/978-3-319-03110-1_3)
- <span id="page-38-4"></span><sup>645</sup> [20] G. S. Tewolde, D. M. Hanna, R. E. Haskell, Accelerating the performance of particle swarm optimization for embedded applications, in: 2009 IEEE Congress on Evolutionary Computation, 2009, pp. 2294–2300. [doi:10.](http://dx.doi.org/10.1109/CEC.2009.4983226) [1109/CEC.2009.4983226](http://dx.doi.org/10.1109/CEC.2009.4983226).
- <span id="page-38-5"></span>[\[](https://doi.org/10.1007/s11390-017-1714-2)21] X.-H. Yan, F.-Z. He, Y.-L. Chen, [A novel hardware/software partitioning](https://doi.org/10.1007/s11390-017-1714-2) <sup>650</sup> [method based on position disturbed particle swarm optimization with in](https://doi.org/10.1007/s11390-017-1714-2)[vasive weed optimization,](https://doi.org/10.1007/s11390-017-1714-2) Journal of Computer Science and Technology 32 (2) (2017) 340–355. [doi:10.1007/s11390-017-1714-2](http://dx.doi.org/10.1007/s11390-017-1714-2). URL <https://doi.org/10.1007/s11390-017-1714-2>

<span id="page-39-0"></span>[22] S.-A. Li, C.-C. Wong, C.-J. Yu, C.-C. Hsu, Hardware/software co-design

- <sup>655</sup> for particle swarm optimization algorithm, in: 2010 IEEE International Conference on Systems, Man and Cybernetics, 2010, pp. 3762–3767. [doi:](http://dx.doi.org/10.1109/ICSMC.2010.5641826) [10.1109/ICSMC.2010.5641826](http://dx.doi.org/10.1109/ICSMC.2010.5641826).
	- [23] G. S. Tewolde, D. M. Hanna, R. E. Haskell, Hardware pso for sensor network applications, in: 2008 IEEE Swarm Intelligence Symposium, 2008,
- <span id="page-39-2"></span>
- <span id="page-39-1"></span><sup>660</sup> pp. 1–8. [doi:10.1109/SIS.2008.4668308](http://dx.doi.org/10.1109/SIS.2008.4668308).
	- [24] M. B. Abdelhalim, A. E. Salama, S. E.-D. Habib, Constrained and unconstrained hardware-software partitioning using particle swarm optimization technique, in: A. Rettberg, M. C. Zanella, R. Dömer, A. Gerstlauer, F. J. Rammig (Eds.), Embedded System Design: Topics, Techniques and Trends,
- <sup>665</sup> Springer US, Boston, MA, 2007, pp. 207–220.
- <span id="page-39-3"></span>[25] T.-Y. Lee, Y.-H. Fan, Y.-M. Cheng, C.-C. Tsai, R.-S. Hsiao, Enhancement of hardware-software partition for embedded multiprocessor fpga systems, in: Intelligent Information Hiding and Multimedia Signal Processing, 2007. IIHMSP 2007. Third International Conference on, Vol. 1, IEEE, 2007, pp. <sup>670</sup> 19–22.
	- [26] M. Ettouil, H. Smei, A. Jemai, Particle swarm optimization on fpga, in: 2018 30th International Conference on Microelectronics (ICM), IEEE, 2018, pp. 32–35.
- <span id="page-39-5"></span><span id="page-39-4"></span>[27] M. Ettouil, H. Smei, A. Jemai, M. Ghazel, Codesign of an iot using a <sup>675</sup> metaheuristic ip, in: 2018 International Conference on Internet of Things, Embedded Systems and Communications (IINTEC), IEEE, 2018, pp. 153– 157.
- <span id="page-39-6"></span>[28] T. L. Dang, Y. Hoshino, Hardware/software co-design for a neural network trained by particle swarm optimization algorithm, Neural Processing <sup>680</sup> Letters 49 (2) (2019) 481–505.
- 

- <span id="page-40-0"></span>[29] A. Trimeche, A. Sakly, A. Mtibaa, Implementation of pso algorithm for mimo detection system in fpga, International Journal of Electronics 105 (1) (2018) 42–57.
- <span id="page-40-1"></span>[\[](http://www.sciencedirect.com/science/article/pii/S1319157817301611)30] I. Damaj, M. Imdoukh, R. Zantout, [Parallel hardware for](http://www.sciencedirect.com/science/article/pii/S1319157817301611) <sup>685</sup> [faster morphological analysis,](http://www.sciencedirect.com/science/article/pii/S1319157817301611) Journal of King Saud University - Computer and Information Sciences 30 (4) (2018) 531 – 546. [doi:https://doi.org/10.1016/j.jksuci.2017.07.003](http://dx.doi.org/https://doi.org/10.1016/j.jksuci.2017.07.003). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S1319157817301611) [S1319157817301611](http://www.sciencedirect.com/science/article/pii/S1319157817301611)
- <span id="page-40-3"></span><span id="page-40-2"></span><sup>690</sup> [31] I. W. Damaj, Parallel algorithms development for programmable logic devices, Advances in Engineering Software 37 (9) (2006) 561–582.
	- [\[](http://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470050118.ecse177)32] I. Damaj, [High-Level Synthesis,](https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470050118.ecse177) Wiley, 2008, pp. 1–10. [arXiv:https://](http://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470050118.ecse177) [onlinelibrary.wiley.com/doi/pdf/10.1002/9780470050118.ecse177](http://arxiv.org/abs/https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470050118.ecse177), [doi:10.1002/9780470050118.ecse177](http://dx.doi.org/10.1002/9780470050118.ecse177).
- <sup>695</sup> URL [https://onlinelibrary.wiley.com/doi/abs/10.1002/](https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470050118.ecse177) [9780470050118.ecse177](https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470050118.ecse177)
- <span id="page-40-4"></span>[\[](http://doi.acm.org/10.1145/503048.503064)33] B. Shackleford, M. Tanaka, R. J. Carter, G. Snider, [Fpga implementation](http://doi.acm.org/10.1145/503048.503064) [of neighborhood-of-four cellular automata random number generators,](http://doi.acm.org/10.1145/503048.503064) in: Proceedings of the 2002 ACM/SIGDA Tenth International Symposium on <sup>700</sup> Field-programmable Gate Arrays, FPGA '02, ACM, New York, NY, USA, 2002, pp. 106–112. [doi:10.1145/503048.503064](http://dx.doi.org/10.1145/503048.503064). URL <http://doi.acm.org/10.1145/503048.503064>
- <span id="page-40-5"></span>[\[](http://www.sciencedirect.com/science/article/pii/S0045790617315653)34] I. Damaj, S. Kasbah, [An analysis framework for hardware](http://www.sciencedirect.com/science/article/pii/S0045790617315653) [and software implementations with applications from cryptogra-](http://www.sciencedirect.com/science/article/pii/S0045790617315653)<sup>705</sup> [phy,](http://www.sciencedirect.com/science/article/pii/S0045790617315653) Computers & Electrical Engineering 69 (2018) 572 – 584. [doi:https://doi.org/10.1016/j.compeleceng.2017.06.008](http://dx.doi.org/https://doi.org/10.1016/j.compeleceng.2017.06.008). URL [http://www.sciencedirect.com/science/article/pii/](http://www.sciencedirect.com/science/article/pii/S0045790617315653) [S0045790617315653](http://www.sciencedirect.com/science/article/pii/S0045790617315653)
- <span id="page-41-1"></span><span id="page-41-0"></span>[35] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al., Introduction <sup>710</sup> to algorithms, Vol. 2, MIT press Cambridge, 2001.
	- [36] J. L. Hennessy, D. A. Patterson, Computer architecture: a quantitative approach, Elsevier, 2011.
	- [37] D. Bishop, [Vhdl support library,](https://github.com/FPHDL/fphdl) online; accessed 26 October 2019 (2008). URL <https://github.com/FPHDL/fphdl>
- <span id="page-41-3"></span><span id="page-41-2"></span><sup>715</sup> [38] I. W. Damaj, A. M. El Hajj, H. T. Mouftah, An analytical framework for effective joint scheduling over tdd-based mobile networks, IEEE Access 7 (2019) 144214–144229. [doi:10.1109/ACCESS.2019.2945849](http://dx.doi.org/10.1109/ACCESS.2019.2945849).