Thursday 22 November 2012

Data Allocation Using Genetic Algorithm for MPSoC Systems with Hybrid Scratch-Pad Memory-paper presentation


Data Allocation Using Genetic Algorithm for MPSoC Systems with Hybrid Scratch-Pad Memory

Zhi Chen, Meikang Qiu,JianweiNiu2, Zhonghai Lu, Yongxin Zhu
Dept. of Electrical and Computer Engineering, University of Kentucky, Lexington, KY 40506, USA
School of Computer Science and Engineering, Beihang University, Beijing100191, China
Dept. of Electronic Systems, KTH Royal Institute of Technology, Stockholm, Sweden
School of Microelectronics, Shanghai Jiao Tong University, Shanghai 200240, China

Abstract

Hybrid on-chip memory architecture has gathered muchattention recently. However, if not well managed and optimized, the memory access will be a major energy and time consumer for this architecture. This paper presents a novel hybrid Scratch-Pad Memory (SPM) comprising SRAM and Magnetic RAM(MRAM) for MultiProcessor System-on-Chip(MPSoC) architectures. To reduce memory access costs and write activities on MRAMs, we propose a novel adaptive genetic algorithm to efficiently allocate data to each memory unit of the hybrid SPM. Experimental results show that the proposed genetic algorithm can reduce write operations to MRAMs by 28.33% and energy consumption by 15.32% on average.

1 Introduction

The high cost of memory access, involving latency and energy consumption, becomes a major limiting-factor for the advancement of multiprocessor system-on-chip(MP-SoC) architectures [7]. For example, caches consume up to 43% of the overall power in the ARM920T processor [5].
Scratch Pad Memory(SPM), a software-managed on-chip memory, has been widely employed by key manufactur-ers, due to its power efficiency and high predictability. Itis demonstrated that a SPM consumes 34% smaller chip area and 40% lower energy consumption than a cache mem-ory [2]. For advantages in power, area, and predictability, SPM is a promising technique to replace the existing hard-ware cache.

Traditionally, SPM is based on SRAM technology which is competitive in the speed of memory access but con-sumes large amounts of leakage power. Magnetic RAM(MRAM) has attractive properties in density, access speed, and standby leakage [3]. However, MRAM suffers longwrite latency and high write energy consumption, whichhinders the utilization of them as a pure SPM. In this paper, we propose a hybrid SPM architecture integrating both SRAM and MRAM as the on-chip memory to pro-mote memory access performance and alleviate the daunting “memory wall” problem for MPSoC architectures Stacked 3D integration technique can vertically stackand integrate various technologies and functional compo-nents on a die. This technique has facilitated the imple-mentation of the 3D stacked hybridcache architectures in
[9, 11]. Based on the hybrid cache architecture, we pro-pose a hybrid SPM architecture consisting of SRAM and MRAM, which manages data at the software level.
Prior works have investigated the hybrid cache by usingSRAM and MRAM and proven that the hybrid architecturecan save significant volumes of power [9, 11]. In [11], Wuet al. studied inter- and intra- cache level hybrid cache ar-chitectures, including the hardware implementation to sup-port intra cache data movement and reduce energy con-sumption. However, these mechanisms are primarily pro-posed for hardware caches while unsuitable for software-controlled SPMs. In the context of the data allocation in
MPSoC systems with SPM, Avissar et al. [1] proposed a static method to allocate data on SPMs for scalar and ar-ray variables. While Padan et al. [6] explored partitioning strategies for exploiting the potential of on-chip SPMs, theydidn’t account for hybrid SPMs architectures.

Since frequent writes to MRAMs will aggravate the performance, power consumption, and lifetime issues of
MRAMs, it is critical to design appropriate data allocation mechanisms to reduce the undesirable high cost of writes on MRAMs. Targeting the advantages of different mem-ory units of the hybrid SPM architecture, we propose a novel adaptive genetic algorithm,Adaptive Genetic Algo-rithm for Data Allocation(AGADA), to allocate data to dif-ferent memory modules. The objective of the proposed al-gorithm is to minimize the power consumption of memory systems. To achieve this goal, the rationale of this algorithm
is allocating the most frequently written data to SRAM and the most frequently read data to MRAM. In addition, our AGADA algorithm will adaptively perform crossover ge-netic operations to generate fitter offsprings (solutions).

Our work includes three major contributions: (1) propos-ing a novel hybrid SPM architecture consisting of SRAM and MRAM to obtain both benefits of these two kinds of memory; (2) using a genetic algorithm for data allocation,based on the proposed hybrid SPM; (3) adaptively perform crossover operations to satisfy the memory size constraint.
The rest of the paper is organized as follows. Section 2
introduces the proposed models. Section 3 presents our de-tailed algorithms. The experimental results and conclusions are given in Section 4 and Section 5, respectively.
62
2 System Model
A. Hardware model.Fig.1(a) exhibits a MPSoC architec-ture with hybrid SPMs. Each core is tightly coupled with alocal SPM that consists of a SRAM and a MRAM. Theprocessor cores access the off-chip DRAM main memorythrough a shared bus, and communicate witheach other via
an interconnect bus. Accessing data stored in SPMs of other processors is called remote access. Generally, accessing thelocal SPM is faster and dissipates less energy than fetch-
ing data from remote SPMs. Off-chip main memory access consumes the most energy.

In addition, when a core accesses a memory unit, it may encountercache miss. In this case, we need to obtain datain another memory unit holding this data. Inevitably, thedata transfer between two remote memories will incur muchhigher overhead. According to the mechanism used in [4],we assume energy consumption for memory getting data from remote memory is the sum of the cost of the remote
access to and the local write to A.
B. Chromosome Model.In the data allocation problem,all data items are randomly allocated to memory units of each core and main memory to form a chromosome, with the constraint of the size of each memory unit. Hence, a chromosome represents an allocation scheme. To perform genetic operations more conveniently, we construct all chromosomes in the list structure, as shown in Fig. 1(b). In this figure, we allocate 5 data to a target system with one processor owning a hybrid SPM. Every element of the list is defined as a data item and a memory unit pair: (d,MT) to indicate that data d is assigned to memory unit MT.In each chromosome, 0, 1, and 2 represent SRAM, MRAM, and off-chip DRAM, respectively. For example, the gene
(A, 0) represents data A is allocated to SRAM.

3 Algorithms
This section focuses on the discussion of the proposed genetic algorithm which consists of three major steps: initialization, evaluation, and genetic operations.The objective function of the target problem is described
as: given the number of local reads Nlr, local writes Nlw,remote reads Nrr, remote writes Nrw, the cost of local read Clr, local write Clw, remote read Crr,remote write Crw, and the cost of data transfer between different processors Cmove, we formulate the total cost of memory access Cm for a specific data as Equation (1).

3.1 Initialization

The initialpopulation is built randomly. To accelerate the search procedure for the solution space, the initial population will include the allocation generated by the greedyalgorithm proposed in [10].

3.2 Fitness Function

The main objective of the data allocation problem is to minimize the overall memory access cost and the number of write operations to MRAMs of a schedule. We define the fitness function of a chromosome as Equation (2).
where M represents the maximal total cost has been ob-served currently.TOTAL_COST (i)is the total memory access cost of chromosome i. It is the sum of the memory access cost of each gene (data) in this chromosome, which is cal-culated by Equation (3).

where𝑁is the number of data and 𝐶𝑀(𝑗)is the memory access cost of data𝑗that is defined as Equation (1).
Generally, genetic operations include selection,crossover, and mutation. We describe each of them as
follows.
A. Selection. In this paper, we will employ the wheel selection. Based on this method, the higher fitness value an individual has, the more area of it will be assigned on the wheel, and thus the more possible it will be selected when the biased roulette wheel is spun.

3.3 GA Operations

B. Crossover. The crossover operations are conducted according to a probability referred to as crossover rate (PC). We randomly select pairs of chromosomes as parents to generate offsprings. We devise a noveladaptive cy-cle crossover strategyto perform crossover operations for the data allocation problem. This method is modified from thecycle crossoverproposed in [8].
In the cycle crossover approach, the child can efficiently inherit the characteristics from both parents. However, for the data allocation problem, this approach may generate in-valid alleles, due to the size constraint of each memory unit.
An example of such scenario is exhibited in Fig. 2(a), where Parent 1 and Parent 2 indicate the parent chromosomes, and Child is generated by these two chromosomes. In this ex-ample, for simplicity, we assume the target system has one processor core with a hybrid SPM, and each on-chip mem-ory can provide space for only 2 data. 0, 1, and 2 represent SRAM, MRAM, and off-chip main memory, respectively.



As shown in the child chromosome, allocating data𝐷to the MRAM will exceed the maximum capacity of it.
We propose anadaptivecycle crossover strategyto over-come the limitation of the cycle crossover method for our data allocation problem and yield valid data allocations.
The key of our approach is to use a variable to indicate the
currently available space for each memory unit. Before al-locating a data to a specific memory, we will first check if there is enough room for the data. If so, the data will be directly allocated to the memory. Otherwise, we will adap-tively examine the memory units of the neighboring proces-sor cores and find a space for it. However, if all on-chip memory units, including SRAMs and MRAMs, are full, the
data will be assigned to the off-chip main memory. An ex-ample of the adaptive cycle crossover operations is exhib-ited in Fig. 2(b). In this figure, the gene (D,0′)indicatesadaptive adjustment result of data allocation to the MRAM
C. Mutation.The mutation operation is archived by ran-domly flipping bits of a chromosome. It is happened in a certain probability calledmutation rate(PM). Since the gene in this paper is defined as a data item and a memory unit pair, the mutation operation can be performed by swap-ping the position of either the data or the memory units of selected genes. Fig. 2(c) shows an example of mutation to exchange the memory positions of two genes.
The whole procedure of our AGADA algorithm works
as follows. First, we randomly generate an initial popula-tion according to the given population size. After initialization
 the fitness value of each individual will be evaluated in light of Equation (3). Then, we iteratively apply the above genetic operations to generate a new population and deter-mine the currently best solution. Finally, the fitness of each individual will be evaluated for the new population. Repeat these procedures until coming to the termination conditions.The details of the AGADA algorithm is described in Algo-rithm 3.1.

4 Experiments

We evaluate our algorithm on a set of benchmarks from PARSEC and run them in M5. To verify the effective-ness of the AGADA algorithm, we implemented it and the greedy algorithm as stand-alone programs that use the memory trace as input. The memory access latencies and
energy consumption are obtained from a modified versionof CACTI by using 65nm technology node.

Our target system is a dual-core in-order MPSoC with a 2MB SRAM and an 8MB MRAM as the hybrid SPM. The baseline configuration is a dual-core MPSoC with a 4MB pure SRAM as the SPM. The specifications of the baseline and the hybrid architecture are given in Table 1.
Generally, the density of MRAM is as 4 times as that of SRAM. Therefore, these two configurations occupy the similar chip areas. We then integrate all these parameters into our custom simulator. 10 applications are selected form PARSEC for simulations: black scholes, bodytrack,canneal, dedup, streamcluster, facesim, fluidanimate, x264,
swaptions, and ferret.
Without loss of generality, the following parameter spec-ifications are set for the simulations of our AGADA algorithm: 1) Population size: 500; 2) Crossover type: adaptivecycle crossover; 3) Crossover rate: 0.85; 4) Mutation rate 0.1; 5) Selection method: roulette wheel; 6) Maximum gen-eration: 2000.
We compare our AGADA algorithm to the greedy algo-rithm with respect to the number of writes to MRAMs, dy-namic energy consumption, and memory access latencies based on the set of benchmarks. Fig. 3 shows that com-pared to the greedy algorithm, our AGADA strategy can re-duce the average number of write operations to MRAMs by 28.33% on average. The major reason is that the greedy
algorithm does not distinguish SRAM from MRAM and only selects the frequently accessed data to SPMs. How-ever, the AGADA algorithm is conscious of write activities to MRAMs, which allocates most frequently read data to MRAMs and most frequently written data to SRAMs. By
reducing the number of writes to MRAMs, the AGADA al-gorithm can efficiently extend endurance of MRAMs. Fig. 4 shows that benefitting from the decrease of write oper-ations to MRAMs, the AGADA algorithm will incur much less dynamic energy consumption than that of the greedy al-gorithm. On average, the AGADA algorithm can reduce dy-namic energy consumption by 15.32% in our experiments.

5 Conclusions and Future Work

This work presented a hybrid SPM architecture consist-ing of SRAM and MRAM to solve the memory wall issue for MPSoC architectures. Based on the hybrid SPM archi-tecture, we proposed a novel data allocation strategy using an adaptive genetic algorithm to reduce the number of write
activities on MRAM. Experimental results showed that our genetic algorithm can contribute to a significant reduction in dynamic power consumption of memory systems and the number of writes to MRAMs compared to a greedy algo-rithm. We plan to investigate the effectiveness of the pro-posed hybrid architecture and the accuracy of the genetic algorithm in the future work.

Acknowledgements

This work was supported in part by the University of
Kentucky Start Up Fund and NSFC 61071061; NSFC
61170296, Fund of Aeronautics Science 20091951020,
New Century Excellent Talents in University 291184; Na-tional High Technology R & D Program of China (863) 2009AA012201.

References

[1] O. Avissar, R. Barua, and D. Stewart. An optimal mem-ory allocation scheme for scratch-pad based embedded sys-tems. ACM Transactions on Embedded Computing Systems
(TECS), 1(1):6–26, 2002.
[2] R. Banakar, S. Steinke, B. S. Lee, M. Balakrishnan, and
R. Marwedel. Scratchpad memory: design alternative
for cache on-chip memory in embedded systems. In
CODES+ISSS ’02, pages 73–78, 2002.
[3] X. Guo, E. Ipek, and T. Soyata. Resistive computation:
Avoiding the power wall with low-leakage, STT-MRAM
based computing. InISCA ’10, pages 371–382, 2010.
[4] Y. Guo, Q. Zhuge, J. Hu, M. Qiu, and E. H.-M. Sha. Optimal
data allocation for scratch-pad memory on embedded multi-core systems. InICPP 2011, pages 464–471, 2011.
[5] J. Montanaro, R. T. Witek, K. Anne, A. J. Black, and E. M.
Cooper. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor.
Digital Technical Journal, 9:49–62, 1997.
[6] P. R. Panda, N. D. Dutt, and A. Nicolau. On-chip vs. off-chip memory: the data partitioning problem in embedded
processor-based systems.ACM Transactions on Design Au-tomation of Electronic Systems, 5(3):682–704, 2000.
[7] M. Qiu and E. H.-M. Sha. Cost minimization while satis-fying hard/soft timing constraints for heterogeneous embed-ded systems. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 14(2):1–30, 2009. (Best Pa-per Award).
[8] K. Shahookar and P. Mazumder. A genetic approach to stan-dard cell placement using meta-genetic parameter optimiza-tion.IEEE Trans. Computer-Aided Design of Integrated Cir-cuits and Systems, 9(5):500–511, 1990.
[9] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel archi-tecture of the 3D stacked MRAM L2 cache for CMPs. In
ISCA ’09, pages 239–249, 2009.
[10] S. Udayakumaran and R. Barua. Compiler-decided dy-namic memory allocation for scratch-pad based embedded
systems. InCASES ’03, pages 276–286, 2003.
[11] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie.
Hybrid cache architecture with disparate memory technolo-gies. InHPCA ’09, pages 239–249.





1 comment:

  1. I really appreciate your professional approach. These are pieces of very useful information that will be of great use for me in future.

    ReplyDelete