CellStream

CellStream is an application for the Cell Broadband Engine that moves data from an external storage quickly and efficiently between SPEs and memory and back out to a storage device. It also supports a drop in kernel on the SPEs so some work can be done on the data as it is streamed through the SPEs.

It has been used as a test application comparing various programming models on multi-core processors with explicitly managed memory hierarchies (such as the Cell BE)[1]. In that paper, it was designed to compare how efficient each programming model was at streaming data to the SPEs. Since the handwritten version of CellStream is able to perform at close to peak speeds, it was a good indicator of the overhead introduced with the different programming models.

Hardware Description

Figure 1: A basic Cell BE block diagramFigure 1: Cell BE Block Diagram

Figure 1 shows a simple block diagram of the Cell Broadband Engine. The Cell processor is a heterogeneous multi-core processor[2]. The Power Processor Element (PPE) acts like a traditional PowerPC processor. It is a dual-threaded processor, so two execution contexts can execute simultaneously. This is show by the two thread boxes in the diagram. Since CellStream uses multiple PPE threads this will show when each thread is executing. Each Synergistic Processor Element (SPE) is a specialized accelerator core with 256KB of local storage space. The programmer is responsible for moving data from Main Memory to the SPE's local store in order for the SPE to compute on that data.

The Element Interconnect Bus is a very fast (204.8 GB/s) bus that connects the components inside of the chip together. While this bus is extremely fast, the bottleneck quickly become how fast each component can communicate with the bus. For example, Main Memory's interface controller is only capable of 25.6 GB/s transfer speeds, which is 12.5% of the bus speed.

Various external components are connected via a FlexIO interface to the Cell processor. These external components include network and disk I/O.

CellStream Description

On the PPE side of CellStream, there are three threads (implemented with pthreads) that run concurrently on three different buffers.

Figure 2: Reading thread reads the first buffer into Main MemoryFigure 2: Reading thread reads the first buffer into Main Memory
Figure 3: First buffer is offloaded to the SPEs while the Reading Thread reads in the second buffer from disk.Figure 3: First buffer is offloaded to the SPEs while the Reading Thread reads in the second buffer from disk.
Figure 4: A basic Cell BE block diagramFigure 4: Writing Thread writes first buffer back to disk. SPEs are working on the second buffer and the Reading Thread is reading the third buffer from disk.

The Reading Thread is responsible for moving data from the external source into Main Memory (Figure 2). The Offloading Thread is responsible for dividing up work among the SPEs, start their execution, and wait for them to finish before handing the buffer to the last thread (Figure 3). Since the Offload Thread does very little work on the actual PPE, it is not shown in the figures. The Writing Thread is responsible for moving the now processed data in Main Memory to an external destination (Figure 4).

The threads rotate which buffer each is responsible for using pthread wait functions. As soon as the Reading thread fills a buffer, it signals the other threads to wake up and check who has permission to the modified thread. Since the Reading thread will specify it is read to be offloaded, the Offloading thread sees this and starts work. The same processor happens between the Offloading thread and Writing thread, along with between the Writing thread and Reading thread once a buffer has been written back to external storage. This version of CellStream uses files on disk as the main source of input and the destination of output, although the Reading and Writing threads can be modified to use other kinds of storage (such as the network).

The main concern of the SPE code is to move data in and out of the local store as efficiently and quickly as possible. It utilizes a double buffering technique with DMA fencing. The fencing calls tell the SPE as soon as one chunk is committed to main memory it should start filling it with a new buffer chunk. These calls can be issued all at once and they will be processed in the order they are issued.

Performance

While the DMA benchmark that comes bundled with the Cell SDK does a good job of measuring the maximum communication capabilities of the hardware[3], it does so in a manner that does not maintain data coherency. CellStream was designed to get as close to the maximum bandwidth as possible while operating on data that remains coherent and uncorrupted as it passes through the Cell BE.

The first experiment was done to verify CellStream exhibited the same communication behavior as the DMA benchmark[3] which varied the size of each DMA transfer to find the ideal DMA transfer size. The number of SPEs used was scaled from 1 to 6 (not eight since a Playstation 3 only has 6 available SPEs). Each buffer was split according to how many SPEs would be working on a buffer so each SPE had an equal share of work.

Figure 5: CellStream SPE and Main Memory Transfer Speeds Figure 5: CellStream Average SPE Transfer Speed on a Playstation 3 with workloads split on 16-byte boundaries.
Figure 6: CellStream SPE and Main Memory Transfer Speeds Figure 6: CellStream Average SPE Transfer Speed on a Playstation 3 with workloads split on cache lines.

The results shown in Figure 5 were very surprising. For the case of 1, 2, and 4 SPEs, the results are what was expected. However the results for 3, 5, and 6 SPEs were very low. After further investigation it was revealed that in the cases of 3, 5, and 6 SPEs, the SPE transfer speeds were very unbalanced.

The first SPE would always perform the fastest, but others would generally perform twice as slow. This behavior was consistent when different SPEs worked on different parts of the buffer, such that the SPE that started work at the beginning of the buffer always had transfer speeds that were almost twice that of some of the other SPEs. Since the buffer was split up so that each SPE's starting memory address began on a 16-byte aligned address, part of the requirements of a DMA transfer, some SPEs were not starting on a cache line, which is on a 128-byte boundary. Whenever we could evenly split up the 16MB buffer so that each SPE's starting work address was on a 128-byte boundary, that SPE would perform as expected.

In Figure 6, the performance is much closer to the results in the DMA benchmark[3] mentioned earlier. Splitting the buffer so each SPE's first memory access was on a 128-byte memory address boundary clearly show a better average speed.

We now see what we'd expect from the hardware. A single SPE is limited by how fast that SPE alone can access memory. With 2 SPEs there is a sizable jump in average bandwidth inside of the chip. With additional SPEs we see the bottleneck of the memory interface present itself.

Future Work

While CellStream is able to stream data between the SPEs and Main Memory very quickly, it doesn't currently have support for SPE to SPE transfers. This was included in an earlier version (and the code is available in the git repository) however it was dropped as the application was optimized for fast SPE to Main Memory transfers.

It also doesn't support different input/output sizes for the SPEs. If the SPE computational work involves reducing the data (such as a compression algorithm), excess padding will be left in the output since the designated buffer chunk for each SPE wasn't reduced.

Source Code

The source code is available via two methods. The first is the archive compressed below. The second is via GitHub. It is released under the GPL so please follow its guidelines if you do anything with the code.

References

  1. S. Schneider, J. Yeom, B. Rose, J. Linford, A. Sandu, D. Nikolopoulos. A Comparison of Programming Models for Multiprocessors with Explicitly Managed Memory Hierarchies. Proc. of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Raleigh, NC, February 2009 PDF.
  2. T. Chen, R. Raghavan, J. N. Dale, and E. Iwata. Cell Broadband Engine and Its First Implementation – A Performance View. IBM Journal of Research and Development, 51(5):559-572, Sept. 2007
  3. D. Jimenez-Gonzalez, X. Martorell, and A. Ramirez Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications. IEEE International Symposium on Performance Analysis of Systems & Software, pages 210-219, April 2007