Home News The coprocessor architecture: An embedded systems architecture for creating...

The Coprocessor Architecture: An Embedded Systems Architecture for Rapid Prototyping

July 6, 2022

By Noah Madinger, Colorado Electronic Product Design (CEPD)

Although well known for its throughput and digital processing performance, coprocessor architecture offers the embedded system designer opportunities to apply project management strategies that improve both development costs and time to market. Focusing specifically on the combination of a discrete microcontroller (MCU) and a discrete field programmable gate array (FPGA), this article shows how this architecture lends itself to an efficient and iterative design process. Drawing on research sources, empirical results, and case studies, the advantages of this architecture are explored and exemplary applications are offered. Upon completion of this article, the embedded system designer will have a better understanding of when and how to implement this versatile hardware architecture.

Introduction

The embedded systems designer finds himself at a juncture of design constraints, performance expectations, and budget and schedule concerns. In fact, even the contradictions in the buzzwords and phrases of modern project management further underscore the precarious nature of this role: “fail fast”; "be agile"; “future proof” and “be disruptive”. The acrobatics involved in even trying to meet these expectations can be heartbreaking, and yet they have been talked about and continue to be reinforced in the marketplace. What is needed is a design approach that allows an evolutionary iterative process to be implemented, and as with most embedded systems, it starts with the hardware architecture.

A hardware architecture known for combining the strengths of microcontroller unit (MCU) and field-programmable gate array (FPGA) technologies, coprocessor architecture can offer the embedded system designer a process capable of satisfying even the most demanding requirements, while allowing the flexibility to face known and unknown challenges. By providing hardware that can adapt iteratively, the designer can demonstrate progress, reach critical milestones, and get the most out of the rapid prototyping process.

Within this process are key project milestones, each with their own added value to the development effort. Throughout this article, they will be referred to by the following terms: The milestone of Digital signal processing with the Microcontroller, the milestone of System Management with the Microcontroller and the milestone of Product Implementation.

As a conclusion to this article, it will be shown that a flexible hardware architecture may be more suitable for modern embedded system design than a more rigid approach. Additionally, this approach can lead to improvements in both project cost and time to market. Arguments, provided examples and case studies will be used to defend this position. Looking at the value of each milestone in the design flexibility that this architecture provides, it becomes clear that an adaptive hardware architecture is a powerful engine to drive embedded system design.

Exploring the strengths of the coprocessor architecture: design flexibility and high-performance processing

A common application for FPGA designs is to interface directly with a high-speed analog-to-digital converter (ADC). The signal is digitized, read from the FPGA, and then some digital signal processor (DSP) algorithms are applied to this signal. Lastly, the FPGA makes decisions based on the results.

This application will serve as an example throughout this article. Furthermore, Figure 1 illustrates a generic coprocessor architecture, in which the MCU and the FPGA are connected via the external memory interface of the MCU. The FPGA is treated as if it were a piece of external static random access memory (SRAM). The signals return to the MCU from the FPGA and serve as hardware interrupt lines and status indicators. This allows the FPGA to indicate critical states to the MCU, such as reporting that an ADC conversion is ready, or that a fault has occurred, or that another noteworthy event has occurred.

Figure 1: Generic diagram of the coprocessor (MCU + FPGA). (Image source: CEPD)

The strengths of the coprocessor approach are probably best seen in the results for each of the milestones mentioned. Value is assessed not only by listing the accomplishments of a task or phase, but also by evaluating the enablement that these accomplishments allow. The answers to the following questions help to assess the overall value of the results of a milestone:

Can you continue the progress of other team members more quickly, as dependencies and bottlenecks are removed from the project?
How do milestone achievements enable other parallel execution pathways?

Digital signal processing with the milestone microcontroller

Figure 2: Architecture – digital signal processing with the microcontroller. (Image source: CEPD)

The first stage of development enabled by this hardware architecture brings the MCU to the fore. Other things being equal, MCU and executable software development requires fewer resources and time than FPGA and HDL development. Thus, by starting product development with the MCU as the main processor, algorithms can be more quickly implemented, tested, and validated. This allows algorithmic and logical errors to be discovered early in the design process, and also allows important parts of the signal chain to be tested and validated.

The role of the FPGA in this initial milestone is to serve as a high-speed data collection interface. Its task is to reliably pipe the data from the high-speed ADC, notify the MCU that the data is available, and present this data to the MCU's external memory interface. Although this role does not include the implementation of HDL-based DSP processes or other algorithms, it is nonetheless very critical.

The development of the FPGA that is carried out in this phase lays the foundations for the final success of the product, both in its development and in its launch to the market. By focusing on the low-level interface, adequate time can be spent testing these essential operations. Only once the FPGA reliably and securely performs this interface role can this milestone be safely completed.

The main results of this initial milestone include the following benefits:

The complete signal path - all amplifications, attenuations and conversions - will have been tested and validated.
Project development time and effort will have been reduced by initially implementing the algorithms in software (C/C++); this is of considerable value to management and other interested parties, who must see the feasibility of this project before approving future design phases.
Lessons learned from implementing the algorithms in C/C++ will be directly transferable to HDL implementations – through the use of HDL software tools, eg Xilinx HLS.

System management with microcontroller milestone

Figure 3: Architecture – system management with the microcontroller. (Image source: CEPD)

The second stage of development, offered by this coprocessor approach, is defined by moving DSP processes and algorithm implementations from the MCU to the FPGA. The FPGA is still responsible for the high-speed ADC interface, however, by taking over these other functions, the speed and parallelism offered by the FPGA are fully exploited. Also, unlike the MCU, multiple instances of the DSP processes and algorithm pipelines can be implemented and run simultaneously.

Building on the lesson learned from the MCU implementation, the designer carries this confidence into this next milestone. Tools, such as the one mentioned Vivid HLS from Xilinx, provide a functional translation of executable C/C++ code into synthesizable HDL. Now, timing constraints, process parameters, and other user preferences still need to be defined and implemented, however, core functionality is pursued and translated onto the FPGA fabric.

For this milestone, the role of the MCU is that of system manager. The MCU monitors, updates, and reports the status and control registers within the FPGA. In addition, the MCU manages the user interface (UI). This user interface could take the form of a web server accessed via an ethernet or Wi-Fi connection, or it could be an industrial touch screen interface giving users access at the point of use. The key to the new and more refined role of the MCU is this: freed from processing-intensive tasks, both the MCU and the FPGA are now leveraged for tasks for which they are well suited.

The main results of this milestone include these benefits:

The FPGA provides fast, parallel execution of DSP processes and algorithm implementations. The MCU provides a streamlined, agile user interface and manages product processes.
Having first been developed and validated in the MCU, algorithmic risks have been mitigated, and these mitigations carry over to synthesizable HDL. The tools, like Vivid HLS, facilitate this translation process. Additionally, FPGA-specific risks can be mitigated using built-in simulation tools, such as the Vivado design suite.
Stakeholders are not exposed to significant risk by moving processes to the FPGA. Instead, they get to see and enjoy the benefits of FPGA speed and parallelism. Measurable performance improvements are being seen and attention can now be focused on preparing this design for manufacturing.

The product implementation milestone

With the intensive processing of the calculations in the FPGA, and the MCU in the management of the system and the user interface, the product is ready for deployment. However, this document does not advocate leaving aside the Alpha and Beta versions; however, the emphasis of this milestone is the capabilities that the coprocessor architecture brings to the implementation of the product.

Both the MCU and FPGA are field upgradeable devices. Several advances have been made to make FPGA upgrades as accessible as software upgrades. Furthermore, since the FPGA is within the addressable memory space of the MCU, the MCU can serve as an access point for the entire system: receiving updates to itself as well as to the FPGA. Updates can be conditionally scheduled, distributed, and customized for each end user. Lastly, user and use case registries can be maintained and associated with specific build implementations. From these data sets, performance can continue to be refined and improved even after the product is in the field.

Perhaps the strengths of this system-wide upgradeability will be more apparent than in space applications. Once the product is released, maintenance and updates must be done remotely. This can be as simple as changing logic conditions, or as complicated as updating a communications modulation scheme. The programmability offered by FPGA technologies and coprocessor architecture can accommodate this full range of capabilities, while offering radiation-resistant component options.

The last key point of this milestone is the progressive reduction of costs. Cost reductions, BOM changes, and other optimizations can also occur at this stage. During field deployments, it may be discovered that the product can work just as well with a less expensive MCU or less capable FPGA. Thanks to the coprocessor, architecture designers are not forced to use components whose capabilities exceed the needs of their application. Also, if a component becomes unavailable, the architecture allows new components to be integrated into the design. This is not the case with a single-chip architecture, a system-on-a-chip (SoC), or a high-performance DSP or MCU trying to handle all of the product processing. The coprocessor architecture is a good mix of power and flexibility that gives the designer more choice and freedom both in the development phases and in the market release.

Supporting research and related case studies

Example of satellite communications

In short, the value of a coprocessor is to offload the primary processing unit for tasks to run on hardware, where speedups and streamlining can be taken advantage of. The advantage of this design choice is a net increase in computational speed and capabilities and, as argued in this article, a reduction in development time and cost. Perhaps one of the most attractive areas for these benefits is space communications systems.

In your post, FPGA based hardware as coprocessor, G. Prasad and N. Vasantha detail how data processing within an FPGA meets the computational needs of satellite communications systems without the high non-recurring engineering costs (NRE) of application-specific integrated circuits (ASICs). or the application-specific limitations of a hard architecture processor. As described in the milestone of Digital Signal Processing with the microcontroller, its design begins with the application processor performing the most computationally intensive algorithms. From this starting point, they identify the key sections of the software that consume the most central processing unit (CPU) clock cycles and migrate these sections to the HDL implementation. The graphic representation is very similar to the one presented so far, however, they have chosen to represent the Application program as its own and independent block, since it can be carried out both in the Host (Processor) and the FPGA-based hardware.

Figure 4: Application program, host processor, and FPGA-based hardware – used in the satellite communications example.

By using a Peripheral Component Interconnect (PCI) interface and Direct Memory Access (DMA) from the host processor, peripheral performance is greatly increased. This is observed above all in the improvements of the process of dedomization. When this process was performed in the host processor software, there was clearly a bottleneck in the system's real-time response. However, when it was moved to the FPGA, the following advantages were observed:

The marketing process includesseveral phases that are reflected below: derandomization runs in real time without causing bottlenecks
The computational overhead of the host processor was significantly reduced and could now better perform the intended logging function.
The overall performance of the entire system was expanded.

All this was achieved without the costs associated with an ASIC and enjoying the flexibility of programmable logic [5]. Satellite communications present considerable challenges, and this approach can verifiably meet these requirements, while still providing design flexibility.

In-car infotainment example

Car entertainment systems are a hallmark for the most discerning consumer. Unlike most automotive electronics, these devices are highly visible and are expected to offer exceptional response time and performance. However, designers are often torn between today's design needs and the flexibility that future features will require. In this example, wireless communications and signal processing implementation needs will be used to highlight the strengths of the coprocessor hardware architecture.

One of the most widely used automotive entertainment system architectures was published by the Delphi Delco Electronics Systems corporation. This architecture used an SH-4 MCU with a complementary ASIC, the peripheral HD64404 Amanda from Hitachi. This architecture satisfied more than 75% of the basic entertainment functions of the automotive market; however, it lacked the capability to address wireless communication and video processing applications. By including an FPGA in this existing architecture, more flexibility and capability can be added to this existing design approach.

Figure 5: Infotainment 1 FPGA coprocessor architecture example.

The architecture of Figure 5 is suitable for both video processing and wireless communications management. By moving the functionalities of the DSP to the FPGA, the processor Amanda it can perform a system management function and is freed up to implement a wireless communications stack. as much the Amanda Because FPGAs have access to external memory, data can be quickly exchanged between processors and system components.

Figure 6: Infotainment 2 FPGA coprocessor architecture example.

The second infotainment in Figure 6 highlights the FPGA's ability to address both high-speed incoming analog data and handling the compression and encoding required for video applications. In fact, all of this functionality can be fed into the FPGA and, through the use of parallel processing, can be addressed in real time.

By embedding an FPGA within an existing hardware architecture, you can combine the proven performance of existing hardware with flexibility and future-proofing. Even within existing systems, the coprocessor architecture offers designers options not otherwise available [6].

Advantages of Rapid Prototyping

At its heart, the rapid prototyping process strives to cover a considerable amount of product development area by running tasks in parallel, quickly identifying "bugs" and design issues, and validating paths of development. data and signals, especially those that are within the critical path of a project. However, for this process to truly produce agile and efficient results, there must be sufficient expertise in the required project areas.

Traditionally, this means that there must be a hardware engineer, an embedded software or DSP engineer, and an HDL engineer. There are now many interdisciplinary professionals who can play multiple roles, but coordinating these efforts continues to be a significant workload.

In his article, An FPGA based rapid prototyping platform for wavelet coprocessors, the authors promote the idea that the use of a coprocessor architecture allows a single DSP engineer to perform all of these functions, efficiently and effectively. For this study, the team began designing and simulating the desired DSP functionality within MATLAB's Simulink tool. This served two main functions, as 1) it verified the desired performance through simulation, and 2) it served as a baseline against which future design options could be compared and referenced.

After simulation, critical functionalities were identified and divided into different cores, which are soft-core components and processors that can be synthesized on an FPGA. The most important step during this work was to define the interface between these kernels and components and to compare the data exchange performance with the desired and simulated performance. This design process is closely aligned with the Xilinx design flow for embedded systems and is summarized in Figure 7 below.

Figure 7: Implementation design flow.

By dividing the system into synthesizable cores, the DSP engineer can focus on the most critical aspects of the signal processing chain. You don't need to be a hardware or HDL expert to modify, route, or implement different soft core processors or components within the FPGA. As long as the designer knows the interface and data formats, he has full control over the signal paths and can fine-tune the performance of the system.

Empirical results: the case of the discrete cosine transform

The empirical results not only confirmed the flexibility that coprocessor architecture offers the embedded system designer, but also showed the performance enhancement options available with modern FPGA tools. Enhancements, such as those listed below, may not be available or have less impact for other hardware architectures. The discrete cosine transform (DCT) was selected as it is a computationally intensive algorithm, and its progression from a C-based to an HDL-based implementation was at the heart of these findings. DCT was chosen because this algorithm is used in digital signal processing for pattern recognition and filtering [8]. The empirical findings were based on a laboratory exercise, which was completed by the author and his collaborators, to obtain the Xilinx Alliance Partner certification for 2020 – 2021.

For this, the following tools and devices were used:

Vivid HLS v2019
The evaluation and simulation device was xczu7ev-ffvc1156-2-e

Starting with the C-based implementation, the DCT algorithm accepts two 16-bit arrays of numbers; matrix "a" is the input matrix to the DCT, and matrix "b" is the output matrix of the DCT. The data width (DW) is therefore defined as 16, and the number of elements within the arrays (N) is 1024/DW, i.e. 64. Finally, the size of the DCT array ( DCT_SIZE) is set to 8, which means that an 8 x 8 matrix is used.

Following the premise of this article, the C-based implementation of the algorithm allows the designer to quickly develop and validate the functionality of the algorithm. Although it is an important consideration, this validation places more importance on functionality than on runtime. This weighting is allowed, since the final implementation of this algorithm will be on an FPGA, where hardware acceleration, loop unwinding, and other techniques are readily available.

Figure 8: Xilinx Vivado HLS design flow.

Once the DCT code is created inside the tool Vivid HLS As a project, the next step is to start synthesizing the design for the FPGA implementation. It is in this next step where some of the most impactful benefits of moving the execution of an algorithm from an MCU to an FPGA become more evident – for reference this step is equivalent to the System Management with the Microcontroller milestone discussed above.

Modern FPGA tools enable a set of optimizations and enhancements that vastly increase the performance of complex algorithms. Before analyzing the results, there are some important terms to keep in mind:

Latency – The number of clock cycles required to execute all iterations of the loop [10]
Interval – The number of clock cycles before the next iteration of a loop starts processing data [11]
BRAM – Block Random Access Memory
DSP48E – Digital Signal Processing Slide for UltraScale Architecture
FF – Flipflop
LUT – Look-up Table
URAM – Unified Random Access Memory (can be composed of a single transistor)

	Latency		Interval
	min	Max	min	Max
Default (solution 1)	2935	2935	2935	2935
Internal loop of the pipe (solution 2)	1723	1723	1723	1723
Outside loop of the pipe (solution 3)	843	843	843	843
Array partition (solution 4)	477	477	477	477
Data flow (solution 5)	476	476	343	343
Online (solution 6)	463	463	98	98

Table 1: Results of the optimization of the execution of the FPGA algorithm (latency and interval).

	BRAM_18K	DSP48E	FF	LUT
Default (solution 1)	5	1	246	964
Internal loop of the pipe (solution 2)	5	1	223	1211
Outside loop of the pipe (solution 3)	5	8	516	1356
Array partition (solution 4)	3	8	862	1879
Data flow (solution 5)	3	8	868	1654
Online (solution 6)	3	16	1086	1462

Table 2: Results of the optimization of the execution of the FPGA algorithm (use of resources).

Predetermined

The default optimization settings come from the unaltered result of the translation of the C-based algorithm to synthesisable HDL. There are no optimizations enabled, and this can be used as a performance benchmark to better understand the other optimizations.

Inner Pipe Loop

Directive PIPELINES indicates to Vivid HLS to unwind the inner loops so that new data can begin to be processed while existing data is still in the pipeline. Thus, the new data does not have to wait for the completion of the existing ones to be able to start processing it.

outer loop of pipe

applying the directive PIPELINES to the outer loop, the outer loop operations are now piped. However, operations in inner loops now occur concurrently. Both latency and interval time are cut in half by applying this directly to the outer loop.

array partition

This directive assigns the contents of loops to arrays and thus flattens all memory access to individual elements within these arrays. Doing this consumes more RAM, but again, the execution time of this algorithm is cut in half.

Data flow

This directive allows the designer to specify the target number of clock cycles between each of the input reads. This directive is only supported by the top-level function. Only loops and functions exposed at this level will benefit from this directive.

Online

Directive INLINE flatten all loops, both inner and outer. Row and column processes can now run simultaneously. The number of clock cycles required is kept to a minimum, even though this consumes more FPGA resources.

Conclusion:

The coprocessor hardware architecture provides the embedded systems designer with a high-performance platform that maintains design flexibility throughout product development and release. By first validating algorithms in C or C++, processes, data and signal paths, and critical functionality can be verified in a relatively short time. By then moving the processor-intensive algorithms to the coprocessor FPGA, the designer can enjoy the benefits of hardware acceleration and a more modular design.

In case parts become obsolete or optimizations are required, the architecture itself can allow for these changes. New MCUs and FPGAs can be incorporated into the design, while the interfaces can remain relatively intact. Additionally, since both the MCU and FPGA can be upgraded in the field, user-specific changes and optimizations can be applied in the field and remotely.

Finally, this architecture combines the development speed and availability of an MCU with the performance and expandability of an FPGA. With optimizations and performance enhancements available at every step of development, the coprocessor architecture can meet the needs of the most demanding requirements for both current and future designs.

Source: https://www.digikey.com.mx/es/articles/the-co-processor-architecture-an-embedded-system-architecture-for-rapid-prototyping