The ZipCPU by Gisselquist Technology

Device Clock Generation

Wed, 17 Dec 2025 00:00:00 -0500

After building a CPU, utilities for handling bus interconnects, several DMAs and memory controllers, I often find my time focused on building interfaces between designs and external peripherals. This seems to be where most of the business has landed for me. Often, these peripherals require a clock output, coming from the design, and so I’d like to spend some time describing how to generate such a “device” clock.

Fig 1. A Basic SOC with Peripherals

There’s actually two topics that need to be discussed when working with modern high speed peripheral design. One of them is generating the clock to be sent to the peripheral, such as Fig. 1 above illustrates. The second one involves processing a clock returned from the peripheral, as shown in Fig. 2 below. This is a key component of high speed designs such as DDR memories, eMMC, HyperRAM, or even NAND flash protocols. This second topic is one we shall need to come back to at a later date.

Fig 2. Data returned with a clock

Today, I’d like to discuss how to go about generating a clock to control device interaction.

I first came across this problem when building a NOR flash controller, based on first a SPI interface and later a Quad SPI interface. My controller was designed for FPGAs, and so the clock could be built with a single frequency. This design had the added complication that the clock needed to be paused from time to time. Specifically, the clock needed to be turned off when nothing was going on. Likewise, the clock needed to be turned off for one cycle after dropping (i.e. activating) the chip select pin, and for a couple cycles after the transaction was complete but before raising (deactivating) the chip select.

I had to deal with a similar problem when controlling a HyperRAM, but … that design failed when I wasn’t (yet) prepared to handle the return clock properly. I did say this deserved an article in its own right, did I not? Processing data on a return clock properly can be a challenge.

I then built a similar design for ASIC platforms. Unlike the FPGA, the final clock speed wouldn’t be known until run time. It might be that the design started at a slower clock speed, only to later speed up to the full rate at run time. Unlike an FPGA which can be fixed later, there’s really no room for failure in ASIC work. At least with an FPGA, if my board didn’t support a particular frequency, I could just rebuild the design for the clock frequency it did support. This doesn’t work, though, for an ASIC–since it tends to be cost prohibitive to rebuild the design at a later time when you decide to connect it to a slower part than the one you designed it for.

The next design I worked with was a NAND flash design. NAND flash can be a challenge, since the protocol requires you to start at a slow frequency and only after you bring up the connection are you allowed to change to a faster frequency. This particular design was built for ASIC environments, and so it depended upon an analog component generating all the clocks I needed. This worked great, up until someone wanted to purchase the design to work on an FPGA, then another wanted it to work on an FPGA, and another and so on.

Fig 3. Single Data Rate (SDR) vs Dual Data Rate (DDR)
SDR

DDR

Just to add another twist to the problem, many protocols require data transitions on both edges of the clock, a protocol often known as “Dual Data Rate” (DDR). Unlike the other designs above, these often require a clock that is 90 degrees offset from the data–so that each clock transition takes place in the middle of each data valid window, rather than on the edges of the window. This sort of “offset” clock is necessary to guarantee setup and hold times within the slave peripheral. An example of the clock and data relationship required by DDR as opposed to a traditional “single data rate” (SDR) clock is shown in Fig. 3.

By the time I got to my SDIO/eMMC controller, I think I finally had the clock division problem handled. An SDIO controller needs bring up the SD card at 400kHz, and then depending upon the card, the PCB, and the controller, the speed may then be raised to 25MHz, 50MHz, 100MHz, or even 200MHz. The clock may also be stopped whenever either there’s nothing to send or receive, or when the SOC can’t load or unload the data to the controller. For example, you might ask an SD card to read and thus produce many blocks of data, then read the first two of these blocks into your internal buffers only to find that the CPU is slow in draining those buffers. In that case, you would need to stop the interface clock before the external card tries to send you a third block of data that would have nowhere to go.

Other devices require user programmable device clock controllers, such as:

10M/100M/1Gb Ethernet controllers

While each of these speeds might use a single clock, building a truly trimode controller requires some extra work.
(DDR) SDRAM controllers

SDRAM controllers from an FPGA standpoint tend to be simple: just produce a clock. However, you can turn the clock off for better power performance. Yes, there are rules … but we won’t get into those here today.
I2S

We discussed generating an I2S clock at a totally arbitrary frequency some time ago.
I2C

In general, I2C is too slow to be the focus of this article. There is an I3C protocol that is built on top of I2C. The techniques we discuss today might work well for I3C masters, but I’m not nearly as familiar with those.
SPI – not just NOR flash

While SPI slaves have a device clock as well, handling these clocks is fundamentally different from what I’m describing today. My focus today will be on generating clock signals for the purpose of controlling external devices–such as an SPI master might need to do.

Specifically, today I want to look at and discuss generating a clock with one or more of the following characteristics:

Output Signal: We’re talking about interface clocks–those generated by the “master” of the interface. These are digital signals, output from either an FPGA (or ASIC) device.

The output may be accomplished via a component like an ODDR or an OSERDES, with or without an additional analog delay following.
Discontinuous: The clock may be discontinuous. Many protocols (flash, SDIO/eMMC, etc) allow or even require, the clock to be stopped, or otherwise only toggled when there’s something to send or receive. As mentioned above, stopping the clock may also be useful for pausing a transmission in progress before a source buffer runs dry, or an incoming buffer overflows.
Dynamic Frequency: Often, the outgoing clock needs to change frequency during operation as part of the protocol. For example, the SDIO protocol needs to start at 400kHz, and then increase to 25MHz (or more). Therefore, a good clock generator will need to be able to naturally generate multiple clock frequencies as the protocol requires.
Minimum pulse width: Switching between frequencies must be done by rule: clock glitches must be fully disallowed and guaranteed against. Too-short clock pulses cannot be allowed. Clock high and low durations must always be at least a half period of the fastest allowable clock.
90 Degree Offset for DDR Signaling: As shown in Fig 3, many modern protocols require both positive and negative edge signaling (DDR). This drops the required clock frequency by 2x, reducing the bandwidth that must be carried over the PCB for the same data rate. However, the clock signal required to support such DDR signaling often needs to be delayed 90 degrees from the data, so that it transitions in the middle of the data valid period.
Faster than the controller’s clock: Just to make matters worse, in my eMMC design, I needed to generate a 200MHz DDR device clock from a 100MHz system clock.

All this is to say that our goal today will be to create a divided clock using digital, rather than analog, logic. (Yes, I can hear my analog engineering friends jump in here with the comment that “Everything is analog!” God bless you, my friends.)

The Problem

The first approach I often see to this problem is the straight forward integer clock division approach. Generally, it looks something like the following:

always @(posedge src_clk)
if (reset)
	counter <= 0;
else if (!active_clock)
	counter <= 0;
else // if (active_clock)
	counter <= counter + 1;

assign	dev_clk = (high_speed) ? (src_clk && active_clock)
			: counter[user_selected_bit];

In this case, active_clock controls whether or not the clock is stepping, and user_selected_bit controls to what level of clock division we are interested in. As for the src_clk, that can be either the system clock or alternatively whatever is required to generate the fastest clock frequency required by the protocol.

Note that we’ve done nothing to guarantee this clock won’t glitch between speed selections, nor can we necessarily guarantee the minimum of two clock rates. We’ll come back to these requirements later, albeit with a different (better) implementation.

The user logic required to use this clock this looks very simple at first:

always @(posedge dev_clk or posedge reset)
if (reset)
begin
	// Reset logic
end else begin
	pedge_data <= // Logic controlling any flops based on the dev_clk
end

When a protocol requires data on both edges of the clock, getting the data right for the second edge of the clock is also important. But, how shall we output data on the negative edge of a clock we’ve just created out of thin air? We’ll need to transition on the negative edge to do this.

always @(negedge dev_clk or posedge reset)
if (reset)
begin
	// Reset logic
end else begin
	nedge_data <= // Logic controlling the negative clock's data
end

assign	output_data = (dev_clk || !ddr_mode) ? pedge_data : nedge_data;

This approach leaves us with two problems. The first is that we’re using our clock as a logic signal when we assign dev_clk to possible be the same as our source clock. The second problem is that we are transitioning user logic on this clock. Worse, though, we’re now transitioning our user logic on both edges of the clock. This violates the rules of good digital logic design.

These aren’t necessarily issues when building ASIC designs. However, in FPGA design, this clock will need to get onto the clocking network’s backbone somehow, and that’s not automatic. Worse, this new clock is not the same as the original src_clk–even when they are at the same frequency. There will always be a delay between the two clocks–a delay that may not be captured by pre-synthesis simulation, and so it can be a dangerous delay the engineer isn’t expecting when building this logic.

This leads to two commercial ASIC design challenges. First, when designing an ASIC IP, you want to be able to test as much of the IP on an FPGA as possible. Non FPGA compatible logic needs to be moved to the periphery of the design and carefully controlled. Second, from a business point of view, it helps to be able to sell the ASIC design to FPGA customers in addition to ASIC customers. So, even though you can do something like this on an ASIC, that doesn’t mean you should.

There are other problems.

Clock domain crossings (CDCs)

Since the src_clk and dev_clk are now two separate and distinct clock domains, you’ll need to properly manage every clock domain crossing between these two clock domains. This can create additional delays through what otherwise might be high speed logic.

Likewise, the positive and negative edges of the same clock are also (technically) separate clock domains. Moving between them is “possible, but not recommended.”
Gating

You may have noticed we haven’t properly gated our clock above. Sure, we used an active_clock signal to provide gating, but this signal does not guarantee the maximum frequency of the output clock. This, however, is a minor problem that most engineers reading this blog would be able to easily fix with a little bit of additional logic.

Two problems in particular, though, become deal breakers when it comes to this type of design. The first is that DDR interfaces often require a clock delayed by 90 degrees from the data, as shown in Fig. 3 above. The simple approach will not generate such a 90 degree delay. While one might use an analog delay element, such as a Xilinx ODELAY element, to delay the clock signal by an appropriate amount, this will only work for high speed clocks and not for clocks less than 50MHz or so. The second problem is, what do you do when you need a device clock that’s faster than your src_clk, like I did in my SDIO/eMMC controller design?

As a result, we really need another approach.

The Solution

The basic solution is to return to the rules, and so avoid all transitions on the device clock edge at all. Instead, we’ll continue to transition on our source clock and then use either an ODDR or an OSERDES to generate the final outgoing clock. In the meantime, we’ll treat the newly generated device clock as a traditional logic signal–rather than a “clock” within our design. That is, we’ll let it be and remain logic.

Let’s start by looking at Fig. 3 above, and dividing the clock period into sections, as shown in Fig. 4 below.

Fig 4. Dividing the clock period

Nominally, we’d want at least two sections per clock–one for each piece of data in a DDR transmission. Sadly, this isn’t enough, since the clock might need to be offset by 90 degrees. Hence, we’ll need to break each clock period into four logically distinct time periods. We can label these time periods 3:0, from left-most or most-significant being 3 down to the right most and least significant being 0.

From here, we can generate what I’m going to call a wide clock, four bits at a time. This wide clock will then be output via a 4:1 OSERDES–if it is to keep pace with the source clock within our design. At its fastest speed, this clock will be either 0011 (where the MSB ‘0’ is transmitted “first”), or 0110 if a 90 degree offset clock is required for DDR transmissions (as shown in Fig. 4). At its next slowest speed, the clock would be 0000 followed by 1111, or 0011 followed by 1100. Further clock divisions will use wide clocks of 0000 or 1111.

If you wish to use an ODDR instead of a 4:1 OSERDES, you can still use this approach, save that you would be generating 2 wide clock bits at a time instead of four. The fastest clock would be a repeating 01, but this fastest clock would be unable to handle the 90 degree offsets of a DDR signal. The next fastest would be either 00 followed by 11, or the 90 degree offset version of the same at 01 followed by 10.

If you want a clock running at twice your system frequency, you could use an eight-bit wide clock signal, designed to feed an 8:1 SERDES. Your fastest clock would become 00110011 (non–DDR) or 01100110 when working with DDR signals.

That’s the first step–the wide clock.

The second step is to generate, together with the wide clock signal, two other signals. The first signal, let’s call this new_edge, will indicate that a new clock cycle is beginning. The second, which I shall call the half_edge, will indicate that the second half of a clock cycle is beginning. Both of these signals are also shown in Fig. 4 above, each indicating the portion of the clock cycle they represent.

All three of these logic signals can be now generated by a “clock generator” module.

If necessary, this clock can be stopped either at the clock generator, or gated further down the signal pipeline by simply zeroing out the wide clock.

Let’s pause for a moment to illustrate what a “clock” like this might look like.

We’ll start with the highest speed clock, running at the source clock rate. This clock will have a wide clock of 0011, and new data on every clock edge.

Fig 5. Highest speed SDR

Fig. 5 shows all of these key signals. First, you can see the system clock, which we called src_clk above, that everything is generated off of. Next, you can see the IO clock we create, followed by the wide_clock used to create it. This is followed by the new_edge control signal. This clock might be the clock we would use for a data signal transitioning at once per clock (SDR). Therefore, to illustrate, I’ve also illustrated what a couple periods of this this data signal might look like.

Were this interface to run in DDR mode, sending one word of data on each edge of the clock, then the wide_clock would need to be (repeatedly) set to 0110, as shown in Fig. 6 below.

Fig 6. Highest speed DDR

There are a couple key differences between Fig. 6 and Fig. 5 above. The first, and perhaps most obvious, is that the data in Fig. 6 are output at two words per system clock cycle. This is often desirable, in that twice the data rate may now be achieved. The second difference is that the IO clock is now offset 90 degrees from the data, instead of 180 degrees. This is often necessary to guarantee that there is a clock transition in the middle of the data valid period. To make this happen, the wide_clock is now set to 0110 in each clock period.

Using these clock signals, we can also pause the clock–as shown in Fig. 7 below.

Fig 7. Pausing the clock

Note that the key signals, such as new_edge and half_edge must also stop when the clock pauses (stops). Because there is no clock signal, the data output signals become don’t care. (For power reasons, I could see holding the output at at its previous value for short periods of time, D2 in this case, but that’s another discussion.)

This same signaling approach also works when dividing the clock speed by two. Fig. 8 shows an example SDR signal with a clock speed set to half the system clock speed.

Fig 8. SDR at half the system clock rate

Fig. 9 shows the same thing, but this time for a DDR signal with the clock at half the system clock speed.

Fig 9. DDR at half the system clock rate

Before leaving this example, note how easy it was to change frequencies in this representation: we just adjusted the wide_clock, and then the new and half clock positions changed to match.

We can drop the clock frequency again to a quarter of the system clock speed, as shown in Fig. 10.

Fig 10. SDR at a quarter of the system clock rate

We can also offset this clock by 90 degrees, as shown in Fig. 11.

Fig 11. DDR at a quarter of the system clock rate

When using this type of “wide” clock, user logic becomes simplified as well. This “simplified” user logic is easily illustrated with an example. For this example, let’s suppose we wished to control 8 data wires using this type of divided clock signaling. Let’s also assume, for the purposes of this illustration, that the source arrives via an AXI stream interface with signals S_VALID and S_DATA[15:0], and a ready signal given by S_READY.

We’ll start with the wide_clock, new_edge, and half_edge signals from the clock generator. Note that, as we propagate these signals through our pipeline (below), we won’t send the wide_clock straight to the output pad, but instead we’ll use it along side our data processing pipeline. This way, if the pipeline must stall (and it might need to), the pipeline can also stall the outgoing clock at the same time.

Hence, we’ll create a one clock delayed version of this wide_clock that we can call outgoing_clock. Further, a second signal, active_clock, can be used to keep track of whether or not we’ve committed to the current clock cycle.

always @(posedge src_clk)
if (i_reset)
begin
	outgoing_clock <= 4'h0;
	active_clock <= 1'b0;
end else if ((S_VALID && S_READY) || (new_edge && second_edge))
begin
	// We commit to this clock if either
	// 1. We have new data and we are ready to consume this new data, *OR*
	// 2. We're in SDR (not DDR) mode, and we've already committed
	//	to a byte of data that we haven't (yet) sent.
	// In both cases, we need to start a clock period.
	//
	// Note that S_READY implies new_edge
	//
	outgoing_clock <= wide_clock;

	// The "active_clock" signal is used to let us know that we've committed
	// to this clock cycle.  From now until the next new_edge, we must
	// forward the wide_clock signal to the output.
	active_clock <= 1;
end else if (new_edge)
begin
	// The clock generator is creating an edge that ... we're not prepared
	// for or ready to handle.  There's just no data available, so ...
	// let's stop the clock.
	outgoing_clock <= 4'h0;

	// In this case, we're not forwarding the clock, nor will we until
	// the next clock period.
	active_clock <= 1'b0;
end else if (active_clock)
	// If we've already committed to this clock cycle, then we'll need to
	// ontinue it to its completion.
	outgoing_clock <= wide_clock;

Before we can get to the data, we need another key signal as well. This is the second_edge signal that we used above. Here’s why: our data is going to arrive, 16b at a time via AXI stream. If we are in DDR mode, then we’ll consume 8b on each edge of this clock–and possibly all 16b at once. However, if we are only in SDR mode, then we’ll need to consume the second 8b on the next clock edge. Hence, we’re going to need a signal that I’m calling, second_edge, to tell us that we have 8b remaining of the 16b committed to us that didn’t get sent on the last clock tick.

always @(posedge src_clk)
if (reset && i_care_about_resets)
	second_edge   <= 0;
else if (S_VALID && S_READY)
	// In SDR, we just accepted 16b and output 8b.
	// We need another new_edge to send the remaining 8b.
	// Note that S_READY implies new_edge
	//
	// Also note that we only use this signal in SDR modes
	second_edge <= !ddrmode;
else if (new_edge)
	// On any (other) new_edge, we can clear this signal
	second_edge <= 0;

That leads us to the outgoing_data. This is a 16 bit data signal, consisting of 8b, outgoing_data[15:8], which will be output on the first half of the clock, and another 8b, outgoing_data[7:0], which will be output on the second half of the clock. A third signal, next_byte, will be used for keeping track of the second byte of data in the case where we don’t output both bytes in the same clock period.

always @(posedge src_clk)
if (reset && i_care_about_resets)
begin
	outgoing_data <= 0;
	next_byte   <= 0;
end else if (S_VALID && S_READY)
begin
	// new_edge is implied by S_READY
	if (ddrmode && half_edge)
	begin
		// Set data for both halves of the clock
		//    The first half in the MSBs
		outgoing_data[15:8] <= S_DATA[15: 8];
		//    The second half in the LSBs
		outgoing_data[ 7:0] <= S_DATA[ 7: 0];

	end else begin
		// Set only the first half ot the data, but set it to be
		// output twice.  We'll need to come back later for the second
		// outgoing byte.
		outgoing_data <= {(2){S_DATA[15:8]}};
	end

	// Keep track of that second byte, so we can come back to it later.
	next_byte <= S_DATA[7:0];
end else if (new_edge ||(ddrmode && half_edge))
begin
	outgoing_data <= {(2){next_byte}};
end

The final signal we need to define is the S_READY signal. In this example, we can accept new data on any new clock edge, unless we have 8b remaining from the last clock edge that have yet to be output.

assign	S_READY = new_edge && !second_edge;

This approach provides us with a couple big advantages to our user logic over what we had before.

First and foremost, all of our user logic now takes place on the same src_clk. We didn’t need any CDCs. AXI slave data, generated externally on this src_clk can now be used within our design on the same clock it was generated on.

Second, did you notice how we were able to simply gate the clock when there was no data available? If not, go back up and look again at the active_clock signal.

Third, unlike the previous approach, we’ve now guaranteed that this clock signal won’t glitch. That is, assuming the outgoing OSERDES won’t generate glitches from our glitchless data signals. The previous clock generator, on the other hand, could well have had glitches between the clock and the data enabling it.

Also look at how easy it was to do pipelined processing. The clock was generated prior to our pipeline, and simply propagated through the pipeline. Although this pipeline only contains a single clock cycle, we could’ve easily extended the pipeline for multiple clock cycles if necessary by simply passing the wide_clock, new_edge, and half_edge signals through the pipeline–adjusting them if and where necessary along the way.

As a result of this example, all IO pins can now be driven using a 4:1 OSERDES. (You could also use ODDRs for the data, if you trusted them to have the same timing relationship as the OSERDES.)

What about frequency changes, or adjusting between the unshifted clock and the clock shifted by 90 degrees? What about when the clock is off, and needs to be turned on? All of these challenges and more now reside within the clock generator.

The Clock Generator

For discussion purposes, let’s take a look at the clock generator I used for my SDIO/eMMC controller. As mentioned above, this clock generator has the particular requirement of being able to generate two outgoing clock periods per system clock cycle, but otherwise it’s a fairly straight forward example of the discussion above.

From a configuration standpoint, there are a couple of configuration options. For example, I wasn’t certain that I’d always have an 8:1 SERDES available to me, nor do all digital environments necessarily offer 2:1 ODDR components. Therefore, we allow those to be adjusted. Second, I want to know the maximum number of bits required in my clock divider.

Still, these configuration parameters are fairly straightforward.

module	sdckgen #(
		// OPT_SERDES is required for generating an 8:1 output.
		parameter [0:0]	OPT_SERDES = 0,

		// If no 8:1 SERDES are available, we can still create a clock
		// using a 2:1 ODDR via OPT_DDR
		parameter [0:0]	OPT_DDR = 0,

		// To hit 100kHz from a 100MHz system clock, we'll need to
		// divide our 100MHz clock by 4, and then by another 250.
		// Hence, we'll need Lg(256)-2 bits.  (The first three speed
		// options are special)
		localparam	LGMAXDIV = 8
	) (

The clock generator is primarily controlled via three signals. The first tells us whether we want our clock offset by 90 degrees for DDR outputs or not. The second controls the speed of the outgoing clock. The final signal tells us we can shut the clock down.

		input	wire			i_cfg_clk90,
		input	wire	[LGMAXDIV-1:0]	i_cfg_ckspd,
		input	wire			i_cfg_shutdown,

When shut down, the wide clock output will be fixed at zero, as will both the new_edge and half_edge control signals.

The shutdown signal is actually really useful at slow clock speeds. Sure you could shut the clock down, as we did above, by just not forwarding it through the pipeline. On the other hand, once the clock has been shut down, you’d like to be able to restart it on a dime. The shutdown control signal to our clock generator allows us to do that. Once set, the clock generator takes the remainder of a clock cycle to shut down, and then stays ready to restart the clock at a moments notice.

The outputs from this module are just about what you would expect. You have the three signals we’ve already discussed. In this case, o_ckstb is the new_edge signal we’ve mentioned, o_hlfclk is the half_edge signal, and o_ckwide is the wide_clock signal.

		//
		output	reg			o_ckstb,	// new_edge
		output	reg			o_hlfck,	// half_edge
		output	reg	[7:0]		o_ckwide,	// wide_clock
		output	wire			o_clk90,
		output	reg	[LGMAXDIV-1:0]	o_ckspd
	);

The two new signals are o_clk90 and o_ckspd. These are feedback signals returned to the control module, used to tell us when any frequency shift or phase shift operations are complete.

These feedback signals solve an issue I was having in my eMMC controller, where the clock would be at some crazy low frequency (100kHz or so), and I’d want to speed it up. Just setting the new clock speed wasn’t enough, since it might take a thousand clocks to finish a single cycle at the 100kHz clock speed. However, by checking these return signals via the register set, the software driver could then tell if any clock frequency change had fully taken effect before going on to any next operation.

The next logic block is part of a two process finite state machine. The first process, shown below, is the combinatorial process. The second will be the clocked logic.

Personally, I’m not a big fan of two process state machines. I’m just not. They often seem to me to be adding extra work and complexity. However, two process state machines allow me to reference logic results even before the full logic path is complete. They also allow me an ability to describe more complicated logic than the simple single process state machine, so a two process state machine it is.

In this case, we are going to generate the next signal for the strobe, nxt_stb, the clock, nxt_clk, and the counter, nxt_counter.

Of these signals, nxt_clk is the simplest to explain. This signal indicates that we’re about to start a new clock cyle. In many ways, this is the combinatorial version of what is to become the new_edge once latched.

Clock cycles themselves come in four phases, just like the four bits of the wide clock we discussed before. You can think of these phases as the 0110 of the fastest clock before. The first bit, 0, is the first phase of the clock. Our new_edge bit, o_ckstb, will only ever be true on this phase. The second bit, 1, is where the clock rises. The third bit, 1 again, is the only phase where the half_edge, o_hlfck, will be set. Finally, the clock will return to zero in the last phase. If the clock is ever idle, it will idle in this first phase prior to delivering a new_edge signal.

This background will help explain how I’ve divided up the counter. There are NCTR bits to the counter. Of those bits, the top two control the phase bits we just described, whereas the others are the clock divider. The nxt_stb signal, mentioned above and below, is simply a signal that these top two phase-control bits are about to change.

With that as background, let’s take a look at how this works.

In general, the first step of any combinatorial block is to set all the values that will be determined within the block. This is a good practice to get into to avoid accidentally generating any latches.

	always @(*)
	begin
		nxt_stb = 1'b0;
		nxt_clk = 1'b0;
		nxt_counter = counter;

From here, we subtract one from the bottom (non-phase) bits of our counter on every cycle. When these bits are zero, subtracting one will cause the counter to overflow and set our nxt_stb signal, so we can know when to adjust the phase bits.

		{ nxt_stb, nxt_counter[NCTR-3:0] } = counter[NCTR-3:0] - 1;

		if (nxt_stb)
		begin
			// Advance the top two bits
			{ nxt_clk, nxt_counter[NCTR-1:NCTR-2] }
						= nxt_counter[NCTR-1:NCTR-2] +1;

If our clock speed is set to 0 (wide clock of either 01100110 or 00110011) or 1 (wide clock of 00111100 or 00001111), then we are always generating a new clock cycle. In this case, we’ll hold the counter at zero and (roughly) ignore the phase.

			if ((OPT_DDR || OPT_SERDES) && ckspd <= 1)
			begin
				nxt_clk = 1;
				nxt_counter[NCTR-3:0] = 0;

Likewise, if the clock speed is equal to two, the wide clock will either alternate between 0000_0000 and 1111_1111, or 0000_1111 and 1111_0000, and so our phase will alternate, but otherwise everything else can be kept to zero.

			end else if (ckspd <= 2)
			begin
				nxt_clk = counter[NCTR-1];
				nxt_counter[NCTR-3:0] = 0;

Finally, in the more general case, we’ll just set the bottom bits to count down from ckspd-3 to zero. Yes, this is “just” a counter, but the maximum value is offset by three for the three special speeds we just discussed above.

			end else
				nxt_counter[NCTR-3:0] = ckspd-3;
		end

You may have noticed that we’ve only adjusted the bottom bits of this counter–the bits that count down. We’ve done nothing to update the phase bits at the top of this “counter”, so let’s handle those next. (Spoiler alert: these MSBs don’t act like counter bits in this implementation.)

Of course, for the highest frequencies, the counter will never change. It sits at zero, with a permanent next phase of 3.

		if (nxt_clk)
		begin
			if ((OPT_DDR || OPT_SERDES) && new_ckspd <= 1)
				nxt_counter = {2'b11, {(NCTR-2){1'b0}} };

When the speed setting is 2, we allow the top two bits to toggle back and forth. If nxt_clk is set, we need to reset these bits only.

			else if (new_ckspd <= 2)
				nxt_counter = { 2'b01, {(NCTR-2){1'b0}} };

Finally, for the general case, we return the phase to zero and reset the clock.

			else begin
				nxt_counter[NCTR-1:NCTR-2] = 0;
			end
		end
	end

This is only the first half of this “two process” FSM. The second half, with respect to the counter, is just about as simple. Perhaps it is even more so, given that we’ve done all of the hard work above.

	always @(posedge i_clk)
	if (i_reset)
	begin
		if (OPT_SERDES)
			counter <= 0;
		else if (OPT_DDR)
			counter <= { 2'b11, {(NCTR-2){1'b0}} };
		else
			counter <= { 2'b01, {(NCTR-2){1'b0}} };
	end else if (nxt_clk && i_cfg_shutdown)
		counter <= { 2'b11, {(NCTR-2){1'b0}} };
	else
		counter <= nxt_counter;

The big thing to notice here is the nxt_clk && i_cfg_shutdown. Remember, if the user ever asserts i_cfg_shutdown, we need to wait for clock cycle to complete before shutting it down. Hence, we wait for the nxt_clk signal before acting. Then, once set, we leave the counter in a state where it will perpetually set nxt_clk. This way, the moment i_cfg_shutdown is released, we’ll be back to generating a clock again.

To explain this a bit better, imagine the clock generator is producing an output clock from ten periods of the source/system clock: five system clocks of 0000_000, followed by five more clocks of 1111_1111. Imagine again that we’ve had several periods of these 10 clock cycles before the user asserts the clock shutdown signal. We then wait another 10 cycles for the clock to fully shut down. Now, if the user drops the shutdown signal after a further 3 cycles, we could either wait another 7 cycles (to complete the 10), or start immediately. Here, we try to arrange to start a stopped clock immediately without violating any of our clocking rules.

The next signal, clk90, controls whether or not we’re generating an clock offset from new_edge, o_ckstb, by 90 degrees or not.

	always @(posedge i_clk)
	if (i_reset)
		clk90 <= 0;
	else
		clk90 <= w_clk90;

	assign	o_clk90 = clk90;

This logic isn’t very interesting yet, since we’ve basically split a two process FSM. It will become more so when we get to w_clk90, and the first process of the FSM, below. The key is, this logic must determine what the current 90 degree offset setting is. Hence, when you look at the outgoing wide clock, this signal must match it.

How about the clock speed? In this case, we go through some error checking.

	initial	ckspd = (OPT_SERDES) ? 8'd0 : (OPT_DDR) ? 8'd1 : 8'd2;
	always @(posedge i_clk)
	if (i_reset)
		ckspd <= (OPT_SERDES) ? 8'd0 : (OPT_DDR) ? 8'd1 : 8'd2;
	else
		ckspd <= w_ckspd;

	always @(*)
	if (OPT_SERDES)
		new_ckspd = i_cfg_ckspd;
	else if (OPT_DDR && i_cfg_ckspd <= 1 && !i_cfg_clk90)
		new_ckspd = 1;
	else if (i_cfg_ckspd <= 2 && (OPT_DDR || !i_cfg_clk90))
		new_ckspd = 2;
	else if (i_cfg_ckspd <= 3)
		new_ckspd = 3;
	else
		new_ckspd = i_cfg_ckspd;

	assign	w_clk90 = (nxt_clk) ? i_cfg_clk90 : clk90;
	assign	w_ckspd = (nxt_clk) ? new_ckspd   : ckspd;

The error checking is here to guarantee that a clock speed of 0 is only used when OPT_SERDES is set. Likewise, a clock speed of 1 may be used in ODDR mode (wide clock of 00001111), but not when the clk90 configuration is set (calling for a wide clock of 0011_1100 which is too complex for an ODDR output module to produce). This continues for a clock speed of two which is fine for a non-offset clock (wide clock of 0000_0000 followed by 1111_1111), but not for an offset clock (wide clock of 0000_1111 followed by 1111_0000 unless the OPT_DDR option is set.

Finally, the two values w_clk90 and w_clkspd are used to tell us what values our registered logic should use when generating a clock. As such, they are either the registered values, or (when we’re about to start a new cycle) the new values.

With all this as background, we can now dig into the core of this logic–generating the three key signals we will be outputting.

On reset, these signals will simply be set to indicate a clock of the fastest rate, ready to go, but otherewise one that is idle (o_ckwide=0).

	initial	o_ckstb  = 0;
	initial	o_hlfck  = 0;
	initial	o_ckwide = 0;
	always @(posedge i_clk)
	if (i_reset)
	begin
		o_ckstb  <= 0;
		o_hlfck  <= 0;
		o_ckwide <= 0;

Next, if we want to shutdown the clock, we can only do so on nxt_clk. When shutdown, the wide clock will be zero and the new edge signals willl all be suppressed.

	end else if (nxt_clk && i_cfg_shutdown)
	begin
		o_ckstb  <= 1'b0;
		o_hlfck  <= 1'b0;
		o_ckwide <= 8'h0;

As mentioned above, the key here is that the clock can suddenly start if the i_cfg_shutdown signal is released. Using this logic, it does not need to remain phase coherent with whatever phase the clock had prior to being shutdown.

Moving on to our highest speed clock, we simply set that according to the 90 degree clock configuration. In general, this speed will only ever generate one of two values: 01100110 or 00110011.

	end else if (OPT_SERDES && w_ckspd == 0)
	begin
		o_ckstb  <= 1;
		o_hlfck  <= 1;
		o_ckwide <= (i_cfg_clk90) ? 8'h66 : 8'h33;

When running from a 100MHz system (src_clk) clock, this plus the OSERDES will generates a 200MHz clock signal to the external device.

One might argue that the OPT_SERDES here is really redundant. There should be enough logic elsewhere to keep w_ckspd at a non-zero value if OPT_SERDES is not set. Why use it?

It’s here specifically to provide a strong hint to the synthesis tool regarding logic that can be cleaned up if OPT_SERDES is not set. This block is complicated enough as it is, so adding it in should simplify our logic.

The problem with putting this value here, and generating a clock module based upon parameters such as OPT_SERDES and OPT_DDR, is that I now need to formally verify the IP under several conditions before I can know if it works. This applies to simulation as well. It is now no longer sufficient to run the simulation tool once when you do something like this. It must now be run many times under different conditions. As an engineer, I need to be aware of costs like this whenever I invoke logic like this.

In this case, I wanted to support multiple types of FPGAs (and/or ASICs), and so this was the logic I chose.

Our next speed, ckspd=1, has almost the same logic. As before, o_ckstb and o_hlfck are both set continually in this mode. In this case, our wide clock output will either be 0011_1100 or 0000_1111 depending on whether or not we need a 90 degree offset clock for DDR.

	end else if ((OPT_SERDES || OPT_DDR) && w_ckspd <= 1)
	begin
		o_ckstb  <= 1'b1;
		o_hlfck  <= 1'b1;
		o_ckwide <= (OPT_SERDES && w_clk90) ? 8'h3c : 8'h0f;

When running from a 100MHz system (src_clk) clock, this generates a 100MHz clock as well.

You may note that there’s no real two-cycle output signal. The signaling, with o_ckstb and o_hlfck, allows us to describe a new clock together with or separate from the second half of that clock period, but offers nothing for describing two clock cycles in the same source clock period. This is just a limitation in our chosen signaling.

The solution to this problem is specific to the eMMC controller that we’ve drawn our example from. In this case, I look at both the DDR setting and the clock speed before generating any transmit data. From this, I determine if I should be sending one byte, two bytes, or four bytes of data per clock. The actual logic is more complex, due to the fact that the eMMC interface may run in 1b, 4b, or 8b modes, but that’s the story of another piece of logic, found outside of the clock controller.

As with clock speeds of either 0 (200MHz) or 1 (100MHz), the clock speed of 2 (50MHz) is also handled specially. This is the speed that alternates between two outputs, generating either 00001111 followed by 11110000 in the offset mode (o_clk90=1), or simply 00000000 followed by 11111111 in the normal mode.

	end else if (w_ckspd == 2)
	begin
		{ o_ckstb, o_hlfck } <= (!nxt_counter[NCTR-1]) ? 2'b10 : 2'b01;
		if (w_clk90 && (OPT_SERDES || OPT_DDR))
			o_ckwide <= (!nxt_counter[NCTR-1]) ? 8'h0f : 8'hf0;
		else
			o_ckwide <= (!nxt_counter[NCTR-1]) ? 8'h00 : 8'hff;

When running from a 100MHz system clock (src_clk above), this generates a 50MHz output clock signal. This might be the “fastest” speed you would normally think of for an integer clock “divider”. As you can see, though, we’ve already generated outgoing 200MHz and 100MHz clocks above.

This brings us to the general case–a divided clock running at less than half our source clock rate. Here, we’ve already done all of the hard work for nxt_clk, so the outgoing next edge signal o_ckstb is done.

	end else begin
		o_ckstb <= nxt_clk;

The half edge signal is determined by the counter. The lower bits must be zero, indicating a new phase, and the top two bits indicate the new phase will be the third of four–so just entering halfway.

		o_hlfck <= (counter == {2'b01, {(NCTR-2){1'b0}} });

The wide clock is determined by the top two phase bits of the next counter. It’s either equal to the most significant bit, when there’s no clock offset, or the exclusive OR of the top two bits when there is.

		if (w_clk90)
			o_ckwide <= {(8){nxt_counter[NCTR-1]
						^ nxt_counter[NCTR-2]}};
		else
			o_ckwide <= {(8){nxt_counter[NCTR-1]}};
	end

This leaves us with only one final signal: the current clock speed. In this case, all the work has been done above, and nothing more need be done with it.

	always @(posedge i_clk)
		o_ckspd <= w_ckspd;

That’s the basic idea. In summary:

There are four phases to the outgoing clock, either 0011 or 0110.
A counter generally helps us know when to transition from one phase to the next.
High speeds get special attention.
Data changes on the outgoing next edge signal, o_ckstb.

In DDR modes, data can also change on the outgoing o_hlfstb signal.

Key features of this approach include:

There’s no need for any clock domain crossings in the outgoing data path. All outgoing signals are handled in the source clock domain.
The clock may be gated at will, and (re)started quickly if necessary.
Frequency changes are controlled, and will take place between clock periods.
Although the clock is generated in logic, it doesn’t trigger any logic. That is, nowhere in the design will anything in the outgoing logic path depend upon either @(posedge dev_clk) or @(negedge dev_clk). Instead, all of the logic is triggered off of the o_ckstb or o_hlfstb signals while still running on the same src_clk we started from.

But … does it work?

Simulation testing

Just to get this clock generator off the ground, I built a quick simulation test bench. You can find it here, and we’ll walk through it quickly.

The first step was pretty boiler plate. I simply started a VCD trace, placed the design into reset, and generated a 100MHz clock.

	initial begin
		$dumpfile("tb_sdckgen.vcd");
		$dumpvars(0,tb_sdckgen);
		reset = 1'b1;
		clk = 0;
		forever
			#5 clk = !clk;
	end

For the second step, I wanted to place the design in a variety of configurations to see how it would work in each. I chose to leave it in each configuration for five clock cycles before moving to the next. I then defined a simple task, capture_beats, that I could call to wait out five cycles of a given clock setting before moving on.

	task	capture_beats;
	begin
		repeat(5)
		begin
			wait(w_ckstb);
			@(posedge clk);
		end
	end endtask

The last step, then, was to walk through one clock setting after another to see what would happen.

I started by taking the design out of reset, and configuring the inputs for a (rough) 100kHz clock.

	initial begin
		{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h0fc;
		repeat (5)
			@(posedge clk)
		@(posedge clk)
			reset <= 0;

		// 100kHz (10us)
		capture_beats;

You can pretty well read the comments below to see the configurations I checked.

		// 200 kHz (5us)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h07f;
		capture_beats;

		// 400 kHz (2.52us)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h041;
		capture_beats;

		//   1MHz (1us)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h01b;
		capture_beats;

		//   5MHz (200ns)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h007;
		capture_beats;

		//  12MHz (80ns)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h004;
		capture_beats;

		//  25MHz (40ns)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h003;
		capture_beats;

		//  50MHz (20ns)
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h002;
		capture_beats;

		// 100MHz
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h001;
		capture_beats;

		// 200MHz
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h000;
		capture_beats;


		//  25MHz, CLK90
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h103;
		capture_beats;

		//  25MHz, CLK90
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h102;
		capture_beats;

		// 100MHz, CLK90
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h101;
		capture_beats;

		// 200MHz, CLK90
		@(posedge clk)
			{ cfg_shutdown, cfg_clk90, cfg_ckspd } = 10'h100;
		capture_beats;

		$finish;
	end

These are basically all of the configurations I wanted to use the design with. Using the generated trace, I can visually see all of the signals within this design working as intended. Further, unlike the formal verification we’ll discuss next, I can actually see many clocks of this design. This allows me to verify, for example, that the 100kHz, 200kHz, and 400kHz clock divisions work as designed.

Sadly, this test is woefully inadequate for any real or professional purpose.

The biggest problem with this simple test bench script is that it’s not self checking. I can run it, but the only way to know if the design did the right thing or not is to pull up a viewer and check the VCD file. Sure, this might get me off the ground, but it is horrible for maintenance. How should I know, for example, if a small and otherwise minor change breaks things?

The second problem with this test bench is that it does nothing to try out unreasonable input signals. How shall I know, for example, that this design will never go faster than the fastest allowed frequency? That is, it should only ever be able to go as fast as the current speed, or the newly commanded speed.

Perhaps some of you may remember my comments on twitter about getting excited to try this new design as a whole (not just the clock generator) on an FPGA, only to be mildly (not) surprised that it didn’t work before all the formal proofs were finished? (I couldn’t find them when I looked today …) Yeah, there’s always a surprise you aren’t expecting that takes place when you work with real hardware.

So, while this looks nice, and while the resulting traces look really pretty, this test bench is highly insufficient.

Let’s move onto something more substantial.

Formal Properties

I like to think of this clock module as a basic clock divider. It’s not much more than a glorified counter, together with a 4-state phase machine. Yeah, sure, you can run through all 4 states in one clock cycle, but it’s still not really all that much more. Formally verifying this clock generator should therefore be pretty simple.

One of the big keys to this proof is the interface property set.

I’ve discussed interface properties before. The idea born from the fact that one component, such as this clock generator, is going to generate signals that another component, in this case the transmit data generator, will use. Further, these two proofs will be independent of each other. Hence, anything the transmitter’s proof needs to assume should then be asserted in the clock generator and vice versa. That’s the purpose of the property set. The property set. also greatly simplifies the assertions found within the design itself.

Still, let’s look over the design assertions for now. We’ll come back to the property set in the next section.

We’ll start with the f_en signal.

	initial	f_en = 1'b1;
	always @(posedge i_clk)
	if (i_reset)
		f_en <= 1'b1;
	else if (nxt_clk)
		f_en <= !i_cfg_shutdown;

This just captures whether the clock should be shut down during the current cycle or not. It’s that simple.

Many engineers just starting out with formal verification struggle to see past the assertions and the assumptions within the language to realize they can still use regular verilog when generating formal properties. In this case, f_en is nothing more than a register which we are going to use in our formal proof. Nothing prevents you from doing this. Indeed, you are more than able to write more complicated state machines when generating formal properties as well.

Just make sure that your new logic doesn’t make the same expresesions as the logic you are verifying, or you might convince yourself something works when it doesn’t. When teaching, I like to explain this way: the best way to verify that A divided by B is C is to multiply C and B together. If the result of the multiply is A, then you’ve verified your result. Why does this work? Because you use different logic paths in your brain for division than you do for multiplication. Hence, if you make a mistake in dividing, you aren’t likely to make the same mistake when multiplying.

The same is true of formal methods. You can use logic in formal methods, just like you do in your design, you just don’t want to use the same logic lest your mind falsely convinces you its right when it isn’t. This is sort of like having one witness to a murder called onto the stand twice under the same name.

Anyway, let’s move on.

The next step is to instantiate a copy of the clock interface properties.

	fclk #(
		.OPT_SERDES(OPT_SERDES),
		.OPT_DDR(OPT_DDR)
	) u_ckprop (
		.i_clk(i_clk), .i_reset(i_reset),
		//
		.i_en(f_en),
		.i_ckspd(o_ckspd),
		.i_clk90(clk90),
		//
		.i_ckstb(o_ckstb),
		.i_hlfck(o_hlfck),
		.i_ckwide(o_ckwide),
		//
		.f_pending_reset(f_pending_reset),
		.f_pending_half(f_pending_half)
	);

See how simply that was?

In addition to the assertions within this property set, the property set provides two output signals that we can use to connect the state of our design to the internal state of the property set. These signals are:

f_pending_reset

This otherwise annoying signal is required for us to be able to handle the clock anomalies between reset and the first clock strobe. This signal is set on a reset, and released once the clock gets started.
f_pending_half

This signal is simpler. It simply means that we’ve seen the new_edge (o_ckstb) and not the half_edge herein called o_hlfck. If f_pending_half is true, then the clock must generate o_hlfck before it can generate o_ckstb.

With these signals, we can express things like this:

	always @(*)
	if (!i_reset && !o_hlfck && !o_ckstb && !f_pending_reset)
		assert(f_pending_half == (counter[NCTR-1:NCTR-2] < 2'b10));

This helps us through long periods of time with neither o_hlfck or o_ckstb. During this time, f_pending_half should be equivalent to the top two bits of our counter being either 2'b00 or 2'b01.

Let’s look at some other assertions.

For example, if we shut the clock down, then we shouldn’t get any more new edges, o_ckstb:

	always @(posedge i_clk)
	if (f_past_valid)
	begin
		if ($past(!i_reset && i_cfg_shutdown))
		begin
			assert(!o_ckstb);
		end

Now we can look at some of the specific options. For example, the clock speed should only be zero (200MHz) if OPT_SERDES is set. While set to zero, either o_ckstb should be set on every clock cycle or we should’ve received a clock shutdown request.

		if (ckspd == 0)
		begin
			assert(OPT_SERDES);
			assert(o_ckstb || $past(i_cfg_shutdown));
			assert(counter == 0
				||counter == {2'b11,{(NCTR-2){1'b0}} });
		end

Likewise, we should only ever be in a clock speed of 1 (100MHz) if either OPT_SERDES or OPT_DDR are set. Further, if OPT_SERDES is not set, we shouldn’t ever be implementing a 90 degree clock offset.

		if (ckspd == 1)
		begin
			assert(OPT_SERDES || OPT_DDR);
			if (!OPT_SERDES)
			begin
				assert(!clk90);
			end
			assert(counter == {2'b11,{(NCTR-2){1'b0}} });
		end

A clock speed of two (50MHz) is available to all configurations. In this case, the bottom bits–the non-phase description bits–must always be zero.

		if (ckspd == 2)
			assert(counter == 0
				|| counter == {2'b01,{(NCTR-2){1'b0}} }
				|| counter == {2'b10,{(NCTR-2){1'b0}} }
				|| counter == {2'b11,{(NCTR-2){1'b0}} });

Finally, in all other clock speeds, all we insist is that the lower bits of the counter be less than the clock speed minus three.

		if (ckspd >= 3)
			assert(counter[NCTR-3:0] <= (ckspd-3));
	end

There are only two ways both o_ckstb and o_hlfck can be true at once. The first is if the speed indicates either 200MHz or 100MHz. The second is if the clock is stopped, and so the wide clock output is zero and a new clock is expected on the next clock cycle.

	always @(*)
	if (!i_reset && o_ckstb && o_hlfck)
		assert(ckspd <= 1 || (o_ckwide == 0 && nxt_clk));

The difficult part of these assertions is that these aren’t enough to limit the output of the clock generator. Just to make certain the outputs are properly limited, I enumerate each together with the conditions they may be produced.

We’ll start with a zero output. This can come from either a stopped clock, or one of two slow clock situations.

	always @(*)
	if (!i_reset)
	case(o_ckwide)
	8'h00: if (nxt_clk)
		begin // A stopped clock
			assert(counter == {2'b11,{(NCTR-2){1'b0}} }
					|| ckspd == 0);
		end else if(!clk90)
		begin // In slow situations with no offset
			assert(counter[NCTR-1] == 1'b0);
		end else if(clk90)
		begin // In slow (DDR) situations with a 90 degree clock offset
			assert(counter[NCTR-1:NCTR-2] == 2'b00
				||counter[NCTR-1:NCTR-2] == 2'b11);
		end

An output of 8'h0f means we’re either in speed one with no clock offset and both clock edges active, or we’re in the first half of speed two.

	8'h0f: assert((!clk90 && ckspd == 1 && o_ckstb && o_hlfck)
			||(clk90 && ckspd == 2 && o_ckstb));

An output of 8'hf0 can only mean we’re in the second half of speed two.

	8'hf0: assert(clk90 && ckspd == 2 && !o_ckstb && o_hlfck);

An output of 8'hff is common at slow speeds, but also completely determined by thee two top phase bits of the counter.

	8'hff: if(!clk90) assert(counter[NCTR-1] == 1'b1);
		else
			assert(counter[NCTR-1:NCTR-2] == 2'b01
				|| counter[NCTR-1:NCTR-2] == 2'b10);

The last several outputs are very specific to their settings. 8'h3c is only possible in a speed of 1 with a 90 degree clock offset.

	8'h3c: assert( clk90 && ckspd == 1 && o_ckstb && o_hlfck);

That leaves the two possible double-clock outputs. First, the double clock with no 90 degree offset.

	8'h33: assert(!clk90 && ckspd == 0 && o_ckstb && o_hlfck);

The last possibility is the double clock with the 90 degree offset.

	8'h66: assert( clk90 && ckspd == 0 && o_ckstb && o_hlfck);

Everything else is specifically disallowed.

	default: assert(0);
	endcase

Interface File

While I might like to leave things there, a full proof of this clock generator requires we go over the formal interface file.

Remember, the purpose of the formal interface file is to separate two proofs. In this case, we want to both formally verify the clock generator, as well as the transmitter data generator that will use the results of the clock generator. Further, unlike the clock generator, the transmitter data generator doesn’t really care if the signals to and from the clock generator are realistic. It only cares that they follow whatever rules it requires–things like either 1) both new_edge && half_edge at the same time, or 2) an alternating new_edge with the half_edge, and so forth.

You can find this formal interface file among the other files associated with the formal proofs for this design. Although it is written in Verilog, it’s not really something that could or would be synthesized. For this reason I keep it in the bench/formal subdirectory of the project, rather than the rtl/ subdirectory.

Starting at the top, our property set must operate in at least three configurations: 1) in an environment where the wide_clock commands an 8:1 OSERDES, 2) an environment where it commands an ODDR instead, or 3) a simpler environment where neither option is available to us.

module	fclk #(
		parameter	[0:0]	OPT_SERDES = 1'b0,
					OPT_DDR    = 1'b0
	) (

Yes, we’ll need to run at least 3 formal proofs, one for each option, to make sure we’ve truly captured each option. This, however, is just the price of doing business with configurable logic.

Our formal properties will need the same inputs as the clock generator. The outputs of the clock generator also need to be listed as inputs to this property set. While the formal property set will primarily consist of assertions and assumptions, it will also produce two outputs–as discussed above. These are necessary for making sure the formal property set’s state is consistent with the internal state of the design.

		input	wire		i_clk, i_reset,
		//
		input	wire		i_en,
		input	wire	[7:0]	i_ckspd,
		input	wire		i_clk90,
		//
		input	wire		i_ckstb, i_hlfck,
		input	wire	[7:0]	i_ckwide,
		//
		output	reg		f_pending_reset,
		output	reg		f_pending_half
	);

Some of you may recall the challenges I’ve struggled through when trying to verify two co-dependent components. My original approach was to swap assumptions and assertions between the two components. This didn’t work, primarily because it was possible for the resulting assumptions to render one or more assertions to be irrelevant or vacuous. In that example, the logic of a design acted as an assumption as well.

In our case, we’re going to disconnect the two designs that will use this property set entirely. The clock generator (the master) will make assertions that the transmitter data generator will later assume, and vice versa. To make this work, we’ll have the SymbiYosys script for the clock generator define a CKGEN macro. This will then tell us whether this property set is being used as part of the clock generator’s proof, or the transmitter data generator’s. If a part of the clock generator’s proof, we’ll make assertions about our outputs. If a part of the transmitter data generator’s proof, those “outputs” will now be inputs of the transmitter data generator, and so we should be making assumptions about them instead. To do this, we’ll create a macro, SLAVE_ASSUME, that can be used to describe properties of these outputs with either assert or assume statements.

`ifdef	CKGEN
`define	SLAVE_ASSUME	assert	// Clock generator proof
`else
`define	SLAVE_ASSUME	assume	// Transmit data generator proof
`endif

The next step is boiler plate: create an f_past_valid register to let us know if we can use the $past() function or not. (Remember, $past()s value is invalid on the first clock of any proof.)

	reg		f_past_tick, f_past_valid;
	reg		last_reset, last_en, last_pending;
	reg	[7:0]	last_ckspd;

	initial	f_past_valid = 0;
	always @(posedge i_clk)
		f_past_valid <= 1;

Likewise, f_pending_reset, will be true between the i_reset signal and the first clock edge.

	initial	f_pending_reset = 1'b0;
	always @(posedge i_clk)
	if (i_reset)
		f_pending_reset <= 1'b1;
	else if (i_ckstb || i_hlfck)
		f_pending_reset <= 1'b0;

Our second output, f_pending_half, is true from the top of the clock to the second half of the clock, but only if the top of the clock didn’t include the half_edge signal (called i_hlfck herein).

	initial	f_pending_half = 1'b0;
	always @(posedge i_clk)
	if (i_reset)
		f_pending_half <= 1'b0;
	else if (i_ckstb)
		f_pending_half <= !i_hlfck;
	else if (i_hlfck)
		f_pending_half <= 1'b0;

A third signal, f_past_tick, will allow us to reason about whether or not we just passed an edge. We’ll get to this one in a bit.

	initial	f_past_tick = 0;
	always @(posedge i_clk)
		f_past_tick <= i_ckstb || i_hlfck;

Now that we have these two signals, we can state with a certainty that we can’t start a new clock cycle while waiting for the second half of a clock cycle. Likewise, if we are in second half of a clock cycle, we shouldn’t see the half edge again unless we’re starting a new (and high speed) clock.

	always @(posedge i_clk)
	if (!i_reset && !f_pending_reset)
	begin
		if (f_pending_half)
			`SLAVE_ASSUME(!i_ckstb);
		else if (i_hlfck)
			`SLAVE_ASSUME(i_ckstb);
	end

Now, with this as background, we can now make assertions about our various clock speeds, and the outputs that should be produced in each. Note that in this formal property set, the i_ckspd input reflects our current clock speed, and not just the requested clock speed that we worked with in the clock generator. Hence, it is an output of the generator clock generator, and no longer the requested clock speed.

Let’s start with the highest speed (200MHz) clock output.

	always @(posedge i_clk)
	if (!i_reset)
	case(i_ckspd)
	0: begin
		// We can only run in this speed if OPT_SERDES is set.
		`SLAVE_ASSUME(OPT_SERDES);

		// This speed has no pending half cycles.  All clock cycles
		// are complete in one cycle.
		`SLAVE_ASSUME(f_pending_reset || !f_pending_half);
		if (i_ckwide == 0)
		begin
			// Clock is either *off*/inactive, or we're still coming
			// out of a reset.
			`SLAVE_ASSUME(f_pending_reset || (!i_ckstb && !i_hlfck));
		end else begin
			// Clock is active, both edges are active in a clock
			// tick
			`SLAVE_ASSUME(i_ckstb && i_hlfck);
		end

The wide_clock output, herein called i_ckwide, can only have one of two values when active at this speed.

		if (i_clk90)
		begin
			// In the case of a 90 degree offset clock, if the
			// clock is active, it must be 0110_0110
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'h66);
		end else begin
			// Otherwise, if the clock is active, it must be
			// 0011_0011
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'h33);
		end end

Those are just the rules for 200MHz (assuming a 100MHz system clock).

Now let’s drop down a speed, and look at the 100MHz clock. In this mode, the new edge and half edge signals must also be present on the same clock. Likewise, there’s no allowable means to have a pending second half–the first and second half must always show up on the same clock cycle.

	1: begin
		if (i_ckwide == 0)
		begin
			`SLAVE_ASSUME(f_pending_reset || (!i_ckstb && !i_hlfck));
		end else begin
			`SLAVE_ASSUME(i_ckstb && i_hlfck);
		end

		if (!f_pending_reset)
			`SLAVE_ASSUME(!f_pending_half);

At 100MHz, the outgoing wide clock can only be 0011_1100 (90 degree offset), or 0000_ffff. The former requires OPT_SERDES, the latter may also be possible in OPT_DDR mode–since the first four bits equal the last four bits.

		if (i_clk90)
		begin
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'h3c);
			`SLAVE_ASSUME(OPT_SERDES);
		end else begin
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'h0f);
			`SLAVE_ASSUME(OPT_SERDES || OPT_DDR);
		end end

Our last special clock speed is 50MHz. For this case, we break our properties into two parts: the 90 degree offset, and the normal (SDR) case.

For the 90 degree offset clock, the clock must either be 0000_1111 if we’re not waiting on the next half clock cycle, or 1111_0000 if we are. Likewise, either the new or half edge signal must be true on every cycle. The only exception is for if/when the clock is stopped. Further, this output will require either OPT_SERDES or OPT_DDR.

	2: begin
		if (i_clk90)
		begin
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'h0f || i_ckwide == 8'hf0);
			if (i_en)
			begin
				`SLAVE_ASSUME(i_ckwide != 0);
			end
			`SLAVE_ASSUME(OPT_SERDES || OPT_DDR);
			if (!f_pending_reset && f_pending_half)
			begin
				`SLAVE_ASSUME(i_ckwide == 8'hf0);
			end
			if (i_ckwide == 8'h00)
			begin
				`SLAVE_ASSUME(!i_ckstb && !i_hlfck);
			end else if (i_ckwide == 8'h0f)
			begin
				`SLAVE_ASSUME(i_ckstb);
			end else begin
				`SLAVE_ASSUME(i_hlfck);
			end

The normal offset is simpler. This doesn’t require OPT_SERDES or OPT_DDR. The wide clock can either be 0000_0000 or 1111_1111. Further, if ever the clock output is 1111_1111, then we must be on the second half edge.

		end else begin
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'hff);
			if (i_ckwide == 8'hff)
				`SLAVE_ASSUME(i_hlfck);
		end end

This brings us to the default clock–the very slow clock generated by integer division (i.e. the counter). As before, the wide clock can either be 0000_0000 or 1111_1111 and hence needs no special hardware such as either OPT_SERDES or OPT_DDR.

	default: begin
			`SLAVE_ASSUME(i_ckwide == 0 || i_ckwide == 8'hff);
			if (!f_pending_reset && !i_clk90 && last_en && i_en)
			begin
				if (i_ckstb)
				begin
					`SLAVE_ASSUME(i_ckwide == 8'h00);
				end else if (i_hlfck)
				begin
					`SLAVE_ASSUME(i_ckwide == 8'hff);
				end else if (f_pending_half)
				begin
					`SLAVE_ASSUME(i_ckwide == 8'h00);
				end else // if (!f_pending_half)
					`SLAVE_ASSUME(i_ckwide == 8'hff);
			end
		end
	endcase

Just as a quick sanity check, if we have no special hardware, then both new and half edges can never be true on the same cycle.

	always @(posedge i_clk)
	if (!OPT_SERDES && !OPT_DDR)
		assert(!i_ckstb || !i_hlfck);

Let’s come back and double check the high speed cases. These are the only cases where both new and half edge may be allowed at the same time. In all other cases, one or both signals should be zero.

	always @(posedge i_clk)
	if (f_past_valid && !last_reset && (last_en || i_ckstb || i_hlfck))
	begin
		case(i_ckspd)
		0: `SLAVE_ASSUME(!i_en || (i_ckstb && i_hlfck));
		1: `SLAVE_ASSUME(!i_en || (i_ckstb && i_hlfck));
		default:
			`SLAVE_ASSUME(!i_ckstb || !i_hlfck);
		endcase
	end

Feel free to check the property set out yourself. While there are a couple more properties to it, these are the most significant.

Coverage Checking

Any good verification set should include not just a simulation, not just formal induction based proofs, but also a set of coverage checks. These are critical to making sure you haven’t (accidentally) assumed away some key component of the devices operation. Were that to happen, then the formal proof would be irrelevant–even if it did pass.

Hence, we add some cover properties here to the clock generator.

The first step is just to check if the clock is active, and if so, what mode it is active in.

	reg		cvr_active, cvr_clk90;
	reg	[7:0]	cvr_spd, cvr_count;

	always @(posedge i_clk)
	if (!cvr_active)
	begin
		cvr_spd <= i_cfg_ckspd;
		cvr_clk90 <= i_cfg_clk90;
	end

	initial	cvr_active = 0;
	always @(posedge i_clk)
	if (i_reset)
		cvr_active <= 1'b0;
	else if (cvr_spd != o_ckspd || cvr_spd != i_cfg_ckspd || !f_en
			|| cvr_clk90 != i_cfg_clk90 || cvr_clk90 != clk90)
		// We want to prove what our clock output can do over
		// time, not so much what happens when/if it changes.
		cvr_active <= 0;
	else if (o_ckstb)
		cvr_active <= 1;

If the clock is active, we can then start counting every new edge that takes place while active.

	always @(posedge i_clk)
	if (i_reset || !cvr_active)
		cvr_count <= 8'b0;
	else if (o_ckstb && !(&cvr_count))
		// Don't allow the counter to overflow, but otherwise
		// count the beginnings of each clock cycle.
		cvr_count <= cvr_count + 1;

With that as background, we can start looking at traces! Let’s get cover traces for a variety of potential frequencies.

	always @(posedge i_clk)
	if (!i_reset)
	begin
		cover(cvr_spd == 2 && !clk90 && cvr_count > 2);	// 50MHz
		cover(cvr_spd == 3 &&  clk90 && cvr_count > 2);	// 25MHz
		cover(cvr_spd == 3 && !clk90 && cvr_count > 2);
		cover(cvr_spd == 4 &&  clk90 && cvr_count > 2);	// 12MHz
		cover(cvr_spd == 4 && !clk90 && cvr_count > 2);
		cover(cvr_spd == 5 &&  clk90 && cvr_count > 2);	//  8MHz
		cover(cvr_spd == 5 && !clk90 && cvr_count > 2);
		cover(cvr_spd == 6 &&  clk90 && cvr_count > 2); //  6MHz
		cover(cvr_spd == 6 && !clk90 && cvr_count > 2);
	end

We’ll have to handle covering the high speed options a bit differently. In this case, we only want to check speeds requiring OPT_SERDES if OPT_SERDES is actually checked. We can’t use an if for this, lest the formal tool decide we failed the cover check. Hence, we’ll use a generate statement, so that the cover statements requiring OPT_SERDES are only generated if OPT_SERDES is true. Now we can check for 200MHz, 100MHz, and 50MHz.

	generate if (OPT_SERDES)
	begin : CVR_SERDES

		always @(posedge i_clk)
		if (!i_reset)
		begin
			cover(cvr_spd == 0 &&  clk90 && cvr_count > 5);
			cover(cvr_spd == 1 &&  clk90 && cvr_count > 5);
			cover(cvr_spd == 1 && !clk90 && cvr_count > 5);
			cover(cvr_spd == 2 &&  clk90 && cvr_count > 5);
			cover(cvr_spd == 2 && !clk90 && cvr_count > 5);
		end

We can apply the same logic to OPT_DDR, but we’ll have fewer clock options to check. In this case, it’s only the 100MHz and 50MHz options.

	end else if (OPT_DDR)
	begin : CVR_DDR

		always @(posedge i_clk)
		if (!i_reset)
		begin
			cover(cvr_spd == 1 && !clk90 && cvr_count > 5);
			cover(cvr_spd == 2 &&  clk90 && cvr_count > 5);
			cover(cvr_spd == 2 && !clk90 && cvr_count > 5);
		end

	end endgenerate

By the time you get to this point, you should have a strong confidence that this device clock generator actually does what it needs to. I certainly do, and it hasn’t failed me (that I recall) since going through this exercise. Yes, other parts of this design have had problems, particularly the front end, but the clock generator has been quite reliable.

Conclusions

This is now my go-to approach whenever I need to generate a device clock:

Generate the “clock” in logic.
Generate the “clock” wide, so it can be output via either OSERDES or ODDR.
Maintain all logic transitions on the original source clock.
Use logical signals like you would enables to handle data transitions.

What did this gain us? We received several advantages from this approach:

A glitchless outgoing clock
An outgoing clock that can …
- change frequency upon command,
- turn on and off as necessary,
- stop, and yet restart on a dime, and
- switch between being data aligned and offset by 90 degrees.

This is everything we would want of an outgoing clock, with none of the challenges associated with breaking the rules. Indeed, this approach works nicely in both FPGA and ASIC contexts, as I’ve now used it quite successfully in both for multiple projects. No, I don’t use the same clock generator for all my projects, but that’s for both requirements (the 200MHz clock is unique) and legal reasons.

This leaves us with the topic of the “return clock”, which we’ll need to come back to and discuss on another day.

The wind goeth toward the south, and turneth about unto the north; it whirleth about continually, and the wind returneth again according to his circuits. (Eccl 1:6)

Quiz #24: Is there an AXI bug here?

Fri, 20 Jun 2025 00:00:00 -0400

This quiz is brought to you courtesy of Xilinx’s AXI slave template.

Thankfully, they’ve since (sort-of) fixed this bug since I wrote that. I say “sort-of” because … the bug just got pushed around. It’s still broken, just not in the same way.

Comparing the Xilinx MIG with an open source DDR3 controller

Wed, 28 May 2025 00:00:00 -0400

Last year, I had the wonderful opportunity of mentoring Angelo as he built an open source DDR3 SDRAM controller.

Today, I have the opportunity to compare this controller with AMD (Xilinx)’s Memory Interface Generator (MIG) solution to the same problem. Let’s take a look to see which one is faster, better, and/or cheaper.

Design differences

Before diving into the comparison, it’s worth understanding a bit about DDR3–both how it works, and how that impacts its performance. From there, I’d like to briefly discuss some of the major design differences between Xilinx’s MIG and the UberDDR3 controller.

We’ll start with the requirements of an SDRAM controller in general.

SDRAM in general

SDRAM stands for Synchronous Dynamic Random Access Memory. “Synchronous” in this context simply means the interface requires a clock, and that all interactions are synchronized to that clock. “Random Access” means that you should be able to access the memory in any order you wish. The key word in this acronymn, though, is the “D” for Dynamic.

“Dynamic” RAM is made from capacitors, rather than flip flops. Why? Capacitors can be made much smaller than flip flops. They also use much less energy than flip flops. When the capacitor is charged, the “bit” of memory it represents contains a “1”. When it isn’t, the bit is a zero. There’s just one critical problem: Capacitors lose their charge over time. This means that every capacitor in memory must be read and recharged periodically or it will lose its contents. The memory controller is responsible for making sure this happens by issuing “refresh” commands to the memory.

That’s only the first challenge. Let’s now go back to that “synchronous” part.

The original (non-DDR) SDRAM standard had a single clock to it. The controller would generate that clock and send it to the memory to control all interactions.

This was soon not fast enough. Why not send memory values on both edges of the clock, instead of just one? You might then push twice as much data across the interface for the same I/O bandwidth. Sadly, as you increase the speed, pretty soon the data from the memory doesn’t come back synchronous to the clock you send. Both the traces on your circuit board, as well as the time to complete the operation within the memory chip will delay the return signals so much that the returned data no longer arrives in time to be sampled at the source by the source’s clock before the next clock edge. Worse, these variabilities are somewhat unpredictable. Therefore, memories were modified so that they return a clock together with the data–keeping the data synchronous with the clock it is traveling with.

Sampling data on a returned clock can be a challenge for an FPGA. Worse, the returned clock is discontinuous: it is only active when the memory has data to return. This will haunt us later, so we’ll come back to it in a moment.

For now, let’s go back to the “dynamic” part of an SDRAM.

SDRAMs are organized into banks, with each bank of memory being organized into rows of capacitors. To read from an SDRAM, a “row” of data from a particular memory bank must first be “activated.” That is, it needs to be copied from its row of capacitors into a row of flip flops. From here, “columns” within this row can be read or written as desired. However, only one row of memory per bank can be active at any given time. Therefore, in order to access a second row of memory, the row in use must first be copied back to its capacitors. This is called “precharging” the row. Only then can the desired row or memory be copied to the active row of flip-flops for access.

I mentioned SDRAM’s are organized in “banks”. Each of these “bank”s can controlled independently. They each have their own row of active flip-flops. With few exceptions, such as the “precharge all rows” command, or the “refresh” cycle command, most of the commands given to the memory will be bank specific.

Hence, to read a byte of memory, the controller must first identify which bank the byte of memory belongs to, and from there it must identify which row is to be read. The controller must then check which row is currently in the flip-flop buffer for that bank (i.e. which row is active). If a different row is active, that row must first be precharged. If no row is active, or alternatively once a formerly active row is precharged, the controller may then activate the desired row. Only once the desired row is active can the controller issue a command to actually read the desired byte from the row. Oh, and … all of this is contingent on not needing to refresh the memory. If a refresh interrupt takes place, you have to precharge all banks, refresh the memory, and then start over.

Well, almost. There’s another important detail: Because of the high speeds we are talking about, the memory will return data in bursts of eight bytes. Hence, you can’t read just a single byte. The minimum read quantity is eight bytes in a single “byte lane”.

What if eight bytes at a time isn’t enough throughput for you? Well, you could strap multiple memory chips together in parallel. In this case, every command issued by the controller would be sent to all of the memory chips. All of them would activate rows together, all of them would refresh their memory together, and all of them could read eight bytes at a time. Each of these chips, then, would control a single “byte lane”. In our case today, we’ll be using a memory having eight “byte lanes”.

So, when it comes to the performance of a memory controller, what do we want to know? We want to know how long it will take us from when the controller receives a read (or write) request until the data can be returned from the memory chip. This includes waiting for any (potential) refresh cycles, waiting for prior active rows to be recharged, new rows to be activated, and the data to finally be returned. The data path is complex enough that we’ll need to be looking at these times statistically.

Specifically, we’re going to model transaction time as some amount of per-transaction latency, followed by a per-amount throughput.

Our goal will be to determine these two unknown quantities: latency and throughput. If we do our job well, these two numbers will then help us predict answers to such questions as: how long will a particular algorithm take, and how much memory bandwidth is available to an application.

MIG

Let’s now discuss some of AMD (Xilinx)’s DDR3 memory controller This is the controller generated by their “Memory Interface Generator” and affectionately known simply as the “MIG” or “MIG controller”.

AMD (Xilinx)’s MIG controller is now many years old. Judging by their change log, it was first released in 2013. Other than configuration adjustments, it has not been significantly modified since 2016. This is considered one of their more “stable” IPs. It gets a lot of use by a wide variety of users, and I’ve certainly used it on a large number of projects.

Examining the source code of the MIG reveals that it is built in two parts. This can be seen from Fig. 1 below, which shows how the MIG fits in the context of the entire test stack we’ll be using today.

Fig 1. Memory pipeline

The first part of the MIG processes AXI transaction requests into its internal “native” interface. AXI, however, is a complex protocol. This translation is not instantaneous, and therefore takes a clock (or two) to accomplish. Many FPGA designers have discovered they can often improve upon the performance of the MIG by skipping this AXI translation layer and using the “native” interface instead. I have not done so personally, since I haven’t found sufficient documentation of this “native” interface to satisfy my needs–but perhaps I just need to look harder at what’s there.

One key feature of an AXI interface is that it permits a certain amount of transaction reordering. For example, a memory controller might prioritize two interactions to the same bank of memory, such that the interaction using the currently active row might go first. Whether or not Xilinx’s MIG does this I cannot say. For today’s test measurements, we’ll only be using one channel–whether read or write, and we’ll only be using a single AXI ID. As a result, all requests must complete in order, and there will be no opportunity for the MIG to reorder any requests.

DDR3 speeds also tend to be much faster than the FPGA logic the controller must support. For this reason, Xilinx’s DDR3 controller runs at either 1/2 or 1/4 the speed of the interface. This means that, on any given FPGA clock cycle, either two or four commands may be issued of the DDR3 device. For this test, we’ll be running at 1/4 speed, so four commands may be issued per system clock cycle.

The biggest problem Xilinx needed to solve with their controller was how to sample return data. Remember, the data returned by the memory contains a discontinuous clock. Worse, the discontinuous clock transitions when the data transitions. This means that the controller must (typically) delay the return clock by a quarter cycle, and only then clock the data on the edge. But … how do you know how far a quarter cycle delay is in order to generate the correct sample time for each byte lane?

Xilinx solved this problem by using a set of IO primitives that they’ve never fully documented. These include PHASORs and IO FIFOs. Using these IO primitives, they can lock a PLL to the returned data clock, and then use that PLL to control the sample time of the return data. This clock is then used to control a special purpose asynchronous FIFO. From here, the data is returned to its environment.

One unusual detail I’ve seen from the MIG is that it will often stall my read requests for a single cycle at a time in a periodic fashion. Such stalls are much too short for any refresh cycles. They are also more frequent than the (more extended) refresh cycles. This leads me to believe that Xilinx’s IO PLL primitive has an additional requirement, which is that in order to maintain lock, the MIG must periodically read from the DDR3 SDRAM. Hence, the MIG must not only take the memory offline periodically to keep the capacitors refreshed, it must also read from the memory to keep this IO PLL locked. Worse, it cannot read from the device at the same time it does this station keeping. As with the AXI to native conversion, this PLL station keeping requirement negatively impacts the MIG’s performance.

Before leaving this point, let me underscore that these “special purpose” IO elements were never fully documented. This adds to the challenge of building an open source controller, since the open source engineer must either reverse engineer these undocumented hardware components or build their data sampler in some other fashion.

Some time ago, I tried building a block-RAM based memory peripheral capable of handling AXI exclusive access requests. While trying to verify that the ZipCPU could generate exclusive access requests and that it would do so properly, I looked into whether or not the MIG would support them. Much to my surprise, the MIG has no exclusive access capability. I’ve since been told that this isn’t a big deal, since you only need exclusive access when more than one CPU is running on the same bus and the MicroBlaze CPU was never certified for multi–core operation, but I do still find this significant.

Finally, the MIG controller tries to maximize parallelism with various “bank machines”. These “bank machines” appear to be complex structures, allocated dynamically upon request. Each bank machine is responsible for handling when and if a row for a given memory bank must be activated, read, written, or precharged. While most memories physically have eight banks, Xilinx’s MIG permits a user to have fewer bank machines. Hence, the first step in responding to a user request is to allocate a bank machine to the request. According to Xilinx, “The [MIG] controller implements an aggressive precharge policy.” As a result, once the request is complete, the controller will precharge the bank if no further requests are pending. The unfortunate consequence of this decision is that subsequent accesses to the same memory will need to first activate the row again before it can be used.

UberDDR3

This leads us to the UberDDR3 controller.

The UberDDR3 controller is an open source (GPLv3) DDR3 controller. It was not built with AMD (Xilinx) funding or help. As such, it uses no special purpose IO features. Instead, it uses basic ISERDES/OSERDES and IDELAY/ODELAY primitives. As a result, there are no PHASER_INs, PHASER_OUTs, IN_FIFOs OUT_FIFOs, or BUFIOs.

This leads to the question of how to deal with the return clock sampling from the DDR3 device. In the case of the UberDDR3 controller, we made the assumption that the DQS toggling would always come back after a fixed amount of time from the clock containing the request. A small calibration state machine is used to determine this delay time and then to find the center of the “eye”. Once done, IDELAY elements, coupled with a shift register, are then used to get the sample point.

Fig. 2 shows a reference to this process.

Fig 2. Incoming data sampling

It is possible that this method will lose calibration over time. Indeed, even the MIG wants to use the XADC to watch for temperature changes to know if it needs to adjust its calibration. Rather than require the XADC, the UberDDR3 controller supports a user input to send it back into calibration mode. Practically, I haven’t needed to do this, but this may also be because my test durations weren’t long enough.

Another difference between the UberDDR3 controller and the MIG is that the UberDDR3 controller only has one interface: Wishbone B4 (Pipelined). This interface is robust enough to replace the need for the MIG’s non-standard “native” interface. Further, because Wishbone has only a single channel for both reads and writes, the UberDDR3 controller maintains a strict on all transactions. There’s no opportunity for reordering accesses, and no associated complexity involved with it either.

This will make our testing a touch more difficult, however, because we’ll be issuing Wishbone requests–native to the UberDDR3 controller but not the MIG. A simple bridge, costing a single clock cycle, will convert from Wishbone to AXI prior to the MIG. We’ll need to account for this when we get to testing.

The UberDDR3 controller also differs in how it handles memory banks. Rather than using an “aggressive” precharging strategy, it uses a lazy one. Rows are only precharged (returned back to the capacitors) when 1) the row has been active too long, or 2) when it is time to do a refresh, and so all active rows on all banks must be precharged. This works great under the assumption that the next access is most likely to be in the vincinity of the last one.

A second difference in how the UberDDR3 controller handles memory banks is that, unlike the MIG, the bank address is drawn from the bits between the row and column address, as shown in Fig. 3.

Fig 3. Bank addressing

Although the MIG has an option to do this, it isn’t clear that the MIG takes any advantage of this arrangement. The UberDDR3 controller, on the other hand, was designed to take explicit advantage of this arrangement. Specifically, the UberDDR3 controller assumes most accesses will be sequential through memory. Hence, when it gets a request for a memory access that is most of the way through the column space of a given row, it then activates the next row on the next bank. This takes place independent of any user requests, and therefore anticipates a future user request which may (or may not) take place.

Xilinx’s documentation reveals very little about their REFRESH strategy. The UberDDR3 controller’s REFRESH strategy is very simple: every so many clocks (827 in this case) the memory is taken off line for a REFRESH cycle. This cycle lasts some number of clocks (46 for this test setup), and then places the memory back on line for further accesses.

This refresh timing is one of those things that makes working with SDRAM in general so difficult: it can be very hard to predict when the memory will be offline for a refresh, and so predicting performance can be a challenge. I know I have personally suffered from testing against an approximation of SDRAM memory, one that has neither REFRESH nor PLL station keeping cycles, only to suffer later when I switch to such a memory and then get hit with a stall or delayed ACK at a time when I’m not expecting it. Logic that worked perfect in my (less-than matched) simulation, would then fail in hardware. This can also be a big challenge for security applications that require a fixed (and known) access time to memory lest they leak information across security domains.

The test setup

Before diving into test results, allow me to introduce the test setup.

Fig 4. An Enclustra Mercury+ KX2 carrier board mounted on an ST1 baseboard

I’ll be running my memory tests using my Kimos project. This project uses an Enclustra Mercury+ KX2 carrier board containing a 2GB DDR3 memory and a Kintex-7 160T mounted on an Enclustra Mercury+ ST1 baseboard.

Fig 5. Test setup

Fig. 5 shows the relevant components of the memory chain used by this Kimos project together with three test points for observation. The project contains a ZipCPU. (Of course!) That ZipCPU has both instruction and data interfaces to memory. Each interface contains a 4kB cache. The instruction cache in particular is large enough to hold all of the instructions required for each of the code loops required by our bench, and so it becomes transparent to the test. This is not true of the data cache. The bench marks I have chosen today are specifically designed to force data cache misses, and then to watch how the controller responds. In the ZipCPU, those two interfaces are then merged together via a Wishbone arbiter, and again merged with a second arbiter with the DMA’s Wishbone requests. The result is that the ZipCPU has only a single bus interface.

Bus requests from the ZipCPU, to include the ZipDMA, are generated at a width designed to match the bus. The interface to the Enclustra’s SDRAM naturally maps to 512 bits, so requests are generated (and recovered) at a 512 bit wide bus width.

Once requests leave the ZipSystem, they enter a crossbar Wishbone interconnect. This interconnect allows the ZipCPU to interact with flash memory, block RAM memory, and the DDR3 SDRAM memory. An additional port also allows interaction with a control bus operating at 32bits. Other peripheral DMAs can also master the bus through this crossbar, to include the SD card controller, an I2C controller, an I2C DMA, and an external debugging bus. Other than loading program memory via the debugging bus to begin the test, these other bus masters will be idle during our testing.

After leaving the crossbar, the Wishbone request goes in one of two directions. It can either go to a Wishbone to AXI converter and then to the MIG, or it can go straight to the UberDDR3 controller. (Only one of these controllers will ever be part of the design at a given time.)

A legitimate question is whether or not the Wishbone to AXI converter will impact this test, or to what extent it will impact it. From a timing standpoint, this converter costs one clock cycle from the Wishbone strobe to the AXI AxVALID signal. This will add one clock of latency to any MIG request. We’ll have to adjust any results we calculate by this one clock cycle. The converter also requires 625 logic elements (LUTs).

What about AXI? The converter doesn’t produce full AXI. All requests, coming out of the converter, are for burst lengths of AxLEN=0 (i.e. one beat), a constant AxID of one bit, an AxSIZE of 512 bits, AxCACHE=4'd3, and so forth.

This will impact area.

A good synthesizer should be able to recognize these constants to reduce both the logic area and logic cost of the MIG. (UberDDR3 is already Wishbone based, so this won’t change anything.)
What about AXI bursts?

Frankly, bursts tend to slow down AXI traffic, rather than speed it up. As we’ve already discovered on this blog, the first thing an AXI slave needs to do with a burst request is to unwind the burst. This takes extra logic, and often costs a clock cycle (or two). As a result, Xilinx’s block RAM controller (not the MIG) suffers an extra clock lost on any burst request. The MIG, on the other hand, doesn’t seem affected by burst requests (or lack thereof)–although they may contribute a clock or two to latency.
What about AXI pipelining?

Both AXI and the pipelined Wishbone specification I use are pipelined bus implementations. This means that multiple requests may be in flight at a time. I don’t foresee any differences, therefore, between the two controllers due to AXI’s pipelined nature.

Had we been using Wishbone Classic, then our memory performance would’ve taken a significant hit. (This is one of the reasons why I don’t use Wishbone Classic.)
What about Read/Write reordering?

The MIG may be able to reorder requests to its advantage. In our test, we will only ever give it a single burst of read or write requests (all with AxLEN=0), and we will wait for all responses to come back from the controller before switching directions. It is possible that the MIG might have a speed advantage over the UberDDR3 controller in a direction swapping environment. If so, then today’s test is not likely to reveal those differences.

Now that you know something about the various test setups, let’s look at some benchmarks.

The LUT/Size benchmark

When I first started out working with FPGAs, I remember my sorrow at seeing how much of my precious Arty’s LUTs were used by Xilinx’s MIG controller. At the time, I was struggling for funds, and didn’t really have the kind of cash required to purchase a big FPGA with lots of area. An Artix 35T was (roughly) all I could afford, and the MIG used a large percentage of its area.

Since area is proportional to dollars, let’s take a look at how much area each of the controllers uses in today’s test.

On a Kintex-7 160T, mounted on an Enclustra Mercury+ KX2 carrier board, the MIG controller uses 24,833 LUTs out of 101,400 LUTs. This is a full 24.5% of the FPGA’s total logic resources. Fig. 6 shows a Vivado generated hierarchy diagram, showing how much of the design this component requires.

Fig 6. Area usage hierarchy with the MIG

The diagram reveals a lot about area. Thankfully, the MIG only uses a quarter of it. The majority of the area used in this design is used by the components that have to touch the 512bit bus. These include the crossbar, the CPU’s DMA, the SDIO controller’s DMA, the various Ethernet bus components, and so on. The most obvious conclusion is that, if you want memory bandwidth, you will have to pay for it. This should come as no surprise to those who have worked in digital design for some time.

On the same board, the UberDDR3 controller uses 13,105 LUTs, or 12.9% of chip’s total logic resources. A similar hierarchy diagram of the design containing the UberDDR3 controller can be found in Fig. 7.

Fig 7. Area usage hierarchy with the UberDDR3 Controller

To be fair, the Xilinx controller must also decode AXI–a rather complex protocol. However, AXI may be converted to Wishbone for only 1,762 LUTs, suggesting this conversion alone isn’t sufficient to explain the difference in logic cost. Further, the Wishbone to AXI converter used to feed the MIG uses only a restricted subset of the AXI protocol. As a result, it’s reasonable to believe that the synthesizer’s number, 24,833 LUTs, is smaller than what a more complex AXI handler might require.

On size alone, therefore, the UberDDR3 controller comes out as the clear winner.

That makes the UberDDR3 controller cheaper. What about faster?

The raw DMA bench mark

We’ve previously discussed bus benchmarking for AXI. In that article, we identified every type of clock cycle associated with an AXI transaction, and then counted how often each type of cycle took place. Since that article, I’ve built something very similar for Wishbone. In hindsight, however, all of these measures tend to be way too complicated. What I really want is the ability to summarize transactions simply in terms of 1) latency, and 2) throughput. Therefore, I’ve chosen to model all DDR3 transaction times by the equation:

In this model, “Latency” is the time from the first request to the first response, and “Throughput” is the fraction of time you can get one beat returned per clock cycle. Calculating these coefficients requires a basic linear fit, and hence transfers with a varying number of beats used by the DMA–but we’ll get to that in a moment.

The biggest challenge here is that the CPU can very much get in the way of these measures, so we’ll begin our measurements using the DMA alone where accesses are quite simple.

Here’s how the test will work: The CPU will first program the Wishbone bus measurement peripheral. It will then program the DMA to do a memory copy, from DDR3 SDRAM to DDR3 SDRAM. The ZipCPU’s DMA will break this copy into parts: It will first read N words into a buffer, and then (as a second step) write those N words to somewhere else on the memory. During this operation, the CPU will not interact with the DDR3 memory at all–to keep from corrupting any potential measures. Instead, it will run all instructions from an on-board block RAM. Once the operation completes, the CPU will issue a stop collection command to the Wishbone bus measurement peripheral. From there, the CPU can read back 1) how many requests were made, 2) how many clock cycles it took to either read or write each block. From the DMA configuration, we’ll know how many blocks were read and/or written. From this, we can create a simple regression to get the latency and throughput numbers we are looking for.

To see how this might work, let’s start with what a DMA trace might nominally look like. Ideally, we’d want to see something like Fig. 8.

Fig 8. Ideal DMA

In this “ideal” DMA, the DMA maintains two buffers. If either of the two buffers is empty, it issues a read command. Once the buffer fills, it issues a write command. Fig. 8 shows these read and write requests in the “DMA-STB” line, with “DMA-WE” (write-enable) showing wich direction the requests are being for. These requests then go through a crossbar and hit the DDR3 controller as “SDRAM-STB” and “SDRAM-WE”. (This simplified picture assumes no stalls, but we’ll get to those.) The SDRAM controller might turn around write requests immediately, as soon as they are committed into its queue, whereas read requests will take sometime longer until REFRESH cycles, bank precharging and activation cycles are complete and the data finally returned. Then, as soon as a full block of read data is returned, the DMA can immediately turn around and request to write that data. Once a full block of write data has been sent, the DMA then has the ability to reuse that buffer for the next block of read data.

AXI promises to be able to use memory in this fashion, and indeed my AXI DMA attempts to do exactly that.

When interacting with a real memory, things aren’t quite so simple. Requests will get delayed (I didn’t draw the stall signal in Fig. 8), responses have delays, etc. Further, there is a delay associated with turning the memory bus around from read to write or back again. Still, this is as simple as we can make a bus transaction look.

In Wishbone, unlike AXI, requests get grouped using the cycle line (CYC). You can see a notional Wishbone DMA cycle in Fig. 9.

Fig 9. Wishbone DMA

Unlike the AXI promise, the ZipCPU’s Wishbone implementation uses only a single buffer, and it doesn’t switch bus direction mid-bus cycle.

Let’s look at this cycle line for a moment through. This is a “feature” not found in AXI. The originating master raises this cycle line on the first request, and drops it after the last acknowledgment. The crossbar uses this signal to know when it can drop arbitration for a given master, to allow a second master to use the same memory. The cycle line can also be used to tell down stream slaves that the originating master is no longer interested in any acknowledgments from its prior requests–effectively acting as a “bus abort” signal. This makes Wishbone more robust than AXI in the presence of hardware failures, but it can also make Wishbone slower than AXI because bursts from different masters cannot be interleaved while the master owning the bus holds its cycle line high.

Arguably, this Wishbone DMA approach will limit our ability to fully test the MIG controller. As a result, we may need to come back to this controller and test again at a later time using an AXI DMA interface alone to see to what extent that might impact our results.

To make our math easier, we’ll add one more requirement: Our transactions will either be to read or write 16, 8, 4, or 2 beats at a time. On a 512bit bus, this corresponds to reading or writing 1024, 512, 256, or 128 bytes at a time–with 1024 bytes being the size of the ZipCPU’s DMA buffer, and therefore the maximum transfer size available.

With all that said, it’s now time to look at some measurement data.

First up is the MIG DDR3 controller. Fig. 10 shows a trace of the DMA waveform when transferring 16 beats of data at a time.

Fig 10. MIG DMA

This image shows two Wishbone bus interfaces. The top one is the view the DMA has of the bus. The bottom interface is the view coming out of the crossbar and going into the memory controller.

In this image, it takes a (rough) 79 clock cycles to go from the beginning of one read request, through a write request, to the beginning of the next read request–as measured between the two vertical markers.

Some things to notice include:

It takes 4 clock cycles for the request to go from the DMA through the crossbar to the controller.
While not shown here, it takes one more clock cycle following sdram_stb && !sdram_stall for the conversion to AXI.
Curiously, the SDRAM STALL line is not universally low during a burst of requests. In this picture, it often rises for a cycle at a time. I have conjectured above that this is due to the MIG’s need for PLL station keeping.
During writes, it takes 3 clocks to go from request to acknowledgment.
During reads, it can take 26 clocks from request to acknowledgment–or more.
Once the MIG starts acknowledging (returning) requests, the ACK line can still drop mid response. (This has cost me no end of heartache!)

If we repeat the same measurement with the UberDDR3 controller, we get the trace shown in Fig. 11.

Fig 11. Uber DMA

In this case, the full 1024 Byte transfer cycle now takes 66 clock cycles instead of 79.

It takes 11 cycles from read request to read acknowledgment.
It takes 7 cycles from write request to acknowledgment.
Unlike the MIG, there’s no periodic loss of acknowledgment. In general, once the acknowledgments start, they continue. This won’t be universally true, but the difference is still significant.

Of course, one transaction never tells the whole story, a full transaction count is required. However, when we look at all transactions, we find on average:

These are the clearest performance numbers we will get to compare these two controllers. When writing to memory, the MIG is clearly faster. This is likely due to its ability to turn a request around before acting upon it. (Don’t forget, one of these clocks of latency is due to the Wishbone to AXI conversion, so the MIG is one clock faster than shown in this chart!) Given that the MIG can turn a request around in 1.8 cycles, it must be doing so before examining any of the details of the request!

When reading from memory, the MIG is clearly slower–and that by a massive amount. One clock of this is due to the Wishbone to AXI conversion. Another clock (or two) is likely due to the AXI to native conversion. The MIG must also arbitrate between reads and writes, and must (likely) always activate a row before it can be used. All of this costs time. As a result of these losses and more that aren’t explained by these, the MIG is clearly much slower than the UberDDR3 controller.

Under Load

Now that we’ve seen how the DDR3 controller(s) act in isolation to a DMA, let’s turn our attention to how they act in response to a CPU–the ZipCPU in this case. (Of course!) For our test configuration, the ZipCPU will have both data and instruction caches. Because of this, our test memory loads will need to be extensive–to break through the cache–or else the cache will get in the way of any decent measurement.

How the Cache Works

Let’s discuss the ZipCPU’s data cache for a moment, because it will become important when we try to understand how fast the CPU can operate in various memory environments.

First, the ZipCPU has only one interface to the bus. This interface is shared by both the instruction and data caches. However, the instruction cache is (generally) big enough to fit most of our program, so it shouldn’t impact the test much.

The one place where we’ll see the instruction cache impact our test is whenever the ZipCPU needs to cross between cache lines. As currently built, this will cost a one clock delay to look up whether or not the next cache line is in the instruction cache. Other than that, we’re not likely to see any impacts from the instruction cache.
The ZipCPU’s data cache is a write through cache. Any attempt to write to memory will go directly to the bus and so to memory. Along the way, the memory in the cache will be updated–but only if the memory to be written is also currently kept in the cache.
The ZipCPU will not wait for a write response from memory before going on to its next instruction. Yes, it will wait if the next instruction is a read instruction, but in all other cases the next instruction is allowed to go forward as necessary.

One (unfortunate) consequence of this choice is that any bus error will likely stop the CPU a couple of instructions after the fault, potentially confusing any engineer trying to understand which instruction, which register, and which memory address was associated with the fault. Such faults are often called asynchronous or imprecise bus faults.
When issuing multiple consecutive write operations in a row, the ZipCPU will not wait for prior operations to complete. Two of our test cases will exploit this to issue three write (or read) requests in a row. In these tests, the CPU will write either three 32b words or three 8b bytes on consecutive instructions and hence clock cycles.

I tend to call these pipelined writes, and I consider them to be some of the better features of the ZipCPU.
All read operations first take a clock cycle to check the cache. As a result, the minimum read time is two cycles: one to read from the cache and check for validity, and a second cycle to shift the 512b bus value and return the 8, 16, or 32b result.
As with the write operations, read operations can also be issued back to back. Back to back read operations will have a latency of two clocks, but a 100% throughput–assuming they both read from the same cache line. If not, there will be an additional clock cycle lost to look up whether or not the requested cache line validly exists within the cache.
Both instruction and data cache sizes have been set to 4kB each. Both caches will use a line size of eight bus words (512 Bytes). Neither cache uses wrap addressing (although this test will help demonstrate that they should …). Instead, all cache reads will start from the top of the cache line, and the CPU will stall until the entire cache line is completely read before continuing.

To help to understand how this data cache works, let’s examine three operations. The first is a read cache miss, as shown in Fig. 12.

Fig 12. ZipCPU Read Data Cache Miss

In this case, a load word (LW) instruction flows through the ZipCPU’s pipeline from prefetch (PF), to decode (DCD), to the read operand (OP) stage. It then leaves the read operand (OP) stage headed for the data cache. The data cache requires a couple of clocks–as dictated by the block RAM it’s built from–to determine that the request is not in the cache. Once this has been determined, the data cache initiates a bus request to read a single cache line (8 bus words) from memory. Both cycle and strobe lines are raised. The strobe line stays active until eight cycles of stb && !stall (stall is not shown here, but assumed low). Once eight requests have been made, the CPU waits for the last of the eight acknowledgments. Once the read is complete, and not before, the cache line is declared valid and the CPU can read from it to complete it’s instruction. This costs another four cycles before the LW instruction can be retired.

While this cache line remains in our cache, further requests to read from memory will take only either two or three clocks: Two clocks if the request is for the same cache line as the prior access, or three clocks otherwise as shown in Fig. 13.

Fig 13. ZipCPU Data Cache Hit

Finally, on any write request, the request will go straight to the bus as shown in Fig. 14.

Fig 14. ZipCPU Write to Memory (through the Data Cache)

The CPU may then go on to other instructions, but the pipeline will necessarily stall if it ever needs to interact with memory prior to this write operation completing (unless its a set of consecutive writes …).

Sequential LRS Word Access

Our first CPU-based test is that of sequential word access. Specifically, we’ll work our way through memory, and write a pseudo random value to every word in memory–one word at a time. We’ll then come back through memory and read and verify that all of the memory values were written as desired.

From C, the write loop is simple enough:

#define	STEP(F,T)  asm volatile("LSR 1,%0\n\tXOR.C %1,%0" : "+r"(F) : "r"(T))
	// ...
	while(mptr < end) {
		STEP(fill, TAPS);
		// fill = (fill&1)?((fill>>1)^TAPS):(fill>>1);
		*mptr++ = fill;
	}

The “STEP” macro exploits the fact that the ZipCPU’s LSR (logical shift right) instruction shifts the least significant bit into the carry flag, so that an linear feedback shift register (LFSR) may be stepped with only two instructions. The second instruction is a conditionally executed exclusive OR operation, only executed if the carry flag was set–indicating that a one was shifted out of the register.

This simple loop then compiles into the following ZipCPU assembly:

loop:
	LSR        $1,R2	; STEP(fill, TAPS)
	XOR.C      R3,R2
	SW         R2,(R1)	; *mptr = fill
	| ADD        $4,R1	;  mptr++
	CMP        R6,R1	; if (mptr < end)
	BC         loop		;	go to top of loop

Basically, we step the LFSR by shifting right by one. If the bit shifted over the edge was a one, we exclusive OR the register with our taps. (XOR.C only performs the exclusive OR if the carry bit is set.) We then store this word (SW= store word) into our memory address (R1), increment the address by adding four to it, and then compare the result with a pointer to the end of our memory region. If we are still less than the end of memory, we go back and loop again.

Inside the CPU’s pipeline, this loop might look like Fig. 15.

Fig 15. Simple write pipeline

Let’s work our way through the details of this diagram.

There are four pipeline stages: prefetch (PF), decode (DCD), read operands (OP), and write-back (WB)
The ZipCPU allows some pairs of instructions to be packed together. In this case, I’ve used the vertical bar to indicate instruction pairing. Hence the S|A instruction coming from the prefetch is one of these combined instructions. The instruction decoder turns this into two instructions, forcing the prefetch to stall for a cycle until the second instruction can advance.
In general and when things are working well, all instructions take one clock cycle. Common exceptions are to this rule are made for memory, divide, and multiply instructions. For this exercise, only memory operations will take longer.
The store word instruction must stall and wait if the memory unit is busy. For the example in Fig. 15, I’ve chosen to begin the example with a busy memory, so you can see what this might look like.
Once the store word request has been issued to the memory controller, a bus request starts and the CPU continues with its next instruction.
The bus request must go through the crossbar to get to the SDRAM. As shown here, this takes three cycles.
The memory then accepts the request, and acknowledges it.

In the case of the MIG, this request is acknowledged almost immediately. The UberDDR3 controller takes several more clock cycles before acknowledging this request.
It takes another clock for this acknowledgment to return back through the crossbar to the CPU.
By this time, the CPU has already gone ahead without waiting for the bus return. However, once returned, the CPU can accept a new memory instruction request.
When the ZipCPU hits the branch instruction (BC = Branch if carry is set), the CPU must clear its pipeline to take the branch. This forces the pipeline to be flushed. The colorless instructions in Fig. 15 are voided, and so never executed. The jump flag is sent to the prefetch and so the CPU must wait for the next instruction to be valid. (No, the ZipCPU does not have any branch prediction logic. A branch predictor might have saved us from these stalls.) If, as shown here, the branch remains in the same instruction cache line, a new instruction may be returned immediately. Otherwise it may take another cycle to complete the cache lookup for an arbitrary cache line.

If you look closely, you’ll notice that the performance of this tight loop is heavily dependent upon the memory performance. If the memory write cannot complete by the time the next write needs to take place, the CPU must then stall and wait.

Using our two test points, we can see how the two controllers handle this test. Of the two, the MIG controller is clearly the fastest, although the speed difference is (in this case) irrelevant.

Fig 16. MIG Write pipeline

Indeed, as we’ve discussed, the MIG’s return comes back so fast that it is clear the MIG has not completed sending this request to the DDR3. Instead, it’s just committed the request to its queue, and then returns its acknowledgment. This acknowledgment also comes back fast enough that the CPU memory controller is idle for two cycles per loop. As a result, the memory write time is faster than the loop, and the loop time (10 clock cycles, from marker to marker) is dominated by the time to execute each of the instructions.

Let’s now look at the trace from the UberDDR3 controller shown in Fig. 17.

Fig 17. Uber Write pipeline

The big thing to notice here is that the UberDDR3 controller takes one more clock cycle to return a busy status. Although this is slower than the MIG, it isn’t enough to slow down the CPU, so the loop continues to take 10 cycles per loop.

If you dig just a bit deeper, you’ll find that every 22us or so, the MIG takes longer to acknowledge a write request.

Fig 18. MIG Write pipeline with stall

In this case, the loop requires 22 clock cycles to complete.

In a similar fashion, every 827 clocks (8.27 us), the UberDDR3 controller does a memory refresh. During this time, the UberDDR3 controller will also take longer to acknowledge a write request.

Fig 19. Uber Write pipeline with stall

In this case, it takes the UberDDR3 controller 57 clocks to complete a single loop.

Let’s now turn our attention to the read half of this test, where we go back through memory in roughly the same fashion to verify the memory writes completed as desired. In particular, we’ll want to look at cache misses. Such misses don’t happen often, but they are the only time the design interacts with its memory.

From C, our read loop is similarly simple:

#define	FAIL		asm("TRAP")
	// ...
	while(mptr < end) {
		STEP(fill, TAPS);
		if (*mptr != (int)fill)
			FAIL;
		mptr++;
	}

The big difference here is that, if the memory every fails to match the pseudorandom sequence, we’ll issue a TRAP instruction which will cause the CPU to halt. This forces a branch into the middle of our loop.

loop:
	LSR        $1,R0	; STEP(fill, TAPS)
	XOR.C      R2,R0
	LW         (R1),R3	; *mptr
	| CMP      R0,R3
	BZ         no_trap	; if (*mptr == (int)fill) ... skip
	TRAP			;   break into supervisor mode--never happens
no_trap:
	ADD        $4,R1	; mptr++
	| CMP      R6,R1	; if (mptr < end)
	BC         loop		;   loop some more

Inside the CPU’s pipeline, this loop might look like Fig. 20.

Fig 20. Read pipeline

This figure shows two times through the loop–one with a cache miss, and one where the data fits entirely within the cache. In this case, the time through the loop upon a cache miss is entirely dependent upon how long the memory controller takes to read. EVERY clock cycle associated with reading from memory (on a cache miss) costs us.

Fig. 21 shows a trace captured from the MIG during this operation.

Fig 21. MIG Data read, cache miss

Here we can see that it takes 35 cycles to read from memory on a cache miss. These 35 cycles directly impact that time it takes to complete our loop.

Since the memory is being read into the data cache, we are reading eight 512 bit words at a time, which we will then process 32 bits per loop. Hence, one might expect a cache miss one of every 128 loops.

Accepting that it takes us 17 clocks to execute this loop without a cache miss, we can calculate the loop time with cache misses as:

In this case, the probability of a cache miss is once every 128 times through. The other latency is 4 clocks for the crossbar, and another 5 clocks in the cache controller. Hence, our loop time for a 35 cycle read, one every 128 times, is about 17.5 cycles. This is pretty close to the measured time of 17.35 cycles.

How about the UberDDR3 controller? Fig. 22 shows us an example waveform.

Fig 22. UberDDR3 Data read, cache miss

In this case, it takes 17 clock cycles to access the DDR3 SDRAM. From this one might expect 17.07 clocks per loop. In reality, we only get about 17.23, likely due to the times when our reads land on REFRESH cycles, as shown in Fig. 23 below, where the read takes 27 clocks instead of 17.

Fig 23. UberDDR3 Data read, cache miss, colliding with a REFRESH cycle

Our conclusion? In this test case, the differences between the MIG and UberDDR3 controllers are nearly irrelevant. The MIG is faster for singleton writes, but we aren’t writing often enough to notice. The UberDDR3 controller is much faster when reading, but the cache helps to hide the difference.

Sequential LRS Triplet Word Access

Let’s try a different test. In this case, let’s write three words at a time, per loop, and then read them back again. As before, we’ll move sequentially through memory from one end to the next. Our goal will be to exploit the ZipCPU’s pipelined memory access capability, to see to what extent that might make a difference.

Why are we writing three values to memory? For a couple reasons. First, it can be a challenge to find enough spare registers to write much more. Technically we might be able to write eight at a time, but we still need to keep track of the various pointers and so forth for the rest of the function we’re using. Second, three is an odd prime number. This will force us to have memory steps that cross cache lines, making for some unusual accesses.

Here’s the C code for writing three pseudorandom words to memory.

	while(mptr+3 < end) {
		register unsigned a, b, c;

		STEP(fill, TAPS);	a = fill;
		STEP(fill, TAPS);	b = fill;
		STEP(fill, TAPS);	c = fill;

		mptr[0] = a;
		mptr[1] = b;
		mptr[2] = c;

		mptr += 3;
	}

As before, we’re using the STEP macro (defined above) to step a linear feedback shift register, used as a pseudorandom number generator, and then writing these pseudorandom numbers to memory. As before, the pseudo in pseudorandom will be very important when we try to verify that our memory was written correctly as intended.

GCC converts this C into the following assembly. (Note, I’ve renamed the Loop labels and added comments, etc., to help keep this readable.)

loop:
	MOV        R3,R2	; STEP(fill, TAPS); a = fill;
	LSR        $1,R2
 	XOR.C      R8,R2
	MOV        R2,R4	; STEP(fill, TAPS); b = fill;
	LSR        $1,R4
	XOR.C      R8,R4
	MOV        R4,R3	; STEP(fill, TAPS); c = fill;
	LSR        $1,R3
	XOR.C      R8,R3
	SW         R2,$-12(R0)	; mptr[0] = a;
	SW         R4,$-8(R0)	; mptr[1] = b;
	SW         R3,$-4(R0)	; mptr[2] = c;
	| ADD      $12,R0	; mptr += 3;
	CMP        R6,R0	; if (mptr+3 < end)
	BC         loop

Even though we’re operating on three words at a time, the loop remains quite similar. LSR/XOR.C steps the LRS. Once we have three values, we use three consecutive SW (store word) instructions to write these values to memory. We then adjust our pointer, compare, and loop if we’re not done yet.

Fig. 24 shows what the CPU pipeline might look like for this loop.

Fig 24. Triplet Write pipeline

Unlike our first test, we’re now crossing between instruction cache lines. This means that there’s a dead cycle between the LSR and XOR instructions, and another one following the BC (branch if carry) loop instruction before the prefetch is able to return the first instruction.

Unlike the last test, our memory operation takes three consecutive cycles.

Here’s a trace showing this write from the perspective of the MIG controller.

Fig 25. Triplet writes using the MIG

In this case, it takes 6 clocks (as shown) for the MIG to acknowledge all three writes. You’ll also note that the crossbar stalls the requests, but that you don’t see any evidence of that at the SDRAM controller. This is simply due to the fact that it takes the crossbar a clock to arbitrate, and it has a two pipeline stage buffer before arbitration is required. As a result, the third request through this crossbar routinely stalls. Put together, this entire loop requires 21 cycles from one request to the next.

Now let’s look at a trace from the UberDDR3 controller.

Fig 26. Triplet writes using the Uber3 controller

In this case, it takes 8 clocks for 3 writes. The UberDDR3 controller is two clocks slower than the MIG. However, it still takes 21 cycles from one request to the next, suggesting that we are still managing to hide the memory access cost by running other instructions in the loop. Indeed, if you dig just a touch deeper, you’ll see that the CPU has 9 spare clock cycles. Hence, this write could take as long as 17 cycles before it would impact the loop time.

Let’s now turn our attention to reading these values back. As before, we’re going to read three values, and then step and compare against our three pseudorandom values.

	while(mptr+3 < end) {
		register unsigned a, b, c;

		a = mptr[0];
		b = mptr[1];
		c = mptr[2];

		STEP(fill, TAPS);
		if (a != (int)fill) {
			FAIL; break;
		}

		STEP(fill, TAPS);
		if (b != (int)fill) {
			FAIL; break;
		}

		STEP(fill, TAPS);
		if (c != (int)fill) {
			FAIL; break;
		}

		mptr+=3;
	}

Curiously, GCC broke our three requests up into a set of two, followed by a separate third request. This will break the ZipCPU’s pipelined memory access into two accesses, although this is still within what “acceptable” assembly might look like.

loop:
	ADD        $12,R2       ; mptr += 3
	| CMP      R6,R2	; while(mptr+3 < end)
	BNC        end_of_loop
	LW         -8(R2),R4	; b = mptr[1]
	LW         -4(R2),R0	; c = mptr[2]
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	LW         -12(R2),R11	; a = mptr[0]
	CMP        R1,R11	; if (a != (int)fill)
	BNZ        trap
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	CMP        R1,R4	; if (b != (int)fill)
	BNZ        trap
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	CMP        R1,R0	; if (c == (int)fill)
	BZ         loop		;	go back and loop again
trap:

One lesson learned is that the if statements should include not only the TRAP/FAIL instruction, but also a break instruction. If you include the break, then GCC will place the TRAP outside of the loop and so we’ll no longer have to worry about multiple branches clearing our pipeline per loop. If you don’t, then the CPU will have to deal with multiple pipeline stalls. Instead, we’ll have only one stall when we go around the loop.

From a pipeline standpoint, the pipeline will look like Fig. 27.

Fig 27. Triplet read pipeline

In this figure, we show two passes through the loop. The first pass shows a complete cache miss and subsequent memory access, whereas the second one can exploit the fact that the data is in the cache.

As before, in the case of a cache miss, the loop time will be dominated by the memory read time. Any delay in memory reading will slow our loop down directly and immediately, but only once per cache miss. The difference here is that our probability of a cache miss has now gone from one in 128 to three in 128.

On a good day, the MIG’s access time looks like Fig. 28 below.

Fig 28. Triplet word access, data cache miss, MIG controller

In this case, it costs us 35 clocks to read from the SDRAM in the case of a cache miss, and 24 clocks with no miss. Were this always the case, we might expect 25 clocks per loop. Instead, we see an average of 27 clocks per loop, suggesting that the refresh and other cycles are slowing us down further.

Likewise, a cache miss when using the UberDDR3 controller looks like Fig. 29.

Fig 29. Triplet word access, data cache miss, UberDDR3 controller

In this case, it typically costs 17 clocks on a cache miss. On a rare occasion, the read might hit a REFRESH cycle, where it might cost 36 clocks or so. Hence we might expect 24.6 cycles through the loop, which is very close to the 24.7 cycles measured.

Sequential LRS Triplet Character Access

The third CPU test I designed is a repeat of the last one, save that the CPU made character (i.e. 8-bit octet) accesses instead of 32-bit word accesses.

In hind sight, this test isn’t very revealing. The statistics are roughly the same as the triplet word access: memory accesses to a given row aren’t faster (or slower) when accessing 8-bits at a time instead of 32. Instead, three 8-bit accesses takes just as much time as three 32-bit access. The only real difference here is that the probability of a read cache miss is now 3 bytes in a 512 cache line, rather than the previous 3 in 128.

Random word access

A more interesting test is the random word access test. In this case, we’re going to generate both (pseudo)random data and a (pseudo)random address. We’ll then store our random data at the random address, and only stop once the random address sequence repeats.

I’m expecting a couple differences here. First, I’m expecting that almost all of the data cache accesses will go directly to memory. There should be no (or at least very few) cache hits. Second, I’m going to expect that almost all of the memory requests should require loading a new row. In this case, the MIG controller should have a bit of an advantage, since it will automatically precharge a row as soon as it recognizes its not being used.

Writing to memory from C will look simple enough:

	afill = initial_afill;
	do {
		STEP(afill, TAPS);
		STEP(dfill, TAPS);
		if ((afill&(~amsk)) == 0)
			mptr[afill&amsk] = dfill;
	} while(afill != initial_afill);

GCC then turns this into the following assembly.

loop:
	LSR        $1,R1	; STEP(afill, TAPS)
	XOR.C      R8,R1
	LSR        $1,R2	; STEP(dfill, TAPS)
	XOR.C      R8,R2
	MOV        R1,R3        ; if (afill & (~amsk)) == 0
	| AND      R12,R3
	BNZ        checkloop
	; Calculate the memory address
	MOV        R11,R3        | AND        R1,R3
	LSL        $2,R3
	MOV        R5,R9         | ADD        R3,R9
	SW         R2,(R9)	; mptr[afill & amsk] = dfill
checkloop:
	CMP        R1,R4
	BNZ        loop

There’s a couple of issues here in this test. First, we have a mid-loop branch that we will sometimes take, and sometimes not. Second, we now have to calculate an address. This requires multiplying the pseudorandom values by four (LSL 2,R3), and adding it to the base memory address.

I’ve drawn out a notional pipeline for what this might look like in Fig. 30.

Fig 30. Random write access

Notice that this notional pipeline includes a stall for crossing instruction cache line boundaries between the XOR and LSR instructions.

From the MIG’s standpoint, a typical random write capture looks like Fig. 31 below.

Fig 31. Random write access, MIG controller

As before, this is a 4 clock access. The MIG is simply returning it’s results before actually performing the write.

A similar trace, drawn from the UberDDR3 controller can be seen in Fig. 32.

Fig 32. Random write access, UberDDR3 controller

In this case, it takes 8 clocks to access memory and perform the write.

However, neither write time is sufficient to significantly impact our time through the loop. Instead, it’s the rare REFRESH cycles that impact the write, but again these impacts are only fractions of a clock per loop. Still, that means that the UberDDR3 controller takes seven tenths of a cycle longer per loop than the MIG controller.

Reads, on the other hand, are more interesting. Why? Because read instructions must wait for their result before executing the next instruction, and the cache will have a negative effect if we’re always suffering from cache misses.

Here’s the C code for a read. Note that we now have two branches, mid loop.

	do {
		STEP(afill, TAPS);
		STEP(dfill, TAPS);
		if ((afill & (~amsk)) == 0) {
			if (mptr[afill&amsk] != (int)dfill) {
				FAIL;
				break;
			}
		}
	} while(afill != initial_afill);

GCC produces the following assembly for us.

loop:
	LSR        $1,R2	; STEP(afill, TAPS)
	XOR.C      R4,R2
	LSR        $1,R0	; STEP(dfill, TAPS)
	XOR.C      R4,R0
	MOV        R2,R3
	| AND      R12,R3
	BNZ        skip_data_check
	MOV        R11,R3	; Calculate afill & amsk
	| AND      R2,R3
	LSL        $2,R3	; Turn this into an address offset
	MOV        R5,R1
	| ADD      R3,R1	; ... and add that to mptr
	LW         (R1),R3	; Read mptr[afill&amsk]
	| CMP      R0,R3	; Compare with dfill, the expected data
	BNZ        trap		; Jump to the FAIL/break if nonzero
skip_data_check:
	LW         12(SP),R1	; Load (from the stack) the initial address
	| CMP      R2,R1	; Check our loop condition
	BNZ        loop
	// ...
trap:

There’s a couple of things to note here. First, there’s not one but two memory operations here. Why? GCC couldn’t find enough registers to hold all of our values, and so it spilled the initial address onto the stack. Nominally, this wouldn’t be an issue. However, it becomes an issue when you have a data cache collision, where both the stack and the SDRAM memory require access to the same cache line. These cases then require two cache lookups per loop. One lookup will be of SDRAM, the other (LW 12(SP),R1) of block RAM where the stack is being kept. (A 2-way or higher data cache may well have mitigated this effect, allowing the stack to stay in the cache longer.)

Second, notice how we now have a BNZ (branch if not zero, or if not equal). This is what we get for adding the break instruction to our failure part of the loop–letting GCC know that this if condition isn’t really part of our loop. As a result, we only have one branch–and that only if our pseudorandom address goes out of bounds.

This leaves us with a pipeline looking like Fig. 33.

Fig 33. Random read access pipeline

A capture of these random reads, when using the MIG controller, looks like Fig. 34 below.

Fig 34. Random read access, MIG controller

As before, we’re looking at 35 clocks to read 8 words. Nominally, we might argue this to be a latency of 27 cycles plus overhead, but … it’s not. One cycle, after the MIG starts returning data, is empty. This means we have a latency of 26 cycles, and a single clock loss of throughput on every transaction.

Judging from the UberDDR3 controller trace in Fig. 35, the UberDDR3 controller doesn’t have this problem.

Fig 35. Random read access, UberDDR3 controller

Instead, it takes 17 clocks to access 8 words, and there’s no unexpected losses in the return.

As a result, the MIG controller requires 72 clocks per loop, whereas the UberDDR3 controller requires 55 clocks per loop.

My conclusion from this test is that the MIG remains faster when writing, but the difference is fairly irrelevant because the CPU continues executing instructions concurrently. In the case of reads, on the other hand, the UberDDR3 controller is much faster. This is the conclusion one might expect given that the UberDDR3 controller has much less latency than the MIG.

MEMCPY

Let’s now leave our contrived tests, and look at some C library functions. For reference, the ZipCPU uses the NewLib C library.

Our first test will be a memcpy() test. Specifically, we’ll copy the first half of our memory to the second half. This will maximize the size of the memory copied.

In addition, our memcpy() requests will be aligned. This will allow the library routine to use 32b word copies instead of byte copies. It’s faster and simpler, but there is some required magic taking place in the library to get to this point.

Our test choice also has an unexpected consequence. Specifically, the UberDDR3’s sequential memory optimizations will all break at the bank level, since we’ll be reading from one bank, and writing to another address on the same bank. This will force the UberDDR3 controller to precharge a row and activate another on every bank read access. (It’s not quite every access, since we do have the data cache.)

With a little digging, the relevant loop within the memcpy() compiles into the following assembly:

loop:
	LW         (R5),R8	; Load two words from memory
	LW         4(R5),R9
	SW         R8,(R4)	; Store them
	SW         R9,$4(R4)
	LW         8(R5),R8	; Load the next two words
	LW         12(R5),R9
	SW         R8,$8(R4)	; Store those as well
	SW         R9,$12(R4)
	LW         16(R5),R8	; Load a third set of words
	LW         20(R5),R9
	SW         R8,$16(R4)	; Store the third set
	SW         R9,$20(R4)
	ADD        $32,R5        | ADD        $32,R4
	LW         -8(R5),R8	; Load a final 4th set of words
	LW         -4(R5),R9
	SW         R8,$-8(R4)	; ... and store them to memory
	SW         R9,$-4(R4)    | CMP        R4,R6
	BNZ        loop

Note that all of the memory accesses are for two sequential words at a time. This is due to the fact that both GCC and memcpy() believe the ZipCPU has native 64-bit instructions. It doesn’t, but this is still a decent optimization.

Second, note that GCC and NewLib have succeeded in unrolling this loop, so that four 64b words are read and written per loop. (I’m not sure which of GCC or NewLib is responsible for this optimization, but it shouldn’t be too hard to look it up.)

Third, note that the load-word instructions cannot start until the store-word instructions prior complete. This is to keep the CPU from hitting certain memory access collisions.

Fig. 36 shows an example of how the MIG deals with this memory copy.

Fig 36. MEMCPY, MIG Controller

Highlighted in the trace is the 35 cycle read.

However, you’ll also note that this trace is primarily dominated by write requests. This is due to the fact that the ZipCPU has a write-through cache, so all writes go to the data bus–two words at a time. Because of the latency difference we’ve seen, these writes can complete in 5 cycles total, or 14 cycles from one write to the next.

Remember, the read requests cannot be issued until the write requests can complete. Hence, for any pair of SW (store word) instructions followed by LW (load word) instructions, the LW instructions must wait for the SW instructions to complete. This write latency directly impacts that wait time. Hence, it takes 14 cycles from one write to the next.

Also shown in Fig. 36 is a write when the SDRAM was busy. These happen periodically, when the MIG takes the SDRAM offline–most likely to refresh some of its capacitors. These cycles, while rare, tend to cost 71 clock cycles to write two words.

In the end, it took 55 cycles to read and write 8 words (32 bytes) when the read data was in the cache, or 87 cycles otherwise.

Fig. 37, on the other hand, shows a trace of the same only this time using the UberDDR3 controller.

Fig 37. MEMCPY, UberDDR3 Controller

As before, reads are faster. The UberDDR3 controller can fill a cache line in 17 cycles, vs 35 for the MIG controller.

However, what kills the UberDDR3 controller in this test is its write performance. Because of the higher latency requirement of the write controller, it typically takes 7 cycles for a two word write to complete. This pushes the two word time from 14 cycles to 16 cycles. As a result, the UberDDR3 controller is 15% slower than the MIG in this test.

MEMCMP

Our final benchmark will be a memory comparison, using memcmp(). Since we just copied the lower half of our memory to the upper half using memcpy() in our last test, we’re now set up for a second test where we verify that the memory was properly copied.

Our C code is very simple.

	if (0 != memcmp(mem+lnw/2, mem, lnw/2))
		FAIL;

Everything taking place, however, lands within the memcmp() library call.

Internally, we spend our time operating on the following loop over and over again:

loop:
	LW         (R1),R4	; Read two words from the left hand side
	LW         4(R1),R5
	LW         (R2),R6	; Read two words from the right hand side
	LW         4(R2),R7
	CMP        R6,R4	; Compare left and right hand words
	CMP.Z      R7,R5
	BNZ        found_difference
	ADD        $8,R1         | ADD        $8,R2	; Increment PTRs
	ADD        $-8,R3        | CMP        $8,R3	; End-of-Loop chk
	BNC        loop

As with memcpy(), the library is try to exploit the 64b values that the ZipCPU supports–albeit not natively. Hence, each 64b read is turned into two adjacent reads, and the comparison is likewise turned into a pair of comparisons, where the second comparison is only accomplished if the first comparison is zero. On any difference, memcmp() breaks out of the loop and traps. Things are working well, however, so there are no differences, and so the CPU stays within the loop until it finishes.

Also, like the memcpy() test, jumping across a large power of two divide will likely break the bank machine optimizations used by the UberDDR3 controller.

Enough predictions, let’s see some results.

Fig. 38 shows an example loop through the MIG Controller.

Fig 38. MEMCMP, MIG Controller

One loop, measured between the markers, takes 106 clocks.

Much to my surprise, when I dug into this test I discovered that every memory access resulted in a cache miss. The reason is simple: the two memories are separated by a power of two amount, yet greater than the cache line size. This means that the two pieces of memory, the “left hand” and “right hand” sides, both use the same cache tags. Therefore, they are both competing for the same cache line. (A 2-way cache may have mitigated this reality, but the ZipCPU currently has only one-way caches.)

Fig. 39 shows the comparable loop when using the UberDDR3 controller.

Fig 39. MEMCMP, UberDDR3 Controller

In this case, the memcmp() uses only 74 clocks per loop–much less than the 106 used by the MIG..

Something else to note is that if you zoom out from the trace in Fig. 38, you can see the MIG’s refresh cycles. Specifically, every 51.8us, there’s a noticable hiccup in the reads, as shown in Fig. 40.

Fig 40. MEMCMP, MIG Controller Refresh timing

The same refresh cycles are just as easy to see, if not easier, in the UberDDR3 controller’s trace if you zoom out, as shown in Fig. 41.

Fig 41. MEMCMP, UberDDR3 Controller Refresh timing

This might explain why the MIG gets 96% throughput, whereas the UberDDR3 controller only gets a rough 90% throughput: the MIG doesn’t refresh nearly as often.

Still, when you put these numbers together, overall the UberDDR3 controller is 30% faster than the MIG when running the MEMCMP test.

Conclusions

So what conclusion can we draw? Is the UberDDR3 controller faster, better, and cheaper than the MIG controller? Or would it make more sense to stick with the MIG?

As with almost all engineering, the answer is: it depends.

The UberDDR3 controller is clearly cheaper than the MIG controller, since it uses 48% lower area.
Reading is much faster when using the UberDDR3 controller, primarily due to its lower latency of 10.8 clocks vice the MIGs 27.7 clocks (on avererage). This lower latency is only partially explained by the MIG’s need to process and decompose AXI bursts. It’s not clear what the rest of latency is caused by, or why it ends up so slow.

At the same time, this read performance improvement can often be hidden by a good cache implementation. This only works, though, when accessing memory from a CPU. Other types of memory access, such as DMA reading or video framebuffer reading won’t likely have the luxury of hiding the memory performance, since they tend to read large consecutive areas of memory at once, rather than accessing random memory locations.
Writing is faster when using the MIG, primarily due to the fact that it acknowledges any write request (nearly) immediately.

This should be an easy issue to fix.
The UberDDR3 controller might increase its throughput to match the MIG, were it to use a different refresh schedule.

I would certainly recommend Angelo look into this.
I really need to implement WRAP addressing for my data cache. I might’ve done so for this article, once I realized how valuable it would be, but then I realized I’d need to go and re-collect all of the data samples I had, and re-draw all of the pipeline diagrams. Instead, I’ll just push this article out first and then take another look at it.
The memcmp() test also makes a strong argument for having at least a 2-way cache implementation.

Given that the UberDDR3 controller is still somewhat new, I think we can all expect more and better things from it as it matures.

For he looketh to the ends of the earth, and seeth under the whole heaven; to make the weight for the winds; and he weigheth the waters by measure. Job 28:24-25

Wrap addressing

Sat, 29 Mar 2025 00:00:00 -0400

Welcome to the ZipCPU blog. I started it years ago after building my own soft core CPU, the ZipCPU, and dedicated this blog to helping individuals stay out of FPGA Hell. I then transitioned from working on the ZipCPU, to building bus components that might be used by every project–crossbars, bridges, DMAs and such. Since that time, my time is primarily spent not on the CPU, but rather its peripherals. This last year, for example, has seen work on several memory controllers, to include both NOR and NAND flash controllers, an SD Card(SDIO)/eMMC controller, and (now) a SATA controller. I’ve also had the opportunity to work on high speed networking, video, and even SONAR applications. All of this work is made easier by having both my own soft-core CPU, together with bus interconnect components, that I’m not afraid to dig into to debug if necessary.

With all of these distractions, its nice every now and then to come back the the ZipCPU.

One of my current projects requires that I bench mark AMD(Xilinx)’s DDR3 SDRAM MIG controller against the open source UberDDR3 controller. The performance differences are dramatic and very significant. My current (draft) article discussing these results works through a series of CPU and DMA based tests. For each test, the article describes first the C code for the test, then the assembly for the critical section, then a diagram of the CPU’s pipeline–reconstructed from simulation traces, and then finally traces showing the differences between the two controllers.

All of that led me to this trace from the data cache, shown in Fig. 1 below.

Fig 1. ZipCPU Data Cache Miss

For quick reference, the top line is the clock. The JMP line beneath it is the signal from the CPU’s core to the instruction fetch that the CPU needs to branch. The PF line shows the output of the prefetch (cache), and whether an instruction is available for the CPU to consume and if so which one. The DCD line shows the output of the instruction decoder. OP is the output of the read operands pipeline stage, and WB is the writeback stage. The CYC, STB, and ACK lines are a subset of the Wishbone bus signaling used to communicate with memory. First there’s the Zip-* version of these signals, showing them coming out of the CPU, and then the SDRAM-* signals coming from the crossbar showing these signals actually going to the memory controller itself.

At issue is how long it takes the CPU to respond to a cache miss. Notice how it takes the CPU 3 clock cycles from receiving an LW (load word) instruction from the read operands stage until when the data cache initiates a bus request, another 3 cycles before the request can make it to SDRAM controller, one cycle to return, and another 5 cycles from the completion of that request before the CPU can continue. That’s 11 clock cycles on every data cache miss above and beyond the cost of the memory access itself.

Ouch.

When it comes to raw performance, every cycle counts. Can we do better?

Yes, we can. Let’s talk about wrap addressing today.

That said, I’d like to focus this article on saving a couple clock cycles in the instruction cache rather than the data cache shown in my example. Why? For the simple practical reason that the instruction cache has been easier to update and get working–although I have yet to post the updates. My data cache upgrades to date remain a (broken) work in progress. Both, however, can be motivated by the diagram in Fig. 1 above.

Wrap Addressing

What might we do to improve the performance of the trace in Fig. 1?

The first thing we might do is speed up how long it takes to recognize that a particular value is not in the cache. There’s only so much that can be done here, however, since the cache tag memory is clocked. As a result, it will always take a clock cycle to look up the cache tag for any new request, and another clock cycle to know it’s not the right tag, and then a third clock cycle to activate the bus.

The crossbar is separate from the CPU, and its timing is dominated by the need for a clock rate that matches the CPU.

The UberDDR3 memory controller is a separate product from the CPU, so its performance is independent from the CPU itself.

How about the return? Once a value has been returned from memory to the cache, it then takes another clock cycle to shift the value into place for the CPU, so there’s not much to be done there … or is there?

There are two optimizations that can be made on this return path. The first is that we can take the value directly from the bus and return it to the CPU–rather than waiting for the value to first be written to and then read back from the cache’s memory. The second optimization is wrap addressing. We’ll discuss both of these optimizations today.

First, though, let me introduce the concept of a cache line. A cache line is the minimum amount of memory that can be read into the cache at a time. The cache itself is composed of many of these cache lines. Upon a cache miss, the cache controller will always go and read a whole cache line.

A long discussion can be had regarding how big a cache line can or should be. For me, I tend to follow the results published by Hennessey and Patterson, and keep my cache lines (roughly) 8 words in length. For simplicity, the ZipCPU’s caches are all one-way caches, but, yes, a significant performance can be gained by upgrading to two or even four-way caches–but that’s a story for another day.

Now that you know what a cache line is, notice how the cache miss in Fig. 1 results in reading an entire cache line. As we’ll discuss in the memory performance benchmarking article (still to be finished), memory performance can be quantified by latency and throughput. Caches can get an advantage over single-beat read or write instructions by reading more than one beat at a time, and so increasing the line size improves efficiency. One problem with increasing the line size, however, is that 1) it increases the amount of time the bus is busy handling any request (remember all requests are for a full cache line), and 2) it increases the risk that you spend a lot of time handling requests for instructions or data you’ll never use or need.

Now we can discuss wrap addressing. Wrap addressing is a means of reading the cache line out of order. Without wrap addressing, we might read the words in the cache line in order from 0-7. With wrap addressing, the cache will specifically read the requested item from the cache line first, then finish to the end of the line, then go back and get what was missing from the beginning. This way, as soon as the word that caused the cache miss in the first place has been read, the CPU can be unblocked and continue whatever it needs to do next while the cache controller finishes its read of the cache line. The big difference is that with wrap addressing the cache line is read in more of a priority fashion. “Wrap addressing” is the just name given to this style of out of order addressing.

That’s what it is. Let’s now look at its impact.

Wrap Addressing with the ZipCPU’s Instruction Cache

Some years ago, I added wrap addressing to the ZipCPU’s AXI instruction and data caches. Up until that time, I had poo-poo’d the benefit that might be had by using it. The ZipCPU was designed to be a “simple” and “low-logic” CPU, and wrap addressing would just complicate things–or so I judged. Then I tried it. At the time, I just needed something that used wrap addressing–the AXI bus functional model I had been given just wasn’t up to the task, but the ZipCPU could issue wrap addressing requests quite nicely. In the process, I was surprised at how much faster the ZipCPU ran when the caches used wrap addressing.

That experiment died, however, once the need was over. The big reason for it dying was simply that I don’t use AXI often. Sure, the ZipCPU has AXI memory controllers, but they only fit the CPU so well. The AXI bus is little endian, and the ZipCPU is big endian, so the two aren’t a natural fit. There’s plenty of pain at the seams. Further, adding wrap addressing to my Wishbone memory controllers was simply work that wasn’t being paid for. No, it doesn’t help that the Wishbone bus doesn’t really offer burst or wrap support, but I think you’ll find that issue to be irrelevant to today’s discussion.

As a result, Wishbone wrap addressing for the ZipCPU has therefore languished until I was recently motivated by examining the MIG and UberDDR3 memory controller bench mark results. Indeed, I found myself a touch embarrassed at the performance the CPU was delivering.

For illustration, let’s look at the first several instructions of a basic ZipCPU test program I use. We’ll break it into two portions. There’s the first several instructions.

	; Clear all registers
	;   The "|" separates two instructions, both of which are
	;   packed into a single instruction word.
 4000000:	86 00 8e 00 	CLR        R0            | CLR        R1
 4000004:	96 00 9e 00 	CLR        R2            | CLR        R3
 4000008:	a6 00 ae 00 	CLR        R4            | CLR        R5
 400000c:	b6 00 be 00 	CLR        R6            | CLR        R7
 4000010:	c6 00 ce 00 	CLR        R8            | CLR        R9
 4000014:	d6 00 de 00 	CLR        R10           | CLR        R11
 4000018:	66 00 00 00 	CLR        R12
	; Set up the initial stack stack pointer
 400001c:	6a 00 00 10 	LDI        0x08000000,SP	; Top of stack
 4000020:	6a 40 00 00 
	; Guarantee we are in supervisor mode, and trap into supervisor
	; mode if not
 4000024:	76 00 00 00 	TRAP
	; Provide a set of initial values for all of the user registers 
 4000028:	7b 47 c0 1e 	MOV        $120+PC,uPC
 400002c:	03 44 00 00 	MOV        R0,uR0
 4000030:	0b 44 00 00 	MOV        R0,uR1
 4000034:	13 44 00 00 	MOV        R0,uR2
 4000038:	1b 44 00 00 	MOV        R0,uR3
 400003c:	23 44 00 00 	MOV        R0,uR4

These get us to the end of the first cache line and now the beginning of the second. Take note that there have been no jumps or branches in this assembly, it’s just straightforward walking from one instruction to the next through the test program. (Yes, we’ll get to branches soon enough.)

The instructions then continue loading the user register set with default values.

 4000040:	2b 44 00 00 	MOV        R0,uR5
 4000044:	33 44 00 00 	MOV        R0,uR6
 4000048:	3b 44 00 00 	MOV        R0,uR7
 400004c:	43 44 00 00 	MOV        R0,uR8
 4000050:	4b 44 00 00 	MOV        R0,uR9
 4000054:	53 44 00 00 	MOV        R0,uR10
 4000058:	5b 44 00 00 	MOV        R0,uR11
 400005c:	63 44 00 00 	MOV        R0,uR12
 4000060:	6b 44 00 00 	MOV        R0,uSP
 4000064:	73 44 00 00 	MOV        R0,uCC
	; Finally, we call the bootloader function to load software into RAM
	; from flash if necessary (it isn't in this case), and to zero any
	; uninitialized global values
 4000068:	03 43 c0 02 	LJSR       @0x040000b4    // Bootloader
 400006c:	7c 87 c0 00 
 4000070:	04 00 00 b4 
	; Software continues, but the next section is outside the scope
	; of today's discussion.
	; ....

These end with a jump to subroutine instruction, followed by the beginning of the “_bootloader” subroutine below.

In this case, the cache line starts at address 0x04000080. However, we don’t start executing there in our example. Instead, we start executing partway through the cache line at the beginning of the bootloader subroutine.

040000b4 <\_bootloader>:
	; Our first step is to create a stack frame.  For this, we
	; subtract from the stack pointer, and then store any
	; registers we might clobber onto the stack.  As before,
	; the "|" separates two instructions, both of which are
	; packed into a single instruction word.
 40000b4:	e8 10 ad 00 	SUB        $16,SP        | SW         R5,(SP)
 40000b8:	b5 04 bd 08 	SW         R6,$4(SP)     | SW         R7,$8(SP)
 40000bc:	44 c7 40 0c 	SW         R8,$12(SP)
 40000c0:	0a 00 00 00 	LDI        0x00000004,R1
 40000c4:	0a 40 00 04 
 40000c8:	0c 00 00 04 	CMP        $4,R1
 40000cc:	78 88 01 0c 	BZ         @0x040001dc
 40000d0:	0a 00 00 00 	LDI        0x00000004,R1
 40000d4:	0a 40 00 04 
 40000d8:	0c 00 00 04 	CMP        $4,R1
 40000dc:	32 08 00 20 	LDI.Z      0x04000000,R6
	; ....

Together, these two sets of instructions make an awesome example to see how wrap addressing would work from an instruction fetch perspective.

One of the things I like about this example is the fact that the test starts with many sequential instructions and no jumps (branches). This will help provide us a baseline of how things work–before jumps start making things complicated.

For today’s discussion, our cache line size is 8 words, each having 64bits. The ZipCPU’s nominal instruction size is 32bits. Therefore, each cache line will nominally contain 16 instructions. Our first cache line, however, contains many clear (CLR) instructions (really load-immediate 0 into register …), and two of these instructions can be packed into a single 32b word. This is shown above using the “|” characters. Fig. 2 shows how the CPU pipeline works through these initial instructions–without wrap adddressing.

Fig 2. Starting the cache, without wrap addressing

Following the CPU reset, the cache starts with the JUMP flag set. Following a jump, it takes us 4 clock cycles to determine that the new address is not in the cache and to therefore start a bus cycle.

This bus cycle is painful. When using the MIG, it requires a (rough) 35 cycles (on a good day) to read all eight words. When using the UberDDR3 controller, it requires a (rough) 18 cycles. Since the ZipCPU can nominally execute one instruction per cycle, this is a painful wait.

Once the bus cycle completes, we take another two cycles to present the instruction from the cache line that we just read to the CPU. The decoder then takes two clock cycles with this instruction, since it contains two instructions packed into a single word, and so forth. From here on out, instructions are passed to the CPU at one instruction word per clock cycle–unless the CPU needs to take more clock cycles with them–as is the case of the compressed instruction.

Some instructions, such as the load immediate instruction, are actually two separate instructions–a bit reverse instruction to load the high order bits and a load immediate lo. Other than that, things stay straight forward until the end of the cache line. Once we get to the end, it takes us another 4 cycles to determine the next instruction is not in the cache, and so a new cycle begins again.

Now that we now how things work normally, we have our first chance for an improvement: what if we started feeding instructions to the CPU before all of the instructions had been read from memory and returned across the bus? What if we fed the next instruction to the CPU as soon as it was available?

In that case, we might see a trace similar to Fig. 3 below.

Fig 3. Feeding instructions straight from the bus returns

We can now overlap our instruction read time with our instruction issue, saving ourselves a full 10 cycles!

Let’s follow this further. What would happen in the case of a jump/branch? Without any modifications to our instruction cache (i.e. before wrap addressing), the JSR initiates a jump at the end of Fig. 4 below.

Fig 4. A Jump Instruction

This trace is a touch more eventful. For example, it includes a move to the CC register. On the ZipCPU, this register contains more than just the condition codes. It also contains the user vs supervisor mode control. This creates a pipeline hazard, and so instructions need to be stalled throughout the pipeline until this instruction has had a chance to write back–clearing the hazard.

The ZipCPU’s JSR instruction follows, requiring three instruction words. The first instruction word moves the program counter plus two into R0. This will now contain the return address for the subroutine. On other architectures, such an instruction is often called a “Link Register” instruction, but on the ZipCPU this is simply the first of the three word JSR instruction. The second instruction loads a new value into the program register. Technically, this is a LW (PC),PC instruction–loading the value of memory, as found at the program counter, into the program counter. Practically, it just allows us to place a 32b destination address into the instruction stream. Once the address is passed to the decoder, the decoder recognizes the unconditional jump and sets a flag for the instruction cache that it now wants a new instruction out of order. The instruction cache now takes four clock cycles to determine this new value is not in the cache, and our cycle repeats.

As before, we can compress this a touch by serving our instructions to the CPU immediately as they are read from the bus–instead of waiting for the entire cache line to be read first. You can see how this optimization might speed things up in Fig. 5.

Fig 5. JSR instruction, post optimization

That’s how the first of our two optimizations works.

Following the jump, without WRAP addressing, the pipeline would look like FIG 6.

Fig 6. JSR Landing, no optimization

To see what’s happening here, notice that we just jumped to address 0x040000b4. Given our cache line size of eight words, with each word being 64bits, this cache line starts at address 0x04000080. If we just returned the value from the bus as soon as it was available, we’d have to read six bus words before we get to the one we’re interested in–as shown in Fig. 6.

 4000080:	; Word 0: I don't care about these instructions.  I'm jumping
 4000084:	;	to address 0x040000b4.  I just have to read
 4000088:	; Word 1: these excess instructions because I'm operating on an
 400008c:		entire cache line.
 4000090:	; Word 2:
 4000094:
 4000098:	; Word 3: Still haven't gotten to anything I care about ...
 400009c:
 40000a0:	; Word 4:
 40000a4:
 40000a8:	; Word 5:
 40000ac:
 40000b0:	; Word 6: This is the first half of the word I do care about
 40000b4:	;	THIS IS THE FIRST INSN OF INTEREST!
 40000b8:	; Word 7:
 40000bc:	;

Why not, instead, request the address we are interested in first? Instead of starting with word 0, and reading until word 6, we might instead start with word 6, read word 7, and then finish by reading the first part of the cache line (words 0-5) while the CPU takes our instruction and gets (potentially) busy doing useful things.

Fig. 7 shows how this wrap addressing might look.

Fig 7. Instruction cache miss using WRAP addressing

Here, we request the last two instruction words, words 6 and 7, of the cache line, and then instruction words 0-5. Word 6 contains two instructions, but we’re only interested in the second of those two. That one is a compressed instruction, packing two instructions into 32bits. Word 7 then contains another three instructions–one packed instruction word and one normal one.

The trace gets a touch more interesting, though, given that the second instruction wants to store a word into memory. The ZipCPU, however, has only one bus interface–an interface that needs to be shared between instruction and data bus accesses. This means that the data access, i.e. the store word instruction, must wait until the instruction cache’s bus cycle completes.

Conclusions

The next step in this article should really be an analysis section that artificially quantifies the additional performance achieved by using wrap addressing over what I had been using. This should then be compared against some actual performance measure. Sadly, that’s one part of caches that I haven’t managed to get right–the performance analysis. Even worse, the lack of a solid ability to analyze this improvement has kept me from writing an article introducing the ZipCPU’s instruction cache in the first place. Perhaps I’ll manage to come back to this later–although it’s held me back for a couple of years now.

Since I haven’t presented the instruction cache in the first place, it doesn’t really make sense to write an article presenting the modifications required to introduce wrap addressing. That said, it was easier to do than I was expecting.

Fig 8. Is formal worth it?

I suppose “easier” is a relative term. I upgraded both instruction and data caches quickly–perhaps even in an afternoon. Then, when everything failed in simulation, I reverted the data cache updates to focus on the instruction cache updates. Those updates are now complete, as is their formal proof, so I expect I’ll push them soon. All in all, the work took me a couple of days to do spread over a month or so, with (as expected) the verification part taking the longest.

No, the updates aren’nt (yet) posted. Why not? Because this update lies behind the ZipCPU’s AXI DMA upgrade, and … that one still has bugs to be worked out in it. What bugs? Well, after posting the DMA initially, I then decided I wanted to change how the DMA handled unaligned FIXed addressing. My typical answer to unaligned FIX addressing is to declare it disallowed in the user manual, but for some reason I thought I might support it. The new/changed requirements then made it so that nothing worked, and so I have some updates left to do there before formal proofs and simulations pass again.

So my next steps are to 1) repeat this work with the data cache, and 2) finish working with the ZipCPU’s DMA, so that 3) I can post another upgrade to the ZipCPU’s repository. In the meantime, I’ll probably post my DDR3 controller memory performance benchmarks before these updates hit the ZipCPU official repository.

For now, let me point out that the WRAP addressing performance is significantly better, and the logic cost associated with it is (surprisingly) rather minimal. How much better? Well, that answer will have to wait until I can do a better job quantifying cache performance …

So the last shall be first, and the first last: for many be called, but few chosen. -- Matt 20:16

Your problem is not AXI

Wed, 06 Nov 2024 00:00:00 -0500

The following was a request for help from my inbox. It illustrates a common problem students have. Indeed, the problem is common enough that this blog was dedicated to its solution. Let me repeat the question here for reference:

I’ve read some of your articles and old comments on forums in trying to get something resembling Xilinx’ AXI4 Peripheral to work with my current project in VIVADO for my FPGA. My main problem is that whenever I so much as add a customizable AXI to my block design and connect it to my AXI peripheral, generate a bitstream (with no failures), then build a platform using it in VITIS (with no failures), my AXI GPIO connections which should not be connected to the recently added customizable AXI, do not operate at all (LEDs act as if tied to 0, although I’m sending all 1s). I tried a solution I found online talking about incorrect “Makefile”s but to no avail. I have also tried just adding some of your files you provided on github instead of the Xilinx’ broken IP including “demoaxi.v” and “easyaxi.v” [sp]. The “demoaxi.v” has the exact same problem as Xilinx’ AXI, just adding it to the block design and connecting it to my AXI peripheral causes the GPIO not connect somehow. Your “easyaxi.v” [sp] does not cause this issue right away, however adding an output and assigning it with the slave register “r0” then results in the same issue. I am at a loss for what to do. I’m not very familiar with the specifics of how AXI works, even after re-reading some of your articles multiple times (I’m still a student with very little experience), so I can’t be certain why I am running into this issue. My guess at what is happening is that adding an AXI block with a certain characteristic somehow causes the addresses for my GPIO and other connections to “bug out”. But I have no idea why adding this kind of AXI block does this (or something else that causes my issue). I’m reaching out because I … might as well do something other than making small changes to my design and waiting for 30+ minutes in between tests to see if something breaks or doesn’t break my GPIO. Do you have any idea what might be causing my issue or how to fix it?

Thanks,

(Student)

(Links have been added …)

Let’s start with the easy question:

Do you have any idea what might be causing my issue or how to fix it?

No. Without looking at the design, the schematic, or digging into the design files, I can’t really comment on something like this. Debugging hardware designs is hard work, it takes time, and it takes a lot of attention to detail. Without the details, I won’t be able to find the bug.

That said, let’s back up and address the root problem, and it’s not AXI.

Yes, I said that right: This student’s problem is not AXI.

If anything, AXI is just the symptom. If you don’t deal with the actual problem, you will not succeed in this field.

Iterative Debugging

The fundamental problem is the method of debugging. The problem is that the design doesn’t work, and this student doesn’t know how to figure out why not. This was why I created my blog in the first place–to address this type of problem.

Fig 1. This is not how to do debugging

Here’s what I am hearing from the description: I tried A. It didn’t work. I don’t know why not. So I tried B. That didn’t work either. I still don’t know why not. Let me try asking an expert to see if he knows. It’s as though the student expects me to be able, from these symptoms alone, to figure out what’s wrong.

That’s not how this works. Indeed, this debugging process will lead you straight to FPGA Hell.

As an illustration, and for a fun story, consider the problem I’ve been working on for the past couple weeks. I’m trying to get the FPGA processing working for this video project (fun promo video link).

I got stuck for about two weeks at the point where I commanded the algorithm to start and it didn’t do anything. Now what?

Fig 2. Voodoo computing defined

One approach to this problem would be to just change things, with no understanding of what’s going on. I like to call this “Voodoo Computing”. Sadly, it’s a common method of debugging that just … doesn’t work.

I use this definition because … it’s just so true. Even I often find myself doing “voodoo computing” at times, and somehow expecting things to suddenly fix themselves. The reality is, that’s not how engineering works.

Engineering works by breaking a problem down into smaller problems, and then breaking those problems into smaller ones at that. In this student’s case, he has a problem where his AXI slave doesn’t work. Let’s break that down by asking a question: Is it your design that’s failing, or the Vivado created “rest-of-the-system” that’s failing? Draw a line. Measure. Which one is it?

Fig 3. Iterative Debugging

Well, how would you know? You know by adding a test point of some type. “Look” inside the system. Look at what’s going on. Look for any internal evidence of a bug. For example, this student wants to write to his component and to see a pin change. Perfect. Now trigger a capture on any writes to this component, and see if you can watch that pin change from within the capture and on the board. Does the component actually get written to? Do the AWVALID, AWREADY, WVALID, WREADY, BVALID, and BREADY signals toggle appropriately? How about WDATA and WSTRB? What of AWADDR? (You might need to reduce this to a single bit: mydbg = (AWADDR == mydevices_register);) If all these are getting set appropriately, then the problem is in your design. Voila! You’ve just narrowed down the issue.

Let’s illustrate this idea. You have a design that doesn’t work. You need to figure out where the bug lies. So we first break this design into three parts. I’ll call them 1) the AXI IP, 2) the LED output, and 3) the rest of the design.

Fig 4. Breaking down the problem

I would suggest two test points–although these can probably be merged into the same “scope” (ILA). The first one would be between the AXI IP and the rest of the design. This test point should look at all the AXI signals. The second one should look at the LED output from your design.

Yes, I can hear you say, but of course the problem is within my AXI IP! Ahm, no, you don’t get it. Earlier this year, I shipped a design to a well paying customer, and they came back and complained that my design wasn’t properly acknowledging write transactions. As I recall, either BID or BVALID were getting corrupted or some such. What should I say as a professional engineer to a comment like that? Do I tell the customer, gosh, I don’t know, that’s never happened to me before? Do I tell him, not at all, my stuff works? Or do I make random changes for him to try to see if these would fix his problem? Frankly, none of these answers would be acceptable. Instead, I asked if he could provide a trace or other evidence of the problem that we could inspect together–much like I illustrated above in Fig. 4. When he did so, I was able to clearly point out that my design was working–it was just Vivado’s IP integrator that hadn’t properly connected it to the AXI bus. Yes, these things happen. You, as the engineer, need to narrow down where the bug is and getting a “trace” of what is going on is one clear way to do this.

Fig 5. Yes, it's hard. Get over it.

This problem is often both iterative and time consuming. Yes, it’s hard. As my Ph.D. advisor used to say, “Take an Aspirin. Get over it.” It’s a fact of life. This field isn’t easy. That’s why it pays well. Personally, that’s also why I find it so rewarding to work in this field. I enjoy the excitement of getting something working!

If we go back to the video processing example I mentioned earlier, I eventually found several bugs in my Verilog IP.

A bus arbiter was broken, and so the arbiter would get locked up following any bus error.

(Yes, this was my own arbiter, and and one I had borrowed from another project. It had no problems in the that other project.)
Every time the video chain got reset, the memory address got written to zero–and so the design tried accessing a NULL memory pointer. This was then the source of the bus error the arbiter was struggling with.
The CPU was faulting since the video controller was writing video data to CPU instruction memory.

I traced this to using the wrong linker description file. Sure, a simplified block RAM only description is great for initial bringup testing, but there’s no way a 1080p image frame will fit in block RAM in addition to the C library.
A key video component was dropping pixels any time Xilinx’s MIG had a hiccup on the last return beat.

This was a bit more insidious than it sounds. The component in question was the video frame buffer. This component reads video data from memory and generates an outgoing video stream. A broken signaling flag caused the frame buffer to drop the bus transaction while one word was still outstanding. This left the memory request and memory recovery FSMs off by one (more) beat.

If you’ve ever stared at traces from Xilinx’s MIG, you’ll notice that it generates a lot of hiccups. Not only does it need to take the memory off line periodically for refreshes, but it also needs to take it off line more often for return clock phase tracking. This means that the ready wire, in this case ARREADY, will have a lot of hiccups to it, and so consequently will the RVALID (and BVALID) acknowledgments have similar hiccups.

What happens, as it did in my case, when your design is sensitive to such a hiccup at one particular clock cycle in your operation but not others? The design might pass a simulation check, but still fail in hardware.

Fig 6. shows the basic trace of what was going on.

Fig 6. The missing ACK

Notice what I just did there? I created a test point within the design, looked at signals from within that test point, captured a trace of what was going on, and hence was able to identify the problem. No, this wasn’t the first test point–it took a couple to get to this point. Still, this is an example of debugging a design within hardware.

The story of this video development goes on.

Fig 7. The 3-board Stack

At this point, though, I’ve now moved from one board to three. On the one hand, that’s a success story. I only moved on once the single board was working. On the other hand, the three boards aren’t talking to each other (yet). I think I’ve now narrowed the problem down to a complex electrical interaction between the two boards.

How did I do that? The key was to be able to capture a trace of what was going on from within the system. Sound familiar? First, I captured a trace indicating that the I2C master on the middle board was attempting to contact the I2C slave on the bottom board and … the bottom board wasn’t acknowledging. Then I captured a trace from the bottom board showing that the I2C pins weren’t even getting toggled. Indeed, I eventually got to the point where I was toggling the I2C pins by hand using the on board switches–and even then the boards weren’t showing a connection between them.

Generate a test. Test. Narrow down the problem. Continue.

Enumerating Debug Methods

In many ways, debugging can be thought of as a feedback loop–much like Col Boyd’s OODA loop.

Fig 8. Debugging Feedback Loop

The faster you can go through this loop, the faster you can find bugs, the better your design will be.

Given this loop, let’s now go back and enumerate the basic methods for debugging a hardware design.

Desk checking. This is the type of debugging where you stare at your design, and hopefully just happen to see whatever the bug was. Yes, I do this a lot. Yes, after a decade or two of doing design it does get easier to find bugs this way. After a while, you start to see patterns and learn look for them. No, I’m still not very successful using this approach–and I’ve been doing digital design for a living for many years.

In the case of this student’s design, I’m sure he’d stared at his design quite a bit and wasn’t seeing anything. Yeah. I get that. I’ve been there too.

Build time required for desk checking? None.

Test time? This doesn’t involve testing, so none.

Analysis time? Well, it depends. Usually I give up before spending too much time doing this.
Lint, sometimes called “Static Design Analysis”. This type of debugging takes place any time you use a tool to examine your design.

I personally like to use verilator -Wall -cc mydesign.v. Using Verilator, I can get my design to have zero lint errors. Since this process tends to be so quick and easy, I rarely discuss bugs found this way. They’re just found and fixed so quickly that there’s no story to tell.

Vivado also produces a list of lint errors (warnings) every time it synthesizes my design. The list tends to be long and filled with false alarms. Every once in a long while I’ll examine this list for bugs. Sometimes I’ll even find one or two.

From the student’s email above, I gather he believed his design was good enough from this standpoint. Still, it’s a place worth looking when things take unexpected turns.

Build time? None.

Test time? Almost instantaneous when using Verilator.

Analysis time? Typically very fast.
Formal methods. Formal methods involve first assuming things about your inputs, and then making assertions about how the design is supposed to work. A solver can then be used to logically prove that if your assumptions hold, then your assertions will as well. If the solver fails, it will provide you with a very short trace illustrating what might happen.

You can read about my own first experience with formal methods here, although that’s no longer where I’d suggest you start. Were I to recommend a starting place, it would probably be my Verilog design tutorial.

Many of the bugs I mentioned in the video design I’m working with should’ve been found via formal methods. However, some of the key components didn’t get formally verified. (Yes, that’s on me. This was supposed to be a prototype…) The arbiter, however, had gone through a formal verification process. Sadly, at one point I had placed an assumption into the design that there would never be any bus errors. What do you know? That kept it from finding bus errors! Likewise, the frame buffer’s proof never passed induction, so it never completed a full bus request to see what would happen if the two got out of sync. The excuses go on. I’m now working on formally verifying these components.

In the case of the student above, he mentions using some formally verified designs, but says nothing about whether or not he formally verified the LED output of those designs.

Build time? For formal methods, this typically references how long it takes to translate the design into a formal language of some type–such as SMT. When using Yosys, the time it takes to do this is usually so quick I don’t notice it.

Test time? We measured formal proof solver time some time ago. Bottom line, 87% of the time a formal proof will take less than two minutes, and only 5% of the time will it ever take longer than ten minutes.

Analysis time? This tends to only take a minute or two. One of the good things of formal proofs, is that the solver will lead you directly to the error.
Simulation.

Simulation is a very important debugging tool. It’s one of the easiest ways to find bugs. In general, if a design doesn’t work in simulation, then it will never work in hardware.

However, simulation depends upon models of all of the components in question–both those written in Verilog and those only available via data sheet, from which Verilog (or other) models need to be written and thus only approximated. As a result, there are often gaps between how the models work and what happens in reality.

A second reality of simulation is that it’s not complete. There will always be cases that don’t get simulated. A good engineer will work to limit the number of these cases, but it’s very hard to eliminate them entirely. For example:
- Not simulating jumping to the last instruction in a cache line left me with quite a confusing mix of symptoms.
- Not simulating bus errors lead to missing a bus lockup in the arbiter above.
- Not simulating ACK dropping at the last beat in a series of requests, led to the frame buffer perpetually resynchronizing.
- Not simulating stalls and multiple outstanding requests led Xilinx to believe their AXI demo worked.
Considering the video processing example I’ve been discussing, I’ll be the first (and proudest) to declare that all of the video algorithms worked nicely in simulation. Yes, they worked in simulation–they just didn’t work in hardware. Why? My simulation didn’t include the MIG or the DDR3 SDRAM. Instead, I had approximated their performance with a basic block RAM implementation. This usually works for me, since I like to formally verify everything–only I didn’t formally verify everything this time. The result were some bugs that slipped through the cracks, and so among other things my simulation never fully exercised the design. My simulation also didn’t include the CPU, nor did it accurately have the same type and amount of memory as the final design had. These were all problems with my simulation, that kept me from catching some of these last bugs.

While simulation is the “easiest” type of debugging, it does tend to be slow and resource (i.e. memory and disk) intensive. Traces from my video tests are often 200GB or larger. Indeed, this is one of the reasons why the simulation doesn’t include either the MIG DDR3 SDRAM controller, the CPU, the flash, block RAM, or the Wishbone crossbar.

I would be very curious to know if the student who wrote me had fully simulated his design–from ARM software to LED.

Build time? When using Verilator, I’ve seen this take up to a minute or two for a large and complex design, although I rarely notice it.

Test time? The video simulations I’ve been running take about an hour or so when using Verilator. A full ZipCPU test suite can take two hours using Verilator, or about a week when using Icarus Verilog.

Test time gets annoying when using Vivado, since it doesn’t automatically capture every signal from within the design as Verilator will. I understand there’s a setting to make this happen, but … I haven’t found it yet.

Analysis time? This tends to be longer than formal methods, since I typically find myself tracing bugs through simulations of very large and complex designs, and it takes a while to trace back from the evidence of the bug to the actual bug itself. The worst examples of simulation analysis I’ve had to do were of NAND flash simulations, where you don’t realize you have a problem until you read results from the flash. Then you need to first find the evidence of the problem in the trace (expected value doesn’t match actual value), then trace it from the AXI bus to the flash read bus, across multiple flash transactions to the critical one that actually programmed the block in question, back across the flash bus to the host IP, and then potentially back further to the AXI transaction that provided the information in the first place. While doable, this can be quite painful.

Fig 9. Tracing from cause to effect can require a lot of investigation

Debug in hardware. Getting to hardware is painful–it requires building a complete design, handling timing exceptions, and a typically long synthesis process. Once you get there, tests can typically be run very fast. However, such tests are often unrevealing. Trying something else on hardware often requires a design change, rebuild, and … a substantial stall in your process which will slow you down. In the case of this student, he measured this stall time at 30min.

This stall time while things are rebuilding can make hardware debugging slow and expensive. Why is it expensive? Because time is expensive. I charge by the hour. I can do that. I’m not a student. Students on the other hand are often overloaded for time. They have other projects to do, and one class (or lab) consuming a majority of their time will quickly become a serious problem on the road to graduation.

Knowing what’s wrong when things fail in hardware is … difficult–else I wouldn’t be writing this note.

However, it’s a skill you need to have if you are going to work in this field. How can you do it? You can use LEDs. You can use your UART. If you are on an ARM based FPGA, you can often use printf. You can use a companion CPU (PC), or even an on-board CPU (ARM or softcore). You can use the ILA, or you can build your own (that’s me). In all cases, you need to be able extract the key information regarding the “bug” (whatever it might be) from the design. That key information needs to point you to the bug. Is it in Vivado generated IP? Is it in the Verilog? If it’s in your Verilog, where is it? You need to be able to bisect your design repeatedly to figure this out.

In the case of the video project I’m working on, this is (currently) where I’m at in my development.

In the case of the student above, I’d love to know whether assign led=1; would work, if the LED control wire was mapped to the correct pin, or if the LED’s control was inverted. Without more information, I might never know.

Build time? That is, how long does it take to turn the design Verilog into a bit file? Typically I deal with build times of roughly 12-15 minutes. The student above was dealing with a 30min build time. I’ve heard horror stories of Vivado even taking as long as a day for particularly large designs, but never had to deal with delays that long myself.

Test time? Most hardware tests take longer to set up than to perform, so I’ll note this as “almost instantaneous.” Certainly my video tests tended to be very quick.

Analysis time? “What just happened?” seems to be a common refrain in hardware testing. Sure, you just ran a test, but … what really happened in it? This is the problem with testing in hardware. It can take a lot of work to get to the “success” or “failure” measure. In the video processing case, video processing takes place on a pixel at a time at over 80M pixels per second, but the final “success” (once I got there) was watching the effects of the video processing as applied to a 4 minute video. Indeed, I was so excited (once I got there), that I called everyone from my family to come and watch.

While I’d love to say one debugging method is better than another, the reality is that they each have their strengths and weaknesses. Formal methods, for example, don’t often work on medium to large designs. Lint tends to miss things. You get the picture. Still, you need to be familiar with every technique, to have them in your tool belt for when something doesn’t work.

Conclusion

Again, the bottom line is that you need to know how to debug a design to succeed in this field. This is a prerequisite for anything that might follow–such as building an AXI slave. Perhaps a fun story might help illustrate my points.

You might also find the first article I wrote on this hardware debugging topic to be valuable.

Or how about the response from a student who then commented on that article, after struggling with these same issues?

In all of this, the hard reality remains:

Hardware debugging is hard.
There is a methodology to it. I might even use the word “methodical”, but that would be redundant.
You will need to learn that methodology to debug your design.
Once you understand the methodology of hardware debugging, you can then debug any design–to include any AXI design.

Hardware design isn’t for everybody. Not everyone will make it through their learning process–be it college or self taught. Yes, there are design communities that would love to help and encourage you. On the bright side, hard work pays well in any field.

Seest thou a man diligent in his business? He shall stand before kings; he shall not stand before mean men. (Prov 22:29)

My Personal Journey in Verification

Sat, 06 Jul 2024 00:00:00 -0400

This week, I’ve been testing a CI/CD pipeline. This has been my opportunity to shake the screws and kick the tires on what should become a new verification product shortly.

I thought that a good design to check might be my SDIO project. It has roughly all the pieces in place, and so makes sense for an automated testing pipeline.

This weekend, the CI project engineer shared with me:

It’s literally the first time I get to know a good hardware project needs such many verifications and testings! There’s even a real SD card simulation model and RW test…

After reminiscing about this for a bit, I thought it might be worth taking a moment to tell how I got here.

Verification: The Goal

Perhaps the best way to explain the “goal” of verification is by way of an old “war story”–as we used to call them.

At one time, I was involved with a DOD unit whose whole goal and purpose was to build quick reaction hardware capabilities for the warfighter. We bragged about our ability to respond to a call on a Friday night with a new product shipped out on a C-130 before the weekend was over.

Anyone who has done engineering for a while will easily recognize that this sort of concept violates all the good principles of engineering. There’s no time for a requirements review. There’s no time for prototyping–or perhaps there is, to the extent that it’s always the prototype that heads out the door to the warfighter as if it were a product. There’s no time to build a complete test suite, to verify the new capability against all things that could go wrong. However, we’d often get only one chance to do this right.

Now, how do you accomplish quality engineering in that kind of environment?

The key to making this sort of shop work lay in the “warehouse”, and what sort of capabilities we might have “lying on the shelf” as we called it. Hence, we’d spend our time polishing prior capabilities, as well as anticipating new requirements. We’d then spend our time building, verifying, and testing these capabilities against phantom requirements, in the hopes that they’d be close to what we’d need to build should a real requirement arise. We’d then place these concept designs in the “warehouse”, and show them off to anyone who came to visit wondering what it was that our team was able to accomplish. Then, when a new requirement arose, we’d go into this “warehouse” and find whatever capability was closest to what the customer required and modify it to fit the mission requirement.

That was how we achieved success.

The same applies in digital logic design. You want to have a good set of tried, trusted, and true components in your “library” so that whenever a new customer comes along, you can leverage these components quickly to meet his needs. This is why I’ve often said that well written, well tested, well verified design components are gold in this business. Such components allow you to go from zero to product in short order. Indeed, the more well-tested components you have that you can reuse, the faster you’ll be to market with any new need, and the cheaper it will cost you to get there.

That’s therefore the ultimate goal: a library of reusable components that can be quickly composed into new products for customers.

As I’ve tried to achieve this objective over the years, my approach to component verification has changed, or rather grown, many times over.

Hardware Verification

When I first started learning FPGA design, I understood nothing about simulation. Rather than learning how to do simulation properly, I instead learned quickly how to test my designs in hardware. Most of these designs were DSP based. (My background was DSP, so this made sense …) Hence, the following approach tended to work for me:

I created access points in the hardware that allowed me to read and write registers at key locations within the design.
One of these “registers” I could write to controlled the inputs to my DSP pipeline.
Another register, when written to, would cause the design to “step” the entire DSP pipeline as if a new sample had just arrived from the A/D.
A set of registers within the design then allowed me to read the state of the entire pipeline, so I could do debugging.

This worked great for “stepping” through designs. When I moved to processing real-time information, such as the A/D results from the antenna connected to the design, I build an internal logic analyzer to catch and capture key signals along the way.

I called this “Hardware in the loop testing”.

Management thought I was a genius.

This approach worked … for a while. Then I started realizing how painful it was. I think the transition came when I was trying to debug my FFT by writing test vectors to an Arty A7 circuit board via UART, and reading the results back to display them on my screen. Even with the hardware in the loop, hitting all the test vectors was painfully slow.

Eventually, I had to search for a new and better solution. This was just too slow. Later on, I would start to realize that this solution didn’t catch enough bugs–but I’ll get to that in a bit.

Happy Path Simulation Testing

“Happy path” testing is a reference to simply testing working paths through a project’s environment. To use an aviation analogy, a “happy path” test might make sure the ground avoidance radar never alerted when you weren’t close to the ground. It doesn’t make certain that the radar necessarily does the right thing when you are close to the ground.

So, let’s talk about my next project: the ZipCPU.

Verification of the CPU began with an assembly program the ZipCPU would run. The program was designed to test all the instructions of the CPU with sufficient fidelity to know when/if the CPU worked.

The test had one of two outcomes. If the program halted, then the test was considered a success. If it detected an error, the CPU would execute a BUSY instruction (i.e. jump to current address) and then perpetually loop. My test harness could then detect this condition and end with a failing exit code.

When the ZipCPU acquired a software tool chain (GCC+Binutils) and C-library support, this assembly program was abandoned and replaced with a similar program in C. While I still use this program, it’s no longer the core of the ZipCPU’s verification suite. Instead, I tend to use it to shake out any bugs in any new environment the ZipCPU might be placed into.

This approach failed horribly, however, when I tried integrating an instruction cache into the ZipCPU. I built the instruction cache. I tested the instruction cache in isolation. I tested the cache as part of the CPU. I convinced myself that it worked. Then I placed my “working” design onto hardware and all hell broke lose.

This was certainly not “the way.”

Formal Verification

I was then asked to review a new, open source, formal verification tool called SymbiYosys. The tool handed my cocky attitude back to me, and took my pride down a couple steps. In particular, I found a bunch of bugs in a FIFO I had used for years. The bugs had never shown up in hardware testing (that I had noticed at least), and certainly hadn’t shown up in any of my “Happy path” testing. This left me wondering, how many other bugs did I have in my designs that I didn’t know about?

I then started working through my previous projects, formally verifying all my prior work. In every case, I found more bugs. By the time I got to the ZipCPU–I found a myriad of bugs in what I thought was a “working” CPU.

I’d like to say that the quality of my IP went up at this point. I was certainly finding a lot of bugs I’d never found before by using formal methods. I now knew, for example, how to guarantee I’d never have any more of those cache bugs I’d had before.

So, while it is likely that my IP quality was going up, the unfortunate reality was that I was still finding bugs in my “formally verified” IP–although not nearly as many.

A couple of improvements helped me move forward here.

Bidirectional formal property sets

The biggest danger in formal verification is that you might assume() something that isn’t true. The first way to limit this is to make sure you never assume() a property within the design, but rather you only assume() properties of inputs–never outputs, and never local registers.

But how do you know when you’ve assumed too much? This can be a challenge.

One of the best ways I’ve found to do this is to create a bidirectional property set. A bus master, for example, would make assumptions about how the slave would respond. A similar property set for the bus slave would make assumptions about what the master would do. Further, the slave would turn the master’s assumptions into verifiable assertions–guaranteeing that the master’s assumptions were valid. If you can use the same property set in this manner for both master and slave, save that you swap assumptions and assertions, then you can verify both in isolation to include only assuming those things that can be verified elsewhere.

Creating such property sets for both AXI-Lite and AXI led me to find many bugs in Xilinx IP. This alone suggested that I was on the “right path”.
Cover checking

I also learned to use formal coverage checking, in addition to straight assertion based verification. Cover checks weren’t the end all, but they could be useful in some key situations. For example, a quick cover check might help you discover that you had gotten the reset polarity wrong, and so all of your formal assertions were passing because your design was assumed to be held in reset. (This has happened to me more than once. Most recently, the cost was a couple of months delay on what should’ve otherwise been a straight forward hardware bringup–but that wasn’t really a formal verification issue.)

For a while, I also used cover checking to quickly discover (with minimal work) how a design component might work within a larger environment. I’ve since switched to simulation checking (with assertions enabled) for my most recent examples of this type of work, but I do still find it valuable.
Induction

Induction isn’t really a “new” thing I learned along the way, but it is worth mentioning specially. As I learned formal verification, I learned to use induction right from the start and so I’ve tended to use induction in every proof I’ve ever done. It’s just become my normal practice from day one.

Induction, however, takes a lot of work. Sometimes it takes so much work I wonder if there’s really any value in it. Then I tend to find some key bug or other–perhaps a buffer overflow or something–some bug I’d have never found without induction. That alone keeps me running induction every time I can. Even better, once the induction proof is complete, you can often trim the entire formal proof down from 15-20 minutes down to less than a single minute.
Contract checking

My initial formal proofs were haphazard. I’d throw assertions at the wall and see what I could find. Yes, I found bugs. However, I never really had the confidence that I was “proving” a design worked. That is, not until I learned of the idea of a “formal contract”. The “formal contract” simply describes the essence of how a component worked.

For example, in a memory system, the formal contract might have the solver track a single value of memory. When written to, the value should change. When read, the value should be returned. If this contract holds for all such memory addresses, then the memory acts (as you would expect) … like a memory.
Parameter checks

For a while, I was maintaining “ZBasic”–a basic ZipCPU distribution. This was where I did all my simulation based testing of the ZipCPU. The problem was, this approach didn’t work. Sure, I’d test the CPU in one configuration, get it to work, and then put it down believing the “CPU” worked. Some time later, I’d try the CPU in a different configuration–such as pipelined vs non-pipelined, and … it would fail in whatever mode it had not been tested in. The problem with the ZBasic approach is that it tended to only check one mode–leaving all of the others unchecked.

This lead me to adjust the proofs of the ZipCPU so that the CPU would at least be formally verified with as many parameter configurations as I could to make sure it would work in all environments.

I’ve written more about these parts of a proof some time ago, and I still stand by them today.

Yes, formal verification is hard work. However, a well verified design is highly valuable–on the shelf, waiting for that new customer requirement to come in.

The problem with all this formal verification work lies in its (well known) Achilles heel. Because formal verification includes an exhaustive combinatorial search for bugs across all potential design inputs and states, it can be computationally expensive. Yeah, it can take a while. To reduce this expense, it’s important to limit the scope of what is verified. As a result, I tend to verify design components rather than entire designs. This leaves open the possibility of a failure in the logic used to connect all these smaller, verified components together.

AutoFPGA and Better Crossbars

Sure enough, the next class of bugs I had to deal with were integration bugs.

I had to deal with several. Common bugs included:

Using unnamed ports, and connecting module ports to the wrong signals.

At one point, I decided the Wishbone “stall” port should come before the Wishbone acknowledgment port. Now, how many designs had to change to accommodate that?
I had a bunch of problems with my initial interconnect design methodology. Initially, I used the slave’s Wishbone strobe signal as an address decoding signal. I then had a bug where the address would move off of the slave of interest, and the acknowledgment was never returned. The result of that bug was that the design hung any time I tried to read the entirety of flash memory.

Think about how much simulation time and effort I had to go through to simulate reading an entire flash memory–just to find this bug at the end of it. Yes, it was painful.

Basically, when connecting otherwise “verified” modules together by hand, I had problems where the result wasn’t reliably working.

The first and most obvious solution to something like this is to use a linting tool, such as verilator -Wall. Verilator can find things like unconnected pins and such. That’s a help, but I had been doing that from early on.

My eventual solution was twofold. First, I redesigned my bus interconnect from the top to the bottom. You can find the new and redesigned interconnect components in my wb2axip repository. Once these components were verified, I then had a proper guarantee: all masters would get acknowledgments (or errors) from all slave requests they made. Errors would no longer be lost. Attempts to interact with a non-existent slave would (properly) return bus errors.

To deal with problems where signals were connected incorrectly, I built a tool I call AutoFPGA to connect components into designs. A special tag given to the tool would immediately connect all bus signals to a bus component–whether it be a slave or master, whether it be connected to a Wishbone, AXI-Lite, or AXI bus. This required that my slaves followed one of two conventions. Either all the bus ports had to follow a basic port ordering convention, or they needed to follow a bus naming convention. Ideally, a slave should follow both. Further, after finding even more port connection bugs, I’m slowly moving towards the practice of naming all of my port connections.

This works great for composing designs of bus components. Almost all of my designs now use this approach, and only a few (mostly test bench) designs remain where I connect bus components by hand manually.

MCY

At one time along the way, I was asked to review MCY: Mutation Coverage with Yosys. My review back to the team was … mixed.

MCY works by intentionally breaking your design. Such changes to the design are called “mutations”. The goal is to determine whether or not the mutated (broken) design will trigger a test failure. In this fashion, the test suite can be evaluated. A “good” test suite will be able to find any mutation. Hence, MCY allows you to measure how good your test suite is in the first place.

Upon request, I tried MCY with the ZipCPU. This turned into a bigger challenge than I had expected. Sure, MCY works with Icarus Verilog, Verilator, and even (perhaps) some other (not so open) simulators as well. However, when I ran a design under MCY, my simulations tended to find only a (rough) 70% of any mutations. The formal proofs, however, could find 95-98% of any mutations.

That’s good, right?

Well, not quite. The problem is that I tend to place all of my formal logic in the same file as the component that would be mutated. In order to keep the mutation engine from mutating the formal properties, I had to remove the formal properties from the file to be mutated into a separate file. Further, I then had to access the values that were to be assumed or asserted external from the file under test using something often known as “dot notation”. While (System)Verilog allows such descriptions natively, there weren’t any open source tools that allowed such external formal property descriptions. (Commercial tools allowed this, just not the open source SymbiYosys.) This left me stuck with a couple of unpleasant choices:

I could remove the ability of the ZipCPU (or whatever design) to be formally verified with Open Source tools,
I could give up on using induction,
I could use MCY with simulation only, or
I could choose to not use MCY at all.

This is why I don’t use MCY. It may be a “good” tool, but it’s just not for me.

What I did learn, however, was that my ZipCPU test suite was checking the CPU’s functionality nicely–just not the debugging port. Indeed, none of my tests checked the debugging port to the CPU at all. As a result, none of the (simulation-based) mutations of the debugging port were ever caught.

Lesson learned? My test suite still wasn’t good enough. Sure, the CPU might “work” today, but how would I know some change in the future wouldn’t break it?

I needed a better way of knowing whether or not my test suite was good enough.

Coverage Checking

Sometime during this process I discovered coverage checking. Coverage checking is a process of automatically watching over all of your simulation based tests to see which lines get executed and which do not. Depending on the tool, coverage checks can also tell whether particular signals are ever flipped or adjusted during simulation. A good coverage check, therefore, can provide some level of indication of whether or not all control paths within a design have been exercised, and whether or not all signals have been toggled.

Coverage metrics are actually kind of nice in this regard.

Sadly, coverage checking isn’t as good as mutation coverage, but … it’s better than nothing.

Consider a classic coverage failure: many of my simulations check for AXI backpressure. Such backpressure is generated when either BVALID && !BREADY, or RVALID && !RREADY. If your design is to follow the AXI specification, it should be able to handle backpressure properly. That is, if you hold !BREADY long enough, it should be possible to force !AWREADY and !WREADY. Likewise, it should be possible to hold RREADY low long enough that ARREADY gets held low. A well verified, bug-free design should be able to deal with these conditions.

However, a “good” design should never create any significant backpressure. Hence, if you build a simulation environment from “good” working components, you aren’t likely to see much backpressure. How then should a component’s backpressure capability be tested?

My current solution here is to test backpressure via formal methods, with the unfortunate consequence that some conditions will never get tested under simulation. The result is that I’ll never get to 100% coverage with this approach.

A second problem with coverage regards the unused signals. For example, AXI-Lite has two signals, AWPROT and ARPROT, that are rarely used by any of my designs. However, they are official AXI-Lite (and AXI) signals. As a result, AutoFPGA will always try to connect them to an AXI-Lite (or AXI) port, yet none of my designs use these. This leads to another set of exceptions that needs to be made when measuring coverage.

So, coverage metrics aren’t perfect. Still, they can help me find what parts of the design are (and are not) being tested well. This can then help feed into better (and more complete) test design.

That’s the good news. Now let’s talk about some of the not so good parts.

When learning formal verification, I spent some time formally verifying Xilinx IP. After finding several bugs, I spoke to a Xilinx executive regarding how they verified their IP. Did they use formal methods? No. Did they use their own AXI Verification IP? No. Yet, they were very proud of how well they had verified their IP. Specifically, their executive bragged about how good their coverage metrics were, and the number of test points checked for each IP.

Hmm.

So, let me get this straight: Xilinx IP gets good coverage metrics, and hits a large number of test points, yet still has bugs within it that I can find via formal methods?

Okay, so … how severe are these bugs? In one case, the bugs would totally break the AXI bus and bring the system containing the IP down to a screeching halt–if the bug were ever tripped. For example, if the system requested both a read burst and a write burst at the same time, one particular slave might accomplish the read burst with the length of the write burst–or vice versa. (It’s been a while, so I’d have to look up the details to be exact regarding them.) In another case dealing with a network controller, it was possible to receive a network packet, capture that packet correctly, and then return a corrupted packet simply because the AXI bus handshakes weren’t properly implemented. To this day this bugs have not been fixed, and it’s nearly five years later.

Put simply, if it is possible for an IP to lock up your system completely, then that IP shouldn’t be trusted until the bug is fixed.

How then did Xilinx manage to convince themselves that their IP was high quality? By “good” coverage metrics.

Lesson learned? Coverage checking is a good thing, and it can reveal holes in any simulation-based verification suite. It’s just not good enough on its own to find all of what you are missing.

My conclusion? Formal verification, followed by a simulation test suite that evaluates coverage statistics is something to pay attention to, but not the end all be-all. One tool isn’t enough. Many tools are required.

Self-Checking Testbenches

I then got involved with ASIC design.

ASIC design differs from FPGA design in a couple of ways. Chief among them is the fact that the ASIC design must work the first time. There’s little to no room for error.

Fig 1. A typical verification environment

When working with my first ASIC design, I was introduced to a more formalized simulation flow. Let me explain it this way, looking at Fig. 1. Designs tend to have two interfaces: a bus interface, together with a device I/O interface. A test script can then be used to drive some form of bus functional model, which will then control the design under test via its bus interface. A device model would then mimic the device the design was intended to talk to. When done well, the test script would evaluate the values returned by the design–after interacting with the device, and declare “success” or “failure”.

Here’s the key to this setup: I can run many different tests from this starting point by simply changing the test script and nothing else.

For example, let’s imagine an external memory controller. A “good” memory controller should be able to accept any bus request, convert it into I/O wires to interact with the external memory, and then return a response from the memory. Hence, it should be possible to first write to the external memory and then (later) read from the same external memory. Whatever is then read should match what was written previously. This is the minimum test case–measuring the “contract” with the memory.

Other test cases might evaluate this contract across all of the modes the memory supports. Still other cases might attempt to trigger all of the faults the design is supposed to be able to handle. The only difference between these many test cases would then be their test scripts. Again, you can measure whether or not the test cases are sufficient using coverage measures.

The key here is that all of the test cases must produce either a “pass” or “fail” result. That is, they must be self-checking. Now, using self checking test cases, I can verify (via simulation) something like the ZipCPU across all of its instructions, in SMP and single CPU environments, using the DMA (or not), and so forth. Indeed, the ZipCPU’s test environment takes this approach one step farther, by not just changing the test script (in this case a ZipCPU software program) but also the configuration of the test environment as well. This allows me to make sure the ZipCPU will continue to work in 32b, 64b, or even wider bus environments in a single test suite.

Yes, this was a problem I was having before I adopted this methodology: I’d test the ZipCPU with a 32b bus, and then deploy the ZipCPU to a board whose memory was 64b wide or wider. The Kimos project, for example, has a 512b bus. Now that I run test cases on multiple bus widths, I have the confidence that I can easily adjust the ZipCPU from one bus width to another.

This is now as far as I’ve now come in my verification journey. I now use formal tests, simulation tests, coverage checking, and a self-checking test suite on new design components. Is this perfect? No, but at least its more rigorous and repeatable than where I started from.

Next Steps: Software/Hardware interaction

The testing regiment discussed above continues to have a very large and significant hole: I can’t test software drivers very well.

Consider as an example my SD card controller. The repository actually contains three controllers: one for interacting with SD cards via their SPI interface, one via the SDIO interface, and a third for use with eMMC cards (using the SDIO interface). The repository contains formal proofs for all leaf modules, and two types of SD card models–a C++ model for SPI and all Verilog models for SDIO and eMMC.

This controller IP also contains a set of software drivers for use when working with SD cards. Ideally, these drivers should be tested together with the SD card controller(s), so they could be verified together.

Recently, for example, I added a DMA capability to the Wishbone version of the SDIO (and eMMC) controller(s). This (new) DMA capability then necessitated quite a few changes to the control software, so that it could take advantage of it. With no tests, how well do you think this software worked when I first tested it in hardware?

It didn’t.

So, for now, the software directory simply holds the software I will copy to other designs and test in actual hardware.

Fig 2. Software driven test bench

The problem is, testing the software directory requires many design components beyond just the SD card controllers that would be under test. It requires memory, a console port, a CPU, and the CPU’s tool chain–all in addition to the design under test. These extra components aren’t a part of the SD controller repository, nor perhaps should they be. How then should these software drivers be tested?

Necessity breeds invention, so I’m sure I’ll eventually solve this problem. This is just as far as I’ve gotten so far.

Automated testing

At any rate, I submitted this repository to an automated continuous integration facility the team I was working with was testing. The utility leans heavily on the existence of a variety of make test capabilities within the repository, and so the SD Card repository was a good fit for testing. Along the way, I needed some help from the test facility engineer to get SymbiYosys, IVerilog and Verilator capabilities installed. His response?

It’s literally the first time I get to know a good hardware project needs such many verifications and testings! There’s even a real SD card simulation model and RW test…

Yeah. Actually, there’s three SD card models–as discussed above. It’s been a long road to get to this point, and I’ve certainly learned a lot along the way.

Watch therefore: for ye know not what hour your Lord doth come. (Matt 24:42)

Debugging video from across the ocean

Sat, 22 Jun 2024 00:00:00 -0400

I’ve come across two approaches to video synchronization. The first, used by a lot of the Xilinx IP I’ve come across, is to hold the video pipeline in reset until everything is ready and then release the resets (in the right and proper order) to get the design started. If something goes wrong, however, there’s no room for recovery. The second approach is the approach I like to use, which is to build video components that are inherently “stable”: 1) if they ever lose synchronization, they will naturally work their way back into synchronization, and 2) once synchronized they will not get out of sync.

At least that’s the goal. It’s a great goal, too–when it works.

Today’s story is about what happens when a “robust” video display isn’t.

System Overview

Let’s start at the top level: I’m working on building a SONAR device.

This device will be placed in the water, and it will sample acoustic data. All of the electronics will be contained within a pressure chamber, with the only interface to the outside world being a single cable providing both Ethernet and power.

Here’s the picture I used to capture this idea when we discussed the network protocols that would be required to debug this device.

Fig 1. Controlling an Underwater FPGA

This “wet” device will then connect to a “dry” device (kept on land, via Ethernet) where the sampled data can then be read, stored and processed.

Now into today’s detail: while my customer has provided no requirement for real-time processing, there’s arguably a need for it during development testing. Even if there’s no need for real-time processing in the final delivery, there’s arguably a need for it in the lab leading up to that final delivery. That is, I’d like to be able to just glance at my lab setup and know (at a glance or two) that things are working. For this reason, I’d like some real time displays that I can read, at a glance, and know that things are working.

So, what do we have available to us to get us closer?

Display Architecture

Some time ago, I built several RTL “display” modules to use for this lab-testing purpose. In general, these modules take an AXI stream of incoming data, and they produce an AXI video stream for display. At present, there are only five of these graphics display modules:

A histogram display

Histograms are exceptionally useful for diagnosing any A/D collection issues, so having a live histogram display to provide insight into the sampled data distribution just makes sense.

However, histogram displays need a tremendous dynamic range. How do you handle that in hardware? Yeah, that was part of the challenge when building this display. It involved figuring out how to build multiplies and divides without doing either multiplication or division. A fun project, though.
A trace module

By “trace”, I mean something to show the time series, such as a plot of voltage against time. My big challenge with this display so far has been the reality that the SONAR A/D chips can produce more data than they eye can quickly process.

Now that we’ve been through a test or two with the hardware, I have a better idea of what would be valuable here. As a result, I’m likely going to take the absolute value of voltages across a significant fraction of a second, and then use that approach to display a couple of seconds worth of data on the screen. Thankfully, my trace display module is quite flexible, and should be able to display anything you give to it by way of an AXI Stream input.
A falling raster

The very first time my wife came to a family day at the office, way back in the 1995-96 time frame or so, the office had a display set up with a microphone and a sliding spectral raster. I was in awe! You could speak, and see what your voice “looked” like spectrally over time. You could hit the table, whistle, bark, whatever, and every sound you made would look different.

I’ve since built this kind of capability many times over, and even studied the best ways to do it from a mathematical standpoint.

In the SONAR world, you’ll find this sort of thing really helps you visualize what’s going on in your data streams–what sounds are your sensors picking up, what frequencies are they at, etc. A good raster will let you “see” motors in the water–all very valuable.
A spectrogram, via the same trace module

This primarily involves plotting the absolute values of the data coming out of an FFT, applied to the incoming data. Thankfully, the trace module is robust enough to handle this kind of input as well.
A split screen display, that can place both an FFT trace and a falling raster on the same screen.

We’ll come back to the split screen display in a bit. In general, however, the processing components used within it look (roughly) like Fig. 2 below.

Fig 2. Split display video processing pipeline

Making this happen required some other behind the scenes components as well, to include:

An empty video generator–to generate an AXI video stream from scratch. The video out of this device is a constant color (typically black). This then forms a “canvas” (via the AXI video stream protocol) that other things can be overlaid on top of.

This generator leaves TVALID high, for reasons we’ve discussed before, and that we’ll get to again in a moment.
A video multiplexer–to select between one of the various “displays”, and send only one to the outgoing video display.

One of the things newcomers to the hardware world often don’t realize is that the hardware used for a display can often not be reused when you switch display types. This is sort of like an ALU–the CPU will include support for ADD, OR, XOR, and AND instructions, even if only one of the results is selected on each clock cycle. The same is true here. Each of the various displays listed above is built in hardware, occupies a separate area of the FPGA (whether used or not), and so something is needed to select between the various outputs to choose which we’d like.

It did take some thought to figure out how to maintaining video synchronization while multiplexing multiple video streams together.
A video overlay module–to merge two displays together, creating a result that looks like it has multiple independent “windows” all displaying real time data.

I wrote these modules years ago. They’ve all worked beautifully–in simulation. So far, these have only been designed to be engineering displays, and not necessarily great finished products. Their biggest design problem? None of them display any units. Still, they promise a valuable debugging capability–provided they work.

Herein lies the rub. Although these display modules have worked nicely in simulation, and although many have been formally verified, for some reason I’ve had troubles with these modules when placed into actual hardware.

Debugging this video chain is the topic of today’s discussion.

AXI Video Rules

For some more background, each of these modules produces an AXI video stream. In general, these components would take data input, and produce a video stream as output–much like Fig. 3 below.

Fig 3. General AXI Stream Video component

In this figure, acoustic data arrives on the left, and video data comes out on the right. Both use AXI streams.

The AXI stream protocol, however, isn’t necessarily a good fit for video proccessing. You really have to be aware of who drives the pixel clock, and where the blanking intervals in your design are handled.

Sink

If video comes into your device, the pixel clock is driven by that video source. The source will also determine when blanking intervals need to take place and how long they should be. This will be controlled via the video’s VALID signal.
Source

Otherwise, if you are not consuming incoming video but producing video out, then the pixel clock and blanking intervals will be driven by the video controller. This will be controlled by the display controllers READY signal.

In our case, these intermediate display modules also need to be aware that there’s often no buffering for the input. If you drop the SRC_READY line, data will be lost. Acoustic sensor data is coming at the design whether you are ready for it or not. Likewise, the video output data needs to get to the display module, and there’s no room in the HDMI standard for VALID dropping when a pixel needs to be produced.

Put simply, there are two constraints to these controllers: 1) the source can’t handle VALID && !READY, and 2) the display controller at the end of the video processing chain can’t handle READY && !VALID. Any IP in the middle needs to do what it can to avoid these conditions.

This leads to some self-imposed criteria, that I’ve “added” to the AXI stream protocol. Here are my extra rules for processing AXI video stream data:

All video processing components should keep READY high.

Specifically, nothing within the module should ever drop the ready signal. Only the downstream display driver should ever drop READY by more than a cycle or two between lines. This drop in READY then needs to propagate through all the way through any video processing chain.

My video multiplexer module is an example of an exception to this rule: It drops READY on all of the video streams that aren’t currently active. By waiting until the end of a frame before adjusting/swapping which source is active, it can keep all sources synchronized with the output. This component will fail, however, if one of those incoming streams is a true video source.
Keep VALID high as much as possible.

Only an upstream video source, such as a camera, should ever drop VALID by more than a cycle or two between lines. As with READY, this drop in VALID should then propagate through the video processing chain.

In my case, There’s no such camera in this design, and so I’m never starting from a live video source. However, for reuse purposes in case I ever wish to merge any of these components with a live feed, I try to keep VALID high as much as possible.
Expect the environment to do something crazy. Deal with it. If your algorithm depends on the image size, and that size changes, deal with it.

For example, if you are doing an overlay, and the overlay position changes, you’ll need to move it. If a video being overlaid isn’t VALID by the time it’s needed, then you’ll have to diable the overlay operation and wait for the overlay video source to get to the end of its frame before stalling it, and then forcing it to wait until the time required for its first pixel comes around again.
If your algorithm has a memory dependency, then there is always the possibility that the memory cannot keep up with the videos requirements. Prepare for this. Expect it. Plan for it. Know how to deal with it.

For example, if you are reading memory from a frame buffer to generate a video image, and the memory doesn’t respond in time then, again, you have to deal with it. Your algorithm should do something “smart”, fail gracefully, and then be able to resynchronize again later. Perhaps something else, such as a disk-drive DMA, was using memory and kept the frame buffer from meeting its real-time requirements. Perhaps it will be gone later. Deal with it, and recover.

In my case, I was building a falling raster. I had two real-time requirements.

First, data comes from the SONAR device at some incoming rate. There’s no room to slow it down. You either handle it in time, or you don’t. In my case, SONAR data is slow, so this isn’t really an issue.

Fig 4. AXI Stream Video "Rules"

This data then goes through an FFT, and possibly a logarithm or an averager, before coming to the first half of the raster. This component then writes data to memory, one FFT line at a time. (See Fig. 2 above.) If the memory is too slow here, data may be catastrophically dropped. This is bad, but rare.

Second, the waterfall display data must be produced at a known rate. VALID must be held high as much as possible so that the downstream display driver at the end of the processing chain can rate limit the pipeline as necessary. That means the waterfall must be read from memory as often as the downstream display driver needs it. If the memory can’t keep up, the display goes on. You can’t allow these to get out of sync, but if they do they have to be able to resynchronize automatically.

Those are my rules for AXI video. I’ve also summarized them in Fig. 4.

Debugging Challenge

Now let’s return to my SONAR project, where one of the big challenges was that the SONAR device wasn’t on my desktop. It’s being developed on the other side of the Atlantic from where I’m at. It has no JTAG connection to Vivado. There’s no ILA, although my Wishbone scope works fine. The bottom line here, though, is that I can’t just glance at the device (like I’d like) to see if the display is working.

I’ve therefore spent countless hours using both formal methods and video simulations to verify that each of these display components work. Each of these displays has passed a lint check, a formal check, and a simulation check. Therefore, they should all be working … right?

Except that when I tried to deploy these “working” modules to the hardware … they didn’t work.

The classic example of “not working” was the split screen spectrum/waterfall display. This screen was supposed to display the current spectrum of the input data on top, with a waterfall synchronized to the same data falling down beneath it. It’s a nice effect–when it works. However, we had problems where the two would get out of sync. 1) The waterfall would show energy in locations separate from the spectral energy, 2) the waterfall could be seen “jumping” horizontally across the screen–just like the old TVs would do when they lost sync.

This never happened in any of my simulations. Never. Not even once.

Sadly, my integrated SONAR simulation environment isn’t perfect. It has some challenges. Of course, there’s the obvious challenge that my simulation isn’t connected to “real” data. Instead, I tend to drive it with various sine waves. These tend to be good for testing. I suppose I could fix this somewhat by replaying collected data, but that’s only on my “To-Do” list for now. Then there’s the challenge that my memory simulation model doesn’t typically match Xilinx’s MIG DDR3 performance. (No, I’m not simulating the entire DDR3 memory–although perhaps I should.) Finally, I can only simulate about 5-15 frames of video data. It just doesn’t take very long before the VCD trace file exceeds 100GB, and then my tools struggle.

Bottom line: works in simulation, fails hard in hardware.

Now, how to figure this one out?

First Step: Formal verification

I know I said everything was formally verified. That wasn’t quite true initially. Initially, the overlay module wasn’t formally verified.

In general, I like to develop with formal methods as my guide. Barring that, if I ever run into problems then formal verification is my first approach to debugging. I find that I can find problems faster when using the formal tools. It tends to condense debugging very quickly. Further, the formal tools aren’t constrained by the requirement that the simulation environment needs to make sense. As a result, I tend to check my designs against a much richer environment when checking them formally than I would via simulation.

In this case, I was tied up with other problems, so I had someone else do the formal verification for me. He was somewhat new to formal verification, and this particular module was quite the challenge–there are just so many cases that had to be considered:

We can start with the typical design, where the overlaid image lands nicely within the main image window.

Fig 5. Overlaid window in video

This is what I typically think of when I set up an overlay of some type.

This isn’t as simple as it sounds, though, since the IP needs to know that the overlay window has finished its line, and so it shouldn’t start on the next line until the main window gets to the left corner of the overlay window for the next line.

What happens, though, when the overlay window scrolls off to once side and wraps back onto the main window?

Fig 6. Clipping the Overlaid video

It might also scroll off the bottom as well.

In both cases, the overlay video should be clipped. This is not something my simulation environment ever really checked, but it is something we had no end of challenges when checking via formal tools.

These clipped examples are okay. There’s nothing wrong with them–they just never look right with only a couple clock cycles of trace.

Fig 7. Overlay not ready

There’s also the possibility of what happens when the overlay window isn’t ready when the main window is, as illustrated in Fig. 7 on the right.

Remember our video rules. Together, these rules require that VALID and READY be propagated through the module–but never dropped internal to the module. That means there’s no time to wait. If the overlay module isn’t ready when it’s time, then the image will be corrupted. We can’t wait or the hardware display will lose sync. The overlay has to be ready or the image will be corrupted.

So, how to deal with situations like this?

Yeah.

Yes, my helper learned a lot during this process. Eventually, we got to the point of pictorially drawing out what was going on each time the formal engine presented us with another verification failure, just so we could follow what was going on. Yes, our drawings started looking like Fig. 5 or 6 above.

Yes, formal verification is where I turn when things don’t work. Typically there’s some hardware path I’m not expecting, and formal tends to find all such paths to make sure the logic considers them properly.

In this case, it wasn’t enough. Even though I formally verified all of these components, the displays still weren’t working. Unfortunately, in order to know this, I had to ask an engineer in a European time zone to connect a monitor and … he told me it wasn’t working. Sure, he was more helpful than that: he provided me pictures of the failures. (They were nasty. These were ugly looking failures.) Unfortunately, these told me nothing of what needed to be adjusted, and it was also costly in terms of requiring a team effort–I would need to arrange for his availability, (potentially) his cost, all for something that wasn’t (yet) a customer requirement.

I needed a better approach.

What I needed was a way to “see” what was going on, without being there. I needed a digital method of screen capture.

Building something like this, however, is quite the challenge: the waterfall displays all use my memory bandwidth–they can even use a (potentially) significant memory bandwidth. Debugging meant that I was going to need a means of capturing the screen headed to the display that wouldn’t (significantly) impact my memory bandwidth–otherwise my test infrastructure (i.e. any debugging screen capture) would impact what I was trying to test. That might lead to chasing down phantom bugs, or believing things were still broken even after they’d been fixed.

This left me at an impass for some time–knowing there were bugs in the video, but unable to do anything about them.

Enter QOI Compression

Some time ago, I remember reading about QOI compression. It captured my attention, as a fun underdog story.

Yes, I’d implemented my own GIF compression/decompression in time past. This was back when I was still focused on software, and thus before I started doing any hardware design. I’d even looked up how to compress images with PNG and how BZip2 could compress files. Frankly, over the course of 30 years working in this industry, compression is kind of hard to avoid. That said, none of these compression methods is really suitable for FPGA work.

QOI is different.

QOI is much simpler than GIF, PNG, or BZip2. Much simpler. It’s so simple, it can be implemented in hardware without too many challenges. It’s so simple, it can be implemented in 700 Xilinx 6-LUTs. Not only that, it claims better performance than PNG across many (not all) benchmarks.

Yeah, now I’m interested.

With a little bit of work, I was able to implement a QOI compression module. A small wrapper could encode and attach a small “file” header and trailer onto the compressed stream. This could then be followed by a QOI image capture module which I could then use to capture a series of subsequent video frames.

This led to a debugging plan that was starting to take shape. You can see how this plan would work in Fig. 8 below.

Fig 8. Video debug plan using QOI compression

If all went well, video data would be siphoned off from between the video multiplexer and the display driver generating the HDMI output. This video would be (nominally) at around (800*600*3*60) 82MB/s. If the compression works well, the data rate should drop to about 1MB/s–but we’ll see.

Of course, as with anything, nothing works out of the box. Worse, if you are going to rely on something for “test”, it really needs to be better than the device under test. If not, you’ll never know which item is the cause of an observation: the device under test, or the test infrastructure used to measure it.

Therefore, I set up a basic simulation test on my desktop. I’d run the SONAR simulation, visually inspect the HDMI output, and capture three frames of data. I’d then convert these three frames of data to PNGs. If the resulting PNGs visually matched, then I had a ~~strong~~ confidence the QOI compression, encoder, and recorder were working.

Note that I had to cross out the word “strong” there. Unless and until an IP can be tested through every logic path, you really don’t have any “strong” confidence something is working. Still, it was enough to get me off the ground.

The challenge here is that tracing the design through simulation while it records three images can generate a 120GB+ VCD file, and took longer to test in simulation than it did to build the hardware design, load the hardware design, and capture images from hardware. As a result, I often found myself debugging both the QOI processing system and the (buggy) video processing system jointly, in hardware, at the same time. No, it’s not ideal, but it did work.

The First Bug: Never getting back in sync

I started my debugging with the default display, a split screen spectrogram and waterfall. Using my newfound capability, I quickly received an image that looked something like the figure below.

Fig 9. First QOI capture -- no waterfall

This figure shows what should be a split screen spectrogram and waterfall display. The spectrum on top appears about right, however the waterfall that’s supposed to exist in the bottom half of the display is completely absent.

Well, the good news is that I could at least capture a bug.

The next step was to walk this bug backwards through the design. In this case, we’re walking backwards through Fig. 2 above and the first component to look at is the overlay module. It is possible for the overlay module to lose synchronization. This typically means either the overlay isn’t ready when the primary display is ready for it, or that the overlay is still displaying some (other) portion of its video. Once out of sync, you can no longer merge the two displays. The two streams then need to be resynchronized. That is, the overlay module would need to wait for the end of the secondary image (the image to be overlaid on top of the primary), and then it would need to stall the secondary image until the primary display was ready for it again.

However, the overlay module wasn’t losing synchronization.

No?

This was a complete surprise to me. This was where I was expecting the bug, and where most of my debugging efforts had been (blindly) focused up until this point.

Okay, so … let’s move back one more step. (See Fig. 2)

It is possible for the video waterfall reader to get out of sync between its two clocks. Specifically, one portion of the reader reads data, one line at a time, from the bus and stuffs it into first a synchronous FIFO, and then an asynchronous one. This half operates at whatever speed the bus is at, and that’s defined by the memory’s speed. The second half of the reader takes this data from the asynchronous FIFO and attempts to create an AXI stream video output from it–this time at the pixel clock rate. Because we are not allowed to stall this video output to wait for memory, it is possible for the two to get out of sync. In this case, the reader (pixel clock domain) is supposed to wait for an end of frame indication from the memory reader (bus clock domain, via the asynchronous FIFO), and then it is to stall the memory reader (by not reading from the asynchronous FIFO) until it receives an end of video frame indication from its own video reconstruction logic.

A quick check revealed that yes, these two were getting out of sync.

Here’s how the “out-of-sync” detection was taking place:

	initial	px_lost_sync = 0;
	always @(posedge pix_clk)
	if (pix_reset)
		px_lost_sync <= 0;
	else if (M_VID_TVALID && M_VID_TREADY && M_VID_HLAST)
	begin
		// Check when sending the last pixel of a line.  On this last
		// pixel, the data read from memory (px_hlast) must also
		// indicate that it is the last pixel in a line.  Further,
		// if this is also the last line in a frame, then both the
		// memory indicator of the last line in a frame (px_vlast)
		// and the outgoing video indicator (M_VID_VLAST) must match.
		if (!px_hlast || (M_VID_VLAST && !px_vlast))
			px_lost_sync <= 1'b1;
		else if (px_lost_sync && M_VID_VLAST && px_vlast)
			// We can resynchronize once both memory and
			// outgoing video streams have both reached the end of
			// a frame.
			px_lost_sync <= 1'b0;
	end

Following any reset, the entire design should be synchronized. That’s the easy part.

Next, if the output of the overlay module (that’s the M_VID_* prefix values) is ready to produce the last pixel of a line, then we check if the FIFO signals line up. In our example, we have two sets of synchronization signals. First, there are the M_VID_HLAST and M_VID_VLAST signals. These are generated blindly based upon the frame size. These indicate the last pixel in a line (M_VID_HLAST) and the end of a frame (M_VID_VLAST) respectively–from the perspective of the video stream. Two other signals, px_hlast and px_vlast, come through the asynchronous FIFO. These are used to indicate the last bus word in a line and the end of a frame from the perspective of the data found within the asynchronous FIFO containing the samples read from memory–one bus word (not one pixel) at a time. If these two ever get out of sync, then perhaps memory hasn’t kept up with the display or perhaps something else has gone wrong.

So, to determine if we’ve lost sync, we check for it on the last pixel of any line. That is, when M_VID_HLAST is true to indicate the last pixel in a line, then px_last should also be true–both should be synchronized. Likewise, when M_VID_VLAST (last line of frame) is true, then px_vlast should also be true–or the two have come out of sync.

Because I’m also doing 128b bus word to 8b pixel conversions here, the two signals don’t directly correspond. That is, px_hlast might be true (last bus word of a line), even though M_VID_HLAST isn’t true yet (last pixel of a line). Hence, I only check these values if M_VID_HLAST is true–on the last pixel of the line.

That’s how we know if we’re out of sync. But … how do we get synchronized again?

For this, the plan is to read from the memory reader as fast as possible until the end of the frame. Once we get to the end of the frame, we’ll stop reading from memory and wait for the video (pixel clock) to get to the end of the frame. Once both are synchronized at the end of a frame, the plan is to then release both together and we’ll be synchronized again.

At least, that’s how this is supposed to work.

The key (broken) signal was the signal to read from the asynchronous FIFO. This signal, called afifo_read, is shown below.

	always @(*)
	begin
		afifo_read = (px_count <= PW || !px_valid
			|| (M_VID_TVALID && M_VID_TREADY && M_VID_HLAST));
		if (M_VID_TVALID && !M_VID_TREADY)
			afifo_read = 1'b0;

		// Always read if we are out of sync
		if (px_lost_sync && (!px_hlast || !px_vlast))
			afifo_read = 1'b1;
	end

Basically, we want to read from the asynchronous FIFO any time we don’t have a full pixel’s width left in our bus width to pixel gearbox, any time we don’t have a valid buffer, or any time we reach the end of the line–where we would flush the gearbox’s buffer. The exception to this is if the outgoing AXI stream is stalled. This is how the FIFO read signal is supposed to work normally. There’s one exception here, and that is if the two are out of sync. In that case, we will always read from the FIFO until the last pixel in a line on the last line of the frame.

This all sounds good. It looked good on a desk check too. II passed over this many times, reading it, convincing myself that this was right.

The problem is this was the logic that was broken.

If you look closely, you might notice that this logic would never allow us to get back in sync. Once we lose synchronization, we’ll read until the end of the frame and then stop, only to read again when any of the original criteria are true–the ones assuming synhronization.

Yeah, that’s not right.

This also explains why all my hardware traces showed the waterfall never resynchronizing with the outgoing video stream.

One missing condition fixes this.

	always @(*)
	begin
		// ...
		if (px_lost_sync && px_hlast && px_vlast && (!M_VID_HLAST
				|| !M_VID_VLAST))
			afifo_read = 1'b0;
	end

This last condition states that, if we are out of sync and we’ve reached the last pixel in a frame, then we need to wait until the outgoing frame matches our sync. Only then can we read again.

Once I fixed this, things got better.

Fig 10. QOI capture, showing an attempted waterfall display

I could now get through a significant fraction of a frame before losing synchronization for the rest of it. In other words, I had found and fixed the cause of why the design wasn’t recovering, just not the cause of what caused it to get out of sync in the first place.

The waterfall background is also supposed to be black, not blue–so I needed to dig into that as well. (That turned out to be a bug in the QOI compression module. I could just about guess this bug, if I watched how the official decoder worked.)

So, back I went to the Wishbone scope, this time triggering the scope on a loss of sync event. I needed to find out why this design lost sync in the first place.

The Second Bug: How did we lose sync in the first place?

Years ago, I wrote an article that argued that good and correct video handling was all captured by a pair of counters. You needed one counter for the horizontal pixel, and another for the vertical pixel. Once these got to the raw width and height of the image, the counters would be reset and start over.

When dealing with memory, things are a touch different–at least for this design.

As hinted above, the bus portion of the waterfall reader works off of bus words, not pixels. It reads one line at a time from the bus, reading as many bus words as are necessary to make up a line. In the case of this system, a bus word on the Nexys Video board is 128-bits long–the natural width of the DDR3 SDRAM memory. (Our next hardware platform will increase this to 512-bits.) Likewise, the waterfall pixel size is only 8-bits–since it has no color, and a false color will be provided later. Hence, to read an 800 pixel line, the bus master must read 50 bus words (800*8/128). The last word will then be marked as the last in the line, possibly also the last in the frame, and the result will be stuffed into the asynchronous FIFO. Once the last word in a line is requested of the bus, the bus master needs to increment his line pointer address to the next line

However, there’s a problem with bus mastering: the logic that makes requests of a bus has to take place many clocks before the logic that receives the bus responses. The difference is not really that important, but it typically ends up around 30 clock cycles or so. That means this design needs two sets of X and Y counters: one when making requests, to know when a full line (or frame) has been requested and that it is time to advance to the next line (or frame), and a second set to keep track of when the line (or frame) ends with respect to the values returned from the bus. This second set controls the end of line and frame markers that go into the synchronous and then asynchronous FIFO.

Let’s walk through this logic to see if I can clarify it at all.

First, there’s both an synchronous FIFO and an asynchronous one–since it can be a challenge to know the fill of the asynchronous FIFO.
Once the synchronous FIFO is at least half empty, the reader begins a bus transaction. For a Wishbone bus, this means both CYC and STB need to be raised.
For every STB && !STALL, a request is made of the bus. At this time, we also subtract one from a counter keeping track of the number of available (i.e. uncommitted) entries in the synchronous FIFO.
Likewise, for every STB && !STALL, the IP increments the requested memory address.

Once you get to the end of the line, set the next address to the last line start address minus one line of memory. Remember, we are creating a falling raster, where we go from most recent FFT data to oldest FFT data. Hence we read backwards through memory, one line at a time.

Once we get to the beginning of our assigned memory area, we wrap back to the end of our assigned memory area minus one line.

Once we get to the end of the frame, we need to reset the address to the last line the writer has just completed.
On evey ACK, the returned by data gets stored into the synchronous FIFO. With each result stored in the FIFO, we also add an indication of whether this return was associated with the end of a line or the end of a frame.
Once the reader gets to the end of the line, we restart the (horizontal) ~~pixel~~ bus word counter and increment the line counter. When it gets to the end of the frame, we reset the line counter as well.

Just to make sure that these two sets of counters (request and return) remain synchronized, the return counters to set to equal the request counters any time the bus is idle.
The IP then continues making requests until there would be no more room in the FIFO for the returned data. At this point, STB gets dropped and we wait for the last request to be returned.
Once all requests have been returned, drop CYC and wait again.

The rule of the bus is also the rule of the boarding house bathroom: do your business, and get out of there. Once you are done with any bus transactions, it’s therefore important to get off the bus. Even if we could (now) make more requests, we’ll get off the bus and wait for the FIFO to become less than half full again–that way other (potential) bus masters can have a chance to access memory.

And … right there is the foundation for this bug.

The actual bug was how I determined whether or not the last request was being returned. Let’s look at that logic for a moment, shall we? Here’s what it looked like (when broken): (Watch for what clears o_wb_cyc …)

	initial { o_wb_cyc, o_wb_stb } = 2'b00;
	always @(posedge i_clk)
	if (wb_reset)
		// Halt any requests on reset
		{ o_wb_cyc, o_wb_stb } <= 2'b00;
	else if (o_wb_cyc)
	begin
		if (!o_wb_stb || !i_wb_stall)
			// Drop the strobe signal on the last request.  Never
			// raise it again during this cycle.
			o_wb_stb <= !last_request;

		if (i_wb_ack && (!o_wb_stb || !i_wb_stall)
					&& last_request && last_ack)
			// Drop ACK once the last return has been received.
			o_wb_cyc <= 1'b0;
	end else if (fifo_fill[LGFIFO:LGBURST] == 0)
		// Start requests when the FIFO has less than a burst's size
		// within it.
		{ o_wb_cyc, o_wb_stb } <= 2'b11;

	always @(posedge i_clk)
	if (i_reset || !o_wb_cyc || i_wb_err)
		last_ack <= 0;
	else
		last_ack <= (wb_outstanding + (o_wb_stb ? 1:0)
				<= 2 +(i_wb_ack ? 1:0));

Look specifically at the last_ack signal.

Depending upon the pipeline, this signal can be off by one clock cycle.

This was the bug. Because the last_ack signal, indicating that there’s only one more acknowledgement left, compared the number of outstanding requests against 2 plus the current acknowledgment, and because the signal was registered, last_ack might be set if there were two requests outstanding and nothing was returned on the current cycle.

Since all requests would’ve been made by this time, the X and Y ~~pixel~~ bus word counters for the request would reflect that we’d just requested a line of data. The return counters, on the other hand, would be off by one if CYC ever dropped a cycle early. These return counters would then get reset to equal the request counters any time CYC was zero. Hence, dropping the bus line one cycle early would result in a line of pixels (well, bus words representing pixels …) going into the FIFO that didn’t have enough pixels within it–or perhaps the LAST signal might be missing entirely. Whatever the case, it didn’t line up.

This particular design was formally verified. Shouldn’t this bug have shown up in a formal test? Sadly, no. It’s legal to drop CYC early, so there’s no protocol violation there. Further, my acknowledgment counter was off by one in such that the formal properties allowed it. If I added an assertion that CYC would never be dropped early (which I did once I discovered this bug), the design would then immediately (and appropriately) fail.

There’s one more surprise to this story though. Why didn’t this bug show up in simulation?

Ahh, now there’s a very interesting lesson to be learned.

Reality: Why didn’t the bug(s) show up in simulation?

Why didn’t the bug show up earlier? Because of Xilinx’s DDR3 SDRAM controller, commonly known as “The MIG”.

I don’t normally simulate DDR3 memories. A DDR3 SDRAM memory controller requires a lot of hardware specific components, components that aren’t necessarily easy to simulate, and it also requires a DDR3 SDRAM simulation model. I tend to simplify all of this and just simulate my designs with an alternate SDRAM model–a model that looks and acts “about” right, but one that isn’t exact.

It was the difference between my simulation model, which wouldn’t trigger any of the bugs, and Xilinx’s MIG reality that ended up triggering the bug.

Fig. 11, for example, shows what the Wishbone scope returned when documenting the waterfall reader’s transactions with the MIG.

Fig 11. The Waterfall reader's view of Wishbone bus handshaking when accessing memory

Focus your attention on first the stall (i_stall) and then the acknowledgment (i_ack) lines.

First, stall is high immediately as part of the beginning of the transaction. This is to be expected. With the exception of filling a minimal buffer, any bus master requesting transactions of the bus is going to need to wait for arbitration. This only takes a clock or two. Once arbitration is received, the interconnect won’t stall the design again during this bus cycle.

Only the stall line gets raised again after that–several times even. These stalls are all due to the MIG.

Let’s back up a touch.

There are a lot of rules to SDRAM interaction. Most SDRAM’s are configured in memory banks. Banks are read and written in rows. The data in each row is stored in a set of capacitors. This allows for maximum data packing in minimal area (cost). However, you can’t read from a row of capacitors. To read from the memory, that row first needs to be copied to a row of fast memory. This is called “activating” the row. Once a row is activated, it can be read from or written to. Once you are done with one row, it must be “precharged” (i.e. put back), before a different row can be activated. All of this takes time. If the row you want isn’t activated, you’ll need to switch rows. That will cause a stall as the old row needs to be precharged and the new row activated. Hence, when making a long string of read or a long string of write requests, you’ll suffer from a stall every time you cross rows.

Xilinx’s MIG has another rule. Because of how their architecture uses an IO trained PLL (Xilinx calls this a “phasor”), the MIG needs to regularly read from memory to keep this PLL trained. During this time the memory must also stall. (Why the MIG can’t train on my memory reads, but needs its own–I don’t know.) These stalls are very periodic, and if you dig a bit you can find this taking place within their controller.

Then the part of the trace showing a long stalled section reflects the reality that, every now and again, the memory needs to be taken entirely off line for a period of time so that the capacitors can be recharged. This requires a longer time period, as highlighted in Fig. 12 below.

Fig 12. SDRAM refresh cycles force long stalls

Once it’s time for a refresh cycle like this, several steps need to take place in the memory controller–in this case the MIG. First, any active rows need to be precharged. Then, the memory is refreshed. Finally, you’ll need to re-activate the row you need. This takes time as well–as shown in Fig. 12.

That’s part one–the stall signal. My over-simplified SDRAM memory model doesn’t simulate any of these practical memory realities.

Part two is the acknowledgments. From these traces, you can see that there’s about a 30 cycle latency (300ns) from the first request to the first acknowledgment. However, unlike my over-simplified memory model, the acknowledgments also come back broken due to the stalls. This makes sense. If every request takes 30 cycles, and some get stalled, then it only makes sense that the stalled requests would get acknowledged later the ones that didn’t get stalled.

Put together, this is why my waterfall display worked in simulation, but not in hardware.

Conclusion

Wow, that was a long story!

Yeah. It was long from my perspective too. Although the “bugs” amounted to only 2-5 lines of Verilog, it took a lot of work to find those bugs.

Here are some key takeaways to consider:

All of this was predicated on a simulation vs hardware mismatch.

Because the SDRAM simulation did not match the SDRAM reality, cycle for cycle, a key hardware reality was missed in testing.
This should’ve been caught via formal methods.

From now on, I’m going to have to make certain I check that CYC is only ever dropped either following either a reset, an error, or the last acknowledgment. There should be zero requests outstanding when CYC is dropped.
Why wasn’t the pixel resynchronization bug caught via formal?

Because … FIFOs. It can be a challenge to formally verify a design containing a FIFO. Rather than deal with this properly, I allowed the two halves of the design to be somewhat independent–and so the formal tool never really examined whether or not the design could (or would) properly recover from a lost sync.
Did formally verifying the overlay module help?

Yes. When we went through it, we found bugs in it. Once the overlay module was formally verified, the result stopped jumping. Instead, the overlay might just note a problem and stop showing the overlaid image. Even better, unlike before the overlay module was properly verified, I haven’t had any more instances of the top and bottom pictures getting out of sync with each other.
What about that blue field?

Yes, the waterfall background should be black when no signal was present. The blue field turned out to be caused by a bug in the QOI compression module.

Once fixed, the captured image looked like Fig. 13 below.

Fig 13. SDRAM refresh cycles force long stalls

This was easily found and fixed. (It had to deal with a race condition on the pixel index when writing to the compression table, if I recall correctly …)

How about that QOI module?

The thing worked like a champ! I love the simplicity of the QOI encoding, enough so that I’m likely to use it again and again!

Okay, perhaps I’m overselling this. It wasn’t perfect at first. This is, in many ways to be expected–this was the first time it was ever used. However, it was small and cheap, and worked well enough to get the job done.

Some time later, I managed to formally verify the compression engine, and I found another bug or two that had been missed in my hardware testing.

That’s compression.

Decompression? That’s another story. I think I’ve convinced myself that I can do decompression in hardware, but the algorithm (while cheap) isn’t really straightforward any more. At issue is the reality that it will take several clock cycles (i.e. pipeline stages) to determine the table index for storing colors into, yet the very next pixel might be dependent upon the result of reading from the table. Scheduling the pipeline isn’t straightforward. (Worse, I have simulation test cases showing that the decompression logic I have doesn’t work yet.)
Are the displays ready for prime time?

I’d love to say so, but they don’t have labeled axes. They really need labeled axes to be proper professional displays. Perhaps a QOI decompression algorithm can take labeled image data from memory and overlay it onto the display as well. However, to do this I’m going to have to redesign how I handle scaling, otherwise the labels won’t match the image.

Worse, [DG3YEV Tobias] recently put my waterfall display to shame. My basic displays are much too simple. So, it looks like I might need to up my game.

I should point out, in passing, that the UberDDR3 SDRAM controller doesn’t nearly have as many stall cycles as Xilinx’s MIG. It doesn’t use the (undocumented) hardware phasors, so it doesn’t have to take the memory offline periodically. Further, it can schedule the row precharge and activation cycles so as to avoid bus stalls (when accessing memory sequentially). As such, it operates about 10% faster than the MIG. It even gets a lower latency. These details, however, really belong in an article to themselves.

I suppose the bottom line question is whether or not these displays are ready for our next testing session. The answer is a solid, No. Not yet. I still need to do some more testing with them. However, these displays are a lot closer now than they’ve been for the last two years.

Seest thou a man diligent in his business? he shall stand before kings; he shall not stand before mean men. (Prov 22:29)

Bringing up Kimos

Thu, 13 Jun 2024 00:00:00 -0400

Ever had one of those problems where you were stuck for weeks?

It’s not supposed to happen, but … it does.

Let me tell you about the Kimos story so far.

What is Kimos?

Kimos is the name of one of the current open source projects I’m working on. The project is officially named the “Kintex-7 Memory controller, Open Source toolchain”, but the team shortened that to “KiMOS” and I’ve gotten to the point where I just call it “Kimos” (pronounced KEE-mos). The goals of the project are twofold.

Test an Open Source DDR3 SDRAM memory controller.

This includes both performance testing, and performance comparisons against Xilinx’s MIG controller.

Just as a note, Angelo’s controller has a couple of differences with Xilinx’s controller. One of them is a simpler “native” interface: Wishbone, with an option for one (or more) auxilliary wire(s). The auxilliary wire(s) are designed to simplify wrapping this controller with a full AXI interface. Another difference is the fact that Angelo’s controller is built using documented Xilinx IO capabilities only–rather than the PHY_CONTROL and PHASER* constructs that Xilinx used and chose not to document.

My hypothesis is that these differences, together with some internal structural differences that I encouraged Angelo to make, will make his a faster memory controller. This test will tell.
Once the memory controller works, our goal is to test Kimos using an entirely open source tool flow.

This open source tool flow would replace Vivado.

The project hardware itself is built by Enclustra. It consists of two boards: a Mercury+ ST1 baseboard, and an associated KX2 daughterboard. Together, these boards provide some nice hardware capability in one place:

There’s a large DDR3 SDRAM memory, with a 64b data width. Ultimately, this means we should be able to transfer 512b per FPGA clock. In the case of this project, that’ll be 512b for every 10ns (i.e. a 100MHz FPGA system clock)–even though the memory itself can be clocked faster.
The board also has two Gb Ethernet interfaces, although I only have plans for one of them.

Each interface (naturally) includes an MDIO management interface. Although I might be tempted to take this interface for granted, it shouldn’t be. It was via the MDIO interface that I was able to tell which of the two hardware interfaces corresponded to ETH0 on the schematic and which was ETH1.
There’s an SD card slot on the board, so I’ve already started using it to test my SDIO controller and it’s new DMA capability. Once tested, the dev branch (containing the DMA) will have been “tested” and “hardware proven”, and so I’ll be able to then merge it into the master branch.
I’m likely to use the FMC interface to test a new SATA controller I’m working on. A nice FPGA Drive daughter board from Ospero Electronic Design, Inc., will help to make this happen.

Do note, though, that this controller, although posted, is most certainly broken and broken badly at present–it’s just not that far along in its development to have any reliability to it. The plan is to first build a SATA Verilog model, get the controller running in simulation, and then to get it running on this Enclustra hardware. It’s just got a long way to go in its process at present. The good news is that the project is funded, so if you are interested in it, come back and check in on it later–after I’ve had the chance to prove (and therefore fix) it.
The device also has some I2C interfaces, which I might investigate for testing my ultimate I2C controller on. The main I2C bus has three chips connected to it: an Si5338 clock controller (which isn’t needed for any of my applications), an encrypted hash chip (with … poor documentation–not recommended), and a real time clock.
The design also has some of the more standard interfaces that everything relies on, to include Flash and UART–both of which I have controllers for already.
Although the baseboard has HDMI capabilities, Enclustra never connected the HDMI on the baseboard to the KX2 daughterboard. Hence, if I want video, I’ll need to use the DisplayPort hardware–something I haven’t done before, but … it does have potential (just not funding).

This is a shame, because I have a bunch of live HDMI displays that I’d love to port to this project that … just aren’t likely to happen.

Eventually, my plan is to port my SONAR work to this hardware–but that remains a far off vision at this point.

The project is currently a work in progress, so I have not gotten to the point of completing either of the open source objectives. (Since I initially drafted this, Angelo’s controller has now been ported, and appears to be working–it’s performance just hasn’t been measured yet.)

I have, however, completed a first milestone: getting the design working with Xilinx’s MIG controller. For a task that should’ve taken no longer than a couple of days, this portion of the task has taken a month and a half–leaving me stuck in FPGA Hell for most of this time.

Now that I have Xilinx’s MIG working, I’d like to share a brief description of what went wrong, and why this took so long. Perhaps others may learn from my failures as well.

The challenges with board bringup

The initial steps in board bringup went quickly: I could get the LEDs and serial port up and running with no problems. From there I could test the ZipCPU (running out of block RAM), and things looked good. At this point, a year or so ago, I put the board on the shelf to come back to it later when I had more time and motivation (i.e. funding).

I wasn’t worried about the next steps. I already had controllers for the main hardware components necessary to move forward. I had a controller that would work nicely with Xilinx’s MIG, another that would handle the Gb Ethernet, a flash controller, and so on. These were all proven controllers, so it was just a matter of integrating them and making sure things worked (again) as expected.

Once the Kimos project kicked off, with the goals listed above, I added these components to the project and immediately had problems.

The DONE LED

The first problem was that the “DONE” LED wouldn’t light. Or, rather, it would light just fine until I tried to include Xilinx’s MIG controller. Once I included Xilinx’s MIG controller into the design the LED would no longer light.

Now … how do you fix that one? I mean, where do you even start?

I started by reducing the design as much as possible. I removed components from the design, and adjusted which components were in the design and which were not. With a bit of work, I was able to prove–as mentioned above–that the design would work as long as Xilinx’s MIG (DDR3 SDRAM) controller was not a part of the design. The moment I added Xilinx’s MIG, the design stopped working.

Ouch. What would cause that? Is there a short circuit on the board somewhere? Did I mess up the XDC file? The MIG configuration?

With some help from some other engineers, we traced the first problem to the open source FPGA loader I was using: openFPGALoader. As it turns out, this loader struggles to load large/complex designs at high JTAG frequencies. However, if you drop the frequency down from 4MHz to 3.75MHz, the loader will “just” work and the DONE LED will get lit.

The problem goes a bit deeper, and highlights a problem I’ve had personally as well: since the developer of the openFPGALoader component can’t replicate the problem with the hardware he has, he can’t really test fixes. Hence, although a valid fix has been proposed, the developer is uncertain of it. Still, without help, I wouldn’t have made it this far.

Sadly, now that the DONE LED lit up for my design, it still didn’t work. Worse, I no longer trusted the FPGA loader. This left me always looking over my shoulder for another loading option.

For example, I tried programming the design into flash and then using my internal configuration access port (ICAPE) controller to load the design from flash. This didn’t work, until I first took the flash out of eXecute in Place (XiP) mode. (Would I have known that, if I hadn’t been the one to build the flash controller and put it into XiP mode in the first place? I’m not sure.) However, if I first told the flash to leave XiP mode, I could then specify a warm-boot address to my ICAPE controller, followed by an IPROG command, which could then load any design that … didn’t include Xilinx’s MIG DDR3 SDRAM controller.

At this point, I had proved that my problem was no longer the openFPGALoader. That was the good news. The bad news was that the design still wasn’t working whenever I included the MIG.

JTAG/UART not working

If the design loads, the place I want to go next is to get an internal logic analyzer up and running. Here, I have two options:

Xilinx’s ILA requires a JTAG connection.

Without a Xilinx compatible JTAG connector, I can’t use Xilinx’s ILA.

At one point I purchased a USB based JTAG controller. I … just didn’t manage to purchase the right one, and so the pins never fit.
I typically do my debugging over UART, using a Wishbone scope–something we’ve already discussed on the blog. Using this method I can quickly find and debug problems.

However, with this particular design, any time I added the MIG SDRAM controller to the design my UART debugging port would stop working–together with the rest of the design. That left me with no UART, and no JTAG. Indeed, I could’ve ping’d the board via the Gb Ethernet unless and until I added the MIG.

Something was seriously wrong. This is definitely not “cooking with gas”.

So how then do you debug something this? LEDs!

LEDs not working

Debugging by LED is slow. It can take 10+ minutes to make a change to a design, and each LED will only (at best) give you one bit of output. So the feedback isn’t that great. Still, they are an important part of debugging early design configuration issues. In this case, the Enclustra KX2 daughterboard has four LEDs on it, and the Mercury+ ST1 baseboard has another 4 LEDs. Perhaps they could be used to debug the next steps?

Normally, I build my designs with a “Knight Rider” themed LED display. This helps me know that my FPGA design has loaded properly. There are two parts to this display. First, there’s an “active” LED that moves from one end of the LED string to the other and then back again. This “active” LED is ON with full brightness–whatever that means for an individual design. Then, once the “active” LED moves on to the next LED in the string, a PWM (actually PDM) signal is used to “dim” the LED in a decaying fashion. Of course, the CPU can easily override this display as necessary.

My problem was that, even though the “DONE” LED would (now) light up when loading a design containing the MIG, these user LEDs were not doing anything.

Curiously, if I overrode the LEDs at the top level of the design, I could make them turn either on or off. I just couldn’t get my internal design to control these LEDs properly. (I call this an “override” method because the top level of my design is generated by AutoFPGA, and I wasn’t going so far as to adjust the original sources describing how these LEDs should ultimately operate.) Still, using this top-level override method, I was able to discover that I could see LEDs 4-7 from my desk chair, that these were how I had wired up the LEDs on the baseboard (a year earlier), and that LEDs 6 and 7 had an opposite polarity from all of the other LEDs on the board.

All useful, it just didn’t help.

At one point, I noticed that the LEDs were configured to use the IO standard SSTL15 instead of the normal LVCMOS15 standard I normally use. Once I switched from SSTL15 to LVCMOS15, my knight-rider display worked.

Unfortunately, neither the serial port nor the Ethernet port worked. Both of these continued to work if the MIG controller wasn’t included in the design, just not if the MIG controller was included.

Voodoo Engineering

I like to define Voodoo engineering as “Changing what isn’t broken, in an attempt to fix what is.” Not knowing what else to try, I spent a lot of time doing Voodoo engineering just trying to get the design working.

With the help of a hardware friend and his lab, we examined all of the power rails. Could it be that the design was losing power during the startup sequence, and so not starting properly even though the “DONE” LED was lighting up?

No.

After a lot of work with various probes, all we discovered was that the design used about 50% more power when the MIG was included. Did this mean there was a short circuit somewhere?

Curiously, it was the FPGA that got warmer, not the DDR3 SDRAM.

I left this debug session convinced I needed to look for a bug in my XDC file somewhere.
I spent a lot of time comparing the schematic to the XDC file. I discovered some rather important things:
- Some banks required internal voltage references. These were not declared in any of the reference designs.
- Two banks needed DCI cascade support, but the reference design only had one bank using it.
- The design required a voltage select pin that I wasn’t setting. This pin needed to be set to high impedance.
- I had the DDR3 CKE IO mapped to the wrong pin.
The Enclustra ST1 baseboard can support multiple IO voltages. These need to be configured via a set of user jumpers, and the constraints regarding how these IO voltages are to be set are … complex. Eventually, I set banks A and B to 1.8V and bank C to 1.2V.

Sadly, nothing but the LEDs were using banks B and C, so … none of these changes helped.

I suppose I should be careful here: I was probably fixing actual bugs during these investigations. However, none of the bugs I fixed actually helped move me forward. Fixing these bugs didn’t get the UART+SDRAM working, nor did they get the network interface working whenever the SDRAM was included. Both of these interfaces worked without the SDRAM as part of the design, they just didn’t work when connecting the MIG SDRAM controller to the design.

Was there some short circuit connection between SDRAM pins and something on the UART or network IO banks? There shouldn’t be, I mean, both of these peripherals were on separate IO banks from the DDR3 SDRAM.

Reference design

At this point, I needed to use the reference design to make certain the hardware still worked. I’d had weeks of problems where the DONE pin wasn’t going high. Did this mean I’d short circuited or otherwise damaged the board? The design was using a lot more power when configured to use the SDRAM. Did this mean there was a short circuit damaging the board? Had my board been broken? Was there a manufacturing defect?

Normally, this is where you’d use a reference design. Indeed, this was Enclustra’s recommendation to me. Normally this would be a good recommendation. They recommended I use their reference design, prove that the hardware worked, and then slowly migrate that design to my needs. My problem with this approach was that their reference design wasn’t written in RTL. It was written in TCL with a Verilog wrapper. Worse, their TCL Ethernet implementation depended upon an Ethernet controller from Xilinx that … required a license. Not only that, Enclustra did not provide any master XDC file(s). (They did provide schematics and a .PRJ file with many of the IOs declared within it.) Still, how do you “slowly migrate” TCL to RTL? That left me with just their MIG PRJ file to reference and … I still had a bug.

There were a couple of differences between my MIG PRJ configuration file and their reference MIG configuration. My MIG PRJ configuration file used a 100MHz user clock, and hence a 400MHz DDR3 clock, whereas their reference file used an 800MHz DDR3 clock. (My design wouldn’t close timing at 200MHz, so I was backing away to 100MHz.) Could this be the difference?

Upon request, one of my teammates built a LiteX design for this board. (It took him less than 2hrs. I’d been stuck for weeks! How’d he get it going so fast? Dare I mention I was jealous?) This LiteX design had no problems with the DDR3 SDRAM–although it doesn’t use Xilinx’s MIG. I even had him configure this LiteX demo for the 400MHz DDR3 clock, and … there were no problems.

Given that the LiteX design “just worked”, I knew the hardware on my board still worked. I just didn’t know what I was doing wrong.

The final bug: the reset polarity

One difference between the MIG driven design and the non-MIG design (i.e. my design without a DDR3 SDRAM controller) is that the MIG controller wants to deliver both system clock and the system reset to the rest of the design. Any failure to get either a system clock or a system reset from the MIG controller could break the whole design.

So, I went back to the top level LEDs again. I re-examined the logic, and made sure LED[7] would blink if the MIG was held in reset, and LED[6] would blink if the clocks didn’t lock. This lead me to two problems. The first problem was based upon where I had my board set up: I couldn’t see LED[7] from my desk top with a casual glance. I had to make sure I leaned forward in my desk chair to see it. (Yes, this cost me a couple of debug cycles before I realized I couldn’t see all of the LEDs without leaning forward.) Once I could see it, however, I discovered the system reset wire was being held high.

Well, that would be a problem.

Normally, when I use the MIG controller, I use an active high reset. This time, in order weed out all of the possible bugs, I’d been trying to make my MIG configuration as close to the example/reference configuration I’d been given. That meant I set the design up to use an active-low reset–like the reference design. I had assumed that, if the MIG were given an active low reset it would produce an active low user reset for the design.

Apparently, I was wrong. Indeed, after searching out the Xilinx user guide, I can confirm I was definitely wrong. The synchronous user reset was active high.

Once I switched to an active high reset things started working. My serial port now worked. I could now read from memory over the UART interface, and “ping” the network interface of the device. Even better, my debugging interface now worked. That meant I could use my Wishbone scope again.

I was now “cooking with gas”.

Cleaning up

From here on out, things went quickly. Sure, there were more bugs, but these were easily found, identified, and thus fixed quickly.

While the design came up and I could (now) read from memory, I couldn’t write to memory without hanging up the design. After tracing it, this bug turned out to be a simple copy error. It was part of some logic I was getting ready to test which would’ve ran the MIG at 200MHz, and the design at 100MHz–just in case that was the issue.

This bug was found by adding a Wishbone scope to the design, and then seeing the MIG accept a request that never got acknowledged.

Yeah, that’d lock a bus up real quick.

I should point out that, because I use Wishbone and because Wishbone has the ability to abort an ongoing transaction, I was able to rescue my connection to the board, and therefore my connection to the bus, even after this fault. No, I couldn’t rescue my connection to the SDRAM without a full reset, but I could still talk to the board and hence I could still use my Wishbone scope to debug the problem. Had this been an AXI bus, I would not have had this capability without using some form of protocol firewall.
Other bugs were found in the network software. This was fairly new software, never used before, so finding bugs here were not really all that surprising.

At least with these bugs, I could use my network software together with my Verilator-based simulation environment. Indeed, my C++ network model allows me to send/receive UDP packets to the simulated design, and receive back what the design would return.

Like I said, by this point I was “cooking with gas”. It took about two days (out of 45) to get this portion up and running.

The one bug that was a bit surprising was due to a network access test that set the host software into an infinite loop. During this infinite loop, the software would keep writing to a debug dump, which I was hoping to later use to debug any issues. The surprise came from the fact that I wasn’t expecting this issue, so I had let the test run while I stepped away for some family time. (Supper and a movie with the kids may have been involved here …) When I discovered the bug, the debug dump file had grown to over 270GB! Still, fixing this bug was pretty routine, and there’s not a lot to share other than it was just another bug.

Lessons learned

There are a lot of lessons to be learned here, some of which I’ve done to myself.

All RTL

I like all RTL designs. I prefer all RTL designs. I can debug an all RTL design. I can adjust an all RTL design. I can version control an all RTL design.

I can’t do this with a TCL design that references opaque components that may get upgraded or updated any time I turn around. Worse, I can’t fix an opaque component–and Xilinx isn’t known for fixing the bugs in their designs. As an example, the following bug has been lived in Xilinx’s Ethernet-Lite controller for years:

I reported this in 2019. This is only one of several bugs I found. The logic above is as of Vivado 2022.1. In this snapshot, you can see how they commented the originally broken code. As a result, the current design now looks like they tried to fix it and … it’s still broken on its face. (i.e. RVALID shouldn’t be adjusted or dropped unless RREADY is known to be true …)

Or what about RDATA?

This also violates the first principles of AXI handshaking. Notice that RDATA might not get set if !RVALID && !RREADY–hence the first RDATA value read from this device might be in error.

Yeah, … no. I’m not switching to Xilinx IP any time soon if I can avoid it. At least with my own IP I can fix any problems–once I find them.

For all of these reasons, I would want an all HDL reference design from any vendor I purchase hardware from. At least in this case, you can now find an all-Verilog reference design for the ST1+KX2 boards in my Kimos project–to include a working (and now open source) DDR3 SDRAM controller.

Simulation.

Perhaps my biggest problem was that I didn’t have an all-Verilog simulation environment set up for this design from the top level on down. Such an environment should’ve found this reset bug at the top level of the design immediately. Instead, what I have is a joint Verilog/C++ environment designed to debug the design from just below the top level using Verilator. This kept me from finding and identifying the reset bug–something that could have (and perhaps should have) been found in simulation.

In the end, after finding the reset bug, I did break down and I found a Micron model of a DDR3 memory. This was enough to debug some issues associated with getting the Wishbone scope working inside the memory controller, although it’s not really a permanent solution.

Still, this is a big enough problem that I’ve been shopping around the idea of an open source all-Verilog simulation environment–something faster than Iverilog, with more capability. If you are interested in working on building such a capability–let me know.

Finger pointing

As is always the case, I tend to point the finger everywhere else when I can’t find a bug. This seems to be a common trait among engineers. For the longest time I was convinced that my design was creating a short circuit on the board. As is typically the case, I often have to come back to reality once I do find the bugs.

I guess the bottom line here is that I have more than enough humble pie to share. Feel free to join me.

Since writing this, the project has moved forward quite significantly. The design now appears to work with both the MIG and with the UberDDR3 controller–although I made some more beginner mistakes in the clock setup while getting that controller up and running. Still, it’s up and running now, so my next task will be running some performance metrics to see which controller runs faster/better/cheaper. (Hint: the UberDDR3 controller uses about 30% less logic, so there’s at least one difference right off the bat.)

Stay tuned, and I’ll keep you posted regarding how the two controllers compare against each other.

For I am not ashamed of the gospel of Christ: for it is the power of God unto salvation to every one that believeth; to the Jew first, and also to the Greek. (Romans 1:16)

Chasing resets

Mon, 01 Apr 2024 00:00:00 -0400

A true story.

Some years ago, given a customer’s honest need and request, I proposed a change to a client’s ASIC IP. Specifically, I wanted to add CRC checking, based upon a CRC kept in an out-of-band memory region, to verify the ability to properly read memory regions error free. I said the change shouldn’t take more than about two weeks, and I’d clean up some other problems I was aware of in the mean time. This change solved an urgent problem, so he agreed to it.

By the time I was done, my 80 hr proposal had turned into 270+ hrs of work.

Build it well

I’d like to start my discussion of what went wrong with a list of good practices to follow.

Fig 1. Basic test bench components

Just as a background, a general test bench follows the format shown in Fig. 1, on the right. The “test bench” itself is composed of a series of scripts. These scripts then interact with a common test bench “library”, which then makes requests of an AXI bus via a “bus functional model”. This project was designed to make minor changes to the device under test.

With that vocabulary under our belt, here are some of the good practices I would expect to find in a well built design.

Avoid magic numbers.

Yes, I harp on magic numbers a lot. There’s a reason for it. While it wasn’t hard at all to make the requested changes, I had to come back later and spend more than two weeks chasing down magic numbers buried in the test bench.

Specifically, I wanted to add a hardware capability to calculate and store a CRC in an out of band area on a storage device, and then to check those values again when reading the data back. CRCs can be calculated and checked quickly and efficiently in hardware–especially if the data is already moving. Unfortunately, the test bench had hard coded locations where everything was supposed to land in the hardware, and as a result all of these locations needed updating in order to add room for the CRC.

I spent quite a bit of time chasing down all of these magic numbers.

This applies to register address names as well–but we’ll come back to these in a moment.
The “Rule of three”: If you have to write the same thing three times, refactor it.

If the magic numbers were confined to one or two places, that would be one thing. Unfortunately, they were found throughout the test library copied from place to place to place. Every one of those copies then needed personal attention to double check, in order to answer the question of whether or not the “copied” number was truly a copied number that could be modified or removed.
Name your register addresses. It makes moving them easier.

Or, in this case, four versions of this IP earlier someone had removed a control register from the IP. The address was then reallocated for another purpose. No one noticed the test scripts were still accessing the old register until I came along and tried to assign names to all of the registers within the IP. I then asked, where is the XYZ register? It’s not at this address …

I hate coming across situations like this. “Fixing” such situations always risks making a change (which needs to be made) that then might break something else later. (Yes, that happens too …)
There’s a benefit to naming even one bit magic numbers.

Not to get side tracked, but in another design there was a one-bit number to indicate data direction. Throughout the logic, you’d find expressions like: if (direction), or if (!direction). While you might think this was okay, the designer wrote the design for the wrong sense.

I then came along and then wanted to “fix” things.

Not knowing how deep the corruption lie, or whether or not I was getting the direction mapping right in the first place, I changed all of these expressions to if (direction == DIR_SOURCE) or if (direction == DIR_SINK). This way, if necessary, I could come back later and change DIR_SOURCE and DIR_SINK at one location (okay, one per file …) and then trust that everything would change consistently throughout the design.

I got things “mostly” right on my first pass. The place where I struggled was in the test bench, where things were named backwards. Why? Because if the design was the source, the test bench needed to be the sink.
That reset delay.

This is really what I want to discuss today. How long should a design be held in reset before being released?

My personal answer? No longer than it needs to be. Xilinx asks for a 16 clock period AXI reset. Most designs don’t need this. Indeed, most digital designs can reset themselves in a single clock period, although some require two.

Some designs do very validly need a long reset. I’ve come across this often where an analog tracking circuit needs to start and lock before the digital logic should start working with the results of that circuit. This make sense, I can understand it, and I’ve built this sort of thing before when the hardware requires it. SDRAMs often require long resets as well, on the order of 200us.

In the case of today’s example and lesson learned story, the test bench for the digital portion of the design was using a 1,000 clock reset. That is, the test bench held the design in reset for 1,000 clock cycles. Why? That’s a good question. Nothing in the IP required such a long reset. So, I changed it to 3 cycles. Three cycles was still overkill–one cycle should’ve been sufficient, but simulation time can be expensive. Why waste simulation time if you don’t need to?

After changing to a 3 cycle reset, the design worked fine and passed its test cases. I turned my work in, and counted the project done. All my work had been completed in (roughly) the 80 hours I had projected. Nice.

(Okay, my notes say my initial turn in took closer to 120hrs, but I’m going to tell the story and pretend my cost estimate was 80hrs. I can eat a 40hr overrun on an 80hr contract if I have to–especially if it’s an overrun in what I had proposed to do.)

Constants should be constant. Parameters are there for that purpose.

If a design has a startup constant, something it depends upon, then that constant should be set on startup–before the first clock tick is over, and not later.

Some engineers like to specify fixed design parameters via input ports rather than parameters. While there are good reasons for doing this–especially in ASIC designs, those fixed constants should be set before the first clock cycle. If they are supposed to be equivalent to wires that are hardwired to either power or ground, then they should act like it.

Personally, I think this purpose is better served by parameters rather than hard wired constants, but I can understand a need to build an ASIC that can then be reconfigured in the field via hard switches. For example, consider how switches can be used to adjust the FPGA wires controlling the boot source. In other words, there is a time for configuring a design via input wires. Just … make those values constants from startup for simulation purposes.

Calculated values should be calculated, not set in fixed macros.

This particular design depended upon a set of macros, and one test configuration required one set of macros whereas another test configuration might require another set of macros.

These macros contained all kinds of computed constants. For instance, if the design had 512 byte ECC blocks, then the block boundaries were things like bytes 0-511, 512-1023, 1024-1535, etc–all captured in macros used by the test bench, and all dependent on the devices page size. Further constants captured things like where the ECC would be located in a page, or how many ECC bytes were used for the given ECC size–which was also a macro.

These constants got even worse when it came time to test the ECC. In this case, there were macros specifying where to place the errors. So, for example, the test bench for a 4bit ECC might generate one error in bytes 0-63, one in bytes 64-127, and macros existed defining these ranges all the way up to the (macro-defined) size of the page which could be 2kB, 4kB, 8kB, etc.

Sadly, the test script would only run a set of 30 test cases for one set of macros. The design then needed to be reconfigured if you wanted to run it with another set of macros. Specifically, every time you needed to change which ECC option you were testing, or which device model you wished to test against, then you needed to switch macro sets. In all, there were over 50 sets of macros, and each macro set contained between 40-150 macros the design required in order to operate. Worse, many of those macros were externally calculated. Running all tests required starting and restarting the test driver, one macro set at a time.

Here was the problem: What happens when a macro set configures the IP to run in one fashion, and you need to reconfigure your operations mid-sim-runtime to another macro set? More specifically, what happens when you need to boot with one ECC option (defined as a macro), and then switch to another? In this case, the macro set determined how memory was laid out, and the customer wanted to change the memory layout in the middle of a test run. (He then couldn’t figure out why this was a problem for us …)

Lesson learned? When some configuration points are dependent upon others, use functions and calculate them within the IP. That way, if you switch things around later–or even at runtime, those test-library functions can still capture all the necessary dependencies.

Second lesson learned? IP should be configured via parameters, not macros, and those parameters should all be able to be scripted by the test driver. Perhaps you may recall how I discussed handling this in an article discussing an upgrade to the ZipCPU’s test infrastructure some time back.
If requirements are in flux, the IP can’t be delivered.

This should be a simple given, a basic no-brainer–it’s really basic engineering 101. If you don’t know what you want built, you shouldn’t hire someone to build it until you have solid requirements. If you want to change things mid-task, any rework that will be required is going to be charged against your bottom line.

In this case, the end customer of this IP discovered how I was intending to meet their requirement by adding a CRC. They then wanted things done in a different manner. Specifically, they wanted the CRCs stored somewhere else. Of course, this didn’t take place until after I’d already proposed a fixed price contract based upon 80 hours of work, and accomplished most of that work. Sure, I can support some changes–if the changes are minor. For example, I initially built a 32b CRC capability and they then wanted a 16b CRC capability. I figured that’d be a cheap change–since the design was (now) well parameterized, only two parameters needed to change to adjust. In this case, however, their simple desire to switch CRC sizes from 32b to 16b now doubled the time spent in verification–since we now needed to run the verification test suite twice–once for a 32b CRC and once again for the 16b CRC they wanted. Their other change request, moving the CRC storage elsewhere, was major enough that it couldn’t be done without starting the entire update over from scratch.

Change is normal. Customers don’t always know what they want. I get that. The problem here was that as long as requirements were in flux I wasn’t going to deliver any capability. Let’s agree on what we’re going to deliver first, then I’ll deliver that.

Then the customer started asking why it was taking so long to deliver the promised changes, when could we deliver the IP, and they had a hard RTL freeze deadline, and … Yes, this became quite contradictory: 1) They wanted me to make a change that would force me to start my work all over from scratch, but at the same time 2) wanted all of my changes delivered immediately to meet their hard deadline.

You can’t make this stuff up.

If a design can fail, then a simulation test case should exist that can trigger that failure.

This is especially true of ASIC designs, and a lesson I’m needing to learn in a hard way. In my case, I knew that I could properly calculate and detect CRC errors. I had formally proven that.

However, because I didn’t (initially) generate a simulation test to verify what would happen on a CRC failure, no one noticed how complicated the register handling for these CRC failures had become.
Test bench drivers should mirror software

At some point in time, someone’s going to need to build control software. They’ll start with the test bench driver. The closer that test bench driver looks to real software, the easier their task will be.

So what happened?

Okay, ready for the story?

Here’s what happened: I made my changes inside my promised two weeks. I merged and delivered the changes the customer had requested. Everything worked.

Life was good.

Fig 2. Everything fell apart when merging

Then my client then said, oops, we’re sorry, you made the changes to the wrong version of the IP. The end customer had asked us to make a simple change to allow the software to read a sector from non-volatile memory to boot from on startup. Here’s the correct version to change.

The changes appeared minor, so I merged my changes and re-submitted. This time, many of the tests now failed.

What went wrong?

Fig 3. I now use watchdog timers in my test benches

The first problem was the reset. Remember how I removed that 1,000 clock reset, because it wasn’t needed? One of the test cases was waiting 100 clock cycles, and then calling a startup task which would then set the “constant” input values that were only sampled during reset. This value would determine whether the new bootloader capability would be run on startup or not. The test bench would then wait on the signal that the bootloader had completed its task. However, with a 3 cycle reset, the boot on startup constant was never set before the end of the reset period, so the bootloader never started and the test bench then hung waiting for the bootloader to complete. (Waiting on a non-existent boot loader wasn’t a part of the design I started with.)

It didn’t help that the test script (in file #1) called a task (in file #2), that set a value (in file #3), that was checked elsewhere (in file #4), that was … In other words, there was so much indirection on this reset between where it was set and its ultimate consequence that it took quite a bit of time to sort through. No, it didn’t help that I hadn’t written this IP, nor its test bench, nor its test scripts, nor its test libraries in the first place.

Unfortunately, that was only the first problem.

The second problem was due to an implied requirement that, if your test bench reads from memory on bootup, there must be an initial set of valid data in memory for it to boot from–especially if you are checking for valid CRCs and failing a test if any CRC failed. This requirement didn’t exist in either branch, but became an implied requirement once the boot up and CRC branches were merged together. We hadn’t forseen that one coming either.

A third problem came from how fault detection was handled. In the case of a fault, an interrupt would be generated. The test bench would wait for that interrupt, read the interrupt register from the IP, and then handle each active interrupt as appropriate.

In order to properly handle a CRC failure, I needed to adjust how interrupts were handled in the test library. That’s fair. Let’s look at that logic for a moment.

Interrupts were handled in the test library within a Verilog task. The relevant portion of this task read something like:

	do begin
		wait(interrupt);
		read(interrupt_register);

		if (interrupt_register == 8'h01) begin
			// Handle interrupt #1
		end else if (interrupt_register == 8'h02) begin
			// Handle interrupt #2
		end else if (interrupt_register == 8'h03) begin
			// Handle interrupts #1 and #2
		end // ...
	end while(task_not_done);

This was a hidden violation of the rule of three, since you’d find the same interrupt handler for interrupt #1 following a check for the interrupt register equalling 8'h01, 8'h03, 8'h05, 8'h07, etc.

Worse, the interrupt handlers didn’t just handle interrupts. They would also issue commands, reset the interrupt register, use delays, etc., so that handling interrupt #1 wasn’t the same between a reading of 8'h01 and 8'h05.

My solution was to spend about two days refactoring this, so that every interrupt would be given its own independent handler properly. The result looked something like the logic below.

	do begin
		wait(interrupt);
		read(interrupt_register);

		if (interrupt_register[0]) begin
			// Handle interrupt #1
		end

		if (interrupt_register[1]) begin
			// Handle interrupt #2
		end

		if (interrupt_register[2]) begin
			// Handle interrupt #3
		end // ...

		clear_interrupts; // and adjust the mask if necessary
	end while(task_not_done);

Among other things, I removed all of the register accesses from the interrupt “handling” routines, capturing their needs instead in some registers so the accesses could all happen at the end. As a result, nothing took simulation time during these handlers and things truly could be merged properly.

I was proud of this update. The portion of the test library handling interrupts now “made sense”.

So, I sent the design off to the test team again only to have it come back to me again a couple days later. It had failed another test case. Where? In a second copy of the same broken interrupt handler that I had just refactored.

While I might argue that the rule of three should’ve applied to this second copy, you could also argue that it didn’t simply because it was a second copy of the same interrupt handler and not a third.

I could go on.

As I mentioned in the beginning, a basic 80 hour task became a 270+ hour task. Further, the task went from being on time to late very suddenly. Yes, this was how I spent my Thanksgiving weekend that year.

Conclusion

A good design plus test bench should be easy to adjust and modify.

Building a poor design, a poor test bench, or (worse) both constitutes taking out a loan from your future self. This is often called “technical debt.” If this is a prototype you are willing to throw away later, then perhaps this is okay. If not, then you will end up paying that loan back later, with interest, at a time you are not expecting to pay it. It will cost you more than you want to pay, at a time when you aren’t expecting a delay.

What about formal methods? Certainly formal methods might have helped, no?

I suppose so. Indeed, all of my updates were formally verified. Better yet, everything that was formally verified worked right the first time. What about the stuff that failed? None of it had ever seen a formal tool. Test bench scripts, libraries, and device models, for example, tend not to be formally verified. Further, why would you formally verify a “working” design that you were handed? Unless, of course, it was never truly “working” in the first place.

Remember, well verified, well tested RTL designs are gold in this business. Build them well, and you can sell or re-use them for years to come.

For yet a little while, and he that shall come will come, and will not tarry. (Heb 10:37)

2023, Year in review

Sat, 20 Jan 2024 00:00:00 -0500

It should come as no surprise that a blog with no advertisements has never paid my bills–at least not directly. I blog for fun, and to some extent for rubber duck debugging. As I learn new concepts, I enjoy sharing them here. Going through the rigor to write about a topic also helps to make sure I understand the topic as well.

Why are there no advertisements? For two reasons. First, because I’m not doing this to make money. Second, because because I want more control over any advertising from this site than most advertisers want to provide. Perhaps some day the site will be supported by advertising. Until then, the web site works fine without advertisements.

So how then does the blog fit into my business model? Simply because the blog helps me find customers via those who read articles here and write to me.

Business Projects

So, if the blog doesn’t pay my bills, then what does?

2023 Projects

Well, six projects have paid the bills this year. Three of these have been ASIC projects, to include the PSRAM/NOR flash controller, and an ONFI NAND flash controller. Three other projects this year have been FPGA projects, to include an open source 10Gb Ethernet switch and a SONAR front end based upon my VideoZip design–after including several very significant upgrades, such as handling ARP and ICMP requests in hardware. That’s four of the six projects from this year. Once the other two projects become a bit more marketable, I may mention them here as well.

Since I’ve already discussed the 10Gb Ethernet design, let me take a moment and discuss the I2C controller within it. The I2C controller was originally designed to support the SONAR project. Perhaps you may remember the initial article, outlining the design goals for this controller. Thankfully, it’s met all of these goals and more–but we’ll get to that in a moment. As part of the SONAR project, its purpose was to sample various non-acoustic telemetry data: temperature, power supply voltage, current usage, humidity within the enclosure, and more. All of these needed to be sampled at regular intervals. At first glance, this sounds like a software task–that is until you start adding real-time requirements to it such as the need to shut down the SONAR transmitter if it starts overheating, or using so much power that the FPGA itself will brown out shortly. So, the I2C controller was designed to generate (AXI stream) data packets automatically, without CPU intervention, which could then be forwarded … somewhere.

An example I2C-driven OLED output

This design was then incorporated into the 10Gb Ethernet design. There it provided the team the ability to 1) read the DDR3 memory stick configuration–useful for making sure the DDR3 controller was properly configured, 2) read the SFP+ configuration–and discover that we were using 1GbE SFP+ connectors initially instead of 10GbE connectors (Oops!), 3) read the Extended Display Identification Data (EDID) from the downstream HDMI monitor, 4) configure and verify the Si5324’s register settings, 5) draw a logo onto a small OLED display, all in addition to 6) actively monitoring hardware temperature.

Supporting these additional tasks required two fundamental changes to the initial vision for this I2C controller. First, I needed an I2C DMA, to quietly transfer results read from the device to memory. Only once I had this DMA could the CPU then inspect and/or report on the results. (It was probably one of the easiest DMA’s I’ve written, since I2C is a rather slow protocol.) Second, each packet needed a designated destination channel, so the design could know where to forward the results. This was useful for knowing if the I2C information should be forwarded to the DMA, for storing in memory, or the HDMI slave controller, for forwarding the downstream monitor’s EDID to the upstream monitor. The fact that this controller, designed for completely separate project, in a completely different domain (i.e. SONAR), ended up working so well in an 10Gb Ethernet design project is a basic testament to a well designed interface.

The year has also included some internally funded projects. These include a new SDIO/eMMC controller, a (to-be-posted) upgrade to my standard debugging bus, and a ZipCPU upgrade. Allow me to take a moment to discuss these three (unfunded) projects in a bit more detail.

The SDIO/eMMC controller is new. By using all four data lanes and a higher clock rate, this upgrade offers a minimum 8x transfer rate performance improvement over my prior SPI-only version. That’s kind of exciting. Even better, the IP has been tested on both an SD card as well as an eMMC chip as part of the KlusterLab (i.e. 10Gb Ethernet board) design. The IP, plus software, is so awesome I’m likely to add it to any future designs I have with SD cards or eMMC chips in the future.

The difference between SPI and SDIO: Speed

That’s just the beginning, too. Just because this new SDIO controller works on hardware, doesn’t mean it works in all modes. Since its original posting, I’ve added verification to support all the modes our hardware doesn’t (yet) support. I’ve also started adding eMMC BOOT mode support, and I expect I’ll be (eventually) adding DMA support to this IP as well. My goal is also to make sure I can support multiple sector read or write commands–something the SPI only version couldn’t support, and something that’s supposed to be supported in this new version but isn’t tested (yet). (Remember, if it’s not tested it doesn’t work.) In other words, despite declaring this IP as “working”, it remains under very active development.

I will use Slave/Master Terms where appropriate

Then there’s the upgrade to the debuging bus. This has been in the works now for quite a while. My current/best debugging bus implementation uses six printable characters to transmit a control code (read request, write data, or new address) plus 32-bits of data. At six data bits per 8-bit character transmitted, this meant six characters would need to be sent (minimum) in order to send either a 32-bit address or 32-bit data word, leading to a 36b internal word. It also required 10*6 baud periods (10 baud periods times six characters) for every uncompressed 32b of data transferred, for a best case efficiency of 53%.

The debugging bus multiplexes console and bus channels

Since then, I’ve slowly been working on an upgrade to this protocol that will use five (not necessarily printable) characters to transmit 32-bits of data plus a control code. This upgrade should achieve an overall 64% worst case (i.e. uncompressed) efficiency, for a speed improvement of about 16% over the prior controller in worst case conditions. The upgrade comes with some synchronization challenges, but currently passes all of its simulation checks–so at this point it’s ready for hardware testing. My only problem is … this upgrade isn’t paid for. Inserting it into one of my business projects is likely to increase the cost of that project–both in terms of integration time as well as verification while chasing down any new bugs introduced by this new implementation–at least until the upgraded bus is verified. This has kept this debugging bus upgrade at a lower priority to the other paying projects. Well, that and the fact that I only expect a 16% improvement over the prior implementation. As a result, the upgrade isn’t likely to pay for itself for a long time.

Moving from 6 characters to 5 characters to send 32bits

Finally, let’s discuss the ZipCPU’s big upgrades. As with the other upgrades, these were also internally funded. However, the ZipCPU has now formed a backdrop to a majority of my projects. Indeed, it’s helped me verify ASIC IP in both simulation and FPGA contexts. One upgrade in particular will keep on giving, and that is the ZipCPU’s new DMA controller. I’ve already managed to integrate it into a (work in progress) SATA controller, and I’m likely to retarget this DMA engine (plus a small state machine) to meet the DMA needs of my new SDIO/eMMC controller. Indeed, it is so versatile that I’m likely to use this controller across a lot of projects. Better yet, at this rate, I’m likely to build an AXI version of this new DMA supporting all of these features as well. It’s just that good.

All labour is profitable, whether or not it's paid for

As for dollars? Well, let’s put it this way: the year is now over, and I’m still in business. Not only that, but I’ve also managed to keep two kids in college this year. More specifically, I expect my third child to graduate from college this year. (Five to go …) So, I’ve been hanging in there, and I thank my God that my bills have been paid.

Articles

2023 has been a slower year for articles than past years. Much of this is due to the fact that my time has been so well spent on other paying projects. That’s left less time for blogging. (No, it doesn’t help that my family has fallen in love with Football, and that my major blogging times have been spent watching my son’s high school games, Air Force Academy Falcon’s football, the Kansas City Chiefs, Miami Dolphins, Philadelphia Eagles, and my own home team–the disappointing Minnesota Vikings.) Still, I have managed to push out seven new articles this year. Let’s look at each, and see how easy they can be found using DuckDuckGo.

Debugging the hard stuff

This article discusses some of the challenges I went through when debugging modifications I made to a working ECC algorithm. ECC, of course, is one of those “hard” problems to debug since the intermediate data tends to look meaningless when viewed.

DuckDuckGo Ranking: A search for “FPGA ECC Debugging” brings up the ZipCPU home page as return #111.

That’s kind of disappointing. Let’s try a search using Google. Google finds the correct page immediately as its #1 result. At first I thought the difference was because Google knew I was interested in ZipCPU results. Then I asked my daughter to repeat my test on her phone in private mode. (She has no interest in FPGA anything, so this would be a first for her.) Her Google ranking came up identical, so maybe I can trust this Google ranking.
What is a SwiC?

The ZipCPU was originally designed to be a System within a Chip, or a SwiC as I called it. This article discusses what a SwiC is, and tries to answer the question of whether or not a SwiC makes sense, or equivalently whether or not the ZipCPU made for a good SwiC in the first place. In many ways, this article was a review of whether or not the ZipCPU’s design goals were appropriate, and whether or not they’ve been met.

DuckDuckGo Ranking: Searches on SwiC return all kinds of irrelevant results, and searches on “System within a Chip” return all kinds of results for “Systems on a Chip”. If you cheat and search for “ZipCPU SwiC”, you get the ZipCPU web site as the #1 page.

What is a Virtual Packet FIFO?

A virtual FIFO is a first-in, first-out data structure built in hardware, but using external memory–such as a DDR3 SDRAM–for its memory. A virtual packet FIFO is a virtual FIFO that guarantees completed packets and packet boundaries, in spite of any back pressure that might otherwise cause the FIFO to fill or overflow.

This article goes over the why’s and how’s of a virtual packet FIFO: why you might need it, how to use it, and how it works.

Since writing this article, I’ve now built and tested a Wishbone based virtual packet FIFO as part of the 10Gb Ethernet project. Conclusion? First, verifying the FIFO is a pain. Second, I might be able to tune its memory usage with some better buffering. But, overall, the FIFO itself works quite nicely in all kinds of environments.

DuckDuckGo ranking: The ZipCPU blog comes up as the #2 ranking on DuckDuckGo following a search for “Virtual Packet FIFO”. The ZipCPU reddit page comes up as the #7 ranking. The page itself? Not listed. However, both of the prior pages point to this article, so I’m going to give this a DuckDuckGo ranking of #2. Sadly, most of DuckDuckGo’s other results are completely irrelevant to a Virtual Packet FIFO. In general, they’re about Virtual FIFOs–not Virtual Packet FIFOs. As before, though, Google gets the right article as it’s number one search result.

Introducing the ZipCPU 3.0

After years of updates, ZipCPU 3.0 is here! This means that the ZipCPU now has support for multiple bus structures, wide bus widths, clock stopping, and a brand new DMA. The article announces this new release, and discusses the importance of each of these major upgrades.

DuckDuckGo Ranking: A search for “ZipCPU” on DuckDuckGo yields ZipCPU.com as the #1 search result. That’s good enough for me.
Using a Verilog task to simulate a packet generator for an SDIO controller

I haven’t written a lot about either Verilog test benches, or how to build them, so this is a bit of a new topic for me. Specifically, the question involved was how to make your test bench generate properly synchronous stimuli. No, the correct answer is NOT to generate your stimulus on the negative edge of the clock.

DuckDuckGo Ranking: A search for “SDIO Verilog Tasks” on DuckDuckGo yields the SDIO repository as the #31 search result. (Google returns the correct article, after searching for “SDIO Verilog” at #3.)
SDIO RX: Bugs found with formal methods

If you’ve read my blog often enough, you’ll know that I’m known for formally verifying my designs. In the case of the new SDIO/eMMC controller, I had it “working” on hardware before either the formal verification or the full simulation model were complete. This leaves open the question, how many bugs were missed by my hardware and (partial) simulation testing?

The article spends a lot of time also discussing “why” proper verification, whether formal or simulation, is so important.

DuckDuckGo Ranking: A search for “SDIO formal verification” turns up the ZipCPU blog as result #69. Adding “verilog” to the search terms, returns the blog as number #46. As before, Google returns the right article as the #1 search result after only searching for “SDIO formal”.
An Overview of a 10Gb Ethernet Switch

As I mentioned above, one of the big projects of mine this year was a 10Gb Ethernet switch. This article goes over the basics of the switch, and how the various data paths within the design move data around.

DuckDuckGo Ranking: A search for “10Gb Ethernet Switch FPGA” turns up the Ethernet design as the #16 result, and a search on “10Gb Ethernet Switch Verilog” returns the same github result as the #1 result. Curiously, the 10Gb Ethernet test bench model for the same repository comes up as the #2 result.

For all those who like to spam my email account, my conclusions from these numbers are simple: 1) the ZipCPU blog holds its own just fine on a Google ranking, and 2) DuckDuckGo’s search engine needs work. If you want to sell me web-based services and don’t know this, I’ll assume you haven’t done your homework and leave your email in my spam box.

Upcoming Projects

So, what’s next for 2024? Here are some of the things I know of. Some of these are paid for, others still need funding.

2024 Projects

Still, this is a good list to start from:

One of my ASIC projects is in the middle of a massive speed upgrade. This is not a clock upgrade, or a fastest supported frequency upgrade, but rather an upgrade to adjust the internal state machine. I’m anticipating an additional speed up of between 8x and 256x as a result of this upgrade.

Status? Funded.
My brand new SDIO/eMMC controller has neither eMMC boot support, nor DMA support. Boot support might allow me to boot the ZipCPU directly from an eMMC card, whereas DMA support would allow the ZipCPU to read lots of data from the card without CPU interaction. Both may be on the near-term horizon, although neither upgrade is funded.

Laptop projects have additional requirements

Status? Not funded. On the other hand, this project fits quite nicely on my laptop for those days when I have the opportunity to take my son to his basketball practice … (He’s a 6’4” high school freshman, who is new to the sport as of this year …)

AutoFPGA is now, and has for some time, been a backbone of any of my designs. I use it for everything. It makes adding and removing IP components easy. One of its key capabilities is address assignment (and adjustment). Sadly, it’s worked so well that it now needs some maintenance. Specifically, I’d like to upgrade it so that it can handle partially fixed addressing, such as when some addresses are given and fixed while others are allowed to change from one design to the next. This is only a precursor, though, to supporting 2GB memories where the memory address range overlaps one of the ZipSystem’s fixed address ranges.

Status? A funded (SONAR) project requires these upgrades. Unlike my current SONAR project, built around Digilent’s Nexys Video board, this one will be built around Enclustra’s Mercury KX2, and so either AutoFPGA gets upgraded or I can’t use the full memory range.
The ZipCPU’s GCC backend urgently needs a fix. Specifically, it has a problem with tail (sibling) calls that jump to register addresses. This problem was revealed when testing the SDIO/eMMC software drivers, and needs a proper fix before I can make any more progress on upgrading the ZipOS.

Did I mention working on the ZipOS? Indeed. realistically, further work on the SDIO/eMMC software really wants a proper OS of some type, so … this may be a future and upcoming task.

Status? This project isn’t likely to get any funding, but other projects are likely to require this fix.
As another potential project, an old friend is looking into building a “see-in-the-dark” capability–kind of like a “better” version of night-vision goggles. He’s currently arranging for funding, and after all of my video work I might finally find a customer for it. Yes, his work will require some secret sauce processing–but it’s all quite doable, and could easily fit nicely into this years upcoming work.

Status? If this moves forward, it will be funded.
I’d also like to continue my work on a Wishbone controlled SATA controller this year. I started working on this controller under the assumption that it would be required by my SONAR project, and so funded. Now it no longer looks like it will be funded under this vehicle. Still, the controller is now written, even though the verification work is far from complete. Specifically, I’ll need to work on my SATA (Verilog) Verification IP, until it’s sufficient enough to get me past knowing if I have the Xilinx GTX transceivers modeled correctly or not. Once I get that far, I can both start testing against actual hardware (on my desk), as well as against Verilator models.

Status? Funding has been applied for. Sadly, it’s not likely to be enough to pay for my hours, but perhaps I can have a junior engineer work on this. Still, whether or not the funding comes through remains to be determined.
Did I mention that the new debugging bus upgrades are on my list to be tested? Who knows, I may test their AXI counterparts first, or I may test the UDP version first, or … Only the Good Lord knows how this task will move forward.

Status? Not funded at all.
I am looking into getting some funding for a second version of an Ethernet based Memory controller. The SONAR project required a first version of this controller, and it smokes my serial port based debugging controller. A second version of this controller, designed for resource constrained FPGAs, designed for speed, designed for throughput from the ground up … could easily become a highly desired product.

We’ll see.

Status? Sounds fun, but not (yet) funded.
Finally, I have an outstanding task to test an open source memory controller, using an open source synthesis, and place and route tool, for both Artix-7 and Kintex-7 devices. I’ll let you know how that works out.

Since these are business predictions about the future, I am required by the Good Lord to add that these are subject to whether or not I live and the Lord wills. (See James 4:13-15 for an explanation.)

As always, let me know if you are interested in any of these projects, and especially let me know if you are interested in funding one or more of them. Either way, the upcoming year looks like it will be quite busy and it’s only January.

“My cup runneth over (Ps 23:5)”, and so I shall also pray that God grants you the many blessings He has given me.

Let every thing that hath breath praise the LORD. Praise ye the LORD. (Ps 150:6)