<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>The ZipCPU by Gisselquist Technology</title>
    <description>The ZipCPU blog, featuring how to discussions of FPGA and soft-core CPU design.  This site will be focused on Verilog solutions, using exclusively OpenSource IP products for FPGA design.  Particular focus areas include topics often left out of more mainstream FPGA design courses such as how to debug an FPGA design.
</description>
    <link>https://zipcpu.com/</link>
    <atom:link href="https://zipcpu.com/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Wed, 17 Dec 2025 10:44:03 -0500</pubDate>
    <lastBuildDate>Wed, 17 Dec 2025 10:44:03 -0500</lastBuildDate>
    <generator>Jekyll v4.2.0</generator>
    <image>
      <url>https://zipcpu.com/img/gt-rss.png</url>
      <title></title>
      <link></link>
    </image>
    
      <item>
        <title>Device Clock Generation</title>
        <description>&lt;p&gt;After building a &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;utilities for
handling bus interconnects&lt;/a&gt;, several DMAs
and memory controllers, I often find my time focused on building interfaces
between designs and external peripherals.  This seems to be where most of the
business has landed for me.  Often, these peripherals require a clock output,
coming from the design, and so I’d like to spend some time describing how to
generate such a “device” clock.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 1.  A Basic SOC with Peripherals&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/devclk/soc.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;There’s actually two topics that need to be discussed when working with modern
high speed peripheral design.  One of them is &lt;em&gt;generating&lt;/em&gt; the clock to be sent
to the peripheral, such as Fig. 1 above illustrates.  The second one involves
&lt;em&gt;processing&lt;/em&gt; a clock returned from the peripheral, as shown in Fig. 2 below.
This is a key component of high speed designs such as DDR memories, eMMC,
HyperRAM, or even NAND flash protocols.  This second topic is one we shall
need to come back to at a later date.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 2.  Data returned with a clock&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/devclk/bidir-clk.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Today, I’d like to discuss how to go about &lt;em&gt;generating&lt;/em&gt; a clock to control
device interaction.&lt;/p&gt;

&lt;p&gt;I first came across this problem when building a
&lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;NOR flash controller&lt;/a&gt;,
based on first a &lt;a href=&quot;/blog/2018/08/16/spiflash.html&quot;&gt;SPI
interface&lt;/a&gt; and later a
&lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;Quad SPI interface&lt;/a&gt;.
&lt;a href=&quot;https://github.com/ZipCPU/qspiflash&quot;&gt;My controller&lt;/a&gt; was designed for FPGAs,
and so the clock could be built with a single frequency.
This design had the added complication that the clock needed to be paused from
time to time.  Specifically, the clock needed to be turned off when nothing
was going on.  Likewise, the clock needed to be turned off for one cycle after
dropping (i.e. activating) the chip select pin, and for a couple cycles after
the transaction was complete but before raising (deactivating) the chip select.&lt;/p&gt;

&lt;p&gt;I had to deal with a similar problem when controlling a HyperRAM, but …
&lt;a href=&quot;https://github.com/ZipCPU/wbhyperram&quot;&gt;that design&lt;/a&gt; failed when I wasn’t (yet)
prepared to handle the return clock properly.  I did say this deserved an
article in its own right, did I not?  Processing data on a return clock properly
can be a challenge.&lt;/p&gt;

&lt;p&gt;I then built &lt;a href=&quot;https://www.arasan.com/product/xspi-psram-master/&quot;&gt;a similar design for ASIC
platforms&lt;/a&gt;.  Unlike the
FPGA, the final clock speed wouldn’t be known until run time.  It might be
that the design started at a slower clock speed, only to later speed up to
the full rate at run time.  Unlike an FPGA which can be fixed later, there’s
really no room for failure in &lt;a href=&quot;/blog/2017/10/13/fpga-v-asic.html&quot;&gt;ASIC
work&lt;/a&gt;.  At least
with an FPGA, if my board didn’t support a particular frequency, I could just
rebuild the design for the clock frequency it did support.  This doesn’t work,
though, for an ASIC–since it tends to be cost prohibitive to rebuild the
design at a later time when you decide to connect it to a slower part than
the one you designed it for.&lt;/p&gt;

&lt;p&gt;The next design I worked with was a &lt;a href=&quot;https://www.arasan.com/product/onfi-4-2-controller-phy/&quot;&gt;NAND flash
design&lt;/a&gt;.  NAND flash
can be a challenge, since the protocol requires you to start at a slow
frequency and only after you bring up the connection are you allowed to change
to a faster frequency.  &lt;a href=&quot;https://www.arasan.com/product/onfi-4-2-controller-phy/&quot;&gt;This particular
design&lt;/a&gt; was built for
ASIC environments, and so it depended upon an analog component generating all
the clocks I needed.  This worked great, up until someone wanted to purchase
the design to work on an FPGA, then another wanted it to work on an FPGA, and
another and so on.&lt;/p&gt;

&lt;!-- TWO TRACES: SDR timing vs DDR timing --&gt;
&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 3. Single Data Rate (SDR) vs Dual Data Rate (DDR)&lt;/caption&gt;&lt;tr&gt;&lt;th&gt;SDR&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/devclk/sdr.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;th&gt;DDR&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/devclk/ddr.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Just to add another twist to the problem, many protocols require data
transitions on both edges of the clock, a protocol often known as
“Dual Data Rate” (DDR).  Unlike the other designs above, these often require a
clock that is 90 degrees offset from the data–so that each clock transition
takes place in the middle of each data valid window, rather than on the edges
of the window.  This sort of “offset” clock is necessary to guarantee setup and
hold times within the slave peripheral.  An example of the clock and data
relationship required by DDR as opposed to a traditional “single data rate”
(SDR) clock is shown in Fig. 3.&lt;/p&gt;

&lt;p&gt;By the time I got to my &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;,
I think I finally had the clock division problem handled.  An
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt; needs bring up the SD card
at 400kHz, and then depending upon the card, the PCB, and the controller, the
speed may then be raised to 25MHz, 50MHz, 100MHz, or even 200MHz.  The clock
may also be stopped whenever either there’s nothing to send or receive, or
when the SOC can’t load or unload the data to the controller.  For example, you
might ask an SD card to read and thus produce many blocks of data, then read
the first two of these blocks into your internal buffers only to find that the
CPU is slow in draining those buffers.  In that case, you would need to stop
the interface clock before the external card tries to send you a third block
of data that would have nowhere to go.&lt;/p&gt;

&lt;p&gt;Other devices require user programmable device clock controllers, such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/videozip/tree/master/rtl/ethernet&quot;&gt;10M/100M/1Gb Ethernet controllers&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;While each of these speeds might use a single clock, building a truly
trimode controller requires some extra work.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/zipcpu2025/05/28/memtest.html&quot;&gt;(DDR) SDRAM controllers&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;SDRAM controllers from an FPGA standpoint tend to be simple: just produce a
clock.  However, you can turn the clock off for better power performance.
Yes, there are rules … but we won’t get into those here today.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I2S&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2019/06/28/genclk.html&quot;&gt;We discussed generating an I2S clock at a totally arbitrary
frequency&lt;/a&gt; some time ago.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;I2C&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;In general, I2C is too slow to be the focus of this article.  There is
an I3C protocol that is built on top of I2C.  The techniques we discuss today
might work well for I3C masters, but I’m not nearly as familiar with those.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/wbspi&quot;&gt;SPI – not just NOR flash&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;While SPI &lt;em&gt;slaves&lt;/em&gt; have a device clock as well, handling these clocks is
fundamentally different from what I’m describing today.  My focus today
will be on &lt;em&gt;generating&lt;/em&gt; clock signals for the purpose of controlling
external devices–such as an SPI master might need to do.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Specifically, today I want to look at and discuss generating a clock with one
or more of the following characteristics:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Output Signal:&lt;/strong&gt; We’re talking about interface clocks–those generated by
the “master” of the interface.  These are &lt;em&gt;digital&lt;/em&gt; signals, output from
either an FPGA (or ASIC) device.&lt;/p&gt;

    &lt;p&gt;The output may be accomplished via a component like an
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; or an OSERDES,
with or without an additional analog delay following.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Discontinuous:&lt;/strong&gt; The clock may be discontinuous.  Many protocols
(&lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC&lt;/a&gt;, etc) allow or even require,
the clock to be stopped, or otherwise only toggled when there’s something to
send or receive.  As mentioned above, stopping the clock may also be useful
for pausing a transmission in progress before a source buffer runs dry, or an
incoming buffer overflows.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Dynamic Frequency:&lt;/strong&gt; Often, the outgoing clock needs to change frequency
during operation as part of the protocol.  For example, the SDIO protocol
needs to start at 400kHz, and then increase to 25MHz (or more).  Therefore,
a good clock generator will need to be able to naturally generate multiple
clock frequencies as the protocol requires.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Minimum pulse width:&lt;/strong&gt; Switching between frequencies must be done by rule:
clock glitches must be fully disallowed and guaranteed against.  Too-short
clock pulses cannot be allowed.  Clock high and low durations must always be
at least a half period of the fastest allowable clock.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;90 Degree Offset for DDR Signaling:&lt;/strong&gt; As shown in Fig 3, many modern
protocols require both positive and negative edge signaling (DDR).  This
drops the required clock frequency by 2x, reducing the bandwidth that must
be carried over the PCB for the same data rate.  However, the clock signal
required to support such DDR signaling often needs to be delayed 90 degrees
from the data, so that it transitions in the middle of the data valid period.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Faster than the controller’s clock:&lt;/strong&gt; Just to make matters worse, in &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;my
eMMC design&lt;/a&gt;, I needed to generate a 200MHz
DDR device clock from a 100MHz system clock.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All this is to say that our goal today will be to create a divided clock using
digital, rather than analog, logic.  (Yes, I can hear my analog engineering
friends jump in here with the comment that “Everything is analog!”  God bless
you, my friends.)&lt;/p&gt;

&lt;h2 id=&quot;the-problem&quot;&gt;The Problem&lt;/h2&gt;

&lt;p&gt;The first approach I often see to this problem is the straight forward
integer clock division approach.  Generally, it looks something like the
following:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// if (active_clock)&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;dev_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;high_speed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;src_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_selected_bit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;In this case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_clock&lt;/code&gt; controls whether or not the clock is stepping,
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;user_selected_bit&lt;/code&gt; controls to what level of clock division we are
interested in.  As for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;, that can be either the system clock or
alternatively whatever is required to generate the fastest clock frequency
required by the protocol.&lt;/p&gt;

&lt;p&gt;Note that we’ve done nothing to guarantee this clock won’t glitch between
speed selections, nor can we necessarily guarantee the minimum of two clock
rates.  We’ll come back to these requirements later, albeit with a different
(better) implementation.&lt;/p&gt;

&lt;p&gt;The user logic required to use this clock this looks very simple at first:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dev_clk&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// Reset logic&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;pedge_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// Logic controlling any flops based on the dev_clk&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When a protocol requires data on both edges of the clock, getting the data
right for the second edge of the clock is also important.  But, how shall we
output data on the negative edge of a clock we’ve just created out of thin
air?  We’ll need to transition on the negative edge to do this.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;negedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dev_clk&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// Reset logic&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;nedge_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// Logic controlling the negative clock's data&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;output_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dev_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ddr_mode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pedge_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nedge_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This approach leaves us with two problems.  The first is that we’re using our
clock as a logic signal when we assign &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dev_clk&lt;/code&gt; to possible be the same as
our source clock.  The second problem is that we are transitioning user logic
on this clock.  Worse, though, we’re now transitioning our user logic on both
edges of the clock.  This violates &lt;a href=&quot;/blog/2017/08/21/rules-for-newbies.html&quot;&gt;&lt;em&gt;the
rules&lt;/em&gt;&lt;/a&gt; of good
digital logic design.&lt;/p&gt;

&lt;p&gt;These aren’t necessarily issues when building ASIC designs.  However, in FPGA
design, this clock will need to get onto the clocking network’s backbone
somehow, and that’s not automatic.  Worse, this new clock is &lt;em&gt;not&lt;/em&gt; the same
as the original &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;–even when they are at the same frequency.  There
will always be a delay between the two clocks–a delay that may not be
captured by pre-synthesis simulation, and so it can be a dangerous delay the
engineer isn’t expecting when building this logic.&lt;/p&gt;

&lt;p&gt;This leads to two commercial ASIC design challenges.  First, when designing an
ASIC IP, you want to be able to test as much of the IP on an FPGA as possible.
Non FPGA compatible logic needs to be moved to the periphery of the design and
carefully controlled.  Second, from a business point of view, it helps to be
able to sell the ASIC design to FPGA customers in addition to ASIC customers.
So, even though you &lt;em&gt;can&lt;/em&gt; do something like this on an ASIC, that doesn’t mean
you &lt;em&gt;should&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;There are other problems.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;Clock domain crossings (CDCs)&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;Since the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dev_clk&lt;/code&gt; are now two separate and distinct clock
domains, you’ll need to properly manage every &lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;clock domain
crossing&lt;/a&gt; between these two
clock domains.  This can create additional delays through what otherwise
might be high speed logic.&lt;/p&gt;

    &lt;p&gt;Likewise, the positive and negative edges of the same clock are also
(technically) separate clock domains.  Moving between them is “possible, but
not recommended.”&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Gating&lt;/p&gt;

    &lt;p&gt;You may have noticed we haven’t properly gated our clock above.  Sure, we
used an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_clock&lt;/code&gt; signal to provide gating, but this signal does not
guarantee the maximum frequency of the output clock.  This, however, is a
minor problem that most engineers reading this blog would be able to easily
fix with a little bit of additional logic.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;!--
  STORY TIME:
  ... need a more relevant story
--&gt;

&lt;p&gt;Two problems in particular, though, become deal breakers when it comes to this
type of design.  The first is that DDR interfaces often require a clock delayed
by 90 degrees from the data, as shown in Fig. 3 above.  The simple approach
will not generate such a 90 degree delay.  While one might use an analog delay
element, such as a Xilinx ODELAY element, to delay the clock signal by an
appropriate amount, this will only work for high speed clocks and not for
clocks less than 50MHz or so.  The second problem is, what do you do when you
need a device clock that’s faster than your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;, like I did in my
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt; design?&lt;/p&gt;

&lt;p&gt;As a result, we really need another approach.&lt;/p&gt;

&lt;h2 id=&quot;the-solution&quot;&gt;The Solution&lt;/h2&gt;

&lt;p&gt;The basic solution is to return to &lt;a href=&quot;/blog/2017/08/21/rules-for-newbies.html&quot;&gt;the
rules&lt;/a&gt;, and so
avoid all transitions on the device clock edge at all.  Instead, we’ll continue
to transition on our source clock and then use either an
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; or an OSERDES to generate
the final outgoing clock.  In the meantime, we’ll treat the newly generated
device clock as a traditional logic signal–rather than a “clock” within our
design.  That is, we’ll let it be and remain &lt;em&gt;logic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Let’s start by looking at Fig. 3 above, and dividing the clock period into
sections, as shown in Fig. 4 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 4. Dividing the clock period&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/devclk/ddrbyfour.svg&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Nominally, we’d want at least two sections per clock–one for each piece of
data in a DDR transmission.  Sadly, this isn’t enough, since the clock might
need to be offset by 90 degrees.  Hence, we’ll need to break each clock
period into four logically distinct time periods.  We can label these time
periods 3:0, from left-most or most-significant being 3 down to the right most
and least significant being 0.&lt;/p&gt;

&lt;p&gt;From here, we can generate what I’m going to call a &lt;em&gt;wide&lt;/em&gt; clock, four bits at
a time.  This wide clock will then be output via a 4:1 OSERDES–if it is to keep
pace with the source clock within our design.  At its
fastest speed, this clock will be either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011&lt;/code&gt; (where the MSB ‘0’ is
transmitted “first”), or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0110&lt;/code&gt; if a 90 degree offset clock is required for
DDR transmissions (as shown in Fig. 4).  At its next slowest speed, the clock
would be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1100&lt;/code&gt;.  Further
clock divisions will use wide clocks of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you wish to use an &lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;
instead of a 4:1 OSERDES, you can still use this approach, save that you
would be generating 2 wide clock bits at a time instead of four.  The fastest
clock would be a repeating &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01&lt;/code&gt;, but this fastest clock would be unable to
handle the 90 degree offsets of a DDR signal.  The next fastest would be
either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;11&lt;/code&gt;, or the 90 degree offset version of the same at
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you want a clock running at twice your system frequency, you could use
an eight-bit wide clock signal, designed to feed an 8:1 SERDES.  Your fastest
clock would become &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00110011&lt;/code&gt; (non–DDR) or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01100110&lt;/code&gt; when working with DDR
signals.&lt;/p&gt;

&lt;p&gt;That’s the first step–the wide clock.&lt;/p&gt;

&lt;p&gt;The second step is to generate, together with the wide clock signal, two
other signals.  The first signal, let’s call this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt;, will indicate
that a new clock cycle is beginning.  The second, which I shall call the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt;, will indicate that the second half of a clock cycle is beginning.
Both of these signals are also shown in Fig. 4 above, each indicating the
portion of the clock cycle they represent.&lt;/p&gt;

&lt;p&gt;All three of these &lt;em&gt;logic&lt;/em&gt; signals can be now generated by a “clock generator”
module.&lt;/p&gt;

&lt;p&gt;If necessary, this clock can be stopped either at the clock generator, or
gated further down the signal pipeline by simply zeroing out the wide clock.&lt;/p&gt;

&lt;p&gt;Let’s pause for a moment to illustrate what a “clock” like this might look
like.&lt;/p&gt;

&lt;p&gt;We’ll start with the highest speed clock, running at the source clock rate.
This clock will have a wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011&lt;/code&gt;, and new data on every clock edge.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 5. Highest speed SDR&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h3.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h3.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Fig. 5 shows all of these key signals.  First, you can see the system clock,
which we called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt; above, that everything is generated off of.  Next, you
can see the IO clock we create, followed by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; used to create
it.  This is followed by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; control signal.  This clock might be
the clock we would use for a data signal transitioning at once per clock (SDR).
Therefore, to illustrate, I’ve also illustrated what a couple periods of this
this data signal might look like.&lt;/p&gt;

&lt;p&gt;Were this interface to run in DDR mode, sending one word of data on each edge
of the clock, then the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; would need to be (repeatedly) set to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0110&lt;/code&gt;, as shown in Fig. 6 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 6. Highest speed DDR&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h6.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h6.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;There are a couple key differences between Fig. 6 and Fig. 5 above.  The first,
and perhaps most obvious, is that the data in Fig. 6 are output at two words
per system clock cycle.  This is often desirable, in that twice the data rate
may now be achieved.  The second difference is that the IO clock is now offset
90 degrees from the data, instead of 180 degrees.  This is often necessary to
guarantee that there is a clock transition in the middle of the data valid
period.  To make this happen, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; is now set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0110&lt;/code&gt; in each
clock period.&lt;/p&gt;

&lt;p&gt;Using these clock signals, we can also pause the clock–as shown in Fig. 7
below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 7. Pausing the clock&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h6-pause.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h6-pause.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Note that the key signals, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; must also stop
when the clock pauses (stops).  Because there is no clock signal, the data
output signals become don’t care.  (For power reasons, I could see holding the
output at at its previous value for short periods of time, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;D2&lt;/code&gt; in this case,
but that’s another discussion.)&lt;/p&gt;

&lt;p&gt;This same signaling approach also works when dividing the clock speed by two.
Fig. 8 shows an example SDR signal with a clock speed set to half the system
clock speed.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 8. SDR at half the system clock rate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h0f.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h0f.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Fig. 9 shows the same thing, but this time for a DDR signal with the clock
at half the system clock speed.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 9. DDR at half the system clock rate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h3c.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h3c.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Before leaving this example, note how easy it was to change frequencies in
this representation: we just adjusted the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt;, and then the new and
half clock positions changed to match.&lt;/p&gt;

&lt;p&gt;We can drop the clock frequency again to a quarter of the system clock speed,
as shown in Fig. 10.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;caption&gt;Fig 10. SDR at a quarter of the system clock rate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h00ff.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h00ff.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;We can also offset this clock by 90 degrees, as shown in Fig. 11.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 11. DDR at a quarter of the system clock rate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/devclk/h0ff0.svg&quot;&gt;&lt;img src=&quot;/img/devclk/h0ff0.svg&quot; width=&quot;480&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;When using this type of “wide” clock, user logic becomes simplified as well.
This “simplified” user logic is easily illustrated with an example.  For this
example, let’s suppose we wished to control 8 data wires using this type of
divided clock signaling.  Let’s also assume, for the purposes of this
illustration, that the source arrives via an AXI stream interface with signals
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_VALID&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_DATA[15:0]&lt;/code&gt;, and a ready signal given by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_READY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We’ll start with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; signals from
the clock generator.  Note that, as we propagate these signals through our
pipeline (below), we won’t send the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; straight to the output pad,
but instead we’ll use it along side our data processing pipeline.  This way,
if the pipeline must stall (and it might need to), the pipeline can also stall
the outgoing clock at the same time.&lt;/p&gt;

&lt;p&gt;Hence, we’ll create a one clock delayed version of this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; that
we can call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;outgoing_clock&lt;/code&gt;.  Further, a second signal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_clock&lt;/code&gt;,
can be used to keep track of whether or not we’ve committed to the current
clock cycle.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;4'h0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;second_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// We commit to this clock if either&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// 1. We have new data and we are ready to consume this new data, *OR*&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// 2. We're in SDR (not DDR) mode, and we've already committed&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//	to a byte of data that we haven't (yet) sent.&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// In both cases, we need to start a clock period.&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// Note that S_READY implies new_edge&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wide_clock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;c1&quot;&gt;// The &quot;active_clock&quot; signal is used to let us know that we've committed&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// to this clock cycle.  From now until the next new_edge, we must&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// forward the wide_clock signal to the output.&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// The clock generator is creating an edge that ... we're not prepared&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// for or ready to handle.  There's just no data available, so ...&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// let's stop the clock.&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;4'h0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;c1&quot;&gt;// In this case, we're not forwarding the clock, nor will we until&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// the next clock period.&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;active_clock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// If we've already committed to this clock cycle, then we'll need to&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// ontinue it to its completion.&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_clock&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wide_clock&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Before we can get to the data, we need another key signal as well.  This is
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;second_edge&lt;/code&gt; signal that we used above.  Here’s why: our data is going to
arrive, 16b at a time via AXI stream.  If we are in DDR mode, then we’ll
consume 8b on each edge of this clock–and possibly all 16b at once.  However,
if we are only in SDR mode, then we’ll need to consume the second 8b on the
next clock edge.  Hence, we’re going to need a signal that I’m calling,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;second_edge&lt;/code&gt;, to tell us that we have 8b remaining of the 16b committed to us
that didn’t get sent on the last clock tick.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_care_about_resets&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;second_edge&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// In SDR, we just accepted 16b and output 8b.&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// We need another new_edge to send the remaining 8b.&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// Note that S_READY implies new_edge&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// Also note that we only use this signal in SDR modes&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;second_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ddrmode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// On any (other) new_edge, we can clear this signal&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;second_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That leads us to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;outgoing_data&lt;/code&gt;.  This is a 16 bit data signal, consisting
of 8b, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;outgoing_data[15:8]&lt;/code&gt;, which will be output on the first half of the
clock, and another 8b, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;outgoing_data[7:0]&lt;/code&gt;, which will be output on the second
half of the clock.  A third signal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;next_byte&lt;/code&gt;, will be used for keeping track
of the second byte of data in the case where we don’t output both bytes in the
same clock period.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;src_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_care_about_resets&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;next_byte&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_VALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;c1&quot;&gt;// new_edge is implied by S_READY&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ddrmode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;half_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Set data for both halves of the clock&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//    The first half in the MSBs&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;outgoing_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//    The second half in the LSBs&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;outgoing_data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Set only the first half ot the data, but set it to be&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// output twice.  We'll need to come back later for the second&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// outgoing byte.&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;outgoing_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

	&lt;span class=&quot;c1&quot;&gt;// Keep track of that second byte, so we can come back to it later.&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;next_byte&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;S_DATA&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ddrmode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;half_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
	&lt;span class=&quot;n&quot;&gt;outgoing_data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;next_byte&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The final signal we need to define is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S_READY&lt;/code&gt; signal.  In this example,
we can accept new data on any new clock edge, &lt;em&gt;unless&lt;/em&gt; we have 8b remaining
from the last clock edge that have yet to be output.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;S_READY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;second_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This approach provides us with a couple big advantages to our user logic over
what we had before.&lt;/p&gt;

&lt;p&gt;First and foremost, &lt;a href=&quot;/blog/2017/08/21/rules-for-newbies.html&quot;&gt;all of our user logic now takes place on the same
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;&lt;/a&gt;.
We didn’t need any &lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;CDCs&lt;/a&gt;.
AXI slave data, generated externally on this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt; can now be used within
our design on the same clock it was generated on.&lt;/p&gt;

&lt;p&gt;Second, did you notice how we were able to &lt;a href=&quot;/blog/2021/10/26/clk-gate.html&quot;&gt;simply gate the
clock&lt;/a&gt; when there was no
data available?  If not, go back up and look again at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;active_clock&lt;/code&gt; signal.&lt;/p&gt;

&lt;p&gt;Third, unlike the previous approach, we’ve now guaranteed that this clock
signal won’t glitch.  That is, assuming the outgoing OSERDES won’t generate
glitches from our glitchless data signals.  The previous clock generator,
on the other hand, could well have had glitches between the clock and the
data enabling it.&lt;/p&gt;

&lt;p&gt;Also look at how easy it was to do pipelined processing.  The clock was
generated prior to our pipeline, and simply propagated through the pipeline.
Although this pipeline only contains a single clock cycle, we could’ve easily
extended the pipeline for multiple clock cycles if necessary by simply passing
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; signals through the
pipeline–adjusting them if and where necessary along the way.&lt;/p&gt;

&lt;p&gt;As a result of this example, all IO pins can now be driven using a 4:1
OSERDES.  (You could also use
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;s for the data, if you
trusted them to have the same timing relationship as the OSERDES.)&lt;/p&gt;

&lt;p&gt;What about frequency changes, or adjusting between the unshifted clock and
the clock shifted by 90 degrees?  What about when the clock is off, and needs
to be turned on?  All of these challenges and more now reside within the clock
generator.&lt;/p&gt;

&lt;h2 id=&quot;the-clock-generator&quot;&gt;The Clock Generator&lt;/h2&gt;

&lt;p&gt;For discussion purposes, let’s take a look at the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
I used for &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;my SDIO/eMMC controller&lt;/a&gt;.  As
mentioned above, this
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
has the particular requirement of being able to generate two outgoing clock
periods per system clock cycle, but otherwise it’s a fairly straight forward
example of the discussion above.&lt;/p&gt;

&lt;p&gt;From a configuration standpoint, there are a couple of configuration options.
For example, I wasn’t certain that I’d always have an 8:1 SERDES available
to me, nor do all digital environments necessarily offer 2:1
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;
components.  Therefore, we allow those to be adjusted.  Second, I want to know
the maximum number of bits required in my clock divider.&lt;/p&gt;

&lt;p&gt;Still, these configuration parameters are fairly straightforward.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;module&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;sdckgen&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#(&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// OPT_SERDES is required for generating an 8:1 output.&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// If no 8:1 SERDES are available, we can still create a clock&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// using a 2:1 ODDR via OPT_DDR&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// To hit 100kHz from a 100MHz system clock, we'll need to&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// divide our 100MHz clock by 4, and then by another 250.&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Hence, we'll need Lg(256)-2 bits.  (The first three speed&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// options are special)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;localparam&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;LGMAXDIV&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
is primarily controlled via three signals.  The first tells us whether we want
our clock offset by 90 degrees for DDR outputs or not.  The second controls
the speed of the outgoing clock.  The final signal tells us we can shut the
clock down.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;			&lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGMAXDIV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;			&lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When shut down, the wide clock output will be fixed at zero, as will both the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; control signals.&lt;/p&gt;

&lt;p&gt;The shutdown signal is actually really useful at slow clock speeds.  Sure you
could shut the clock down, as we did above, by just not forwarding it through
the pipeline.  On the other hand, once the clock has been shut down, you’d like
to be able to restart it on a dime.  The shutdown control signal to our
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
allows us to do that.  Once set, the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
takes the remainder of a clock cycle to shut down, and then stays ready to
restart the clock at a moments notice.&lt;/p&gt;

&lt;p&gt;The outputs from this module are just about what you would expect.  You
have the three signals we’ve already discussed.  In this case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;
is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; signal we’ve mentioned, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfclk&lt;/code&gt; is the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; signal, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckwide&lt;/code&gt; is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; signal.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;			&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// new_edge&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;			&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// half_edge&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// wide_clock&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;			&lt;span class=&quot;n&quot;&gt;o_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGMAXDIV&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;o_ckspd&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The two new signals are &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_clk90&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckspd&lt;/code&gt;.  These are feedback signals
returned to the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdaxil.v&quot;&gt;control module&lt;/a&gt;,
used to tell us when any frequency shift or phase shift operations are complete.&lt;/p&gt;

&lt;p&gt;These feedback signals solve an issue I was having in my &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;eMMC
controller&lt;/a&gt;, where the clock would
be at some crazy low frequency (100kHz or so), and I’d want to speed it up.
Just setting the new clock speed wasn’t enough, since it might take a thousand
clocks to finish a single cycle at the 100kHz clock speed.  However, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/sw/emmcdrv.c#L1591-L1593&quot;&gt;by
checking these return signals via the register set, the software driver
could then tell if any clock frequency change had fully taken
effect&lt;/a&gt;
before going on to any next operation.&lt;/p&gt;

&lt;p&gt;The next logic block is part of a two process finite state machine.  The first
process, shown below, is the combinatorial process.  The second will be
the clocked logic.&lt;/p&gt;

&lt;p&gt;Personally, I’m not a big fan of two process state machines.  I’m just not.
They often seem to me to be adding extra work and complexity.  However,
two process state machines allow me to reference logic results even before
the full logic path is complete.  They also allow me an ability to describe
more complicated logic than the simple single process state machine, so
a two process state machine it is.&lt;/p&gt;

&lt;p&gt;In this case, we are going to generate the next signal for the strobe,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_stb&lt;/code&gt;, the clock, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt;, and the counter, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_counter&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Of these signals, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt; is the simplest to explain.  This signal indicates
that we’re about to start a new clock cyle.  In many ways, this is the
combinatorial version of what is to become the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; once latched.&lt;/p&gt;

&lt;p&gt;Clock cycles themselves come in four phases, just like the four bits of the
wide clock we discussed before.  You can think of these phases as the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0110&lt;/code&gt;
of the fastest clock before.  The first bit, 0, is the first phase of the
clock.  Our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; bit, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;, will only ever be true on this phase.
The second bit, 1, is where the clock rises.  The third bit, 1 again, is
the only phase where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt;, will be set.  Finally, the
clock will return to zero in the last phase.  If the clock is ever idle,
it will idle in this first phase prior to delivering a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; signal.&lt;/p&gt;

&lt;p&gt;This background will help explain how I’ve divided up the counter.  There are
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NCTR&lt;/code&gt; bits to the counter.  Of those bits, the top two control the phase
bits we just described, whereas the others are the clock divider.  The
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_stb&lt;/code&gt; signal, mentioned above and below, is simply a signal that these top
two phase-control bits are about to change.&lt;/p&gt;

&lt;p&gt;With that as background, let’s take a look at how this works.&lt;/p&gt;

&lt;p&gt;In general, the first step of any combinatorial block is to set all the
values that will be determined within the block.  This is a good practice
to get into to avoid accidentally generating any latches.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;nxt_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;From here, we subtract one from the bottom (non-phase) bits of our counter
on every cycle.  When these bits are zero, subtracting one will cause the
counter to overflow and set our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_stb&lt;/code&gt; signal, so we can know when to
adjust the phase bits.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_stb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_stb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Advance the top two bits&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
						&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If our clock speed is set to 0 (wide clock of either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01100110&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00110011&lt;/code&gt;)
or 1 (wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00111100&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00001111&lt;/code&gt;), then we are always generating
a new clock cycle.  In this case, we’ll hold the counter at zero and (roughly)
ignore the phase.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Likewise, if the clock speed is equal to two, the wide clock will either
alternate between &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_0000&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_1111&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_0000&lt;/code&gt;,
and so our phase will alternate, but otherwise everything else can be kept
to zero.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Finally, in the more general case, we’ll just set the bottom bits to count
down from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ckspd-3&lt;/code&gt; to zero.  Yes, this is “just” a counter, but the maximum
value is offset by three for the three special speeds we just discussed above.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You may have noticed that we’ve only adjusted the bottom bits of this
counter–the bits that count down.  We’ve done nothing to update the phase
bits at the top of this “counter”, so let’s handle those next.  (Spoiler alert:
these MSBs don’t act like counter bits in this implementation.)&lt;/p&gt;

&lt;p&gt;Of course, for the highest frequencies, the counter will never change.  It
sits at zero, with a permanent next phase of 3.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When the speed setting is 2, we allow the top two bits to toggle back and
forth.  If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt; is set, we need to reset these bits only.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;			&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Finally, for the general case, we return the phase to zero and reset the
clock.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;			&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This is only the first half of this “two process” FSM.  The second half,
with respect to the counter, is just about as simple.  Perhaps it is even more
so, given that we’ve done all of the hard work above.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The big thing to notice here is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk &amp;amp;&amp;amp; i_cfg_shutdown&lt;/code&gt;.  Remember, if
the user ever asserts &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_shutdown&lt;/code&gt;, we need to wait for clock cycle to
complete before shutting it down.  Hence, we wait for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt; signal
before acting.  Then, once set, we leave the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;counter&lt;/code&gt; in a state where it
will perpetually set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt;.  This way, the moment &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_shutdown&lt;/code&gt; is
released, we’ll be back to generating a clock again.&lt;/p&gt;

&lt;p&gt;To explain this a bit better, imagine the clock generator is producing
an output clock from ten periods of the source/system clock: five system clocks
of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_000&lt;/code&gt;, followed by five more clocks of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt;.  Imagine
again that we’ve had several periods of these 10 clock cycles before the
user asserts the clock shutdown signal.  We then wait another 10 cycles for the
clock to fully shut down.  Now, if the user drops the shutdown signal after a
further 3 cycles, we could either wait another 7 cycles (to complete the 10),
or start immediately.  Here, we try to arrange to start a stopped clock
immediately without violating any of our clocking rules.&lt;/p&gt;

&lt;p&gt;The next signal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clk90&lt;/code&gt;, controls whether or not we’re generating an
clock offset from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;, by 90 degrees or not.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;o_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This logic isn’t very interesting yet, since we’ve basically split a two
process FSM.  It will become more so when we get to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;w_clk90&lt;/code&gt;, and the first
process of the FSM, below.  The key is, this logic must determine what the
current 90 degree offset setting is.  Hence, when you look at the outgoing
wide clock, this signal must match it.&lt;/p&gt;

&lt;p&gt;How about the clock speed?  In this case, we go through some error checking.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8'd2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;w_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;assign&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_ckspd&lt;/span&gt;   &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The error checking is here to guarantee that a clock speed of 0 is only used
when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is set.  Likewise, a clock speed of 1 may be used in
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;
mode (wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00001111&lt;/code&gt;), but not when the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;clk90&lt;/code&gt; configuration
is set (calling for a wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011_1100&lt;/code&gt; which is too complex for an
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; output module to produce).
This continues for a clock speed of two which is fine for a non-offset clock
(wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_0000&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt;), but not for an offset
clock (wide clock of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_1111&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_0000&lt;/code&gt; unless the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;
option is set.&lt;/p&gt;

&lt;p&gt;Finally, the two values &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;w_clk90&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;w_clkspd&lt;/code&gt; are used to tell us what
values our registered logic should use when generating a clock.  As such,
they are either the registered values, or (when we’re about to start a new
cycle) the new values.&lt;/p&gt;

&lt;p&gt;With all this as background, we can now dig into the core of this
logic–generating the three key signals we will be outputting.&lt;/p&gt;

&lt;p&gt;On reset, these signals will simply be set to indicate a clock of the
fastest rate, ready to go, but otherewise one that is idle (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckwide=0&lt;/code&gt;).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Next, if we want to shutdown the clock, we can only do so on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt;.
When shutdown, the wide clock will be zero and the new edge signals willl
all be suppressed.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As mentioned above, the key here is that the clock can suddenly start if
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_cfg_shutdown&lt;/code&gt; signal is released.  Using this logic, it does not need
to remain phase coherent with whatever phase the clock had prior to being
shutdown.&lt;/p&gt;

&lt;p&gt;Moving on to our highest speed clock, we simply set that according to
the 90 degree clock configuration.  In general, this speed will only
ever generate one of two values: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;01100110&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00110011&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h66&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h33&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When running from a 100MHz system (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;) clock, this plus the OSERDES
will generates a 200MHz clock signal to the external device.&lt;/p&gt;

&lt;p&gt;One might argue that the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; here is really redundant.  There should
be enough logic elsewhere to keep &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;w_ckspd&lt;/code&gt; at a non-zero value if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt;
is not set.  Why use it?&lt;/p&gt;

&lt;p&gt;It’s here specifically to provide a strong hint to the synthesis tool
regarding logic that can be cleaned up if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is not set.  This block
is complicated enough as it is, so adding it in should simplify our logic.&lt;/p&gt;

&lt;p&gt;The problem with putting this value here, and generating a clock module based
upon parameters such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;, is that I now need to
formally verify the IP under several conditions before I can know if it works.
This applies to simulation as well.  It is now no longer sufficient to run
the simulation tool once when you do something like this.  It must now be run
many times under different conditions.  As an engineer, I need to be aware
of costs like this whenever I invoke logic like this.&lt;/p&gt;

&lt;p&gt;In this case, I wanted to support multiple types of FPGAs (and/or ASICs), and
so this was the logic I chose.&lt;/p&gt;

&lt;p&gt;Our next speed, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ckspd=1&lt;/code&gt;, has almost the same logic.  As before, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt; are both set continually in this mode.  In this case, our wide
clock output will either be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011_1100&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_1111&lt;/code&gt; depending on whether
or not we need a 90 degree offset clock for DDR.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h3c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When running from a 100MHz system (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt;) clock, this generates a 100MHz
clock as well.&lt;/p&gt;

&lt;p&gt;You may note that there’s no real two-cycle output signal.  The signaling,
with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt;, allows us to describe a new clock together
with or separate from the second half of that clock period, but offers nothing
for describing two clock cycles in the same source clock period.  This is
just a limitation in our chosen signaling.&lt;/p&gt;

&lt;p&gt;The solution to this problem is specific to the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;eMMC
controller&lt;/a&gt; that we’ve drawn &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;our
example&lt;/a&gt;
from.  In this case, I look at both the DDR setting and the
clock speed before generating any transmit data.  From this, I determine if
I should be sending one byte, two bytes, or four bytes of data per clock.
The actual logic is more complex, due to the fact that the eMMC interface
may run in 1b, 4b, or 8b modes, but that’s the story of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdtxframe.v&quot;&gt;another piece of logic,
found outside of the clock controller&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As with clock speeds of either 0 (200MHz) or 1 (100MHz), the clock speed of 2
(50MHz) is also handled specially.  This is the speed that alternates between
two outputs, generating either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00001111&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;11110000&lt;/code&gt; in the offset
mode (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_clk90=1&lt;/code&gt;), or simply &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;00000000&lt;/code&gt; followed by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;11111111&lt;/code&gt; in the normal
mode.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b10&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hf0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h00&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;When running from a 100MHz system clock (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt; above), this generates
a 50MHz output clock signal.  This might be the “fastest” speed you would
normally think of for an integer clock “divider”.  As you can see, though,
we’ve already generated outgoing 200MHz and 100MHz clocks above.&lt;/p&gt;

&lt;p&gt;This brings us to the general case–a divided clock running at less than half
our source clock rate.  Here, we’ve already done all of the hard work for
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nxt_clk&lt;/code&gt;, so the outgoing next edge signal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt; is done.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The half edge signal is determined by the counter.  The lower bits must be zero,
indicating a new phase, and the top two bits indicate the new phase will be
the third of four–so just entering halfway.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The wide clock is determined by the top two phase bits of the next counter.
It’s either equal to the most significant bit, when there’s no clock offset,
or the exclusive OR of the top two bits when there is.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
						&lt;span class=&quot;o&quot;&gt;^&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This leaves us with only one final signal: the current clock speed.  In this
case, all the work has been done above, and nothing more need be done with it.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;o_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That’s the basic idea.  In summary:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;There are four phases to the outgoing clock, either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0110&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A counter generally helps us know when to transition from one phase to the
next.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;High speeds get special attention.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Data changes on the outgoing next edge signal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;.&lt;/p&gt;

    &lt;p&gt;In DDR modes, data can also change on the outgoing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfstb&lt;/code&gt; signal.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key features of this approach include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;There’s no need for any &lt;a href=&quot;/blog/2017/10/20/cdc.html&quot;&gt;clock domain
crossings&lt;/a&gt; in the outgoing data
path.  All outgoing signals are handled in the source clock domain.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The clock may be gated at will, and (re)started quickly if necessary.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Frequency changes are controlled, and will take place between clock periods.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Although the clock is generated in logic, it doesn’t trigger any logic.
That is, nowhere in the design will anything in the outgoing logic path
depend upon either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@(posedge dev_clk)&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;@(negedge dev_clk)&lt;/code&gt;.  Instead,
all of the logic is triggered off of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfstb&lt;/code&gt; signals
while still running on the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src_clk&lt;/code&gt; we started from.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But … does it work?&lt;/p&gt;

&lt;h2 id=&quot;simulation-testing&quot;&gt;Simulation testing&lt;/h2&gt;

&lt;p&gt;Just to get this clock generator off the ground, I built a &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;quick simulation
test bench&lt;/a&gt;.  You can
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;find it here&lt;/a&gt;, and we’ll walk through it quickly.&lt;/p&gt;

&lt;p&gt;The first step was pretty boiler plate.  I simply started a VCD trace, placed
the design into reset, and generated a 100MHz clock.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;dumpfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;tb_sdckgen.vcd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;dumpvars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tb_sdckgen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;forever&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;#&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;For the second step, I wanted to place the design in a variety of
configurations to see how it would work in each.  I chose to leave it in each
configuration for five clock cycles before moving to the next.  I then defined
a simple task, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;capture_beats&lt;/code&gt;, that I could call to wait out five cycles of
a given clock setting before moving on.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;task&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;repeat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;w_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;endtask&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The last step, then, was to walk through one clock setting after another
to see what would happen.&lt;/p&gt;

&lt;p&gt;I started by taking the design out of reset, and configuring the inputs for
a (rough) 100kHz clock.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h0fc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;repeat&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 100kHz (10us)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You can pretty well read the comments below to see the configurations I checked.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;c1&quot;&gt;// 200 kHz (5us)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h07f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 400 kHz (2.52us)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h041&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//   1MHz (1us)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h01b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//   5MHz (200ns)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h007&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//  12MHz (80ns)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h004&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//  25MHz (40ns)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h003&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//  50MHz (20ns)&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h002&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 100MHz&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 200MHz&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;


		&lt;span class=&quot;c1&quot;&gt;//  25MHz, CLK90&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h103&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;//  25MHz, CLK90&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h102&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 100MHz, CLK90&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h101&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// 200MHz, CLK90&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;10'h100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;capture_beats&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;finish&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;These are basically all of the configurations I wanted to use the design with.
Using the generated trace, I can visually see all of the signals within this
design working as intended.  Further, unlike the formal verification we’ll
discuss next, I can actually see &lt;em&gt;many&lt;/em&gt; clocks of this design.  This allows
me to verify, for example, that the 100kHz, 200kHz, and 400kHz clock divisions
work as designed.&lt;/p&gt;

&lt;p&gt;Sadly, this test is woefully inadequate for any real or professional purpose.&lt;/p&gt;

&lt;p&gt;The biggest problem with &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;this simple test bench
script&lt;/a&gt;
is that it’s not self checking.  I can run it, but the only way to know if the
design did the right thing or not is to pull up a viewer and check the
&lt;a href=&quot;/blog/2017/07/31/vcd.html&quot;&gt;VCD file&lt;/a&gt;.
Sure, this might get me off the ground, but it is &lt;em&gt;horrible&lt;/em&gt; for maintenance.
How should I know, for example, if a small and otherwise minor change breaks
things?&lt;/p&gt;

&lt;p&gt;The second problem with &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;this test
bench&lt;/a&gt;
is that it does nothing to try out unreasonable input signals.  How shall I
know, for example, that this design will never go faster than the fastest
allowed frequency?  That is, it should only ever be able to go as fast as the
current speed, or the newly commanded speed.&lt;/p&gt;

&lt;p&gt;Perhaps some of you may remember my comments on twitter about getting excited
to try this new design as a whole (not just the clock generator) on an FPGA,
only to be mildly (not) surprised that it didn’t work before all the formal
proofs were finished?  (I couldn’t find them when I looked today …)  Yeah,
there’s always a surprise you aren’t expecting that takes place when you work
with real hardware.&lt;/p&gt;

&lt;p&gt;So, while &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;this&lt;/a&gt;
looks nice, and while the resulting traces look really pretty,
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/verilog/tb_sdckgen.v&quot;&gt;this test bench&lt;/a&gt;
is highly insufficient.&lt;/p&gt;

&lt;p&gt;Let’s move onto something more substantial.&lt;/p&gt;

&lt;h2 id=&quot;formal-properties&quot;&gt;Formal Properties&lt;/h2&gt;

&lt;p&gt;I like to think of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;this clock
module&lt;/a&gt;
as a basic clock divider.  It’s not much more than a glorified counter,
together with a 4-state phase machine.  Yeah, sure, you can run through all 4
states in one clock cycle, but it’s still not really all that much more.
Formally verifying &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;this clock
generator&lt;/a&gt;
should therefore be pretty simple.&lt;/p&gt;

&lt;p&gt;One of the big keys to this proof is &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;the interface property
set&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://zipcpu.com/formal/2020/06/12/four-keys.html&quot;&gt;I’ve discussed interface properties
before&lt;/a&gt;.  The idea born
from the fact that one component, such as &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;this clock generator&lt;/a&gt;,
is going to generate signals that another component, in this case &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;the transmit
data generator&lt;/a&gt;,
will use.  Further, these two proofs will be independent of each other.  Hence,
anything the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter’s&lt;/a&gt;
proof needs to assume should then be asserted in the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
and vice versa.  That’s the purpose of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt;.
The &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt;.
also greatly simplifies the assertions found within the design itself.&lt;/p&gt;

&lt;p&gt;Still, let’s look over the design assertions for now.  We’ll come back to
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt; in the next section.&lt;/p&gt;

&lt;p&gt;We’ll start with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_en&lt;/code&gt; signal.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;f_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This just captures whether the clock should be shut down during the current
cycle or not.  It’s that simple.&lt;/p&gt;

&lt;p&gt;Many engineers just starting out with formal verification struggle to see
past the assertions and the assumptions within the language to realize they
can still use regular verilog when generating formal properties.  In this
case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_en&lt;/code&gt; is nothing more than a register which we are going to use in our
formal proof.  Nothing prevents you from doing this.  Indeed, you are more
than able to write &lt;a href=&quot;https://zipcpu.com/formal/2019/02/21/txuart.html&quot;&gt;more complicated state
machines&lt;/a&gt;
when generating formal properties as well.&lt;/p&gt;

&lt;p&gt;Just make sure that your new logic doesn’t make the same expresesions as the
logic you are verifying, or you might convince yourself something works when
it doesn’t.  When teaching, I like to explain this way: the best way to
verify that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt; divided by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; is to multiply &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;B&lt;/code&gt; together.
If the result of the multiply is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;, then you’ve verified your result.  Why
does this work?  Because you use different logic paths in your brain for
division than you do for multiplication.  Hence, if you make a mistake in
dividing, you aren’t likely to make the same mistake when multiplying.&lt;/p&gt;

&lt;p&gt;The same is true of formal methods.  You can use logic in formal methods, just
like you do in your design, you just don’t want to use the same logic lest
your mind falsely convinces you its right when it isn’t.  This is sort of
like having one witness to a murder called onto the stand twice under the
same name.&lt;/p&gt;

&lt;p&gt;Anyway, let’s move on.&lt;/p&gt;

&lt;p&gt;The next step is to instantiate a copy of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;the clock interface
properties&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;n&quot;&gt;fclk&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#(&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;u_ckprop&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;See how simply that was?&lt;/p&gt;

&lt;p&gt;In addition to the assertions within &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;this property
set&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;the property set&lt;/a&gt;
provides two output signals that we can use to connect the state of our
design to the internal state of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;the property
set&lt;/a&gt;.
These signals are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_reset&lt;/code&gt;&lt;/p&gt;

    &lt;p&gt;This otherwise annoying signal is required for us to be able to handle
the clock anomalies between reset and the first clock strobe.  This signal is
set on a reset, and released once the clock gets started.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_half&lt;/code&gt;&lt;/p&gt;

    &lt;p&gt;This signal is simpler.  It simply means that we’ve seen the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt;
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;) and not the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; herein called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt;.  If
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_half&lt;/code&gt; is true, then the clock must generate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt; before it
can generate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these signals, we can express things like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This helps us through long periods of time with neither &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;.
During this time, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_half&lt;/code&gt; should be equivalent to the top two bits
of our counter being either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2'b00&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2'b01&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let’s look at some other assertions.&lt;/p&gt;

&lt;p&gt;For example, if we shut the clock down, then we shouldn’t get any more new
edges, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_past_valid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;($&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;past&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now we can look at some of the specific options.  For example, the clock
speed should only be zero (200MHz) if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is set.  While set to zero,
either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt; should be set on every clock cycle or we should’ve received
a clock shutdown request.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;$&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;past&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_cfg_shutdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Likewise, we should only ever be in a clock speed of 1 (100MHz) if either
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt; are set.  Further, if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is not set, we
shouldn’t ever be implementing a 90 degree clock offset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;A clock speed of two (50MHz) is available to all configurations.  In this case,
the bottom bits–the non-phase description bits–must always be zero.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Finally, in all other clock speeds, all we insist is that the lower bits of
the counter be less than the clock speed minus three.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There are only two ways both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_ckstb&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_hlfck&lt;/code&gt; can be true at once.
The first is if the speed indicates either 200MHz or 100MHz.  The second is
if the clock is stopped, and so the wide clock output is zero and a new
clock is expected on the next clock cycle.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The difficult part of these assertions is that these aren’t enough to
limit the output of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock
generator&lt;/a&gt;.
Just to make certain the outputs are properly limited, I enumerate each
together with the conditions they may be produced.&lt;/p&gt;

&lt;p&gt;We’ll start with a zero output.  This can come from either a stopped clock,
or one of two slow clock situations.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckwide&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;mh&quot;&gt;8'h00&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nxt_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// A stopped clock&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;}}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt;
					&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// In slow situations with no offset&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// In slow (DDR) situations with a 90 degree clock offset&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b00&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;An output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h0f&lt;/code&gt; means we’re either in speed one with no clock offset
and both clock edges active, or we’re in the first half of speed two.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;An output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'hf0&lt;/code&gt; can only mean we’re in the second half of speed two.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'hf0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;An output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'hff&lt;/code&gt; is common at slow speeds, but also completely determined
by thee two top phase bits of the counter.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b01&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NCTR&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The last several outputs are very specific to their settings.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h3c&lt;/code&gt; is
only possible in a speed of 1 with a 90 degree clock offset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'h3c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;That leaves the two possible double-clock outputs.  First, the double clock
with no 90 degree offset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'h33&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The last possibility is the double clock with the 90 degree offset.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mh&quot;&gt;8'h66&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Everything else is specifically disallowed.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;default&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;endcase&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h2 id=&quot;interface-file&quot;&gt;Interface File&lt;/h2&gt;

&lt;p&gt;While I might like to leave things there, a full proof of this
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
requires we go over the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal interface
file&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Remember, the purpose of the formal interface file is to separate two proofs.
In this case, we want to both formally verify the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;,
as well as the 
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data generator&lt;/a&gt;
that will use the results of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;.
Further, unlike the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;,
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data generator&lt;/a&gt;
doesn’t really care if the signals to and from the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt; are realistic.  It only cares that
they follow whatever rules it requires–things like either
1) both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge &amp;amp;&amp;amp; half_edge&lt;/code&gt; at the same time, or 2) an alternating
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;new_edge&lt;/code&gt; with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt;, and so forth.&lt;/p&gt;

&lt;p&gt;You can find this &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal interface
file&lt;/a&gt;
among the other files associated with the formal proofs for this design.
Although it is written in Verilog, it’s not really something that could or
would be synthesized.  For this reason I keep it in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bench/formal&lt;/code&gt;
subdirectory of the project, rather than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rtl/&lt;/code&gt; subdirectory.&lt;/p&gt;

&lt;p&gt;Starting at the top, our 
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt;
must operate in at least three configurations: 1) in an environment where the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; commands an 8:1 OSERDES, 2) an environment where it commands an
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt; instead, or 3) a simpler
environment where neither option is available to us.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;k&quot;&gt;module&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;fclk&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;#(&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;parameter&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
					&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;    &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Yes, we’ll need to run at least
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/sdckgen.sby#L2-L4&quot;&gt;3 formal proofs&lt;/a&gt;,
one for each option, to make sure we’ve truly captured each option.  This,
however, is just the price of doing business with configurable logic.&lt;/p&gt;

&lt;p&gt;Our &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal properties&lt;/a&gt;
will need the same inputs as the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;.
The outputs of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
also need to be listed as inputs to this &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt;.
While the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal property set&lt;/a&gt;
will primarily consist of assertions and assumptions, it will also produce
two outputs–as discussed above.  These are necessary for making sure the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal property set&lt;/a&gt;’s
state is consistent with the internal state of the design.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;i_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;input&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;wire&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;//&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
		&lt;span class=&quot;kt&quot;&gt;output&lt;/span&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Some of you may recall the &lt;a href=&quot;/formal/2018/12/18/skynet.html&quot;&gt;challenges I’ve struggled through when trying to
verify two co-dependent components&lt;/a&gt;.
My original approach was to &lt;a href=&quot;/formal/2018/04/23/invariant.html&quot;&gt;swap assumptions and
assertions&lt;/a&gt; between the
two components.  This &lt;a href=&quot;/formal/2018/12/18/skynet.html&quot;&gt;didn’t
work&lt;/a&gt;,
primarily because it was possible for the resulting &lt;em&gt;assumptions&lt;/em&gt; to render
one or more assertions to be irrelevant or vacuous.  In that example, the logic
of a design acted as an assumption as well.&lt;/p&gt;

&lt;p&gt;In our case, we’re going to disconnect the two designs that will use this
property set entirely.  The
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator (the master)&lt;/a&gt;
will make assertions that the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data generator&lt;/a&gt; will later assume, and vice versa.
To make this work, we’ll have the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/sdckgen.sby&quot;&gt;SymbiYosys
script&lt;/a&gt;
for the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
define a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CKGEN&lt;/code&gt; macro.  This will then tell us whether this property set is
being used as part of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;’s proof, or the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data generator&lt;/a&gt;’s.
If a part of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;’s
proof, we’ll make assertions about our outputs.  If a part of the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data generator&lt;/a&gt;’s
proof, those “outputs” will now be inputs of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdtxframe.v&quot;&gt;transmitter data
generator&lt;/a&gt;,
and so we should be making assumptions about them instead.  To do this, we’ll
create a macro, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SLAVE_ASSUME&lt;/code&gt;, that can be used to describe properties of
these outputs with either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assert&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume&lt;/code&gt; statements.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;&lt;span class=&quot;cp&quot;&gt;`ifdef&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;CKGEN&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;`define&lt;/span&gt;	SLAVE_ASSUME	assert	&lt;span class=&quot;err&quot;&gt;//&lt;/span&gt; Clock generator proof&lt;span class=&quot;cp&quot;&gt;
`else&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;`define&lt;/span&gt;	SLAVE_ASSUME	assume	&lt;span class=&quot;err&quot;&gt;//&lt;/span&gt; Transmit data generator proof&lt;span class=&quot;cp&quot;&gt;
`endif&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The next step is boiler plate: create an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_past_valid&lt;/code&gt; register to let us
know if we can use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$past()&lt;/code&gt; function or not.  (Remember, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$past()&lt;/code&gt;s value
is invalid on the first clock of any proof.)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;f_past_tick&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f_past_valid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;last_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_pending&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;last_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;f_past_valid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_past_valid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Likewise, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_reset&lt;/code&gt;, will be true between the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_reset&lt;/code&gt; signal and the
first clock edge.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Our second output, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_pending_half&lt;/code&gt;, is true from the top of the clock to
the second half of the clock, but &lt;em&gt;only&lt;/em&gt; if the top of the clock didn’t
include the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;half_edge&lt;/code&gt; signal (called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_hlfck&lt;/code&gt; herein).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;A third signal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;f_past_tick&lt;/code&gt;, will allow us to reason about whether or not
we just passed an edge.  We’ll get to this one in a bit.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;f_past_tick&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;f_past_tick&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now that we have these two signals, we can state with a certainty that
we can’t start a new clock cycle while waiting for the second half of a clock
cycle.  Likewise, if we are in second half of a clock cycle, we shouldn’t see
the half edge again unless we’re starting a new (and high speed) clock.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, with this as background, we can now make assertions about our various
clock speeds, and the outputs that should be produced in each.  Note that in
this &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;formal property set&lt;/a&gt;,
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_ckspd&lt;/code&gt; input reflects our &lt;em&gt;current&lt;/em&gt; clock speed, and not just the
&lt;em&gt;requested&lt;/em&gt; clock speed that we worked with in the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock
generator&lt;/a&gt;.
Hence, it is an &lt;em&gt;output&lt;/em&gt; of the generator &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock
generator&lt;/a&gt;,
and no longer the requested clock speed.&lt;/p&gt;

&lt;p&gt;Let’s start with the highest speed (200MHz) clock output.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// We can only run in this speed if OPT_SERDES is set.&lt;/span&gt;
		&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// This speed has no pending half cycles.  All clock cycles&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// are complete in one cycle.&lt;/span&gt;
		&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Clock is either *off*/inactive, or we're still coming&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// out of a reset.&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Clock is active, both edges are active in a clock&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// tick&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wide_clock&lt;/code&gt; output, herein called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_ckwide&lt;/code&gt;, can only have one of two
values when active at this speed.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// In the case of a 90 degree offset clock, if the&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// clock is active, it must be 0110_0110&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h66&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Otherwise, if the clock is active, it must be&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// 0011_0011&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h33&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Those are just the rules for 200MHz (assuming a 100MHz system clock).&lt;/p&gt;

&lt;p&gt;Now let’s drop down a speed, and look at the 100MHz clock.  In this mode,
the new edge and half edge signals must also be present on the same clock.
Likewise, there’s no allowable means to have a pending second half–the
first and second half must always show up on the same clock cycle.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;At 100MHz, the outgoing wide clock can only be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0011_1100&lt;/code&gt; (90 degree offset),
or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_ffff&lt;/code&gt;.  The former requires &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt;, the latter may also be
possible in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt; mode–since the first four bits equal the last four
bits.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h3c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Our last special clock speed is 50MHz.  For this case, we break our properties
into two parts: the 90 degree offset, and the normal (SDR) case.&lt;/p&gt;

&lt;p&gt;For the 90 degree offset clock, the clock must either be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_1111&lt;/code&gt; if
we’re not waiting on the next half clock cycle, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_0000&lt;/code&gt; if we are.
Likewise, either the new or half edge signal must be true on every cycle.
The only exception is for if/when the clock is stopped.  Further, this
output will require either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hf0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hf0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h0f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The normal offset is simpler.  This doesn’t require &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;.
The wide clock can either be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_0000&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt;.  Further, if ever
the clock output is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt;, then we must be on the second half edge.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This brings us to the default clock–the very slow clock generated by
integer division (i.e. the counter).  As before, the wide clock can either
be &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0000_0000&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;1111_1111&lt;/code&gt; and hence needs no special hardware such as
either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;default&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
					&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
					&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_pending_half&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
					&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// if (!f_pending_half)&lt;/span&gt;
					&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckwide&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'hff&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;endcase&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Just as a quick sanity check, if we have no special hardware, then both
new and half edges can never be true on the same cycle.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Let’s come back and double check the high speed cases.  These are the only
cases where both new and half edge may be allowed at the same time.  In all
other cases, one or both signals should be zero.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_past_valid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;case&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_en&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;nl&quot;&gt;default:&lt;/span&gt;
			&lt;span class=&quot;cp&quot;&gt;`SLAVE_ASSUME&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_hlfck&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;endcase&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Feel free to check the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/bench/formal/fclk.v&quot;&gt;property set&lt;/a&gt;
out yourself.  While there are a couple more properties to it, these
are the most significant.&lt;/p&gt;

&lt;h2 id=&quot;coverage-checking&quot;&gt;Coverage Checking&lt;/h2&gt;

&lt;p&gt;Any good verification set should include not just a simulation, not just
formal induction based proofs, but also a set of coverage checks.
These are critical to making sure you haven’t (accidentally) assumed away
some key component of the devices operation.  Were that to happen, then
the formal proof would be irrelevant–even if it did pass.&lt;/p&gt;

&lt;p&gt;Hence, we add some cover properties here to the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The first step is just to check if the clock is active, and if so, what mode
it is active in.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;		&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;kt&quot;&gt;reg&lt;/span&gt;	&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_ckspd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f_en&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_cfg_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// We want to prove what our clock output can do over&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// time, not so much what happens when/if it changes.&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If the clock is active, we can then start counting every new edge that takes
place while active.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;8'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_ckstb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Don't allow the counter to overflow, but otherwise&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// count the beginnings of each clock cycle.&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;With that as background, we can start looking at traces!  Let’s get
cover traces for a variety of potential frequencies.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// 50MHz&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// 25MHz&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// 12MHz&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;//  8MHz&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;//  6MHz&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We’ll have to handle covering the high speed options a bit differently.  In
this case, we &lt;em&gt;only&lt;/em&gt; want to check speeds requiring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; if
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is actually checked.  We can’t use an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if&lt;/code&gt; for this, lest the
formal tool decide we failed the cover check.  Hence, we’ll use a generate
statement, so that the cover statements requiring &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; are &lt;em&gt;only&lt;/em&gt;
generated if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_SERDES&lt;/code&gt; is true.  Now we can check for 200MHz, 100MHz, and
50MHz.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;generate&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_SERDES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CVR_SERDES&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We can apply the same logic to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OPT_DDR&lt;/code&gt;, but we’ll have fewer clock options
to check.  In this case, it’s only the 100MHz and 50MHz options.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPT_DDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CVR_DDR&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt;  &lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;cover&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cvr_spd&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clk90&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cvr_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;endgenerate&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;By the time you get to this point, you should have a strong confidence that
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;this device clock generator&lt;/a&gt;
actually does what it needs to.  I certainly do, and it hasn’t failed me (that
I recall) since going through this exercise.  Yes, other parts of this design
have had problems, particularly the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdfrontend.v&quot;&gt;front end&lt;/a&gt;, but the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt;
has been quite reliable.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;This is now my go-to approach whenever I need to generate a device clock:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Generate the “clock” in logic.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Generate the “clock” wide, so it can be output via either OSERDES or
&lt;a href=&quot;/blog/2020/08/22/oddr.html&quot;&gt;ODDR&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Maintain all logic transitions on the original source clock.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Use logical signals like you would enables to handle data transitions.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What did this gain us?  We received several advantages from this approach:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;A glitchless outgoing clock&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;An outgoing clock that can …&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;change frequency upon command,&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;turn on and off as necessary,&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;stop, and yet restart on a dime, and&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;switch between being data aligned and offset by 90 degrees.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is everything we would want of an outgoing clock, with none of the
challenges associated with breaking &lt;a href=&quot;/blog/2017/08/21/rules-for-newbies.html&quot;&gt;&lt;em&gt;the
rules&lt;/em&gt;&lt;/a&gt;.  Indeed,
this approach works nicely in both FPGA and ASIC contexts, as I’ve now used it
quite successfully in both for multiple projects.  No, I don’t use the same
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/a1d912367ce71389ef25ced4b83d34d23b05b391/rtl/sdckgen.v&quot;&gt;clock generator&lt;/a&gt; for all my projects, but that’s for both
requirements (the 200MHz clock is unique) and &lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;legal
reasons&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This leaves us with the topic of the “return clock”, which we’ll need to come
back to and discuss on another day.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;The wind goeth toward the south, and turneth about unto the north; it whirleth about continually, and the wind returneth again according to his circuits. (Eccl 1:6)&lt;/em&gt;</description>
        <pubDate>Wed, 17 Dec 2025 00:00:00 -0500</pubDate>
        <link>https://zipcpu.com/blog/2025/12/17/devclk.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2025/12/17/devclk.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Quiz #24: Is there an AXI bug here?</title>
        <description>&lt;!-- answer: &quot;2022/11/01/fv-answer22.html&quot; --&gt;

&lt;p&gt;This quiz is brought to you courtesy of &lt;a href=&quot;/formal/2019/05/13/axifull.html&quot;&gt;Xilinx’s AXI
slave template&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Thankfully, they’ve since (sort-of) fixed this bug since &lt;a href=&quot;/formal/2019/05/13/axifull.html&quot;&gt;I wrote
that&lt;/a&gt;.  I say “sort-of”
because …  the bug just got pushed around.  It’s still broken, just not
in the same way.&lt;/p&gt;
</description>
        <pubDate>Fri, 20 Jun 2025 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/quiz/2025/06/20/quiz24.html</link>
        <guid isPermaLink="true">https://zipcpu.com/quiz/2025/06/20/quiz24.html</guid>
        
        
        <category>quiz</category>
        
      </item>
    
      <item>
        <title>Comparing the Xilinx MIG with an open source DDR3 controller</title>
        <description>&lt;p&gt;Last year, I had the wonderful opportunity of mentoring Angelo as he built an
open source &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;DDR3 SDRAM controller&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Today, I have the opportunity to compare &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;this
controller&lt;/a&gt; with &lt;a href=&quot;https://docs.amd.com/v/u/en-US/ug086&quot;&gt;AMD (Xilinx)’s
Memory Interface Generator (MIG) solution&lt;/a&gt;
to the same problem.  Let’s take a look to see which one is faster, better,
and/or cheaper.&lt;/p&gt;

&lt;h2 id=&quot;design-differences&quot;&gt;Design differences&lt;/h2&gt;

&lt;p&gt;Before diving into the comparison, it’s worth understanding a bit about
DDR3–both how it works, and how that impacts its performance.  From there,
I’d like to briefly discuss some of the major design differences between
&lt;a href=&quot;https://docs.amd.com/v/u/en-US/ug086&quot;&gt;Xilinx’s MIG&lt;/a&gt;
and the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We’ll start with the requirements of an SDRAM controller in general.&lt;/p&gt;

&lt;h3 id=&quot;sdram-in-general&quot;&gt;SDRAM in general&lt;/h3&gt;

&lt;p&gt;SDRAM stands for &lt;a href=&quot;https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory&quot;&gt;Synchronous Dynamic Random Access
Memory&lt;/a&gt;.
“Synchronous” in this context simply means the interface requires a clock, and
that all interactions are synchronized to that clock.  “Random Access” means
that you should be able to access the memory in any order you wish.  The key
word in this acronymn, though, is the “D” for Dynamic.&lt;/p&gt;

&lt;p&gt;“Dynamic” RAM is made from capacitors, rather than flip flops.  Why?
Capacitors can be made much smaller than flip flops.  They also use
much less energy than flip flops.  When the capacitor is charged, the “bit”
of memory it represents contains a “1”.  When it isn’t, the bit is a zero.
There’s just one critical problem: Capacitors lose their charge over time.
This means that every capacitor in memory must be read and recharged
periodically or it will lose its contents.  The memory controller is
responsible for making sure this happens by issuing “refresh” commands to the
memory.&lt;/p&gt;

&lt;p&gt;That’s only the first challenge.  Let’s now go back to that “synchronous” part.&lt;/p&gt;

&lt;p&gt;The original (non-DDR) SDRAM standard had a single clock to it.  The controller
would generate that clock and send it to the memory to control all interactions.&lt;/p&gt;

&lt;p&gt;This was soon not fast enough.  Why not send memory values on both edges of
the clock, instead of just one?  You might then push twice as much data across
the interface for the same I/O bandwidth.  Sadly, as you increase the speed,
pretty soon the data from the memory doesn’t come back synchronous to the
clock you send.  Both the traces on your circuit board, as well as the time
to complete the operation within the memory chip will delay the return signals
so much that the returned data no longer arrives in time to be sampled at the
source by the source’s clock before the next clock edge.  Worse, these
variabilities are somewhat unpredictable.  Therefore, memories were modified
so that they return a clock together with the data–keeping the data
synchronous with the clock it is traveling with.&lt;/p&gt;

&lt;p&gt;Sampling data on a returned clock can be a challenge for an FPGA.  Worse, the
returned clock is discontinuous: it is only active when the memory has data
to return.  This will haunt us later, so we’ll come back to it in a moment.&lt;/p&gt;

&lt;p&gt;For now, let’s go back to the “dynamic” part of an SDRAM.&lt;/p&gt;

&lt;p&gt;SDRAMs are organized into banks, with each bank of memory being organized into
rows of capacitors.  To read from an SDRAM, a “row” of data from a particular
memory bank must first be “activated.”  That is, it needs to be copied from
its row of capacitors into a row of flip flops.  From here, “columns” within
this row can be read or written as desired.  However, only one row of memory
per bank can be active at any given time.  Therefore, in order to access a
second row of memory, the row in use must first be copied back to its
capacitors.  This is called “precharging” the row.  Only then can the desired
row or memory be copied to the active row of flip-flops for access.&lt;/p&gt;

&lt;p&gt;I mentioned SDRAM’s are organized in “banks”.
Each of these “bank”s can controlled independently.
They each have their own row of active flip-flops.
With few exceptions, such as the “precharge all rows” command, or the “refresh”
cycle command, most of the commands given to the memory will be bank specific.&lt;/p&gt;

&lt;p&gt;Hence, to read a byte of memory, the controller must first identify which bank
the byte of memory belongs to, and from there it must identify which row
is to be read.  The controller must then check which row is currently in the
flip-flop buffer for that bank (i.e. which row is active).  If a different
row is active, that row must first be precharged.  If no row is active, or
alternatively once a formerly active row is precharged, the controller may
then activate the desired row.  Only once the desired row is active can the
controller issue a command to actually read the desired byte from the row.
Oh, and … all of this is contingent on not needing to refresh the memory.
If a refresh interrupt takes place, you have to precharge all banks, refresh
the memory, and then start over.&lt;/p&gt;

&lt;p&gt;Well, almost.  There’s another important detail: Because of the high speeds
we are talking about, the memory will return data in bursts of eight bytes.
Hence, you can’t read just a single byte.  The minimum read quantity is eight
bytes in a single “byte lane”.&lt;/p&gt;

&lt;p&gt;What if eight bytes at a time isn’t enough throughput for you?  Well, you could
strap multiple memory chips together in parallel.  In this case, every command
issued by the controller would be sent to all of the memory chips.  All of them
would activate rows together, all of them would refresh their memory together,
and all of them could read eight bytes at a time.  Each of these chips, then,
would control a single “byte lane”.  In our case today, we’ll be using a memory
having eight “byte lanes”.&lt;/p&gt;

&lt;p&gt;So, when it comes to the performance of a memory controller, what do we want
to know?  We want to know how long it will take us from when the controller
receives a read (or write) request until the data can be returned from the
memory chip.  This includes waiting for any (potential) refresh cycles,
waiting for prior active rows to be recharged, new rows to be activated,
and the data to finally be returned.  The data path is complex enough that
we’ll need to be looking at these times statistically.&lt;/p&gt;

&lt;p&gt;Specifically, we’re going to model transaction time as some amount of
per-transaction latency, followed by a per-amount throughput.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/eqn-transaction-time.png&quot; alt=&quot;&quot; width=&quot;524&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Our goal will be to determine these two unknown quantities: &lt;em&gt;latency&lt;/em&gt; and
&lt;em&gt;throughput&lt;/em&gt;.  If we do our job well, these two numbers will then help us
predict answers to such questions as: how long will a particular algorithm
take, and how much memory bandwidth is available to an application.&lt;/p&gt;

&lt;h3 id=&quot;mig&quot;&gt;MIG&lt;/h3&gt;

&lt;p&gt;Let’s now discuss some of &lt;a href=&quot;https://docs.amd.com/v/u/en-US/ug086&quot;&gt;AMD (Xilinx)’s DDR3 memory
controller&lt;/a&gt; This is the
controller generated by their “Memory Interface Generator” and affectionately
known simply as the “MIG” or “MIG controller”.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.amd.com/v/u/en-US/ug086&quot;&gt;AMD (Xilinx)’s MIG controller&lt;/a&gt; is now
many years old.  Judging by their change log, it was first released in 2013.
Other than configuration adjustments, it has not been significantly modified
since 2016.  This is considered one of their more “stable” IPs.  It gets a
lot of use by a wide variety of users, and I’ve certainly used it on a large
number of projects.&lt;/p&gt;

&lt;p&gt;Examining the source code of the MIG reveals that it is built in two parts.
This can be seen from Fig. 1 below, which shows how the MIG fits in the context
of the entire test stack we’ll be using today.&lt;/p&gt;
&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 1. Memory pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-chain.svg&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The first part of the MIG processes AXI transaction requests into its internal
“native” interface.  AXI, however, is a complex protocol.  This translation
is not instantaneous, and therefore takes a clock (or two) to accomplish.
Many FPGA designers have discovered they can often improve upon the performance
of the MIG by skipping this AXI translation layer and using the
“native” interface instead.  I have not done so personally, since I haven’t
found sufficient documentation of this “native” interface to satisfy my
needs–but perhaps I just need to look harder at what’s there.&lt;/p&gt;

&lt;p&gt;One key feature of an AXI interface is that it permits a certain amount of
transaction reordering.  For example, a memory controller might prioritize two
interactions to the same bank of memory, such that the interaction using
the currently active row might go first.  Whether or not Xilinx’s MIG does
this I cannot say.  For today’s test measurements, we’ll only be using one
channel–whether read or write, and we’ll only be using a single AXI ID.
As a result, all requests must complete in order, and there will be no
opportunity for the MIG to reorder any requests.&lt;/p&gt;

&lt;p&gt;DDR3 speeds also tend to be much faster than the FPGA logic the controller must
support.  For this reason, Xilinx’s DDR3 controller runs at either 1/2 or 1/4
the speed of the interface.  This means that, on any given FPGA clock cycle,
either two or four commands may be issued of the DDR3 device.  For this test,
we’ll be running at 1/4 speed, so four commands may be issued per system clock
cycle.&lt;/p&gt;

&lt;p&gt;The biggest problem Xilinx needed to solve with their controller was how to
sample return data.  Remember, the data returned by the memory contains a
discontinuous clock.  Worse, the discontinuous clock transitions when the
data transitions.  This means that the controller must (typically) delay the
return clock by a quarter cycle, and only then clock the data on the edge.
But … how do you know how far a quarter cycle delay is in order to generate
the correct sample time for each byte lane?&lt;/p&gt;

&lt;p&gt;Xilinx solved this problem by using a set of IO primitives that they’ve never
fully documented.  These include PHASORs and IO FIFOs.  Using these IO
primitives, they can lock a PLL to the returned data clock, and then use
that PLL to control the sample time of the return data.  This clock is
then used to control a special purpose asynchronous FIFO.  From here,
the data is returned to its environment.&lt;/p&gt;

&lt;p&gt;One unusual detail I’ve seen from the MIG is that it will often stall
my read requests for a single cycle at a time in a periodic fashion.  Such
stalls are much too short for any refresh cycles.  They are also more frequent
than the (more extended) refresh cycles.  This leads me to believe that
Xilinx’s IO PLL primitive has an additional requirement, which is that in order
to maintain lock, the MIG must periodically read from the DDR3 SDRAM.  Hence,
the MIG must not only take the memory offline periodically to keep the
capacitors refreshed, it must also read from the memory to keep this IO PLL
locked.  Worse, it cannot read from the device at the same time it does this
station keeping.  As with the AXI to native conversion, this PLL station
keeping requirement negatively impacts the MIG’s performance.&lt;/p&gt;

&lt;p&gt;Before leaving this point, let me underscore that these “special purpose”
IO elements were never fully documented.  This adds to the challenge of building
an open source controller, since the open source engineer must either
reverse engineer these undocumented hardware components or build their
data sampler in some other fashion.&lt;/p&gt;

&lt;p&gt;Some time ago, I tried building &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/demofull.v&quot;&gt;a block-RAM based memory peripheral capable
of handling AXI exclusive access
requests&lt;/a&gt;.
While trying to verify that the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
could generate exclusive access requests and that it would do so properly, I
looked into whether or not the MIG would support them.  Much
to my surprise, the MIG has &lt;em&gt;no exclusive access capability&lt;/em&gt;.  I’ve since been
told that this isn’t a big deal, since you only need exclusive access when
more than one CPU is running on the same bus and the MicroBlaze CPU was never
certified for multi–core operation, but I do still find this significant.&lt;/p&gt;

&lt;p&gt;Finally, the MIG controller tries to maximize parallelism with various “bank
machines”.  These “bank machines” appear to be complex structures, allocated
dynamically upon request.  Each bank machine is responsible for handling when
and if a row for a given memory bank must be activated, read, written, or
precharged.  While most memories physically have eight banks, Xilinx’s MIG
permits a user to have fewer bank machines.  Hence, the first step in
responding to a user request is to &lt;em&gt;allocate&lt;/em&gt; a bank machine to the request.
According to Xilinx, “The [MIG] controller implements an aggressive precharge
policy.”  As a result, once the request is complete, the controller will
precharge the bank if no further requests are pending.  The unfortunate
consequence of this decision is that subsequent accesses to the same memory
will need to first activate the row again before it can be used.&lt;/p&gt;

&lt;h3 id=&quot;uberddr3&quot;&gt;UberDDR3&lt;/h3&gt;

&lt;p&gt;This leads us to the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; is an open
source (GPLv3) DDR3 controller.  It was not built with AMD (Xilinx) funding or
help.  As such, it uses no special purpose IO features.  Instead, it uses basic
ISERDES/OSERDES and IDELAY/ODELAY primitives.  As a result, there are no
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHASER_IN&lt;/code&gt;s, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHASER_OUT&lt;/code&gt;s, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IN_FIFO&lt;/code&gt;s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OUT_FIFO&lt;/code&gt;s, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUFIO&lt;/code&gt;s.&lt;/p&gt;

&lt;p&gt;This leads to the question of how to deal with the return clock sampling from
the DDR3 device.  In the case of the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;, we made the assumption
that the DQS toggling would always come back after a fixed amount of time
from the clock containing the request.  A small calibration state machine
is used to determine this delay time and then to find the center of the “eye”.
Once done, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IDELAY&lt;/code&gt; elements, coupled with a shift register, are then used
to get the sample point.&lt;/p&gt;

&lt;p&gt;Fig. 2 shows a reference to this process.&lt;/p&gt;
&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 2. Incoming data sampling&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/idelay.svg&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;It is possible that this method will lose calibration over time.  Indeed,
even the MIG wants to use the XADC to watch for temperature changes to know
if it needs to adjust its calibration.  Rather than require the XADC, the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; supports a
user input to send it back into calibration mode.  Practically, I haven’t
needed to do this, but this may also be because my test durations weren’t long
enough.&lt;/p&gt;

&lt;p&gt;Another difference between the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; and the MIG
is that the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; only has one
interface: &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone B4
(Pipelined)&lt;/a&gt;.
This interface is robust enough to replace
the need for the MIG’s non-standard “native” interface.  Further, because
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
has only a single channel for both reads and writes, the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; maintains a
strict on all transactions.  There’s no opportunity for reordering accesses,
and no associated complexity involved with it either.&lt;/p&gt;

&lt;p&gt;This will make our testing a touch more difficult, however, because we’ll be
issuing &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
requests–native to the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; but not the MIG.
A &lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;simple bridge&lt;/a&gt;, costing
a single clock cycle, will convert from
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; to AXI prior
to the MIG.  We’ll need to account for this when we get to testing.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; also differs
in how it handles memory banks.  Rather than using an “aggressive” precharging
strategy, it uses a lazy one.  Rows are only precharged (returned back to the
capacitors) when 1) the row has been active too long, or 2) when it is time to
do a refresh, and so all active rows on all banks must be precharged.  This
works great under the assumption that the next access is most likely to be in
the vincinity of the last one.&lt;/p&gt;

&lt;p&gt;A second difference in how the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; handles memory banks
is that, unlike the MIG, the bank address is drawn from the bits between the
row and column address, as shown in Fig. 3.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 3. Bank addressing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/bank-addressing.svg&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Although the MIG has an &lt;em&gt;option&lt;/em&gt; to do this, it isn’t
clear that the MIG takes any advantage of this arrangement.  The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;, on the other hand, was
designed to take explicit advantage of this arrangement.  Specifically, the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; assumes most
accesses will be sequential through memory.  Hence, when it gets a request for
a memory access that is most of the way through the column space of a given
row, it then activates the next row on the next bank.  This takes place
independent of any user requests, and therefore anticipates a future user
request which may (or may not) take place.&lt;/p&gt;

&lt;p&gt;Xilinx’s documentation reveals very little about their REFRESH strategy.
The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;’s REFRESH
strategy is very simple: every so many clocks (827 in this case) the memory
is taken off line for a REFRESH cycle.  This cycle lasts some number of clocks
(46 for this test setup), and then places the memory back on line for further
accesses.&lt;/p&gt;

&lt;p&gt;This refresh timing is one of those things that makes working with SDRAM in
general so difficult: it can be very hard to predict when the memory will be
offline for a refresh, and so predicting performance can be a challenge.
I know I have personally suffered from testing against an approximation of
SDRAM memory, one that has neither REFRESH nor PLL station keeping cycles,
only to suffer later when I switch to such a memory and then get hit with a
stall or delayed ACK at a time when I’m not expecting it.  &lt;a href=&quot;/blog/2018/08/04/sim-mismatch.html&quot;&gt;Logic that worked
perfect in my (less-than matched) simulation, would then fail in
hardware&lt;/a&gt;.  This
can also be a big challenge for security applications that require a
fixed (and known) access time to memory lest they leak information across
security domains.&lt;/p&gt;

&lt;h2 id=&quot;the-test-setup&quot;&gt;The test setup&lt;/h2&gt;

&lt;p&gt;Before diving into test results, allow me to introduce the test setup.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 4. An Enclustra Mercury+ KX2 carrier board mounted on an ST1 baseboard&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/enclustra.png&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I’ll be running my memory tests using my &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos
project&lt;/a&gt;.  &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;This
project&lt;/a&gt; uses an &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra Mercury+ KX2
carrier board&lt;/a&gt;
containing a 2GB DDR3 memory and a Kintex-7 160T mounted on an &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Enclustra
Mercury+ ST1
baseboard&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. Test setup&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/testsetup.svg&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Fig. 5 shows the relevant components of the memory chain used by this
&lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos project&lt;/a&gt; together with three test
points for observation.  The project contains a
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  (Of course!) That
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has both instruction and data
interfaces to memory.  Each interface contains a 4kB cache.  The &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.html&quot;&gt;instruction
cache&lt;/a&gt;
in particular is large enough to hold all of the instructions required for
each of the code loops required by our bench, and so it becomes transparent
to the test.  This is not true of the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data cache&lt;/a&gt;.
The bench marks I have chosen today are specifically designed to force
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data cache&lt;/a&gt;
misses, and then to watch how the controller responds.  In the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, those two interfaces are then
merged together via a
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/ex/wbdblpriarb.v&quot;&gt;arbiter&lt;/a&gt;,
and again merged with a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/ex/wbpripriarb.v&quot;&gt;second
arbiter&lt;/a&gt;
with the DMA’s &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
requests.  The result is that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has only a single bus interface.&lt;/p&gt;

&lt;p&gt;Bus requests from the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;,
to include the ZipDMA, are generated at a width
designed to match the bus.  The interface to the Enclustra’s SDRAM naturally
maps to 512 bits, so requests are generated (and recovered) at a 512 bit wide
bus width.&lt;/p&gt;

&lt;p&gt;Once requests leave the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/zipsystem.v&quot;&gt;ZipSystem&lt;/a&gt;, they
enter a &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbxbar.v&quot;&gt;Wishbone
interconnect&lt;/a&gt;.
This &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt; allows the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
to interact with &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash
memory&lt;/a&gt;, block RAM
memory, and the DDR3 SDRAM memory.  An additional port also allows interaction
with a control bus operating at 32bits.  Other peripheral DMAs can also
master the bus through this
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;, to include the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD card controller&lt;/a&gt;,
an &lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;an I2C
DMA&lt;/a&gt;, and an
&lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;external debugging
bus&lt;/a&gt;.  Other than
loading program memory via &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;the debugging
bus&lt;/a&gt;
to begin the test, these other bus masters will be idle during our testing.&lt;/p&gt;

&lt;p&gt;After leaving the &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;,
the &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
request goes in one of two directions.  It can either go to a &lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;Wishbone to AXI
converter&lt;/a&gt;
and then to the MIG, or it can go straight to the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.  (Only one
of these controllers will ever be part of the design at a given time.)&lt;/p&gt;

&lt;p&gt;A legitimate question is whether or not the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbm2axisp.v&quot;&gt;Wishbone to AXI
converter&lt;/a&gt;
will impact this test, or to what extent it will impact it.  From a timing
standpoint, &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbm2axisp.v&quot;&gt;this
converter&lt;/a&gt;
costs one clock cycle from the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
strobe to the AXI AxVALID signal.
This will add one clock of latency to any MIG request.  We’ll have to adjust
any results we calculate by this one clock cycle.  The
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbm2axisp.v&quot;&gt;converter&lt;/a&gt;
also requires 625 logic elements (LUTs).&lt;/p&gt;

&lt;p&gt;What about AXI?  The
&lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;converter&lt;/a&gt;
doesn’t produce full AXI.  All requests, coming out of the converter, are
for burst lengths of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AxLEN=0&lt;/code&gt; (i.e. one beat), a constant &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AxID&lt;/code&gt; of one bit,
an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AxSIZE&lt;/code&gt; of 512 bits, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AxCACHE=4'd3&lt;/code&gt;, and so forth.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;This will impact area.&lt;/p&gt;

    &lt;p&gt;A good synthesizer should be able to recognize these constants to reduce
both the logic area and logic cost of the MIG.
(&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; is already
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
based, so this won’t change anything.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about AXI bursts?&lt;/p&gt;

    &lt;p&gt;Frankly, bursts tend to slow down AXI traffic, rather than speed it up.
As &lt;a href=&quot;/blog/2019/05/29/demoaxi.html&quot;&gt;we’ve already discovered on this
blog&lt;/a&gt;, the first thing an
AXI slave needs to do with a burst request is to unwind the burst.  This
takes extra logic, and often costs a clock cycle (or two).  As a result,
Xilinx’s block RAM controller (not the MIG) suffers an extra clock lost on
any burst request.  The MIG, on the other hand, doesn’t seem affected by
burst requests (or lack thereof)–although they may contribute a clock or
two to latency.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about AXI pipelining?&lt;/p&gt;

    &lt;p&gt;Both AXI and the &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;pipelined
Wishbone&lt;/a&gt;
specification I use are &lt;em&gt;pipelined&lt;/em&gt; bus implementations.  This means that
multiple requests may be in flight at a time.  I don’t foresee any
differences, therefore, between the two controllers due to AXI’s pipelined
nature.&lt;/p&gt;

    &lt;p&gt;Had we been using Wishbone &lt;em&gt;Classic&lt;/em&gt;, then our memory performance would’ve
taken a significant hit.  (This is &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/doc/orconf.pdf&quot;&gt;one of the reasons why I &lt;em&gt;don’t&lt;/em&gt; use
Wishbone
Classic&lt;/a&gt;.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about Read/Write reordering?&lt;/p&gt;

    &lt;p&gt;The MIG may be able to reorder requests to its advantage.  In our test, we
will only ever give it a single burst of read or write requests (all with
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AxLEN=0&lt;/code&gt;), and we will wait for all responses to come back from the
controller before switching directions.  It is possible that the MIG might
have a speed advantage over the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller in a
direction swapping environment.  If so, then today’s test is not likely
to reveal those differences.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now that you know something about the various test setups, let’s look at some
benchmarks.&lt;/p&gt;

&lt;h2 id=&quot;the-lutsize-benchmark&quot;&gt;The LUT/Size benchmark&lt;/h2&gt;

&lt;p&gt;When I first started out working with FPGAs, I remember my sorrow at seeing
how much of my precious &lt;a href=&quot;http://store.digilentinc.com/arty-artix-7-fpga-development-board-for-makers-and-hobbyists&quot;&gt;Arty&lt;/a&gt;’s
LUTs were used by Xilinx’s MIG controller.
At the time, I was struggling for funds, and didn’t really have the kind of
cash required to purchase a &lt;em&gt;big&lt;/em&gt; FPGA with lots of area.  An Artix 35T was
(roughly) all I could afford, and the MIG used a large percentage of its area.&lt;/p&gt;

&lt;p&gt;Since area is proportional to dollars, let’s take a look at how much area
each of the controllers uses in today’s test.&lt;/p&gt;

&lt;p&gt;On a Kintex-7 160T, mounted on an &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra Mercury+ KX2 carrier
board&lt;/a&gt;,
the MIG controller uses 24,833 LUTs out of 101,400 LUTs.  This is a full
24.5% of the FPGA’s total logic resources.  Fig. 6 shows a Vivado generated
hierarchy diagram, showing how much of the design this component requires.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 6. Area usage hierarchy with the MIG&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/mig-usage.png&quot;&gt;&lt;img src=&quot;/img/migbench/mig-usage.png&quot; width=&quot;500&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The diagram reveals a lot about area.  Thankfully, the MIG only uses a quarter
of it.  The majority of the area used in this design is used by the components
that have to touch the 512bit bus.  These include the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;, the CPU’s DMA,
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt;’s DMA, the various
Ethernet bus components, and so on.  The most obvious conclusion is that, if
you want memory bandwidth, you will have to pay for it.  This should come as
no surprise to those who have worked in digital design for some time.&lt;/p&gt;

&lt;p&gt;On the same board, the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; uses 13,105
LUTs, or 12.9% of chip’s total logic resources.  A similar hierarchy
diagram of the design containing the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; can be found in Fig. 7.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 7. Area usage hierarchy with the UberDDR3 Controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/uber-usage.png&quot; width=&quot;500&quot;&gt;&lt;img src=&quot;/img/migbench/uber-usage.png&quot; width=&quot;500&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;To be fair, the Xilinx controller must also decode AXI–a rather complex
protocol.  However, &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axim2wbsp.v&quot;&gt;AXI may be converted to
Wishbone&lt;/a&gt; for
only 1,762 LUTs, suggesting this conversion alone isn’t sufficient to explain
the difference in logic cost.  Further, the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbm2axisp.v&quot;&gt;Wishbone to AXI
converter&lt;/a&gt;
used to feed the MIG uses only a restricted subset of the AXI protocol.  As
a result, it’s reasonable to believe that the synthesizer’s number, 24,833 LUTs,
is smaller than what a more complex AXI handler might require.&lt;/p&gt;

&lt;p&gt;On size alone, therefore, the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; comes out as the clear
winner.&lt;/p&gt;

&lt;p&gt;That makes the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; cheaper.  What about
faster?&lt;/p&gt;

&lt;!--
## Time to First Access

Enough background, let's start testing.

From power up to first access, how long does it take?

`LDI _sdram, R0`
`LDI _rtc, R1`
`LW (R0),R2`
`LW (R1),R3`
`HALT`
--&gt;

&lt;h2 id=&quot;the-raw-dma-bench-mark&quot;&gt;The raw DMA bench mark&lt;/h2&gt;

&lt;p&gt;We’ve &lt;a href=&quot;/blog/2021/08/14/axiperf.html&quot;&gt;previously discussed bus benchmarking for
AXI&lt;/a&gt;.  In &lt;a href=&quot;/blog/2021/08/14/axiperf.html&quot;&gt;that
article&lt;/a&gt;,
we identified every type of clock cycle associated with an AXI transaction,
and then counted how often each type of cycle took place.  Since &lt;a href=&quot;/blog/2021/08/14/axiperf.html&quot;&gt;that
article&lt;/a&gt;, I’ve built
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbperf.v&quot;&gt;something very similar for
Wishbone&lt;/a&gt;.
In hindsight, however, all of these measures tend to be way too complicated.
What I really want is the ability to summarize transactions simply in terms of
1) latency, and 2) throughput.  Therefore, I’ve chosen to model all DDR3
transaction times by the equation:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/eqn-transaction-time.png&quot; alt=&quot;&quot; width=&quot;524&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this model, “Latency” is the time from the first request to the first
response, and “Throughput” is the fraction of time you can get one beat
returned per clock cycle.  Calculating these coefficients requires a basic
linear fit, and hence transfers with a varying number of beats used by the
DMA–but we’ll get to that in a moment.&lt;/p&gt;

&lt;p&gt;The biggest challenge here is that the CPU can very much get in the way of
these measures, so we’ll begin our measurements using the DMA alone where
accesses are quite simple.&lt;/p&gt;

&lt;p&gt;Here’s how the test will work: The CPU will first program the
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbperf.v&quot;&gt;Wishbone bus measurement
peripheral&lt;/a&gt;.
It will then program the DMA to do a memory copy, from DDR3 SDRAM to DDR3
SDRAM.  The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s DMA will break
this copy into parts: It will first read &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; words into a buffer, and then
(as a second step) write those &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;N&lt;/code&gt; words to somewhere else on the memory.
During this operation, the CPU will not interact with the DDR3 memory at
all–to keep from corrupting any potential measures.  Instead, it will run all
instructions from an on-board block RAM.  Once the operation completes, the
CPU will issue a stop collection command to the &lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/wbperf.v&quot;&gt;Wishbone bus measurement
peripheral&lt;/a&gt;.
From there, the CPU can read back 1) how many requests were made, 2) how
many clock cycles it took to either read or write each block.  From the DMA
configuration, we’ll know how many blocks were read and/or written.  From this,
we can create a simple regression to get the latency and throughput numbers we
are looking for.&lt;/p&gt;

&lt;p&gt;To see how this might work, let’s start with what a DMA trace might nominally
look like.  Ideally, we’d want to see something like Fig. 8.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 8. Ideal DMA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/ideal-dma.svg&quot;&gt;&lt;img src=&quot;/img/migbench/ideal-dma.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this “ideal” DMA, the DMA maintains two buffers.  If either of the two
buffers is empty, it issues a read command.  Once the buffer fills, it issues
a write command.  Fig. 8 shows these read and write requests in the “DMA-STB”
line, with “DMA-WE” (write-enable) showing wich direction the requests are
being for.  These requests then go through a
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
and hit the DDR3 controller as
“SDRAM-STB” and “SDRAM-WE”.  (This simplified picture assumes no stalls, but
we’ll get to those.)  The SDRAM controller might turn around write requests
immediately, as soon as they are committed into its queue, whereas read
requests will take sometime longer until REFRESH cycles, bank precharging
and activation cycles are complete and the data finally returned.  Then,
as soon as a full block of read data is returned, the DMA can immediately turn
around and request to write that data.  Once a full block of write data has
been sent, the DMA then has the ability to reuse that buffer for the next block
of read data.&lt;/p&gt;

&lt;p&gt;AXI promises to be able to use memory in this fashion, and indeed my
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axidma.v&quot;&gt;AXI DMA&lt;/a&gt;
attempts to do exactly that.&lt;/p&gt;

&lt;p&gt;When interacting with a real memory, things aren’t quite so simple.  Requests
will get delayed (I didn’t draw the stall signal in Fig. 8), responses have
delays, etc.  Further, there is a delay associated with turning the memory
bus around from read to write or back again.  Still, this is as simple as
we can make a bus transaction look.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
unlike AXI, requests get grouped using the cycle line (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;).
You can see a notional
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; DMA cycle in
Fig. 9.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 9. Wishbone DMA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/cpb-dma.svg&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Unlike the AXI promise, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; implementation
uses only a single buffer, and it doesn’t switch bus direction mid-bus cycle.&lt;/p&gt;

&lt;p&gt;Let’s look at this cycle line for a moment through.  This is a “feature” not
found in AXI.  The originating master raises this cycle line on the first
request, and drops it after the last acknowledgment.  The
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
uses this signal
to know when it can drop arbitration for a given master, to allow a second
master to use the same memory.  The cycle line can also be used to tell
down stream slaves that the originating master is no longer interested in any
acknowledgments from its prior requests–effectively acting as a “bus abort”
signal.  This makes
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
more robust than AXI in the presence of hardware
failures, but it can also make
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
slower than AXI because bursts from
different masters cannot be interleaved while the master owning the bus holds
its cycle line high.&lt;/p&gt;

&lt;p&gt;Arguably, this
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
DMA approach will limit our ability to fully test
the MIG controller.  As a result, we may need to come back to this controller
and test again at a later time using an AXI DMA interface alone to see to what
extent that might impact our results.&lt;/p&gt;

&lt;p&gt;To make our math easier, we’ll add one more requirement: Our transactions will
either be to read or write 16, 8, 4, or 2 beats at a time.  On a 512bit bus,
this corresponds to reading or writing 1024, 512, 256, or 128 bytes at a
time–with 1024 bytes being the size of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
DMA buffer, and therefore the maximum transfer size available.&lt;/p&gt;

&lt;p&gt;With all that said, it’s now time to look at some measurement data.&lt;/p&gt;

&lt;p&gt;First up is the MIG DDR3 controller.  Fig. 10 shows a trace of the
DMA waveform when transferring 16 beats of data at a time.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 10. MIG DMA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cpb.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This image shows two
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
bus interfaces.  The top one is the view the
DMA has of the bus.  The bottom interface is the view coming out of the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; and going into
the memory controller.&lt;/p&gt;

&lt;p&gt;In this image, it takes a (rough) 79 clock cycles to go from the beginning of
one read request, through a write request, to the beginning of the next read
request–as measured between the two vertical markers.&lt;/p&gt;

&lt;p&gt;Some things to notice include:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;It takes 4 clock cycles for the request to go from the DMA through the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
to the controller.&lt;/li&gt;
  &lt;li&gt;While not shown here, it takes one more clock cycle following
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sdram_stb &amp;amp;&amp;amp; !sdram_stall&lt;/code&gt; for the conversion to AXI.&lt;/li&gt;
  &lt;li&gt;Curiously, the SDRAM STALL line is not universally low during a burst
of requests.  In this picture, it often rises for a cycle at a time.  I have
conjectured above that this is due to the MIG’s need for PLL station keeping.&lt;/li&gt;
  &lt;li&gt;During writes, it takes 3 clocks to go from request to acknowledgment.&lt;/li&gt;
  &lt;li&gt;During reads, it can take 26 clocks from request to acknowledgment–or more.&lt;/li&gt;
  &lt;li&gt;Once the MIG starts acknowledging (returning) requests, the ACK line can
still drop mid response.  (This has cost me no end of heartache!)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If we repeat the same measurement with the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;, we get the trace shown
in Fig. 11.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 11. Uber DMA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cpb.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, the full 1024 Byte transfer cycle now takes 66 clock cycles
instead of 79.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;It takes 11 cycles from read request to read acknowledgment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;It takes 7 cycles from write request to acknowledgment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Unlike the MIG, there’s no periodic loss of acknowledgment.  In general,
once the acknowledgments start, they continue.  This won’t be universally
true, but the difference is still significant.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Of course, one transaction never tells the whole story, a full transaction
count is required.  However, when we look at all transactions, we find on
average:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/membench.png&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;These are the clearest performance numbers we will get to compare these two
controllers.  When writing to memory, the MIG is clearly faster.  This is
likely due to its ability to turn a request around before acting upon it.
(Don’t forget, one of these clocks of latency is due to the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt; to AXI
&lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;conversion&lt;/a&gt;, so the
MIG is one clock faster than shown in this chart!)  Given that the MIG can
turn a request around in 1.8 cycles, it must be doing so before examining
any of the details of the request!&lt;/p&gt;

&lt;p&gt;When reading from memory, the MIG is clearly slower–and that by a massive
amount.  One clock of this is due to the &lt;a href=&quot;/blog/2020/03/23/wbm2axisp.html&quot;&gt;Wishbone to AXI
conversion&lt;/a&gt;.  Another
clock (or two) is likely due to the AXI to native conversion.  The MIG must
also arbitrate between reads and writes, and must (likely) always activate
a row before it can be used.  All of this costs time.  As a result of these
losses and more that aren’t explained by these, the MIG is clearly &lt;em&gt;much&lt;/em&gt;
slower than the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;under-load&quot;&gt;Under Load&lt;/h2&gt;

&lt;p&gt;Now that we’ve seen how the DDR3 controller(s) act in isolation to a DMA,
let’s turn our attention to how they act in response to a CPU–the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; in this case.  (Of course!)
For our test configuration, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; will have
both &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.html&quot;&gt;instruction&lt;/a&gt;
caches.  Because of this, our test memory loads will need to be
extensive–to break through the cache–or else the cache will
get in the way of any decent measurement.&lt;/p&gt;

&lt;h3 id=&quot;how-the-cache-works&quot;&gt;How the Cache Works&lt;/h3&gt;

&lt;p&gt;Let’s discuss the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data cache&lt;/a&gt;
for a moment, because it will become important when we try to understand how
fast the CPU can operate in various memory environments.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;First, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has only one
interface to the bus.  This interface is shared by both the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data&lt;/a&gt;
caches.  However, the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
cache is (generally) big enough to fit most of our program, so it shouldn’t
impact the test much.&lt;/p&gt;

    &lt;p&gt;The one place where we’ll see the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
cache impact our test is whenever the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; needs to cross
between cache lines.  As currently built, this will cost a one clock delay
to look up whether or not the next cache line is in the instruction cache.
Other than that, we’re not likely to see any impacts from the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
cache.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data cache&lt;/a&gt;
is a write through cache.  Any attempt to write to memory will go directly
to the bus and so to memory.  Along the way, the memory in the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;cache&lt;/a&gt;
will be updated–but only if the memory to be written is also currently
kept in the cache.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; will not wait for a
write response from memory before going on to its next instruction.  Yes, it
will wait if the next instruction is a read instruction, but in all other
cases the next instruction is allowed to go forward as necessary.&lt;/p&gt;

    &lt;p&gt;One (unfortunate) consequence of this choice is that any bus error will
likely stop the CPU a couple of instructions &lt;em&gt;after&lt;/em&gt; the fault, potentially
confusing any engineer trying to understand which instruction, which register,
and which memory address was associated with the fault.  Such
faults are often called &lt;em&gt;asynchronous&lt;/em&gt; or &lt;em&gt;imprecise&lt;/em&gt; bus faults.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;When issuing multiple consecutive write operations in a row, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; will not wait for prior
operations to complete.  Two of our test cases will exploit this to issue
three write (or read) requests in a row.  In these tests, the CPU will
write either three 32b words or three 8b bytes on consecutive instructions
and hence clock cycles.&lt;/p&gt;

    &lt;p&gt;I tend to call these &lt;em&gt;pipelined writes&lt;/em&gt;, and I consider them to be some of
the better features of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;All read operations first take a clock cycle to check the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;cache&lt;/a&gt;.  As a
result, the minimum read time is two cycles: one to read from the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;cache&lt;/a&gt;
and check for validity, and a second cycle to shift the 512b bus value
and return the 8, 16, or 32b result.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As with the write operations, read operations can also be issued back to back.
Back to back read operations will have a latency of two clocks, but a
100% throughput–assuming they both read from the same
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;cache&lt;/a&gt; line.
If not, there will be an additional clock cycle lost to look up whether or
not the requested cache line validly exists within the cache.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Both
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt;
sizes have been set to 4kB each.  Both caches will use a line size of eight
bus words (512 Bytes).  Neither cache uses &lt;a href=&quot;/zipcpu/2025/03/29/pfwrap.html&quot;&gt;wrap
addressing&lt;/a&gt; (although this
test will help demonstrate that they should …).  Instead, all cache
reads will start from the top of the cache line, and the CPU will stall
until the entire cache line is completely read before continuing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To help to understand how this &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt; works,
let’s examine three operations.  The first is a read cache miss, as shown in
Fig. 12.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 12. ZipCPU Read Data Cache Miss&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/dcache-miss.png&quot; width=&quot;720&quot;&gt;&lt;img src=&quot;/img/migbench/dcache-miss.png&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, a load word (LW) instruction flows through the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;pipeline from prefetch (PF),
to decode (DCD), to the read operand (OP)
stage&lt;/a&gt;.  It then
leaves the read operand (OP) stage headed for the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt;.  The
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data cache&lt;/a&gt;
requires a couple of clocks–as dictated by the block RAM it’s built from–to
determine
that the request is not in the cache.  Once this has been determined, the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt; initiates
a bus request to read a single cache line (8 bus words) from memory.  Both
cycle and strobe lines are raised.  The strobe line stays active until eight
cycles of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;stb &amp;amp;&amp;amp; !stall&lt;/code&gt; (stall is not shown here, but assumed low).  Once
eight requests have been made, the CPU waits for the last of the eight
acknowledgments.  Once the read is complete, and not before, the cache line
is declared valid and the CPU can read from it to complete it’s instruction.
This costs another four cycles before the LW instruction can be retired.&lt;/p&gt;

&lt;p&gt;While this cache line remains in our cache, further requests to read from
memory will take only either two or three clocks: Two clocks if the request
is for the same cache line as the prior access, or three clocks otherwise
as shown in Fig. 13.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 13. ZipCPU Data Cache Hit&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/dcache-hit.png&quot; width=&quot;280&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Finally, on any write request, the request will go straight to the bus as
shown in Fig. 14.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 14. ZipCPU Write to Memory (through the Data Cache)&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/dcache-write.png&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The CPU may then go on to other instructions, but the pipeline will
necessarily stall if it ever needs to interact with memory prior to this
write operation completing (unless its a set of consecutive writes …).&lt;/p&gt;

&lt;h3 id=&quot;sequential-lrs-word-access&quot;&gt;Sequential LRS Word Access&lt;/h3&gt;

&lt;p&gt;Our first CPU-based test is that of sequential word access.  Specifically,
we’ll work our way through memory, and write a pseudo random value to every
word in memory–one word at a time.  We’ll then come back through memory
and read and verify that all of the memory values were written as desired.&lt;/p&gt;

&lt;p&gt;From C, the write loop is simple enough:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span class=&quot;cp&quot;&gt;#define	STEP(F,T)  asm volatile(&quot;LSR 1,%0\n\tXOR.C %1,%0&quot; : &quot;+r&quot;(F) : &quot;r&quot;(T))
&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// fill = (fill&amp;amp;1)?((fill&amp;gt;&amp;gt;1)^TAPS):(fill&amp;gt;&amp;gt;1);&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The “STEP” macro exploits the fact that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s LSR (logical shift right)
instruction shifts the least significant bit into the carry flag, so that an
linear feedback shift register (LFSR) may be stepped with only two instructions.
The second instruction is a conditionally executed exclusive OR operation,
only executed if the carry flag was set–indicating that a one was shifted out
of the register.&lt;/p&gt;

&lt;p&gt;This simple loop then compiles into the following
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
&lt;a href=&quot;/zipcpu/2018/01/01/zipcpu-isa.html&quot;&gt;assembly&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LSR        $1,R2	; STEP(fill, TAPS)
	XOR.C      R3,R2
	SW         R2,(R1)	; *mptr = fill
	| ADD        $4,R1	;  mptr++
	CMP        R6,R1	; if (mptr &amp;lt; end)
	BC         loop		;	go to top of loop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Basically, we step the LFSR by shifting right by one.  If the bit shifted
over the edge was a one, we exclusive OR the register with our taps.  (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XOR.C&lt;/code&gt;
only performs the exclusive OR if the carry bit is set.)
We then store this word (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SW&lt;/code&gt;= store word) into our memory address (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;R1&lt;/code&gt;),
increment the address by adding four to it, and then compare the result with
a pointer to the end of our memory region.  If we are still less than the
end of memory, we go back and loop again.&lt;/p&gt;

&lt;p&gt;Inside the CPU’s pipeline, this loop might look like Fig. 15.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 15. Simple write pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp1.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp1.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Let’s work our way through the details of this diagram.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;There are &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;four pipeline stages: prefetch (PF), decode (DCD), read operands
(OP), and write-back
(WB)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; allows some pairs of
instructions to be packed together.  In this case, I’ve used the vertical
bar to indicate instruction pairing.  Hence the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;S|A&lt;/code&gt; instruction coming
from the prefetch is one of these combined instructions.  The instruction
decoder turns this into two instructions, forcing the prefetch to stall
for a cycle until the second instruction can advance.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In general and when things are working well, all instructions take one clock
cycle.  Common exceptions are to this rule are made for memory, divide, and
multiply instructions.  For this exercise, only memory operations will take
longer.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The store word instruction must stall and wait if the memory unit is busy.
For the example in Fig. 15, I’ve chosen to begin the example with a busy
memory, so you can see what this might look like.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the store word request has been issued to the memory controller, a bus
request starts and the CPU continues with its next instruction.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The bus request must go through the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
to get to the SDRAM.  As shown here, this takes three cycles.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The memory then accepts the request, and acknowledges it.&lt;/p&gt;

    &lt;p&gt;In the case of the MIG, this request is acknowledged almost immediately.
The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; takes
several more clock cycles before acknowledging this request.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;It takes another clock for this acknowledgment to return back through the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; to the CPU.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;By this time, the CPU has already gone ahead without waiting for the bus
return.  However, once returned, the CPU can accept a new memory instruction
request.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;When the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; hits the branch
instruction (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BC&lt;/code&gt; = Branch if carry is set), the CPU must clear its pipeline
to take the branch.  This forces the pipeline to be flushed.  The colorless
instructions in Fig. 15 are voided, and so never executed.  The jump flag is
sent to the prefetch and so the CPU must wait for the next instruction to be
valid.  (No, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; does not
have any branch prediction logic.  A branch predictor might have saved us
from these stalls.)  If, as shown here, the branch remains in the same
instruction cache line, a new instruction may be returned immediately.
Otherwise it may take another cycle to complete the cache lookup for an
arbitrary cache line.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you look closely, you’ll notice that the performance of this tight loop
is heavily dependent upon the memory performance.  If the memory write cannot
complete by the time the next write needs to take place, the CPU must then stall
and wait.&lt;/p&gt;

&lt;p&gt;Using our two test points, we can see how the two controllers handle this test.
Of the two, the MIG controller is clearly the fastest, although the speed
difference is (in this case) irrelevant.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 16. MIG Write pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp1.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Indeed, as we’ve discussed, the MIG’s return comes back so fast that it is
clear the MIG has not completed sending this request to the DDR3.  Instead,
it’s just committed the request to its queue, and then returns its
acknowledgment.  This acknowledgment also comes back fast enough that the
CPU memory controller is idle for two cycles per loop.  As a result, the
memory write time is faster than the loop, and the loop time (10 clock cycles,
from marker to marker) is dominated by the time to execute each of the
instructions.&lt;/p&gt;

&lt;p&gt;Let’s now look at the trace from the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; shown in Fig. 17.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 17. Uber Write pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp1.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The big thing to notice here is that the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; takes one more clock
cycle to return a busy status.  Although this is slower than the MIG, it isn’t
enough to slow down the CPU, so the loop continues to take 10 cycles per loop.&lt;/p&gt;

&lt;p&gt;If you dig just a bit deeper, you’ll find that every 22us or so, the MIG
takes longer to acknowledge a write request.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 18. MIG Write pipeline with stall&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp1-stall.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, the loop requires 22 clock cycles to complete.&lt;/p&gt;

&lt;p&gt;In a similar fashion, every 827 clocks (8.27 us), the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; does a memory refresh.
During this time, the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; will also take longer to
acknowledge a write request.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 19. Uber Write pipeline with stall&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp1-stall.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it takes the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; 57 clocks to complete a
single loop.&lt;/p&gt;

&lt;p&gt;Let’s now turn our attention to the read half of this test, where we go back
through memory in roughly the same fashion to verify the memory writes
completed as desired.  In particular, we’ll want to look at cache misses.
Such misses don’t happen often, but they are the only time the design
interacts with its memory.&lt;/p&gt;

&lt;p&gt;From C, our read loop is similarly simple:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;&lt;span class=&quot;cp&quot;&gt;#define	FAIL		asm(&quot;TRAP&quot;)
&lt;/span&gt;	&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The big difference here is that, if the memory every fails to match the
pseudorandom sequence, we’ll issue a TRAP instruction which will cause the
CPU to halt.  This forces a branch into the middle of our loop.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LSR        $1,R0	; STEP(fill, TAPS)
	XOR.C      R2,R0
	LW         (R1),R3	; *mptr
	| CMP      R0,R3
	BZ         no_trap	; if (*mptr == (int)fill) ... skip
	TRAP			;   break into supervisor mode--never happens
no_trap:
	ADD        $4,R1	; mptr++
	| CMP      R6,R1	; if (mptr &amp;lt; end)
	BC         loop		;   loop some more&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Inside the CPU’s pipeline, this loop might look like Fig. 20.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 20. Read pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp2.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp2.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This figure shows two times through the loop–one with a cache miss, and
one where the data fits entirely within the cache.  In this case, the time
through the loop upon a cache miss is entirely dependent upon how long the
memory controller takes to read.  &lt;em&gt;EVERY&lt;/em&gt; clock cycle associated with reading
from memory (on a cache miss) costs us.&lt;/p&gt;

&lt;p&gt;Fig. 21 shows a trace captured from the MIG during this operation.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 21. MIG Data read, cache miss&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp2.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Here we can see that it takes 35 cycles to read from memory on a cache miss.
These 35 cycles directly impact that time it takes to complete our loop.&lt;/p&gt;

&lt;p&gt;Since the memory is being read into the data cache, we are reading eight 512 bit
words at a time, which we will then process 32 bits per loop.  Hence, one might
expect a cache miss one of every 128 loops.&lt;/p&gt;

&lt;p&gt;Accepting that it takes us 17 clocks to execute this loop without a cache
miss, we can calculate the loop time with cache misses as:&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/eqn-cp2-readloop.png&quot; alt=&quot;&quot; width=&quot;524&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, the probability of a cache miss is once every 128 times through.
The other latency is 4 clocks for the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;,
and another 5 clocks in the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;cache
controller&lt;/a&gt;.
Hence, our loop time for a 35 cycle read, one every 128
times, is about 17.5 cycles.  This is pretty close to the measured time of
17.35 cycles.&lt;/p&gt;

&lt;p&gt;How about the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;?
Fig. 22 shows us an example waveform.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 22. UberDDR3 Data read, cache miss&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp2.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it takes 17 clock cycles to access the DDR3 SDRAM.  From this
one might expect 17.07 clocks per loop.  In reality, we only get about 17.23,
likely due to the times when our reads land on REFRESH cycles, as shown in
Fig. 23 below, where the read takes 27 clocks instead of 17.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 23. UberDDR3 Data read, cache miss, colliding with a REFRESH cycle&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp2-refresh.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Our conclusion?  In this test case, the differences between the MIG and
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;s
are nearly irrelevant.  The MIG is faster for singleton writes, but we aren’t
writing often enough to notice.  The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;
is much faster when reading, but the cache helps to hide the difference.&lt;/p&gt;

&lt;h3 id=&quot;sequential-lrs-triplet-word-access&quot;&gt;Sequential LRS Triplet Word Access&lt;/h3&gt;

&lt;p&gt;Let’s try a different test.  In this case, let’s write three words at a time,
per loop, and then read them back again.  As before, we’ll move sequentially
through memory from one end to the next.  Our goal will be to exploit the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s pipelined memory access
capability, to see to what extent that might make a difference.&lt;/p&gt;

&lt;p&gt;Why are we writing three values to memory?  For a couple reasons.  First,
it can be a challenge to find enough spare registers to write much more.
Technically we might be able to write eight at a time, but we still need to
keep track of the various pointers and so forth for the rest of the function
we’re using.  Second, three is an odd prime number.  This will force us to
have memory steps that cross cache lines, making for some unusual accesses.&lt;/p&gt;

&lt;p&gt;Here’s the C code for writing three pseudorandom words to memory.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;	&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;register&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As before, we’re using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STEP&lt;/code&gt; macro (defined above) to step a linear
feedback shift register, used as a pseudorandom number generator, and then
writing these pseudorandom numbers to memory.  As before, the &lt;em&gt;pseudo&lt;/em&gt; in
&lt;em&gt;pseudorandom&lt;/em&gt; will be very important when we try to verify that our memory
was written correctly as intended.&lt;/p&gt;

&lt;p&gt;GCC converts this C into the following assembly.  (Note, I’ve renamed the
Loop labels and added comments, etc., to help keep this readable.)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	MOV        R3,R2	; STEP(fill, TAPS); a = fill;
	LSR        $1,R2
 	XOR.C      R8,R2
	MOV        R2,R4	; STEP(fill, TAPS); b = fill;
	LSR        $1,R4
	XOR.C      R8,R4
	MOV        R4,R3	; STEP(fill, TAPS); c = fill;
	LSR        $1,R3
	XOR.C      R8,R3
	SW         R2,$-12(R0)	; mptr[0] = a;
	SW         R4,$-8(R0)	; mptr[1] = b;
	SW         R3,$-4(R0)	; mptr[2] = c;
	| ADD      $12,R0	; mptr += 3;
	CMP        R6,R0	; if (mptr+3 &amp;lt; end)
	BC         loop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Even though we’re operating on three words at a time, the loop remains quite
similar.  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LSR/XOR.C&lt;/code&gt; steps the LRS.  Once we have three values, we use
three consecutive &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SW&lt;/code&gt; (store word) instructions to write these values to
memory.  We then adjust our pointer, compare, and loop if we’re not done yet.&lt;/p&gt;

&lt;p&gt;Fig. 24 shows what the CPU pipeline might look like for this loop.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 24. Triplet Write pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp3.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp3.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Unlike our first test, we’re now crossing between instruction cache lines.
This means that there’s a dead cycle between the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LSR&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XOR&lt;/code&gt; instructions,
and another one following the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BC&lt;/code&gt; (branch if carry) loop instruction before
the prefetch is able to return the first instruction.&lt;/p&gt;

&lt;p&gt;Unlike the last test, our memory operation takes three consecutive cycles.&lt;/p&gt;

&lt;p&gt;Here’s a trace showing this write from the perspective of the MIG controller.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 25. Triplet writes using the MIG&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp3.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it takes 6 clocks (as shown) for the MIG to acknowledge all
three writes.  You’ll also note that the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; stalls the
requests, but that you don’t see any evidence of that at the SDRAM controller.
This is simply due to the fact that it takes the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; a clock to
arbitrate, and it has a two pipeline stage buffer before arbitration is
required.  As a result, the third request through this
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; routinely stalls.
Put together, this entire loop requires 21 cycles from one request to the next.&lt;/p&gt;

&lt;p&gt;Now let’s look at a trace from the 
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 26. Triplet writes using the Uber3 controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp3.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it takes 8 clocks for 3 writes.  The 
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; is two clocks
slower than the MIG.  However, it still takes 21 cycles from one request to the
next, suggesting that we are still managing to hide the memory access cost
by running other instructions in the loop.  Indeed, if you dig just a touch
deeper, you’ll see that the CPU has 9 spare clock cycles.  Hence, this write
could take as long as 17 cycles before it would impact the loop time.&lt;/p&gt;

&lt;p&gt;Let’s now turn our attention to reading these values back.  As before, we’re
going to read three values, and then step and compare against our three
pseudorandom values.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;	&lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;register&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;unsigned&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;b&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;];&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;b&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Curiously, GCC broke our three requests up into a set of two, followed by a
separate third request.  This will break the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s pipelined memory access
into two accesses, although this is still within what “acceptable” assembly
might look like.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	ADD        $12,R2       ; mptr += 3
	| CMP      R6,R2	; while(mptr+3 &amp;lt; end)
	BNC        end_of_loop
	LW         -8(R2),R4	; b = mptr[1]
	LW         -4(R2),R0	; c = mptr[2]
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	LW         -12(R2),R11	; a = mptr[0]
	CMP        R1,R11	; if (a != (int)fill)
	BNZ        trap
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	CMP        R1,R4	; if (b != (int)fill)
	BNZ        trap
	LSR        $1,R1	; STEP(fill, TAPS);
	XOR.C      R3,R1
	CMP        R1,R0	; if (c == (int)fill)
	BZ         loop		;	go back and loop again
trap:&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;One lesson learned is that the if statements should include not only the
TRAP/FAIL instruction, but also a break instruction.  If you include the break,
then GCC will place the TRAP outside of the loop and so we’ll no longer
have to worry about multiple branches clearing our pipeline per loop.  If you
don’t, then the CPU will have to deal with multiple pipeline stalls.
Instead, we’ll have only one stall when we go around the loop.&lt;/p&gt;

&lt;p&gt;From a pipeline standpoint, the pipeline will look like Fig. 27.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 27. Triplet read pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp4.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp4.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this figure, we show two passes through the loop.  The first pass shows
a complete cache miss and subsequent memory access, whereas the second one
can exploit the fact that the data is in the cache.&lt;/p&gt;

&lt;p&gt;As before, in the case of a cache miss, the loop time will be dominated by
the memory read time.  Any delay in memory reading will slow our loop down
directly and immediately, but only once per cache miss.  The difference here
is that our probability of a cache miss has now gone from one in 128 to
three in 128.&lt;/p&gt;

&lt;p&gt;On a good day, the MIG’s access time looks like Fig. 28 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 28. Triplet word access, data cache miss, MIG controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp4.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it costs us 35 clocks to read from the SDRAM in the case of a
cache miss, and 24 clocks with no miss.  Were this always the case, we might
expect 25 clocks per loop.  Instead, we see an average of 27 clocks per loop,
suggesting that the refresh and other cycles are slowing us down further.&lt;/p&gt;

&lt;p&gt;Likewise, a cache miss when using the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; looks like
Fig. 29.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 29. Triplet word access, data cache miss, UberDDR3 controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp4.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it typically costs 17 clocks on a cache miss.  On a rare occasion,
the read might hit a REFRESH cycle, where it might cost 36 clocks or so.
Hence we might expect 24.6 cycles through the loop, which is very close to the
24.7 cycles measured.&lt;/p&gt;

&lt;h3 id=&quot;sequential-lrs-triplet-character-access&quot;&gt;Sequential LRS Triplet Character Access&lt;/h3&gt;

&lt;p&gt;The third CPU test I designed is a repeat of the last one, save that the
CPU made &lt;em&gt;character&lt;/em&gt; (i.e. 8-bit octet) accesses instead of 32-bit &lt;em&gt;word&lt;/em&gt;
accesses.&lt;/p&gt;

&lt;p&gt;In hind sight, this test isn’t very revealing.  The statistics are roughly the
same as the triplet word access: memory accesses to a given row aren’t faster
(or slower) when accessing 8-bits at a time instead of 32.  Instead, three
8-bit accesses takes just as much time as three 32-bit access.  The only real
difference here is that the probability of a read cache miss is now 3 bytes in
a 512 cache line, rather than the previous 3 in 128.&lt;/p&gt;

&lt;h3 id=&quot;random-word-access&quot;&gt;Random word access&lt;/h3&gt;

&lt;p&gt;A more interesting test is the random word access test.  In this case,
we’re going to generate both (pseudo)random data and a (pseudo)random address.
We’ll then store our random data at the random address, and only stop once
the random address sequence repeats.&lt;/p&gt;

&lt;p&gt;I’m expecting a couple differences here.  First, I’m expecting that almost all
of the data cache accesses will go directly to memory.  There should be no
(or at least very few) cache hits.  Second, I’m going to expect that almost
all of the memory requests should require loading a new row.  In this case,
the MIG controller should have a bit of an advantage, since it will
automatically precharge a row as soon as it recognizes its not being used.&lt;/p&gt;

&lt;p&gt;Writing to memory from C will look simple enough:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;	&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;initial_afill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dfill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;amsk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;amsk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dfill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;initial_afill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;GCC then turns this into the following assembly.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LSR        $1,R1	; STEP(afill, TAPS)
	XOR.C      R8,R1
	LSR        $1,R2	; STEP(dfill, TAPS)
	XOR.C      R8,R2
	MOV        R1,R3        ; if (afill &amp;amp; (~amsk)) == 0
	| AND      R12,R3
	BNZ        checkloop
	; Calculate the memory address
	MOV        R11,R3        | AND        R1,R3
	LSL        $2,R3
	MOV        R5,R9         | ADD        R3,R9
	SW         R2,(R9)	; mptr[afill &amp;amp; amsk] = dfill
checkloop:
	CMP        R1,R4
	BNZ        loop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There’s a couple of issues here in this test.  First, we have a mid-loop
branch that we will sometimes take, and sometimes not.  Second, we now have
to calculate an address.  This requires multiplying the pseudorandom
values by four (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LSL 2,R3&lt;/code&gt;), and adding it to the base memory address.&lt;/p&gt;

&lt;p&gt;I’ve drawn out a notional pipeline for what this might look like in Fig. 30.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 30. Random write access&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp7.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp7.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Notice that this notional pipeline includes a stall for crossing instruction
cache line boundaries between the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;XOR&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LSR&lt;/code&gt; instructions.&lt;/p&gt;

&lt;p&gt;From the MIG’s standpoint, a typical random write capture looks like Fig. 31
below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 31. Random write access, MIG controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp7.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;As before, this is a 4 clock access.  The MIG is simply returning it’s results
before actually performing the write.&lt;/p&gt;

&lt;p&gt;A similar trace, drawn from the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; can be seen in Fig. 32.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 32. Random write access, UberDDR3 controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp7.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, it takes 8 clocks to access memory and perform the write.&lt;/p&gt;

&lt;p&gt;However, neither write time is sufficient to significantly impact our time
through the loop.  Instead, it’s the rare REFRESH cycles that impact the write,
but again these impacts are only fractions of a clock per loop.  Still, that
means that the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; takes seven
tenths of a cycle longer per loop than the MIG controller.&lt;/p&gt;

&lt;p&gt;Reads, on the other hand, are more interesting.  Why?  Because read instructions
must wait for their result before executing the next instruction, and the
cache will have a negative effect if we’re always suffering from cache misses.&lt;/p&gt;

&lt;p&gt;Here’s the C code for a read.  Note that we now have two branches, mid loop.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;STEP&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dfill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TAPS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;amsk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
			&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;amsk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dfill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
				&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
				&lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
			&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
		&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
	&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;afill&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;initial_afill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;GCC produces the following assembly for us.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LSR        $1,R2	; STEP(afill, TAPS)
	XOR.C      R4,R2
	LSR        $1,R0	; STEP(dfill, TAPS)
	XOR.C      R4,R0
	MOV        R2,R3
	| AND      R12,R3
	BNZ        skip_data_check
	MOV        R11,R3	; Calculate afill &amp;amp; amsk
	| AND      R2,R3
	LSL        $2,R3	; Turn this into an address offset
	MOV        R5,R1
	| ADD      R3,R1	; ... and add that to mptr
	LW         (R1),R3	; Read mptr[afill&amp;amp;amsk]
	| CMP      R0,R3	; Compare with dfill, the expected data
	BNZ        trap		; Jump to the FAIL/break if nonzero
skip_data_check:
	LW         12(SP),R1	; Load (from the stack) the initial address
	| CMP      R2,R1	; Check our loop condition
	BNZ        loop
	// ...
trap:&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;There’s a couple of things to note here.  First, there’s not one but &lt;em&gt;two&lt;/em&gt;
memory operations here.  Why?  GCC couldn’t find enough registers to hold
all of our values, and so it spilled the initial address onto the stack.
Nominally, this wouldn’t be an issue.  However, it becomes an issue when
you have a data cache &lt;em&gt;collision&lt;/em&gt;, where both the stack and the SDRAM memory
require access to the same cache line.  These cases then require two cache
lookups per loop.  One lookup will be of SDRAM, the other  (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LW 12(SP),R1&lt;/code&gt;)
of block RAM where the stack is being kept.  (A 2-way or higher data cache
may well have mitigated this effect, allowing the stack to stay in the cache
longer.)&lt;/p&gt;

&lt;p&gt;Second, notice how we now have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BNZ&lt;/code&gt; (branch if not zero, or if not equal).
This is what we get for adding the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;break&lt;/code&gt; instruction to our failure part of
the loop–letting GCC know that this if condition isn’t really part of our
loop.  As a result, we only have one branch–and that only if our pseudorandom
address goes out of bounds.&lt;/p&gt;

&lt;p&gt;This leaves us with a pipeline looking like Fig. 33.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 33. Random read access pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp8.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp8.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;A capture of these random reads, when using the MIG controller, looks like
Fig. 34 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 34. Random read access, MIG controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp8.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;As before, we’re looking at 35 clocks to read 8 words.  Nominally, we might
argue this to be a latency of 27 cycles plus overhead, but … it’s not.
One cycle, after the MIG starts returning data, is empty.  This means we have a
latency of 26 cycles, and a single clock loss of throughput on every
transaction.&lt;/p&gt;

&lt;p&gt;Judging from the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
trace in Fig. 35, the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; doesn’t have this problem.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 35. Random read access, UberDDR3 controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp8.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Instead, it takes 17 clocks to access 8 words, and there’s no unexpected losses
in the return.&lt;/p&gt;

&lt;p&gt;As a result, the MIG controller requires 72 clocks per loop, whereas the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; requires
55 clocks per loop.&lt;/p&gt;

&lt;p&gt;My conclusion from this test is that the MIG remains faster when writing, but
the difference is fairly irrelevant because the CPU continues executing
instructions concurrently.  In the case of reads, on the other hand, the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; is much
faster.  This is the conclusion one might expect given that the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt; has much less
latency than the MIG.&lt;/p&gt;

&lt;h3 id=&quot;memcpy&quot;&gt;&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;MEMCPY&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Let’s now leave our contrived tests, and look at some C library functions.
For reference, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; uses the
&lt;a href=&quot;https://sourceware.org/newlib/&quot;&gt;NewLib&lt;/a&gt; C library.&lt;/p&gt;

&lt;p&gt;Our first test will be a
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; test.
Specifically, we’ll copy the first half of our memory to the second half.
This will maximize the size of the memory copied.&lt;/p&gt;

&lt;p&gt;In addition, our
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; requests
will be &lt;em&gt;aligned&lt;/em&gt;.  This will allow the library routine to use 32b word
copies instead of byte copies.  It’s faster and simpler, but there is
some required magic taking place in the library to get to this point.&lt;/p&gt;

&lt;p&gt;Our test choice also has an unexpected consequence.  Specifically, the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt;’s sequential memory
optimizations will all break at the bank level, since we’ll be reading from
one bank, and writing to another address &lt;em&gt;on the same bank&lt;/em&gt;.  This will force
the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; to
precharge a row and activate another on every bank read access.  (It’s not
quite every access, since we do have the data cache.)&lt;/p&gt;

&lt;p&gt;With a little digging, the relevant loop within the
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; compiles into
the following &lt;a href=&quot;/zipcpu/2018/01/01/zipcpu-isa.html&quot;&gt;assembly&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LW         (R5),R8	; Load two words from memory
	LW         4(R5),R9
	SW         R8,(R4)	; Store them
	SW         R9,$4(R4)
	LW         8(R5),R8	; Load the next two words
	LW         12(R5),R9
	SW         R8,$8(R4)	; Store those as well
	SW         R9,$12(R4)
	LW         16(R5),R8	; Load a third set of words
	LW         20(R5),R9
	SW         R8,$16(R4)	; Store the third set
	SW         R9,$20(R4)
	ADD        $32,R5        | ADD        $32,R4
	LW         -8(R5),R8	; Load a final 4th set of words
	LW         -4(R5),R9
	SW         R8,$-8(R4)	; ... and store them to memory
	SW         R9,$-4(R4)    | CMP        R4,R6
	BNZ        loop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Note that all of the memory accesses are for two sequential words at a time.
This is due to the fact that both GCC and
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; believe the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has native 64-bit instructions.
It doesn’t, but this is still a decent optimization.&lt;/p&gt;

&lt;p&gt;Second, note that GCC and
&lt;a href=&quot;https://sourceware.org/newlib/&quot;&gt;NewLib&lt;/a&gt; have succeeded in unrolling this loop,
so that four 64b words are read and written per loop.  (I’m not sure which of
GCC or &lt;a href=&quot;https://sourceware.org/newlib/&quot;&gt;NewLib&lt;/a&gt; is responsible for this
optimization, but it shouldn’t be too hard to look it up.)&lt;/p&gt;

&lt;p&gt;Third, note that the load-word instructions cannot start until the store-word
instructions prior complete.  This is to keep the CPU from hitting certain
memory access collisions.&lt;/p&gt;

&lt;!--
Can we predict how long this will take?

The load word instructions will miss the cache once every sixteen times through
this loop, costing `LATENCY+2/THROUGHPUT` clock cycles loss per miss, and three
cycles per hit.  The first and second store word instruction pairs will cost
`LATENCY+2/THROUGHPUT` each, since they cannot run concurrently with the memory
loads.  However, the third pair will require two fewer clocks, and the fourth
will require six fewer clocks (5 for the branch) because they can run
concurrently.

MIG: (1/16)(27.7+9+8/0.96)+1+(15/16)(4) + 4(5 + 2.8 + 2/0.9)-6
	= 56.7 clocks/loop
	// 35 cycle access
	// 99,125, when not in the cache
	// 55 cycles in cache
Uber2: (1/16)(10.8+9+8/0.9)+1+(15/16)(4) + 4(5 + 8.2 + 2/0.92)-6
	= 77.1 clock/loop
	// 57 cycles?
ACTUAL-MIG:  0x00ed081c clocks / 
ACTUAL-Uber: 0x0110e8e9 clocks / 
--&gt;

&lt;p&gt;Fig. 36 shows an example of how the MIG deals with this memory copy.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 36. MEMCPY, MIG Controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cp9.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Highlighted in the trace is the 35 cycle read.&lt;/p&gt;

&lt;p&gt;However, you’ll also note that this trace is primarily dominated by write
requests.  This is due to the fact that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
has a &lt;em&gt;write-through&lt;/em&gt; cache, so all writes go to the data bus–two words
at a time.  Because of the latency difference we’ve seen, these writes
can complete in 5 cycles total, or 14 cycles from one write to the next.&lt;/p&gt;

&lt;p&gt;Remember, the read requests cannot be issued until the write requests
can complete.  Hence, for any pair of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SW&lt;/code&gt; (store word) instructions followed
by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LW&lt;/code&gt; (load word) instructions, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LW&lt;/code&gt; instructions must wait for the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SW&lt;/code&gt; instructions to complete.  This write latency directly impacts that
wait time.  Hence, it takes 14 cycles from one write to the next.&lt;/p&gt;

&lt;p&gt;Also shown in Fig. 36 is a write when the SDRAM was busy.  These happen
periodically, when the MIG takes the SDRAM offline–most likely to refresh
some of its capacitors.  These cycles, while rare, tend to cost 71 clock
cycles to write two words.&lt;/p&gt;

&lt;p&gt;In the end, it took 55 cycles to read and write 8 words (32 bytes) when the
read data was in the cache, or 87 cycles otherwise.&lt;/p&gt;

&lt;p&gt;Fig. 37, on the other hand, shows a trace of the same only this time using
the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 37. MEMCPY, UberDDR3 Controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cp9.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;As before, reads are faster.
The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
can fill a cache line in 17 cycles, vs 35 for the MIG controller.&lt;/p&gt;

&lt;p&gt;However, what kills the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; in this test
is its write performance.  Because of the higher latency requirement of the
write controller, it typically takes 7 cycles for a two word write to complete.
This pushes the two word time from 14 cycles to 16 cycles.  As a result,
the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
is 15% &lt;em&gt;slower&lt;/em&gt; than the MIG in this test.&lt;/p&gt;

&lt;h3 id=&quot;memcmp&quot;&gt;&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;MEMCMP&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Our final benchmark will be a memory comparison, using
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;memcmp()&lt;/a&gt;.  Since we
just copied the lower half of our memory to the upper half using
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; in our last
test, we’re now set up for a second test where we verify that the memory
was properly copied.&lt;/p&gt;

&lt;p&gt;Our C code is very simple.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-c&quot; data-lang=&quot;c&quot;&gt;	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;memcmp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mem&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lnw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mem&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lnw&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;/&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;FAIL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Everything taking place, however, lands within the
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;memcmp()&lt;/a&gt; library call.&lt;/p&gt;

&lt;p&gt;Internally, we spend our time operating on the following loop over and over
again:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt;loop:
	LW         (R1),R4	; Read two words from the left hand side
	LW         4(R1),R5
	LW         (R2),R6	; Read two words from the right hand side
	LW         4(R2),R7
	CMP        R6,R4	; Compare left and right hand words
	CMP.Z      R7,R5
	BNZ        found_difference
	ADD        $8,R1         | ADD        $8,R2	; Increment PTRs
	ADD        $-8,R3        | CMP        $8,R3	; End-of-Loop chk
	BNC        loop&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;As with
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt;, the
library is try to exploit the 64b values that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; supports–albeit not natively.
Hence, each 64b read is turned into two adjacent reads, and the comparison
is likewise turned into a pair of comparisons, where the second comparison
is only accomplished if the first comparison is zero.  On any difference,
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;memcmp()&lt;/a&gt; breaks out
of the loop and traps.  Things are working well, however, so there are
no differences, and so the CPU stays within the loop until it finishes.&lt;/p&gt;

&lt;p&gt;Also, like the
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcpy&quot;&gt;memcpy()&lt;/a&gt; test, jumping
across a large power of two divide will likely break the bank machine
optimizations used by the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Enough predictions, let’s see some results.&lt;/p&gt;

&lt;p&gt;Fig. 38 shows an example loop through the MIG Controller.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 38. MEMCMP, MIG Controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cpa.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;One loop, measured between the markers, takes 106 clocks.&lt;/p&gt;

&lt;p&gt;Much to my surprise, when I dug into this test I discovered that &lt;em&gt;every&lt;/em&gt; memory
access resulted in a cache miss.  The reason is simple: the two memories
are separated by a power of two amount, yet greater than the cache line
size.  This means that the two pieces of memory, the “left hand” and “right
hand” sides, both use the same cache tags.  Therefore, they are both competing
for the same cache line.  (A 2-way cache may have mitigated this reality, but
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; currently has only one-way
caches.)&lt;/p&gt;

&lt;p&gt;Fig. 39 shows the comparable loop when using the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 39. MEMCMP, UberDDR3 Controller&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cpa.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this case, the
&lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;memcmp()&lt;/a&gt;
uses only 74 clocks per loop–much less than the 106 used by the MIG..&lt;/p&gt;

&lt;p&gt;Something else to note is that if you zoom out from the trace in Fig. 38, you
can see the MIG’s refresh cycles.  Specifically, every 51.8us, there’s a
noticable hiccup in the reads, as shown in Fig. 40.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 40. MEMCMP, MIG Controller Refresh timing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/mig-cpa-refresh.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The same refresh cycles are just as easy to see, if not easier, in the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;’s trace
if you zoom out, as shown in Fig. 41.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 41. MEMCMP, UberDDR3 Controller Refresh timing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/migbench/uber2-cpa-refresh.png&quot; width=&quot;720&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This might explain why the MIG gets 96% throughput, whereas the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; only gets
a rough 90% throughput: the MIG doesn’t refresh nearly as often.&lt;/p&gt;

&lt;p&gt;Still, when you put these numbers together, overall the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
is 30% &lt;em&gt;faster&lt;/em&gt; than the MIG when running the MEMCMP test.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;So what conclusion can we draw?  Is the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
faster, better, and cheaper than the MIG controller?  Or would it make more
sense to stick with the MIG?&lt;/p&gt;

&lt;p&gt;As with almost all engineering, the answer is: it depends.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; is
clearly &lt;em&gt;cheaper&lt;/em&gt; than the MIG controller, since it uses 48% lower area.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Reading is much faster when using the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;,
primarily due to its lower latency of 10.8 clocks vice the MIGs 27.7 clocks
(on avererage).  This lower latency is only partially explained by the
MIG’s need to process and decompose AXI bursts.  It’s not clear what the
rest of latency is caused by, or why it ends up so slow.&lt;/p&gt;

    &lt;p&gt;At the same time, this read performance improvement can often be hidden by
a good cache implementation.  This only works, though, when accessing
memory from a CPU.  Other types of memory access, such as DMA reading or
video framebuffer reading won’t likely have the luxury of hiding the
memory performance, since they tend to read large consecutive areas of
memory at once, rather than accessing random memory locations.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Writing is faster when using the MIG, primarily due to the fact that it
acknowledges any write request (nearly) immediately.&lt;/p&gt;

    &lt;p&gt;This should be an easy issue to fix.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt; might
increase its throughput to match the MIG, were it to use a different
refresh schedule.&lt;/p&gt;

    &lt;p&gt;I would certainly recommend Angelo look into this.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I really need to implement &lt;a href=&quot;/zipcpu/2025/03/29/pfwrap.html&quot;&gt;WRAP
addressing&lt;/a&gt; for my &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.html&quot;&gt;data
cache&lt;/a&gt;.
I might’ve done so for this article, once I realized how valuable it would
be, but then I realized I’d need to go and re-collect all of the data
samples I had, and re-draw all of the pipeline diagrams.  Instead, I’ll
just push this article out first and then take another look at it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://en.cppreference.com/w/cpp/string/byte/memcmp&quot;&gt;memcmp()&lt;/a&gt; test
also makes a strong argument for having at least a 2-way cache
implementation.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Given that the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 controller&lt;/a&gt;
is still somewhat new, I think we can all expect more and better things from
it as it matures.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;For he looketh to the ends of the earth, and seeth under the whole heaven; to make the weight for the winds; and he weigheth the waters by measure.  Job 28:24-25&lt;/em&gt;</description>
        <pubDate>Wed, 28 May 2025 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/zipcpu/2025/05/28/memtest.html</link>
        <guid isPermaLink="true">https://zipcpu.com/zipcpu/2025/05/28/memtest.html</guid>
        
        
        <category>zipcpu</category>
        
      </item>
    
      <item>
        <title>Wrap addressing</title>
        <description>&lt;p&gt;Welcome to the &lt;em&gt;ZipCPU&lt;/em&gt; blog.  I started it years ago after building my own
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;soft core CPU, the ZipCPU&lt;/a&gt;, and
dedicated this blog to helping individuals stay out of &lt;a href=&quot;/fpga-hell.html&quot;&gt;FPGA
Hell&lt;/a&gt;.  I then transitioned from working on
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;,
to building &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;bus components that might be used by every project–crossbars,
bridges, DMAs&lt;/a&gt; and such.  Since that time,
my time is primarily spent not on the CPU, but rather its peripherals.  This
last year, for example, has seen work on several memory controllers, to
include both &lt;a href=&quot;https://www.arasan.com/product/xspi-nor-ip/&quot;&gt;NOR&lt;/a&gt; and
&lt;a href=&quot;https://www.arasan.com/products/nand-flash/&quot;&gt;NAND&lt;/a&gt; flash controllers, an
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD Card(SDIO)/eMMC controller&lt;/a&gt;, and (now)
&lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;a SATA controller&lt;/a&gt;.  I’ve also had the
opportunity to work on &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;high speed
networking&lt;/a&gt;, video, and even SONAR
applications.  All of this work is made easier by having both my own &lt;a href=&quot;/about/zipcpu.html&quot;&gt;soft-core
CPU&lt;/a&gt;, together with &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;bus interconnect
components&lt;/a&gt;, that &lt;a href=&quot;/zipcpu/2019/02/04/debugging-that-cpu.html&quot;&gt;I’m not afraid to dig
into to debug if
necessary&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With all of these distractions, its nice every now and then to come back the
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One of my current projects requires that I bench mark AMD(Xilinx)’s
DDR3 SDRAM MIG controller against the open source &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3
controller&lt;/a&gt;.  The performance
differences are dramatic and very significant.  My current (draft) article
discussing these results works through a series of CPU and DMA based tests.
For each test, the article describes first the C code for the test, then the
assembly for the critical section, then a diagram of the CPU’s
pipeline–reconstructed from simulation traces, and then finally traces
showing the differences between the two controllers.&lt;/p&gt;

&lt;p&gt;All of that led me to this trace from the data cache, shown in Fig. 1 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 1. ZipCPU Data Cache Miss&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/migbench/cp2.svg&quot;&gt;&lt;img src=&quot;/img/migbench/cp2.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;For quick reference, the top line is the clock.  The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JMP&lt;/code&gt; line beneath it is
the signal from the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;’s core to the
&lt;a href=&quot;/zipcpu/2017/11/18/wb-prefetch.html&quot;&gt;instruction fetch&lt;/a&gt; that
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
needs to branch.  The &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PF&lt;/code&gt;
line&lt;/a&gt; shows the output
of the prefetch (cache), and whether an instruction is available for the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; to consume and if so which one.
The &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DCD&lt;/code&gt; line shows the output of the instruction
decoder&lt;/a&gt;.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OP&lt;/code&gt; is the output of the &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;read operands pipeline
stage&lt;/a&gt;, and &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WB&lt;/code&gt; is
the writeback stage&lt;/a&gt;.
The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ACK&lt;/code&gt; lines are a subset of the &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone
bus signaling&lt;/a&gt;
used to communicate with memory.  First there’s the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Zip-*&lt;/code&gt; version of these
signals, showing them coming out of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;, and then the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SDRAM-*&lt;/code&gt; signals
coming from the &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt;
showing these signals actually going to the memory controller itself.&lt;/p&gt;

&lt;p&gt;At issue is how long it takes the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
to respond to a cache miss.  Notice how it takes the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; 3 clock cycles from receiving an
&lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;LW (load word)
instruction&lt;/a&gt; from the
read operands stage until when the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data cache&lt;/a&gt;
initiates a bus request, another 3 cycles before the request can make it to
SDRAM controller, one cycle to return, and another 5 cycles from the
completion of that request before the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; can continue.  That’s 11 clock
cycles on every &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt;
miss above and beyond the cost of the memory access itself.&lt;/p&gt;

&lt;p&gt;Ouch.&lt;/p&gt;

&lt;p&gt;When it comes to raw performance, every cycle counts.  Can we do better?&lt;/p&gt;

&lt;p&gt;Yes, we can.  Let’s talk about &lt;em&gt;wrap&lt;/em&gt; addressing today.&lt;/p&gt;

&lt;p&gt;That said, I’d like to focus this article on saving a couple clock cycles in
the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;&lt;em&gt;instruction&lt;/em&gt;
cache&lt;/a&gt; rather
than the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;&lt;em&gt;data&lt;/em&gt;
cache&lt;/a&gt; shown
in my example.  Why?  For the simple practical reason that the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;&lt;em&gt;instruction&lt;/em&gt;
cache&lt;/a&gt;
has been easier to update and get working–although I have yet to post the
updates.  My &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data
cache&lt;/a&gt;
upgrades to date remain a (broken) work in progress.  Both, however, can be
motivated by the diagram in Fig. 1 above.&lt;/p&gt;

&lt;h2 id=&quot;wrap-addressing&quot;&gt;Wrap Addressing&lt;/h2&gt;

&lt;p&gt;What might we do to improve the performance of the trace in Fig. 1?&lt;/p&gt;

&lt;p&gt;The first thing we might do is speed up how long it takes to recognize that
a particular value is not in the cache.  There’s only so much that can be
done here, however, since the cache tag memory is &lt;em&gt;clocked&lt;/em&gt;.  As a result, it
will always take a clock cycle to look up the cache tag for any new request,
and another clock cycle to know it’s not the right tag, and then a third
clock cycle to activate the bus.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;crossbar&lt;/a&gt; is separate
from the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;, and its timing is
dominated by the need for a clock rate that matches the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 memory controller&lt;/a&gt;
is a separate product from the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;,
so its performance is independent from the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; itself.&lt;/p&gt;

&lt;p&gt;How about the return?  Once a value has been returned from memory to the
cache, it then takes another clock cycle to shift the value into place for
the CPU, so there’s not much to be done there … or is there?&lt;/p&gt;

&lt;p&gt;There are two optimizations that can be made on this return path.  The first
is that we can take the value directly from the bus and return it to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;–rather than waiting for the
value to first be written to and then read back from the cache’s memory.
The second optimization is &lt;em&gt;wrap&lt;/em&gt; addressing.  We’ll discuss both of these
optimizations today.&lt;/p&gt;

&lt;p&gt;First, though, let me introduce the concept of a &lt;em&gt;cache line&lt;/em&gt;.  A &lt;em&gt;cache line&lt;/em&gt;
is the minimum amount of memory that can be read into the cache at a time.
The cache itself is composed of many of these cache lines.  Upon a cache miss,
the cache controller will always go and read a whole cache line.&lt;/p&gt;

&lt;p&gt;A long discussion can be had regarding how big a cache line can or should be.
For me, I tend to follow the results published by &lt;a href=&quot;https://www.amazon.com/Computer-Architecture-Quantitative-Approach-Kaufmann/dp/0443154066/&quot;&gt;Hennessey and
Patterson&lt;/a&gt;,
and keep my cache lines (roughly) 8 words in length.  For simplicity, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
caches are all one-way caches, but, yes, a significant performance
can be gained by upgrading to two or even four-way caches–but that’s a story
for another day.&lt;/p&gt;

&lt;p&gt;Now that you know what a cache line is, notice how the cache miss in Fig. 1
results in reading an entire cache line.  As we’ll discuss in the memory
performance benchmarking article (still to be finished), memory performance
can be quantified by latency and throughput.  Caches can get an advantage
over &lt;a href=&quot;/zipcpu/2021/09/30/axiops.html&quot;&gt;single-beat read or write
instructions&lt;/a&gt; by
reading more than one beat at a time, and so increasing the line size improves
efficiency.  One problem with increasing the line size, however, is that
1) it increases the amount of time the bus is busy handling any request
(remember all requests are for a full cache line), and 2) it increases the
risk that you spend a lot of time handling requests for instructions or data
you’ll never use or need.&lt;/p&gt;

&lt;p&gt;Now we can discuss wrap addressing.  Wrap addressing is a means of reading
the cache line out of order.  Without wrap addressing, we might read the
words in the cache line in order from 0-7.  With wrap addressing, the cache
will specifically read the requested item from the cache line first, then
finish to the end of the line, then go back and get what was missing from
the beginning.  This way, as soon as the word that caused the cache miss in
the first place has been read, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
can be unblocked and continue whatever it needs to do next while the
cache controller finishes its read of the cache line.  The big difference is
that with wrap addressing the cache line is read in more of a priority fashion.
“Wrap addressing” is the just name given to this style of out of order
addressing.&lt;/p&gt;

&lt;p&gt;That’s what it is.  Let’s now look at its impact.&lt;/p&gt;

&lt;h2 id=&quot;wrap-addressing-with-the-zipcpus-instruction-cache&quot;&gt;Wrap Addressing with the ZipCPU’s Instruction Cache&lt;/h2&gt;

&lt;p&gt;Some years ago, I added wrap addressing to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/axiicache.v&quot;&gt;AXI
instruction&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/axidcache.v&quot;&gt;data
caches&lt;/a&gt;.  Up
until that time, I had poo-poo’d the benefit that might be had by using it.
The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; was designed to be a “simple”
and “low-logic” CPU, and wrap addressing would just complicate things–or so
I judged.  Then I tried it.  At the time, I just needed &lt;em&gt;something&lt;/em&gt; that used
wrap addressing–the AXI bus functional model I had been given just wasn’t up to
the task, but the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; could issue
wrap addressing requests quite nicely.  In the process, I was surprised at how
much faster the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; ran when the
caches used wrap addressing.&lt;/p&gt;

&lt;p&gt;That experiment died, however, once the need was over.  The big reason for it
dying was simply that I don’t use AXI often.  Sure, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has AXI memory controllers, but
they only fit the CPU so well.  The AXI bus is little endian, and the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; is big endian, so the two aren’t
a natural fit.  There’s plenty of pain at the seams.  Further, adding wrap
addressing to my
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
memory controllers was simply work that wasn’t being paid for.  No, it doesn’t
help that the &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
bus doesn’t really offer burst or wrap support, but I think you’ll find that
issue to be irrelevant to today’s discussion.&lt;/p&gt;

&lt;p&gt;As a result, &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
wrap addressing for the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has
therefore languished until I was recently motivated by examining the MIG and
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; memory controller bench
mark results.  Indeed, I found myself a touch embarrassed at the performance
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; was delivering.&lt;/p&gt;

&lt;p&gt;For illustration, let’s look at the first several instructions of a basic
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; test program I use.  We’ll
break it into two portions.  There’s the first several instructions.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;	; Clear all registers
	;   The &quot;|&quot; separates two instructions, both of which are
	;   packed into a single instruction word.
 4000000:	86 00 8e 00 	CLR        R0            | CLR        R1
 4000004:	96 00 9e 00 	CLR        R2            | CLR        R3
 4000008:	a6 00 ae 00 	CLR        R4            | CLR        R5
 400000c:	b6 00 be 00 	CLR        R6            | CLR        R7
 4000010:	c6 00 ce 00 	CLR        R8            | CLR        R9
 4000014:	d6 00 de 00 	CLR        R10           | CLR        R11
 4000018:	66 00 00 00 	CLR        R12
	; Set up the initial stack stack pointer
 400001c:	6a 00 00 10 	LDI        0x08000000,SP	; Top of stack
 4000020:	6a 40 00 00 
	; Guarantee we are in supervisor mode, and trap into supervisor
	; mode if not
 4000024:	76 00 00 00 	TRAP
	; Provide a set of initial values for all of the user registers 
 4000028:	7b 47 c0 1e 	MOV        $120+PC,uPC
 400002c:	03 44 00 00 	MOV        R0,uR0
 4000030:	0b 44 00 00 	MOV        R0,uR1
 4000034:	13 44 00 00 	MOV        R0,uR2
 4000038:	1b 44 00 00 	MOV        R0,uR3
 400003c:	23 44 00 00 	MOV        R0,uR4&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;These get us to the end of the first cache line and now the beginning of the
second.  Take note that there have been no jumps or branches in this
assembly, it’s just straightforward walking from one instruction to the
next through the test program.  (Yes, we’ll get to branches soon enough.)&lt;/p&gt;

&lt;p&gt;The instructions then continue loading the user register set with default
values.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt; 4000040:	2b 44 00 00 	MOV        R0,uR5
 4000044:	33 44 00 00 	MOV        R0,uR6
 4000048:	3b 44 00 00 	MOV        R0,uR7
 400004c:	43 44 00 00 	MOV        R0,uR8
 4000050:	4b 44 00 00 	MOV        R0,uR9
 4000054:	53 44 00 00 	MOV        R0,uR10
 4000058:	5b 44 00 00 	MOV        R0,uR11
 400005c:	63 44 00 00 	MOV        R0,uR12
 4000060:	6b 44 00 00 	MOV        R0,uSP
 4000064:	73 44 00 00 	MOV        R0,uCC
	; Finally, we call the bootloader function to load software into RAM
	; from flash if necessary (it isn't in this case), and to zero any
	; uninitialized global values
 4000068:	03 43 c0 02 	LJSR       @0x040000b4    // Bootloader
 400006c:	7c 87 c0 00 
 4000070:	04 00 00 b4 
	; Software continues, but the next section is outside the scope
	; of today's discussion.
	; ....&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;These end with a jump to subroutine instruction, followed by the beginning
of the “_bootloader” subroutine below.&lt;/p&gt;

&lt;p&gt;In this case, the cache line starts at address 0x04000080.  However, we don’t
start executing there in our example.  Instead, we start executing
partway through the cache line at the beginning of the bootloader
subroutine.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-asm&quot; data-lang=&quot;asm&quot;&gt;040000b4 &amp;lt;\_bootloader&amp;gt;:
	; Our first step is to create a stack frame.  For this, we
	; subtract from the stack pointer, and then store any
	; registers we might clobber onto the stack.  As before,
	; the &quot;|&quot; separates two instructions, both of which are
	; packed into a single instruction word.
 40000b4:	e8 10 ad 00 	SUB        $16,SP        | SW         R5,(SP)
 40000b8:	b5 04 bd 08 	SW         R6,$4(SP)     | SW         R7,$8(SP)
 40000bc:	44 c7 40 0c 	SW         R8,$12(SP)
 40000c0:	0a 00 00 00 	LDI        0x00000004,R1
 40000c4:	0a 40 00 04 
 40000c8:	0c 00 00 04 	CMP        $4,R1
 40000cc:	78 88 01 0c 	BZ         @0x040001dc
 40000d0:	0a 00 00 00 	LDI        0x00000004,R1
 40000d4:	0a 40 00 04 
 40000d8:	0c 00 00 04 	CMP        $4,R1
 40000dc:	32 08 00 20 	LDI.Z      0x04000000,R6
	; ....&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Together, these two sets of instructions make an awesome example to see how
wrap addressing would work from an instruction fetch perspective.&lt;/p&gt;

&lt;p&gt;One of the things I like about this example is the fact that the test starts
with many sequential instructions and no jumps (branches).  This will help
provide us a baseline of how things work–before jumps start making things
complicated.&lt;/p&gt;

&lt;p&gt;For today’s discussion, our cache line size is 8 words, each having 64bits.
The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s nominal instruction
size is 32bits.  Therefore, each cache line will nominally contain 16
instructions.  Our first cache line, however, contains many clear (CLR)
instructions (really load-immediate 0 into register …), and two of these
instructions can be packed into a single 32b word.  This is shown above using
the “|” characters.  Fig. 2 shows how the &lt;a href=&quot;/zipcpu/2017/08/23/cpu-pipeline.html&quot;&gt;CPU
pipeline&lt;/a&gt;
works through these initial instructions–without wrap adddressing.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 2. Starting the cache, without wrap addressing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/pf-startup.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/pf-startup.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Following the CPU reset, the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;cache&lt;/a&gt;
starts with the JUMP flag set.  Following a jump, it takes us 4 clock cycles
to determine that the new address is not in the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;cache&lt;/a&gt;
and to therefore start a bus cycle.&lt;/p&gt;

&lt;p&gt;This bus cycle is painful.  When using the MIG, it requires a (rough)
35 cycles (on a good day) to read all eight words.  When using the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller,
it requires a (rough) 18 cycles.  Since the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
can nominally execute one instruction per cycle, this is a painful wait.&lt;/p&gt;

&lt;p&gt;Once the bus cycle completes, we take another two cycles to present the
instruction from the cache line that we just read to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.  The decoder
then takes two clock cycles with this instruction, since it contains two
instructions packed into a single word, and so forth.  From here on out,
instructions are passed to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; at one instruction word per clock
cycle–unless the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
needs to take more clock cycles with them–as is the case of the &lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;compressed
instruction&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Some instructions, such as the &lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;load immediate
instruction&lt;/a&gt;, are actually
two separate instructions–a bit reverse instruction to load the high order
bits and a load immediate lo.
Other than that, things stay straight forward until the end of the cache line.
Once we get to the end, it takes us another 4 cycles to determine the next
instruction is not in the cache, and so a new cycle begins again.&lt;/p&gt;

&lt;p&gt;Now that we now how things work normally, we have our first chance for an
improvement: what if we started feeding instructions to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; &lt;em&gt;before&lt;/em&gt;
all of the instructions had been read from memory and returned across the bus?
What if we fed the next instruction to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
as soon as it was available?&lt;/p&gt;

&lt;p&gt;In that case, we might see a trace similar to Fig. 3 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 3. Feeding instructions straight from the bus returns&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/wrap-startup.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/wrap-startup.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;We can now overlap our instruction read time with our instruction issue,
saving ourselves a full 10 cycles!&lt;/p&gt;

&lt;p&gt;Let’s follow this further.  What would happen in the case of a jump/branch?
Without any modifications to &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;our instruction
cache&lt;/a&gt;
(i.e. before &lt;em&gt;wrap&lt;/em&gt; addressing), the JSR initiates a jump at the end of
Fig. 4 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 4. A Jump Instruction&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/pf-jsr.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/pf-jsr.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This trace is a touch more eventful.  For example, it includes a move to the
&lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;CC register&lt;/a&gt;.
On the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;, this register
contains more than just the condition codes.  It also contains the
&lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;user vs supervisor mode
control&lt;/a&gt;.  This creates a
pipeline hazard, and so instructions need to be stalled throughout the pipeline
until this instruction has had a chance to write back–clearing the hazard.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
&lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;JSR instruction&lt;/a&gt;
follows, requiring three instruction words.  The first instruction word moves
the program counter plus two into R0.  This will now contain the return address
for the subroutine.  On other architectures, such an instruction is often
called a “Link Register” instruction, but on the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
this is simply the first of the three word
&lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;JSR instruction&lt;/a&gt;.
The second instruction loads a new value into the program register.
Technically, this is a &lt;a href=&quot;/zipcpu2018/01/01/zipcpu-isa.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LW (PC),PC&lt;/code&gt; instruction–loading the
value of memory, as found at the program counter, into the program
counter&lt;/a&gt;.
Practically, it just allows us to place a 32b destination address into the
instruction stream.  Once the address is passed to the decoder, the decoder
recognizes the unconditional jump and sets a flag for the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
that it now wants a new instruction out of order.  The &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
now takes four clock cycles to determine this new value is not in the cache,
and our cycle repeats.&lt;/p&gt;

&lt;p&gt;As before, we can compress this a touch by serving our instructions to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
immediately as they are read from the bus–instead of waiting for the
entire cache line to be read first.  You can see how this optimization might
speed things up in Fig. 5.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 5. JSR instruction, post optimization&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/wrap-jsr.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/wrap-jsr.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;That’s how the first of our two optimizations works.&lt;/p&gt;

&lt;p&gt;Following the jump, without WRAP addressing, the pipeline would look like
FIG 6.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 6. JSR Landing, no optimization&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/pf-land.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/pf-land.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;To see what’s happening here, notice that we just jumped to address
0x040000b4.  Given our cache line size of
eight words, with each word being 64bits, this cache line starts at address
0x04000080.  If we just returned the value from the bus as soon as it was
available, we’d have to read six bus words before we get to the one we’re
interested in–as shown in Fig. 6.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-text&quot; data-lang=&quot;text&quot;&gt; 4000080:	; Word 0: I don't care about these instructions.  I'm jumping
 4000084:	;	to address 0x040000b4.  I just have to read
 4000088:	; Word 1: these excess instructions because I'm operating on an
 400008c:		entire cache line.
 4000090:	; Word 2:
 4000094:
 4000098:	; Word 3: Still haven't gotten to anything I care about ...
 400009c:
 40000a0:	; Word 4:
 40000a4:
 40000a8:	; Word 5:
 40000ac:
 40000b0:	; Word 6: This is the first half of the word I do care about
 40000b4:	;	THIS IS THE FIRST INSN OF INTEREST!
 40000b8:	; Word 7:
 40000bc:	;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Why not, instead, request the address we are interested in first?  Instead
of starting with word 0, and reading until word 6, we might instead start with
word 6, read word 7, and then finish by reading the first part of the cache
line (words 0-5) while the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
takes our instruction and gets (potentially) busy doing useful things.&lt;/p&gt;

&lt;p&gt;Fig. 7 shows how this wrap addressing might look.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 7. Instruction cache miss using WRAP addressing&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;a href=&quot;/img/pfwrap/wrap-land.svg&quot;&gt;&lt;img src=&quot;/img/pfwrap/wrap-land.svg&quot; width=&quot;720&quot; /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Here, we request the last two instruction words, words 6 and 7, of the cache
line, and then instruction words 0-5.  Word 6 contains two instructions, but
we’re only interested in the second of those two.  That one is a compressed
instruction, packing two instructions into 32bits.  Word 7 then contains
another three instructions–one packed instruction word and one normal one.&lt;/p&gt;

&lt;p&gt;The trace gets a touch more interesting, though, given that the second
instruction wants to &lt;em&gt;store&lt;/em&gt; a word into memory.  The
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;,
however, has only one bus interface–an interface that needs to be shared
between instruction and data bus accesses.  This means that the data access,
i.e. the store word instruction, must wait until the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;’s bus
cycle completes.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;The next step in this article should really be an analysis section that
artificially quantifies the additional performance achieved by using wrap
addressing over what I had been using.  This should then be compared against
some actual performance measure.  Sadly, that’s one part of caches
that I haven’t managed to get right–the performance analysis.  Even worse,
the lack of a solid ability to analyze this improvement has kept me from
writing an article introducing the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
in the first place.  Perhaps I’ll manage to come back to this later–although
it’s held me back for a couple of years now.&lt;/p&gt;

&lt;p&gt;Since I haven’t presented the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt; in
the first place, it doesn’t really make sense to write an article presenting
the &lt;em&gt;modifications&lt;/em&gt; required to introduce wrap addressing.  That said, it was
easier to do than I was expecting.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 25px&quot;&gt;&lt;caption&gt;Fig 8. Is formal worth it?&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/pfwrap/formal-value.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I suppose “easier” is a relative term.  I upgraded both
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data&lt;/a&gt;
caches quickly–perhaps even in an afternoon.  Then, when everything
failed in simulation, I reverted the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data cache&lt;/a&gt;
updates to focus on the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/pfcache.v&quot;&gt;instruction&lt;/a&gt;
cache updates.  Those updates are now complete, as is their formal proof,
so I expect I’ll push them soon.  All in all, the work took me a couple of
days to do spread over a month or so, with (as expected) the verification
part taking the longest.&lt;/p&gt;

&lt;p&gt;No, the updates aren’nt (yet) posted.  Why not?  Because this update lies
behind the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s AXI DMA upgrade,
and … that one still has bugs to be worked out in it.  What bugs?  Well,
after posting the DMA initially, I then decided I wanted to change how the
DMA handled unaligned FIXed addressing.  My typical answer to unaligned FIX
addressing is to declare it disallowed in the user manual, but for some reason
I thought I might support it.  The new/changed requirements then made it so
that nothing worked, and so I have some updates left to do there before
formal proofs and simulations pass again.&lt;/p&gt;

&lt;p&gt;So my next steps are to 1) repeat this work with the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/master/rtl/core/dcache.v&quot;&gt;data cache&lt;/a&gt;,
and 2) finish working with the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s DMA, so that 3) I can post
another upgrade to the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
repository.  In the meantime, I’ll probably post my DDR3 controller memory
performance benchmarks before these updates hit the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu&quot;&gt;official repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For now, let me point out that the WRAP addressing performance is significantly
better, and the logic cost associated with it is (surprisingly) rather minimal.
How much better?  Well, that answer will have to wait until I can do a better
job quantifying cache performance …&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;So the last shall be first, and the first last: for many be called, but few chosen.  -- Matt 20:16&lt;/em&gt;</description>
        <pubDate>Sat, 29 Mar 2025 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/zipcpu/2025/03/29/pfwrap.html</link>
        <guid isPermaLink="true">https://zipcpu.com/zipcpu/2025/03/29/pfwrap.html</guid>
        
        
        <category>zipcpu</category>
        
      </item>
    
      <item>
        <title>Your problem is not AXI</title>
        <description>&lt;p&gt;The following was a request for help from my inbox.  It illustrates a common
problem students have.  Indeed, the problem is common enough that &lt;a href=&quot;/fpga-hell.html&quot;&gt;this blog
was dedicated&lt;/a&gt; to its solution.  Let me
repeat the question here for reference:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I’ve read some of your articles and old comments on forums in trying to
get something resembling Xilinx’ AXI4 Peripheral to work with my current
project in VIVADO for my FPGA. My main problem is that whenever I so much as
add a customizable AXI to my block design and connect it to my AXI
peripheral, generate a bitstream (with no failures), then build a platform
using it in VITIS (with no failures), my AXI GPIO connections which should
not be connected to the recently added customizable AXI, do not operate at
all (LEDs act as if tied to 0, although I’m sending all 1s). I tried a
solution I found online talking about incorrect “Makefile”s but to no avail.
I have also tried just adding some of your files &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;you provided on
github&lt;/a&gt; instead of the Xilinx’ broken IP
including
“&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/demoaxi.v&quot;&gt;demoaxi.v&lt;/a&gt;” and
“&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/easyaxil.v&quot;&gt;easyaxi.v&lt;/a&gt;”
[sp]. The
“&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/demoaxi.v&quot;&gt;demoaxi.v&lt;/a&gt;”
has the exact same problem as Xilinx’ AXI, just adding it to the
block design and connecting it to my AXI peripheral causes the GPIO not
connect somehow. Your
“&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/easyaxil.v&quot;&gt;easyaxi.v&lt;/a&gt;”
[sp] does not cause this issue right away,
however adding an output and assigning it with the slave register “r0” then
results in the same issue. I am at a loss for what to do. I’m not very
familiar with the specifics of how AXI works, even after re-reading some of
your articles multiple times (I’m still a student with very little
experience), so I can’t be certain why I am running into this issue. My
guess at what is happening is that adding an AXI block with a certain
characteristic somehow causes the addresses for my GPIO and other connections
to “bug out”.  But I have no idea why adding this kind of AXI block does
this (or something else that causes my issue). I’m reaching out because I
… might as well do something other
than making small changes to my design and waiting for 30+ minutes in between
tests to see if something breaks or doesn’t break my GPIO. Do you have any
idea what might be causing my issue or how to fix it?&lt;/p&gt;

  &lt;p&gt;Thanks,&lt;/p&gt;

  &lt;p&gt;(Student)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;(Links have been added …)&lt;/p&gt;

&lt;p&gt;Let’s start with the easy question:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Do you have any idea what might be causing my issue or how to fix it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No.  Without looking at the design, the schematic, or digging into the design
files, I can’t really comment on something like this.  Debugging hardware
designs is hard work, it takes time, and it takes a lot of attention to detail.
Without the details, I won’t be able to find the bug.&lt;/p&gt;

&lt;p&gt;That said, let’s back up and address the root problem, and it’s not AXI.&lt;/p&gt;

&lt;p&gt;Yes, I said that right: This student’s problem is not AXI.&lt;/p&gt;

&lt;p&gt;If anything, AXI is just the symptom.  If you don’t deal with the actual
problem, you will not succeed in this field.&lt;/p&gt;

&lt;h2 id=&quot;iterative-debugging&quot;&gt;Iterative Debugging&lt;/h2&gt;

&lt;p&gt;The fundamental problem is the method of debugging.  The problem is that the
design doesn’t work, and this student doesn’t know how to figure out why not.
This was why I created my blog in the first place–to address this type of
problem.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. This is not how to do debugging&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/broken-process.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Here’s what I am hearing from the description: I tried A.  It didn’t work.
I don’t know why not.  So I tried B.  That didn’t work either.  I still don’t
know why not.  Let me try asking an expert to see if he knows.  It’s as though
the student expects me to be able, from these symptoms alone, to figure
out what’s wrong.&lt;/p&gt;

&lt;p&gt;That’s not how this works.  Indeed, this debugging process will lead you
straight to &lt;a href=&quot;/fpga-hell.html&quot;&gt;FPGA Hell&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;As an illustration, and for a fun story, consider the problem I’ve been working
on for the past couple weeks.  I’m trying to get the FPGA processing working
for &lt;a href=&quot;https://www.youtube.com/watch?v=vSB9BcLcUhM&quot;&gt;this video project (fun promo video
link)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I got stuck for about two weeks at the point where I commanded the algorithm
to start and it didn’t do anything.  Now what?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 2. Voodoo computing defined&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/sdrxframe/voodoo.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;One approach to this problem would be to just change things, with no
understanding of what’s going on.  I like to call this “Voodoo Computing”.
Sadly, it’s a common method of debugging that just … doesn’t work.&lt;/p&gt;

&lt;p&gt;I use this definition because … it’s just so true.  Even I often find myself
doing “voodoo computing” at times, and somehow expecting things to suddenly
fix themselves.  The reality is, that’s not how engineering works.&lt;/p&gt;

&lt;p&gt;Engineering works by breaking a problem down into smaller problems, and then
breaking those problems into smaller ones at that.  In this student’s case,
he has a problem where his AXI slave doesn’t work.  Let’s break that down by
asking a question: Is it your design that’s failing, or the Vivado created
“rest-of-the-system” that’s failing?  Draw a line.  Measure.  Which one is it?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 3. Iterative Debugging&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/iterative-debugging.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Well, how would you know?  You know by adding a test point of some type.
“Look” inside the system.  Look at what’s going on.  Look for any internal
evidence of a bug.  For example, this student wants to write to his component
and to see a pin change.  Perfect.  Now trigger a capture on any writes to this
component, and see if you can watch that pin change from within the capture
and on the board.  Does the component actually get written to?  Do the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWVALID&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWREADY&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WVALID&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WREADY&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BVALID&lt;/code&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BREADY&lt;/code&gt; signals toggle
appropriately?  How about &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WDATA&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;WSTRB&lt;/code&gt;?  What of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWADDR&lt;/code&gt;?  (You might
need to reduce this to a
single bit: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mydbg = (AWADDR == mydevices_register);&lt;/code&gt;)  If all these are
getting set appropriately, then the problem is in your design.  Voila!  You’ve
just narrowed down the issue.&lt;/p&gt;

&lt;p&gt;Let’s illustrate this idea.  You have a design that doesn’t work.  You need
to figure out where the bug lies.  So we first break this design into three
parts.  I’ll call them 1) the AXI IP, 2) the LED output, and 3) the rest of the
design.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 4. Breaking down the problem&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/decomposition.svg&quot; width=&quot;560&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I would suggest two test points–although these can probably be merged into
the same “scope” (ILA).  The first one would be between the AXI IP and the
rest of the design.  This test point should look at all the AXI signals.
The second one should look at the LED output from your design.&lt;/p&gt;

&lt;p&gt;Yes, I can hear you say, but of course the problem is within my AXI IP!  Ahm,
no, you don’t get it.  Earlier this year, I shipped a design to a well paying
customer, and they came back and complained that my design wasn’t properly
acknowledging write transactions.  As I recall, either BID or BVALID were
getting corrupted or some such.  What should I say as a professional engineer
to a comment like that?  Do I tell the customer, gosh, I don’t know, that’s
never happened to me before?  Do I tell him, not at all, my stuff works?  Or
do I make random changes for him to try to see if these would fix his problem?
Frankly, none of these answers would be acceptable.  Instead, I asked if he
could provide a trace or other evidence of the problem that we could inspect
together–much like I illustrated above in Fig. 4.  When he did so, I was able
to clearly point out that my design was working–it was just Vivado’s IP
integrator that hadn’t properly connected it to the AXI bus.  Yes, these
things happen.  You, as the engineer, need to narrow down where the bug is
and getting a “trace” of what is going on is one clear way to do this.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 5. Yes, it's hard.  Get over it.&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/encouragement.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This problem is often both iterative and time consuming.  Yes, it’s hard.
As my Ph.D. advisor used to say, “Take an Aspirin.  Get over it.”  It’s a
fact of life.  This field isn’t easy.  That’s why it pays well.  Personally,
that’s also why I find it so rewarding to work in this field.  I enjoy the
excitement of getting something working!&lt;/p&gt;

&lt;p&gt;If we go back to the &lt;a href=&quot;https://www.youtube.com/watch?v=vSB9BcLcUhM&quot;&gt;video processing example I mentioned
earlier&lt;/a&gt;, I eventually found
several bugs in my Verilog IP.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;A bus arbiter was broken, and so the arbiter would get locked up following
any bus error.&lt;/p&gt;

    &lt;p&gt;(Yes, this was &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;my own
arbiter&lt;/a&gt;, and
and one I had borrowed from &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;another
project&lt;/a&gt;.  It had no problems in the
&lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;that other project&lt;/a&gt;.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Every time the video chain got reset, the memory address got written to
zero–and so the design tried accessing a NULL memory pointer.  This was then
the source of the bus error the arbiter was struggling with.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The CPU was faulting since the video controller was writing video data to
CPU instruction memory.&lt;/p&gt;

    &lt;p&gt;I traced this to using the wrong linker description file.  Sure, a
simplified block RAM only description is great for initial bringup testing,
but there’s no way a 1080p image frame will fit in block RAM in addition
to the C library.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A key video component was dropping pixels any time Xilinx’s MIG had a
hiccup on the last return beat.&lt;/p&gt;

    &lt;p&gt;This was a bit more insidious than it sounds.  The component in question
was the video frame buffer.  This component reads video data from memory
and generates an outgoing video stream.  A broken signaling flag caused the
frame buffer to drop the bus transaction while one word was still
outstanding.  This left the memory request and memory recovery FSMs off by
one (more) beat.&lt;/p&gt;

    &lt;p&gt;If you’ve ever stared at traces from Xilinx’s MIG, you’ll notice that it
generates a lot of hiccups.  Not only does it need to take the memory off
line periodically for refreshes, but it also needs to take it off line more
often for return clock phase tracking.  This means that the ready wire,
in this case &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ARREADY&lt;/code&gt;, will have a lot of hiccups to it, and so
consequently will the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALID&lt;/code&gt; (and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BVALID&lt;/code&gt;) acknowledgments have similar
hiccups.&lt;/p&gt;

    &lt;p&gt;What happens, as it did in my case, when your design is sensitive to such
a hiccup at one particular clock cycle in your operation but not others?
The design might pass a simulation check, but still fail in hardware.&lt;/p&gt;

    &lt;p&gt;Fig 6. shows the basic trace of what was going on.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 6. The missing ACK&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/hlast-bug-annotated.png&quot; width=&quot;760&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Notice what I just did there?  I created a test point within the design, looked
at signals from within that test point, captured a trace of what was going on,
and hence was able to identify the problem.  No, this wasn’t the first test
point–it took a couple to get to this point.  Still, this is an example of
debugging a design within hardware.&lt;/p&gt;

&lt;p&gt;The story of this video development goes on.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 7. The 3-board Stack&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/stacked-woled.jpg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;At this point, though, I’ve now moved from one board to three.  On the one
hand, that’s a success story.  I only moved on once the single board was
working.  On the other hand, the three boards aren’t talking to each other
(yet).  I think I’ve now narrowed the problem down to a &lt;a href=&quot;https://x.com/zipcpu/status/1853895732266516793&quot;&gt;complex electrical
interaction between the two
boards&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How did I do that?  The key was to be able to capture a trace of what was
going on from within the system.  Sound familiar?  First, I captured a trace
indicating that the I2C master on the middle board was attempting to contact
the I2C slave on the bottom board and … the bottom board wasn’t
acknowledging.  Then I captured a trace from the bottom board showing that
the I2C pins weren’t even getting toggled.  Indeed, I eventually got to the
point where I was toggling the I2C pins by hand using the on board
switches–and even then the boards weren’t showing a connection between
them.&lt;/p&gt;

&lt;p&gt;Generate a test.  Test.  Narrow down the problem.  Continue.&lt;/p&gt;

&lt;h2 id=&quot;enumerating-debug-methods&quot;&gt;Enumerating Debug Methods&lt;/h2&gt;

&lt;p&gt;In many ways, debugging can be thought of as a feedback loop–much like
&lt;a href=&quot;https://en.wikipedia.org/wiki/John_Boyd_(military_strategist)&quot;&gt;Col Boyd&lt;/a&gt;’s
&lt;a href=&quot;https://en.wikipedia.org/wiki/OODA_loop&quot;&gt;OODA loop&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 8. Debugging Feedback Loop&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/feedback-loop.svg&quot; width=&quot;560&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The faster you can go through this loop, the faster you can find bugs, the
better your design will be.&lt;/p&gt;

&lt;p&gt;Given this loop, let’s now go back and enumerate the basic methods for
debugging a hardware design.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Desk checking&lt;/strong&gt;.  This is the type of debugging where you stare at your
design, and hopefully just happen to see whatever the bug was.  Yes, I do
this a lot.  Yes, after a decade or two of doing design it does get easier
to find bugs this way.  After a while, you start to see patterns and learn
look for them.  No, I’m still not very successful using this
approach–and I’ve been doing digital design for a living for many years.&lt;/p&gt;

    &lt;p&gt;In the case of this student’s design, I’m sure he’d stared at his design
quite a bit and wasn’t seeing anything.  Yeah.  I get that.  I’ve been there
too.&lt;/p&gt;

    &lt;p&gt;Build time required for desk checking?  None.&lt;/p&gt;

    &lt;p&gt;Test time?   This doesn’t involve testing, so none.&lt;/p&gt;

    &lt;p&gt;Analysis time?  Well, it depends.  Usually I give up before spending too
much time doing this.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Lint&lt;/strong&gt;, sometimes called “Static Design Analysis”.  This type of
debugging takes place any time you use a tool to examine your design.&lt;/p&gt;

    &lt;p&gt;I personally like to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;verilator -Wall -cc mydesign.v&lt;/code&gt;.  Using Verilator,
I can get my design to have &lt;em&gt;zero&lt;/em&gt; lint errors.  Since this process tends
to be so quick and easy, I rarely discuss bugs found this way.  They’re just
found and fixed so quickly that there’s no story to tell.&lt;/p&gt;

    &lt;p&gt;Vivado also produces a list of lint errors (warnings) every time it
synthesizes my design.  The list tends to be long and filled with false
alarms.  Every once in a long while I’ll examine this list for bugs.
Sometimes I’ll even find one or two.&lt;/p&gt;

    &lt;p&gt;From the student’s email above, I gather he believed his design was good
enough from this standpoint.  Still, it’s a place worth looking when things
take unexpected turns.&lt;/p&gt;

    &lt;p&gt;Build time?  None.&lt;/p&gt;

    &lt;p&gt;Test time?   Almost instantaneous when using Verilator.&lt;/p&gt;

    &lt;p&gt;Analysis time?  Typically very fast.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Formal methods&lt;/strong&gt;.  Formal methods involve first &lt;em&gt;assuming&lt;/em&gt; things about
your inputs, and then making &lt;em&gt;assertions&lt;/em&gt; about how the design is supposed
to work.  A solver can then be used to logically &lt;em&gt;prove&lt;/em&gt; that if your
assumptions hold, then your assertions will as well.  If the solver fails,
it will provide you with a very short trace illustrating what might happen.&lt;/p&gt;

    &lt;p&gt;You can read about &lt;a href=&quot;/blog/2017/10/19/formal-intro.html&quot;&gt;my own first experience with formal methods
here&lt;/a&gt;, although that’s
no longer where I’d suggest you start.  Were I to recommend a starting
place, it would probably be &lt;a href=&quot;/tutorial/&quot;&gt;my Verilog design
tutorial&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Many of the bugs I mentioned in the &lt;a href=&quot;https://www.youtube.com/watch?v=vSB9BcLcUhM&quot;&gt;video design I’m working
with&lt;/a&gt; &lt;em&gt;should’ve&lt;/em&gt; been found
via formal methods.  However, some of the key components didn’t get
formally verified.  (Yes, that’s on me.  This was supposed to be a
&lt;em&gt;prototype&lt;/em&gt;…)  The
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbmarbiter.v&quot;&gt;arbiter&lt;/a&gt;,
however, had gone through a formal verification process.  Sadly, at one point
I had placed an assumption into the design that there would never be any bus
errors.  What do you know?  That kept it from finding bus errors!
Likewise, the &lt;a href=&quot;https://x.com/zipcpu/status/1852735323161207089&quot;&gt;frame buffer’s proof never passed
induction&lt;/a&gt;, so it
never completed a full bus request to see what would happen if the two got
out of sync.  The excuses go on.  I’m now working on formally verifying
these components.&lt;/p&gt;

    &lt;p&gt;In the case of the student above, he mentions using some formally verified
designs, but says nothing about whether or not he formally verified the LED
output of those designs.&lt;/p&gt;

    &lt;p&gt;Build time?  For formal methods, this typically references how long it
takes to translate the design into a formal language of some type–such as
SMT.  When using Yosys, the time it takes to do this is usually so quick I
don’t notice it.&lt;/p&gt;

    &lt;p&gt;Test time?   &lt;a href=&quot;/formal/2019/08/03/proof-duration.html&quot;&gt;We measured formal proof solver time some time
ago&lt;/a&gt;.  Bottom
line, 87% of the time a formal proof will take less than two minutes, and
only 5% of the time will it ever take longer than ten minutes.&lt;/p&gt;

    &lt;p&gt;Analysis time?  This tends to only take a minute or two.  One of the
good things of formal proofs, is that the solver will lead you directly
to the error.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Simulation&lt;/strong&gt;.&lt;/p&gt;

    &lt;p&gt;Simulation is a very important debugging tool.  It’s one of the easiest
ways to find bugs.  In general, if a design doesn’t work in simulation,
then it will never work in hardware.&lt;/p&gt;

    &lt;p&gt;However, simulation depends upon &lt;em&gt;models&lt;/em&gt; of all of the components in
question–both those written in Verilog and those only available via
data sheet, from which Verilog (or other) models need to be written
and thus only approximated.  As a result, there are often gaps between how
the models work and what happens in reality.&lt;/p&gt;

    &lt;p&gt;A second reality of simulation is that it’s not complete.  There will always
be cases that don’t get simulated.  A good engineer will work to limit the
number of these cases, but it’s very hard to eliminate them entirely.
For example:&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;Not simulating jumping to the last instruction in a cache line left me with &lt;a href=&quot;/zipcpu/2017/12/28/ugliest-bug.html&quot;&gt;quite a confusing mix of symptoms&lt;/a&gt;.&lt;/li&gt;
      &lt;li&gt;Not simulating bus errors lead to missing a bus lockup in the arbiter above.&lt;/li&gt;
      &lt;li&gt;Not simulating ACK dropping at the last beat in a series of requests, led to the frame buffer perpetually resynchronizing.&lt;/li&gt;
      &lt;li&gt;Not simulating stalls and multiple outstanding requests led Xilinx to believe their AXI demo worked.&lt;/li&gt;
    &lt;/ul&gt;

    &lt;p&gt;Considering the &lt;a href=&quot;https://www.youtube.com/watch?v=vSB9BcLcUhM&quot;&gt;video processing
example&lt;/a&gt; I’ve been discussing,
I’ll be the first (and proudest) to
declare that all of the video algorithms worked nicely in simulation.
Yes, they worked in simulation–they just didn’t work in hardware.
Why?  My simulation didn’t include the MIG or the DDR3 SDRAM.  Instead, I
had &lt;em&gt;approximated&lt;/em&gt; their performance with a basic block RAM implementation.
This usually works for me, since I like to formally verify everything–only
I didn’t formally verify everything this time.  The result were some bugs
that slipped through the cracks, and so among other things my simulation
never fully exercised the design.  My simulation also didn’t include the
CPU, nor did it accurately have the same type and amount of memory as the
final design had.  These were all problems with my simulation, that kept me
from catching some of these last bugs.&lt;/p&gt;

    &lt;p&gt;While simulation is the “easiest” type of debugging, it does tend to be slow
and resource (i.e. memory and disk) intensive.  Traces from my video tests
are often 200GB or larger.  Indeed, this is one of the reasons why the
simulation doesn’t include either the MIG DDR3 SDRAM controller, the CPU,
the &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash&lt;/a&gt;,
&lt;a href=&quot;/zipcpu/2018/07/13/memories.html&quot;&gt;block RAM&lt;/a&gt;, or the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;Wishbone crossbar&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;I would be very curious to know if the student who wrote me had fully
simulated his design–from ARM software to LED.&lt;/p&gt;

    &lt;p&gt;Build time?  When using Verilator, I’ve seen this take up to a minute or
two for a large and complex design, although I rarely notice it.&lt;/p&gt;

    &lt;p&gt;Test time?   The video simulations I’ve been running take about an hour or
so when using Verilator.  A full ZipCPU test suite can take two hours using
Verilator, or about a week when using Icarus Verilog.&lt;/p&gt;

    &lt;p&gt;Test time gets annoying when using Vivado, since it doesn’t automatically
capture every signal from within the design as Verilator will.  I
understand there’s a setting to make this happen, but … I haven’t found
it yet.&lt;/p&gt;

    &lt;p&gt;Analysis time?  This tends to be longer than formal methods, since I
typically find myself tracing bugs through simulations of very large and
complex designs, and it takes a while to trace back from the evidence of the
bug to the actual bug itself.  The worst examples of simulation analysis
I’ve had to do were of &lt;a href=&quot;https://www.arasan.com/products/nand-flash/&quot;&gt;NAND flash
simulations&lt;/a&gt;, where you don’t
realize you have a problem until you read results from the flash.  Then you
need to first find the evidence of the problem in the trace (expected
value doesn’t match actual value), then trace it from the AXI bus to the
flash read bus, across multiple flash transactions to the critical one
that actually programmed the block in question, back across the flash bus
to the host IP, and then potentially back further to the AXI transaction
that provided the information in the first place.  While doable, this can
be quite painful.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: center&quot;&gt;&lt;caption&gt;Fig 9. Tracing from cause to effect can require a lot of investigation&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/not-axi/longsim.svg&quot; width=&quot;760&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ol start=&quot;5&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Debug in hardware&lt;/strong&gt;.  Getting to hardware is painful–it requires building
a complete design, handling timing exceptions, and a typically long
synthesis process.  Once you get there, tests can typically be run very
fast.  However, such tests are often unrevealing.  Trying something else
on hardware often requires a design change, rebuild, and … a substantial
stall in your process which will slow you down.  In the case of this student,
he measured this stall time at 30min.&lt;/p&gt;

    &lt;p&gt;This &lt;em&gt;stall&lt;/em&gt; time while things are rebuilding can make hardware debugging
slow and expensive.  Why is it expensive?  Because time is expensive.  I
charge by the hour.  I can do that.  I’m not a student.  Students on the
other hand are often overloaded for time.  They have other projects to do,
and one class (or lab) consuming a majority of their time will quickly
become a serious problem on the road to graduation.&lt;/p&gt;

    &lt;p&gt;Knowing what’s wrong when things fail in
hardware is … difficult–else I wouldn’t be writing this note.&lt;/p&gt;

    &lt;p&gt;However, it’s a skill you need to have if you are going to work in this
field.  How can you do it?  You can use LEDs.  You can use your UART.  If
you are on an ARM based FPGA, you can often use printf.  You can use a
companion CPU (PC), or even an on-board CPU (ARM or softcore).  You can
use the ILA, or you can build your own (that’s me).  In all cases, you
need to be able extract the key information regarding the “bug” (whatever
it might be) from the design.  That key information needs to point you to
the bug.  Is it in Vivado generated IP?  Is it in the Verilog?  If it’s in
your Verilog, where is it?  You need to be able to bisect your design
repeatedly to figure this out.&lt;/p&gt;

    &lt;p&gt;In the case of &lt;a href=&quot;https://www.youtube.com/watch?v=vSB9BcLcUhM&quot;&gt;the video project I’m working
on&lt;/a&gt;, this is (currently) where
I’m at in my development.&lt;/p&gt;

    &lt;p&gt;In the case of the student above, I’d love to know whether &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assign led=1;&lt;/code&gt;
would work, if the LED control wire was mapped to the correct pin, or
if the LED’s control was inverted.  Without more information, I might
never know.&lt;/p&gt;

    &lt;p&gt;Build time?  That is, how long does it take to turn the design Verilog
into a bit file?  Typically I deal with build times of roughly 12-15 minutes.
The student above was dealing with a 30min build time.  I’ve heard horror
stories of Vivado even taking as long as a day for particularly large
designs, but never had to deal with delays that long myself.&lt;/p&gt;

    &lt;p&gt;Test time?   Most hardware tests take longer to set up than to perform, so
I’ll note this as “almost instantaneous.”  Certainly my video tests tended
to be very quick.&lt;/p&gt;

    &lt;p&gt;Analysis time?  “What just happened?” seems to be a common refrain in
hardware testing.  Sure, you just ran a test, but … what really happened
in it?  This is the problem with testing in hardware.  It can take a lot
of work to get to the “success” or “failure” measure.  In the video
processing case, video processing takes place on a pixel at a time at over
80M pixels per second, but the final “success” (once I got there) was
watching the effects of the video processing as applied to a 4 minute video.
Indeed, I was so excited (once I got there), that I called everyone from
my family to come and watch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While I’d love to say one debugging method is better than another, the reality
is that they each have their strengths and weaknesses.  Formal methods, for
example, don’t often work on medium to large designs.  Lint tends to miss
things.  You get the picture.  Still, you need to be familiar with
every technique, to have them in your tool belt for when something doesn’t
work.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Again, the bottom line is that you need to know how to debug a design
to succeed in this field.  This is a prerequisite for anything that might
follow–such as building an AXI slave.  Perhaps a &lt;a href=&quot;https://zipcpu.com/zipcpu/2019/02/04/debugging-that-cpu.html&quot;&gt;fun
story&lt;/a&gt; might
help illustrate my points.&lt;/p&gt;

&lt;p&gt;You might also find the &lt;a href=&quot;https://zipcpu.com/blog/2017/06/02/design-process.html&quot;&gt;first article I wrote on this hardware debugging
topic&lt;/a&gt; to be valuable.&lt;/p&gt;

&lt;p&gt;Or how about &lt;a href=&quot;https://zipcpu.com/blog/2017/06/10/lost-college-student.html&quot;&gt;the response from a student who then commented on that article,
after struggling with these same
issues&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;In all of this, the hard reality remains:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Hardware debugging is hard.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There is a methodology to it.  I might even use the word “methodical”,
but that would be redundant.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;You will need to learn that methodology to debug your design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once you understand the methodology of hardware debugging, you can then
debug any design–to include any AXI design.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hardware design isn’t for everybody.  Not everyone will make it through
their learning process–be it college or self taught.  Yes, there are
&lt;a href=&quot;https://reddit.com/r/FPGA&quot;&gt;design communities&lt;/a&gt; that would love to help
and encourage you.  On the bright side, hard work pays well in any field.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Seest thou a man diligent in his business?  He shall stand before kings; he shall not stand before mean men. (Prov 22:29)&lt;/em&gt;</description>
        <pubDate>Wed, 06 Nov 2024 00:00:00 -0500</pubDate>
        <link>https://zipcpu.com/blog/2024/11/06/not-axi.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/11/06/not-axi.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>My Personal Journey in Verification</title>
        <description>&lt;p&gt;This week, I’ve been testing a CI/CD pipeline.  This has been my opportunity
to shake the screws and kick the tires on what should become a new verification
product shortly.&lt;/p&gt;

&lt;p&gt;I thought that a good design to check might be my
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SDIO project&lt;/a&gt;.  It has roughly all the
pieces in place, and so makes sense for an automated testing pipeline.&lt;/p&gt;

&lt;p&gt;This weekend, the CI project engineer shared with me:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s literally the first time I get to know a good hardware project needs
such many verifications and testings!  There’s even a real SD card
simulation model and RW test…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After reminiscing about this for a bit, I thought it might be worth taking a
moment to tell how I got here.&lt;/p&gt;

&lt;h2 id=&quot;verification-the-goal&quot;&gt;Verification: The Goal&lt;/h2&gt;

&lt;p&gt;Perhaps the best way to explain the “goal” of verification is by way of an
old “war story”–as we used to call them.&lt;/p&gt;

&lt;p&gt;At one time, I was involved with a DOD unit whose whole goal and purpose was
to build quick reaction hardware capabilities for the warfighter.  We bragged
about our ability to respond to a call on a Friday night with a new product
shipped out on a C-130 before the weekend was over.&lt;/p&gt;

&lt;p&gt;Anyone who has done engineering for a while will easily recognize that this
sort of concept violates all the good principles of engineering.  There’s no
time for a requirements review.  There’s no time for prototyping–or perhaps
there is, to the extent that it’s always the &lt;em&gt;prototype&lt;/em&gt; that heads out the
door to the warfighter as if it were a &lt;em&gt;product&lt;/em&gt;.  There’s no time to build a
complete test suite, to verify the new capability against all things that could
go wrong.  However, we’d often get only one chance to do this right.&lt;/p&gt;

&lt;p&gt;Now, how do you accomplish quality engineering in that kind of environment?&lt;/p&gt;

&lt;p&gt;The key to making this sort of shop work lay in the “warehouse”, and what
sort of capabilities we might have “lying on the shelf” as we called it.
Hence, we’d spend our time polishing prior capabilities, as well as
anticipating new requirements.  We’d then spend our time building, verifying,
and testing these capabilities against phantom requirements, in the hopes that
they’d be close to what we’d need to build should a real requirement arise.
We’d then place these concept designs in the “warehouse”, and show them off
to anyone who came to visit wondering what it was that our team was able to
accomplish.  Then, when a new requirement arose, we’d go into this “warehouse”
and find whatever capability was closest to what the customer required and
modify it to fit the mission requirement.&lt;/p&gt;

&lt;p&gt;That was how we achieved success.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vlog-wait/rule-of-gold.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The same applies in digital logic design.  You want to have a good set of
tried, trusted, and true components in your “library” so that whenever a new
customer comes along, you can leverage these components quickly to meet his
needs.  This is why I’ve often said that well written, well tested, well
verified design components are gold in this business.  Such components allow
you to go from zero to product in short order.  Indeed, the more well-tested
components you have that you can
&lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;reuse&lt;/a&gt;, the faster you’ll be
to market with any new need, and the cheaper it will cost you to get there.&lt;/p&gt;

&lt;p&gt;That’s therefore the ultimate goal: a library of
&lt;a href=&quot;/blog/2020/01/13/reuse.html&quot;&gt;reusable&lt;/a&gt;
components that can be quickly composed into new products for customers.&lt;/p&gt;

&lt;p&gt;As I’ve tried to achieve this objective over the years, my approach to
component verification has changed, or rather grown, many times over.&lt;/p&gt;

&lt;h2 id=&quot;hardware-verification&quot;&gt;Hardware Verification&lt;/h2&gt;

&lt;p&gt;When I first started learning FPGA design, I understood nothing about
simulation.  Rather than learning how to do simulation properly, I instead
learned quickly how to test my designs in hardware.  Most of these designs
were DSP based.  (My background was DSP, so this made sense …)  Hence,
the following approach tended to work for me:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;I created access points in the hardware that allowed me to read and write
registers at key locations within the design.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;One of these “registers” I could write to controlled the inputs to my DSP
pipeline.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Another register, when written to, would cause the design to “step” the
entire DSP pipeline as if a new sample had just arrived from the A/D.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A set of registers within the design then allowed me to read the state of
the entire pipeline, so I could do debugging.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This worked great for “stepping” through designs.  When I moved to processing
real-time information, such as the A/D results from the antenna connected to
the design, I build an internal logic analyzer to catch and capture key
signals along the way.&lt;/p&gt;

&lt;p&gt;I called this “Hardware in the loop testing”.&lt;/p&gt;

&lt;p&gt;Management thought I was a genius.&lt;/p&gt;

&lt;p&gt;This approach worked … for a while.  Then I started realizing how painful it
was.  I think the transition came when I was trying to debug
&lt;a href=&quot;/2018/10/02/fft.html&quot;&gt;my FFT&lt;/a&gt; by writing test vectors to
an Arty A7 circuit board via UART, and reading the results back to display
them on my screen. Even with the hardware in the loop, hitting all the test
vectors was painfully slow.&lt;/p&gt;

&lt;p&gt;Eventually, I had to search for a new and better solution.  This was just too
slow.  Later on, I would start to realize that this solution didn’t catch
enough bugs–but I’ll get to that in a bit.&lt;/p&gt;

&lt;h2 id=&quot;happy-path-simulation-testing&quot;&gt;Happy Path Simulation Testing&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“Happy path” testing&lt;/a&gt;
is a reference to simply testing working paths
through a project’s environment.  To use an aviation analogy, a &lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“happy path”
test&lt;/a&gt;
might make sure the ground avoidance radar never alerted when you
weren’t close to the ground.  It doesn’t make certain that the radar
necessarily does the right thing when you are close to the ground.&lt;/p&gt;

&lt;p&gt;So, let’s talk about my next project: the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Verification of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
began with an &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;assembly
program&lt;/a&gt;
the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; would run.  The
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;program&lt;/a&gt;
was designed to test all the instructions of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
with sufficient fidelity to know when/if the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; worked.&lt;/p&gt;

&lt;p&gt;The test had one of two outcomes.  If the program halted, then the test was
considered a success.  If it detected an error, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; would execute a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BUSY&lt;/code&gt; instruction (i.e. jump to current address) and then perpetually loop.
My test harness could then detect this condition and end with a failing exit
code.&lt;/p&gt;

&lt;p&gt;When the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; acquired a software
tool chain (GCC+Binutils) and C-library support, this &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/bench/asm/simtest.s&quot;&gt;assembly
program&lt;/a&gt;
was abandoned and replaced with a &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/sim/zipsw/cputest.c&quot;&gt;similar program in
C&lt;/a&gt;.
While I still use &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/sim/zipsw/cputest.c&quot;&gt;this
program&lt;/a&gt;,
it’s no longer the core of the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
verification suite.  Instead, I tend to use it to shake out any bugs in any
new environment the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; might be
placed into.&lt;/p&gt;

&lt;p&gt;This approach failed horribly, however, when I tried integrating an &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
into the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  I built the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;.
I tested the &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;instruction
cache&lt;/a&gt;
in isolation.  I tested the
&lt;a href=&quot;https://github.com/ZipCPU/zipcpu/blob/a20b6064ea794d66fdeb2e00929287d7f2dc9ac6/rtl/core/pfcache.v&quot;&gt;cache&lt;/a&gt;
as part of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.  I convinced myself that it worked.
Then I placed my “working” design onto hardware and &lt;a href=&quot;/zipcpu/2017/12/28/ugliest-bug.html&quot;&gt;all
hell broke lose&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This was certainly not “the way.”&lt;/p&gt;

&lt;h2 id=&quot;formal-verification&quot;&gt;Formal Verification&lt;/h2&gt;

&lt;p&gt;I was then asked to &lt;a href=&quot;/blog/2017/10/19/formal-intro.html&quot;&gt;review a new, open source, formal verification tool called
SymbiYosys&lt;/a&gt;.  The tool
handed my cocky attitude back to me, and took my pride down a couple steps.  In
particular, I found a bunch of bugs in a FIFO I had used for years.  The bugs
had never shown up in hardware testing (that I had noticed at least), and
certainly hadn’t shown up in any of my &lt;a href=&quot;https://en.wikipedia.org/wiki/Happy_path&quot;&gt;“Happy path”
testing&lt;/a&gt;.  This left me wondering,
how many other bugs did I have in my designs that I didn’t know about?&lt;/p&gt;

&lt;p&gt;I then started &lt;a href=&quot;/blog/2018/01/22/formal-progress.html&quot;&gt;working through my previous projects, formally verifying all my
prior work&lt;/a&gt;.  In every
case, I found more bugs.  By the time I got to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;–&lt;a href=&quot;/blog/2018/04/02/formal-cpu-bugs.html&quot;&gt;I found a myriad of bugs
in what I thought was a “working”&lt;/a&gt;
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I’d like to say that the quality of my IP went up at this point.  I was
certainly finding a lot of bugs I’d never found before by using formal methods.
I now knew, for example, how to guarantee I’d never have any more of those
cache bugs I’d had before.&lt;/p&gt;

&lt;p&gt;So, while it is likely that my IP quality was going up, the unfortunate
reality was that I was still finding bugs in my “formally verified”
IP–although not nearly as many.&lt;/p&gt;

&lt;p&gt;A &lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;couple of improvements&lt;/a&gt;
helped me move forward here.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Bidirectional formal property sets&lt;/p&gt;

    &lt;p&gt;The biggest danger in formal verification is that you might &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt;
something that isn’t true.  The first way to limit this is to make
sure you never &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt; a property within the design, but rather you
only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;assume()&lt;/code&gt; properties of inputs–never outputs, and never local
registers.&lt;/p&gt;

    &lt;p&gt;But how do you know when you’ve assumed too much?  This can be a challenge.&lt;/p&gt;

    &lt;p&gt;One of the best ways I’ve found to do this is to create a bidirectional
property set.  A bus master, for example, would make assumptions about
how the slave would respond.  A similar property set for the bus slave
would make assumptions about what the master would do.  Further, the slave
would turn the master’s assumptions into verifiable assertions–guaranteeing
that the master’s assumptions were valid.  If you can use the same property
set in this manner for both master and slave, save that you swap
assumptions and assertions, then you can verify both in isolation to
include only assuming those things that can be verified elsewhere.&lt;/p&gt;

    &lt;p&gt;Creating such property sets for both AXI-Lite and AXI led me to find
many bugs in Xilinx IP.  This alone suggested that I was on the “right path”.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Cover checking&lt;/p&gt;

    &lt;p&gt;I also learned to use &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;formal coverage
checking&lt;/a&gt;, in
addition to straight assertion
based verification.  Cover checks weren’t the end all, but they could
be useful in some key situations.  For example, a quick cover check might
help you discover that you had gotten the reset polarity wrong, and so
all of your formal assertions were passing because your design was assumed
to be held in reset.  (This has happened to me more than once.  Most
recently, the &lt;a href=&quot;/blog/2024/06/13/kimos.html&quot;&gt;cost was a couple of months
delay&lt;/a&gt; on what should’ve
otherwise been a straight forward hardware bringup–but that wasn’t really
a &lt;em&gt;formal&lt;/em&gt; verification issue.)&lt;/p&gt;

    &lt;p&gt;For a while, I also &lt;a href=&quot;/formal/2018/07/14/dev-cycle.html&quot;&gt;used cover checking to quickly discover (with minimal
work) how a design component might work within a larger
environment&lt;/a&gt;.  I’ve
since switched to simulation checking (with assertions enabled) for my
most recent examples of this type of work, but I do still find it valuable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt; isn’t
really a “new” thing I learned along the way, but it is worth mentioning
specially.  As I learned formal verification, I learned to use
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
right from the start and so I’ve tended to use
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
in every proof I’ve ever done.  It’s just become my normal practice from day
one.&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;Induction&lt;/a&gt;,
however, takes a lot of work.  Sometimes it takes so much work I wonder
if there’s really any value in it.  Then I tend to find some key bug or
other–perhaps a buffer overflow or something–some bug I’d have never found
without
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;.
That alone keeps me running
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
every time I can.  Even better, once the
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;
proof is complete, you can often &lt;a href=&quot;/formal/2019/08/03/proof-duration.html&quot;&gt;trim the entire formal proof down from
15-20 minutes down to less than a single
minute&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Contract checking&lt;/p&gt;

    &lt;p&gt;My initial formal proofs were haphazard.  I’d throw assertions at the wall
and see what I could find.  Yes, I found bugs.  However, I never really had
the confidence that I was “proving” a design worked.  That is, not until I
learned of the idea of a “formal contract”.  The “formal contract” simply
describes the essence of how a component worked.&lt;/p&gt;

    &lt;p&gt;For example, in a memory system, the formal contract might have the solver
track a single value of memory.  When written to, the value should change.
When read, the value should be returned.  If this contract holds for all such
memory addresses, then the memory acts (as you would expect) … like a
&lt;em&gt;memory&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Parameter checks&lt;/p&gt;

    &lt;p&gt;For a while, I was maintaining &lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;“ZBasic”–a basic ZipCPU
distribution&lt;/a&gt;.  This was where I did all
my simulation based testing of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  The problem was, this
approach didn’t work.  Sure, I’d test the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; in one configuration, get it
to work, and then put it down believing the
“&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;” worked.  Some time later,
I’d try the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; in a different
configuration–such as pipelined vs non-pipelined, and … it
would fail in whatever mode it had not been tested in.  The problem with the
&lt;a href=&quot;https://github.com/ZipCPU/zbasic&quot;&gt;ZBasic approach&lt;/a&gt; is that it tended to only
check one mode–leaving all of the others unchecked.&lt;/p&gt;

    &lt;p&gt;This lead me to adjust the proofs of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; so that the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; would at least be formally
verified with as many parameter configurations as I could to make sure it
would work in all environments.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’ve written more about &lt;a href=&quot;/formal/2020/06/12/four-keys.html&quot;&gt;these parts of a proof some time
ago&lt;/a&gt;, and I still stand
by them today.&lt;/p&gt;

&lt;p&gt;Yes, formal verification is hard work.  However, a well verified design is
highly valuable–on the shelf, waiting for that new customer requirement to
come in.&lt;/p&gt;

&lt;p&gt;The problem with all this formal verification work lies in its (well known)
Achilles heel.  Because formal verification includes an exhaustive
combinatorial search for bugs across all potential design inputs and states,
it can be computationally expensive.  Yeah, it can take a while.  To reduce
this expense, it’s important to limit the scope of what is verified.  As a
result, I tend to verify design &lt;em&gt;components&lt;/em&gt; rather than entire designs.  This
leaves open the possibility of a failure in the logic used to connect all
these smaller, verified components together.&lt;/p&gt;

&lt;h2 id=&quot;autofpga-and-better-crossbars&quot;&gt;AutoFPGA and Better Crossbars&lt;/h2&gt;

&lt;p&gt;Sure enough, the next class of bugs I had to deal with were integration bugs.&lt;/p&gt;

&lt;p&gt;I had to deal with several.  Common bugs included:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Using unnamed ports, and connecting module ports to the wrong signals.&lt;/p&gt;

    &lt;p&gt;At one point, I decided the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
“stall” port should come before the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
acknowledgment port.  Now, how many designs had to change to accommodate
that?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I had a bunch of problems with my &lt;a href=&quot;/blog/2017/06/22/simple-wb-interconnect.html&quot;&gt;initial interconnect
design&lt;/a&gt;
methodology.  Initially, I used the slave’s
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
strobe signal as an address decoding signal.  I then had a bug where the
address would move off of the slave of interest, and the acknowledgment
was never returned.  The result of that bug was that the design hung any
time I tried to read the entirety of &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash
memory&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Think about how much simulation time and effort I had to go through to
simulate reading an &lt;em&gt;entire&lt;/em&gt; &lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;flash
memory&lt;/a&gt;–just to find
this bug at the end of it.  Yes, it was painful.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Basically, when connecting otherwise “verified” modules together by hand,
I had problems where the result wasn’t reliably working.&lt;/p&gt;

&lt;p&gt;The first and most obvious solution to something like this is to use a linting
tool, such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;verilator -Wall&lt;/code&gt;. 
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt; can find things like
unconnected pins and such.  That’s a help, but I had been doing that from
early on.&lt;/p&gt;

&lt;p&gt;My eventual solution was twofold.  First, I redesigned my &lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;bus
interconnect&lt;/a&gt; from the
top to the bottom.  You can find the new and redesigned
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt; components
in my &lt;a href=&quot;https://github.com/ZipCPU/wb2axip&quot;&gt;wb2axip repository&lt;/a&gt;.  Once these
components were verified, I then had a proper guarantee: all masters would get
acknowledgments (or errors) from all slave requests they made.  Errors would
no longer be lost.  Attempts to interact with a non-existent slave would
(properly) return bus errors.&lt;/p&gt;

&lt;p&gt;To deal with problems where signals were connected incorrectly, I built a tool
I call &lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt; to
connect components into designs.  A special tag given to the tool would
immediately connect all bus signals to a bus component–whether it be a slave
or master, whether it be connected to a
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;,
&lt;a href=&quot;/formal/2018/12/28/axilite.html&quot;&gt;AXI-Lite&lt;/a&gt;, or
&lt;a href=&quot;/formal/2019/05/13/axifull&quot;&gt;AXI&lt;/a&gt; bus.  This required that my
slaves followed one of two conventions.  Either all the bus ports had to
follow a basic port ordering convention, or they needed to follow a bus
naming convention.  Ideally, a slave should follow both.  Further, after
finding even more port connection bugs, I’m slowly moving towards the practice
of naming all of my port connections.&lt;/p&gt;

&lt;p&gt;This works great for composing designs of bus components.  Almost all of my
designs now use this approach, and only a few (mostly test bench) designs
remain where I connect bus components by hand manually.&lt;/p&gt;

&lt;h2 id=&quot;mcy&quot;&gt;MCY&lt;/h2&gt;

&lt;p&gt;At one time along the way, I was asked to review &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY: Mutation Coverage with
Yosys&lt;/a&gt;.  My review back to the team was …
mixed.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
works by intentionally breaking your design.  Such changes to the design are
called “mutations”.  The goal is to determine whether or not the mutated
(broken) design will trigger a test failure.  In this fashion, the test suite
can be evaluated.  A “good” test suite will be able to find any mutation.
Hence, &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
allows you to measure how good your test suite is in the first place.&lt;/p&gt;

&lt;p&gt;Upon request, I tried &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; with the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;.  This turned into a bigger
challenge than I had expected.  Sure, &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;
works with &lt;a href=&quot;https://github.com/steveicarus/iverilog&quot;&gt;Icarus Verilog&lt;/a&gt;,
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt;, and even (perhaps) some other
(not so open) simulators as well.  However, when I ran a design under
&lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;, my simulations tended to find only a
(rough) 70% of any mutations.  The formal proofs, however, could find 95-98% of
any mutations.&lt;/p&gt;

&lt;p&gt;That’s good, right?&lt;/p&gt;

&lt;p&gt;Well, not quite.  The problem is that I tend to place all of my formal
logic in the same file as the component that would be mutated.  In order to
keep the mutation engine from mutating the formal properties, I had to remove
the formal properties from the file to be mutated into a separate file.
Further, I then had to access the values that were to be assumed or asserted
external from the file under test using something often known as “dot notation”.
While (System)Verilog allows such descriptions natively, there weren’t any open
source tools that allowed such external formal property descriptions.
(Commercial tools allowed this, just not the open source
&lt;a href=&quot;https://github.com/YosysHQ/sby&quot;&gt;SymbiYosys&lt;/a&gt;.) This left me stuck with a couple
of unpleasant choices:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;I could remove the ability of the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
(or whatever design) to be formally verified with Open Source tools,&lt;/li&gt;
  &lt;li&gt;I could give up on using
&lt;a href=&quot;/blog/2018/03/10/induction-exercise.html&quot;&gt;induction&lt;/a&gt;,&lt;/li&gt;
  &lt;li&gt;I could use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; with simulation only, or&lt;/li&gt;
  &lt;li&gt;I could choose to not use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt; at all.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is why I don’t use &lt;a href=&quot;https://github.com/YosysHQ/mcy&quot;&gt;MCY&lt;/a&gt;.  It may be a
“good” tool, but it’s just not for me.&lt;/p&gt;

&lt;p&gt;What I did learn, however, was that my
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; test suite was checking the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;’s functionality nicely–just not
the debugging port.  Indeed, none of my tests checked the debugging port to the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt;
at all.  As a result, none of the (simulation-based) mutations of the
debugging port were ever caught.&lt;/p&gt;

&lt;p&gt;Lesson learned?  My test suite still wasn’t good enough.  Sure, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;CPU&lt;/a&gt; might
“work” today, but how would I know some change in the future wouldn’t break it?&lt;/p&gt;

&lt;p&gt;I needed a better way of knowing whether or not my test suite was good enough.&lt;/p&gt;

&lt;h2 id=&quot;coverage-checking&quot;&gt;Coverage Checking&lt;/h2&gt;

&lt;p&gt;Sometime during this process I discovered
&lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;coverage checking&lt;/a&gt;.
&lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;Coverage checking&lt;/a&gt;
is a process of automatically watching over all of your simulation based tests
to see which lines get executed and which do not.  Depending on the tool,
coverage checks can also tell whether particular signals are ever flipped or
adjusted during simulation.  A good coverage check, therefore, can provide
some level of indication of whether or not all control paths within a design
have been exercised, and whether or not all signals have been toggled.&lt;/p&gt;

&lt;p&gt;Coverage metrics are actually kind of nice in this regard.&lt;/p&gt;

&lt;p&gt;Sadly, coverage checking isn’t as good as mutation coverage, but … it’s
better than nothing.&lt;/p&gt;

&lt;p&gt;Consider a classic coverage failure: many of my simulations check for
AXI &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.  Such
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt; is generated when
either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BVALID &amp;amp;&amp;amp; !BREADY&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RVALID &amp;amp;&amp;amp; !RREADY&lt;/code&gt;.  If your design is to
follow the AXI specification, it should be able to handle
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
properly.  That is, if you hold &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!BREADY&lt;/code&gt; long enough, it should be possible
to force &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!AWREADY&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!WREADY&lt;/code&gt;.  Likewise, it should be possible to hold
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RREADY&lt;/code&gt; low long enough that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ARREADY&lt;/code&gt; gets held low.  A well verified,
bug-free design should be able to deal with these conditions.&lt;/p&gt;

&lt;p&gt;However, a “good” design should never create any significant
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.
Hence, if you build a simulation environment from “good” working components,
you aren’t likely to see much
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;.  How then should a
component’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
capability be tested?&lt;/p&gt;

&lt;p&gt;My current solution here is to test
&lt;a href=&quot;https://en.wikipedia.org/wiki/Back_pressure&quot;&gt;backpressure&lt;/a&gt;
via formal methods, with the unfortunate consequence that some conditions
will never get tested under simulation.  The result is that I’ll never get
to 100% coverage with this approach.&lt;/p&gt;

&lt;p&gt;A second problem with coverage regards the unused signals.  For example,
AXI-Lite has two signals, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AWPROT&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ARPROT&lt;/code&gt;, that are rarely used by
any of my designs.  However, they are official AXI-Lite (and AXI) signals.
As a result,
&lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt;
will always try to connect them to an AXI-Lite (or AXI) port, yet none of my
designs use these.  This leads to another set of exceptions that needs to be
made when measuring coverage.&lt;/p&gt;

&lt;p&gt;So, coverage metrics aren’t perfect.  Still, they can help me find
what parts of the design are (and are not) being tested well.  This can then
help feed into better (and more complete) test design.&lt;/p&gt;

&lt;p&gt;That’s the good news.  Now let’s talk about some of the not so good parts.&lt;/p&gt;

&lt;p&gt;When learning formal verification, I spent some time formally verifying
Xilinx IP.  After finding several bugs, I spoke to a Xilinx executive
regarding how they verified their IP.  Did they use formal methods?  No.
Did they use their own AXI Verification IP?  No.  Yet, they were very proud of
how well they had verified their IP.  Specifically, their executive bragged
about how good their coverage metrics were, and the number of test points
checked for each IP.&lt;/p&gt;

&lt;p&gt;Hmm.&lt;/p&gt;

&lt;p&gt;So, let me get this straight: Xilinx IP gets good coverage metrics, and hits
a large number of test points, yet still has bugs within it that I can find
via formal methods?&lt;/p&gt;

&lt;p&gt;Okay, so … how severe are these bugs?  In one case, the bugs would totally
break the AXI bus and bring the system containing the IP down to a screeching
halt–if the bug were ever tripped.  For example, if the system requested both
a read burst and a write burst at the same time, one particular slave might
accomplish the read burst with the length of the write burst–or vice versa.
(It’s been a while, so I’d have to look up the details to be exact regarding
them.)  In another case dealing with a network controller, it was possible
to receive a network packet, capture that packet correctly, and then return
a corrupted packet simply because the &lt;a href=&quot;/blog/2021/08/28/axi-rules.html&quot;&gt;AXI bus
handshakes&lt;/a&gt; weren’t properly
implemented.  To this day this bugs have not been fixed, and it’s nearly five
years later.&lt;/p&gt;

&lt;p&gt;Put simply, if it is possible for an IP to lock up your system completely,
then that IP shouldn’t be trusted until the bug is fixed.&lt;/p&gt;

&lt;p&gt;How then did Xilinx manage to convince themselves that their IP was high
quality?  By “good” coverage metrics.&lt;/p&gt;

&lt;p&gt;Lesson learned?  &lt;a href=&quot;https://en.wikipedia.org/wiki/Code_coverage&quot;&gt;Coverage
checking&lt;/a&gt; is a good thing, and it
can reveal holes in any simulation-based verification suite.  It’s just not
good enough on its own to find all of what you are missing.&lt;/p&gt;

&lt;p&gt;My conclusion?  Formal verification, followed by a simulation test suite that
evaluates coverage statistics is something to pay attention to, but not the
end all be-all.  One tool isn’t enough.  Many tools are required.&lt;/p&gt;

&lt;h2 id=&quot;self-checking-testbenches&quot;&gt;Self-Checking Testbenches&lt;/h2&gt;

&lt;p&gt;I then got involved with ASIC design.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;/blog/2017/10/13/fpga-v-asic.html&quot;&gt;ASIC design differs from FPGA design in a couple of
ways&lt;/a&gt;.  Chief among them
is the fact that the ASIC design must work the first time.  There’s little to
no room for error.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 1. A typical verification environment&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vjourney/verilogtb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;When working with my first ASIC design, I was introduced to a more formalized
simulation flow.  Let me explain it this way, looking at Fig. 1.  Designs
tend to have two interfaces: a bus interface, together with a device I/O
interface.  A test script can then be used to drive some form of bus functional
model, which will then control the design under test via its bus interface.  A
device model would then mimic the device the design was intended to talk to.
When done well, the test script would evaluate the values returned by the
design–after interacting with the device, and declare “success” or “failure”.&lt;/p&gt;

&lt;p&gt;Here’s the key to this setup: I can run many different tests from this starting
point by simply changing the test script and nothing else.&lt;/p&gt;

&lt;p&gt;For example, let’s imagine an external memory controller.  A “good” memory
controller should be able to accept any bus request, convert it into
I/O wires to interact with the external memory, and then return a response from
the memory.  Hence, it should be possible to first write to the external memory
and then (later) read from the same external memory.  Whatever is then read
should match what was written previously.  This is the minimum test
case–measuring the “contract” with the memory.&lt;/p&gt;

&lt;p&gt;Other test cases might evaluate this contract across all of the modes the
memory supports.  Still other cases might attempt to trigger all of the faults
the design is supposed to be able to handle.  The only difference between these
many test cases would then be their test scripts.  Again, you can measure
whether or not the test cases are sufficient using coverage measures.&lt;/p&gt;

&lt;p&gt;The key here is that all of the test cases must produce either a “pass” or
“fail” result.  That is, they must be self-checking.  Now, using self checking
test cases, I can verify (via simulation) something like the 
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; across all of its instructions,
in SMP and single CPU environments, using the DMA (or not), and so forth.
Indeed, the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s test environment
takes this approach one step farther, by not just changing the test script
(in this case a &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; software program)
but also the configuration of the test environment as well.  This allows me
to make sure the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; will continue
to work in 32b, 64b, or even wider bus environments in a single test suite.&lt;/p&gt;

&lt;p&gt;Yes, this was a problem I was having before I adopted this methodology: I’d
test the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; with a 32b bus, and then
deploy the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; to a board whose
memory was 64b wide or wider.  The &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos
project&lt;/a&gt;, for example, has a 512b bus.  Now
that I run test cases on multiple bus widths, I have the confidence that I
can easily adjust the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; from one
bus width to another.&lt;/p&gt;

&lt;p&gt;This is now as far as I’ve now come in my verification journey.  I now use
formal tests, simulation tests, coverage checking, and a self-checking test
suite on new design components.  Is this perfect?  No, but at least its more
rigorous and repeatable than where I started from.&lt;/p&gt;

&lt;h2 id=&quot;next-steps-softwarehardware-interaction&quot;&gt;Next Steps: Software/Hardware interaction&lt;/h2&gt;

&lt;p&gt;The testing regiment discussed above continues to have a very large and
significant hole: I can’t test software drivers very well.&lt;/p&gt;

&lt;p&gt;Consider as an example my &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SD card
controller&lt;/a&gt;.  The 
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;repository&lt;/a&gt; actually contains three
controllers: &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdspi.v&quot;&gt;one for interacting with SD cards via their SPI
interface&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio_top.v&quot;&gt;one via
the SDIO interface&lt;/a&gt;,
and a third for use with eMMC cards (&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio_top.v&quot;&gt;using the SDIO
interface&lt;/a&gt;).
The &lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;repository&lt;/a&gt; contains formal proofs
for all leaf modules, and two types of SD card models–a &lt;a href=&quot;https://github.com/ZipCPU/blob/master/bench/cpp/sdspi.cpp&quot;&gt;C++ model for
SPI&lt;/a&gt; and all Verilog
models for
&lt;a href=&quot;https://github.com/ZipCPU/blob/master/sdspi/bench/verilog/mdl_sdio.v&quot;&gt;SDIO&lt;/a&gt; and
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/bench/verilog/mdl_emmc.v&quot;&gt;eMMC&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This controller IP also contains a set of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
drivers&lt;/a&gt; for use when working
with SD cards.  Ideally, these drivers should be tested together with the
&lt;a href=&quot;https://github.com/ZipCPU/sdsdpi&quot;&gt;SD card controller(s)&lt;/a&gt;, so they could be
verified together.&lt;/p&gt;

&lt;p&gt;Recently, for example, I added a &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sddma.v&quot;&gt;DMA
capability&lt;/a&gt; to the
&lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone&lt;/a&gt;
version of &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sdio.v&quot;&gt;the SDIO (and eMMC)
controller(s)&lt;/a&gt;.  This
(new) &lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/rtl/sddma.v&quot;&gt;DMA
capability&lt;/a&gt;
then necessitated quite a few changes to the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;control software&lt;/a&gt;, so that it
could take advantage of it.  With no tests, how well do you think
&lt;a href=&quot;https://github.com/ZipCPU/sdspi/blob/master/sw/sdiodrv.c&quot;&gt;this software&lt;/a&gt;
worked when I first tested it in hardware?&lt;/p&gt;

&lt;p&gt;It didn’t.&lt;/p&gt;

&lt;p&gt;So, for now, the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
directory&lt;/a&gt; simply holds the
software I will copy to other designs and test in actual hardware.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 2. Software driven test bench&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/cpusim/softwaretb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The problem is, testing the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
directory&lt;/a&gt; requires many
design components beyond just the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD card controllers&lt;/a&gt; that would be under test.
It requires memory, a console port, a CPU, and the CPU’s tool chain–all in
addition to the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;design&lt;/a&gt; under test.
These extra components aren’t a part of the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD controller
repository&lt;/a&gt;, nor perhaps should they be.  How
then should these &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;software
drivers&lt;/a&gt; be tested?&lt;/p&gt;

&lt;p&gt;Necessity breeds invention, so I’m sure I’ll eventually solve this problem.
This is just as far as I’ve gotten so far.&lt;/p&gt;

&lt;h2 id=&quot;automated-testing&quot;&gt;Automated testing&lt;/h2&gt;

&lt;p&gt;At any rate, I submitted this
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;repository&lt;/a&gt; to an automated continuous
integration facility the team I was working with was testing.  The utility
leans heavily on the existence of a variety of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;make test&lt;/code&gt; capabilities within
the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;repository&lt;/a&gt;, and so the
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SD Card repository&lt;/a&gt; was a good fit for
testing.  Along the way, I needed some help from the test facility engineer to
get &lt;a href=&quot;https://github.com/YosysHQ/sby&quot;&gt;SymbiYosys&lt;/a&gt;,
&lt;a href=&quot;https://github.com/steveicarus/iverilog&quot;&gt;IVerilog&lt;/a&gt; and
&lt;a href=&quot;https://www.veripool.org/verilator/&quot;&gt;Verilator&lt;/a&gt; capabilities installed.  His
response?&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;It’s literally the first time I get to know a good hardware project needs
such many verifications and testings!  There’s even a real SD card
simulation model and RW test…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Yeah.  Actually, there’s three SD card models–as discussed above.  It’s been
a long road to get to this point, and I’ve certainly learned a lot along the
way.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Watch therefore: for ye know not what hour your Lord doth come. (Matt 24:42)&lt;/em&gt;</description>
        <pubDate>Sat, 06 Jul 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/formal/2024/07/06/verifjourney.html</link>
        <guid isPermaLink="true">https://zipcpu.com/formal/2024/07/06/verifjourney.html</guid>
        
        
        <category>formal</category>
        
      </item>
    
      <item>
        <title>Debugging video from across the ocean</title>
        <description>&lt;p&gt;I’ve come across two approaches to video synchronization.  The first, used by
a lot of the Xilinx IP I’ve come across, is to hold the video pipeline in
reset until everything is ready and then release the resets (in the right and
proper order) to get the design started.  If something goes wrong, however,
there’s no room for recovery.  The second approach is the approach I like to
use, which is to &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;build video components that are inherently
“stable”&lt;/a&gt;: 1) if they
ever lose synchronization, they will naturally work their way back into
synchronization, and 2) once synchronized they will not get out of sync.&lt;/p&gt;

&lt;p&gt;At least that’s the goal.  It’s a great goal, too–when it works.&lt;/p&gt;

&lt;p&gt;Today’s story is about what happens when a “robust” video display isn’t.&lt;/p&gt;

&lt;h2 id=&quot;system-overview&quot;&gt;System Overview&lt;/h2&gt;

&lt;p&gt;Let’s start at the top level: I’m working on building a SONAR device.&lt;/p&gt;

&lt;p&gt;This device will be placed in the water, and it will sample acoustic data.
All of the electronics will be contained within a pressure chamber, with
the only interface to the outside world being a single cable providing both
Ethernet and power.&lt;/p&gt;

&lt;p&gt;Here’s the picture I used to capture this idea when &lt;a href=&quot;/blog/2022/08/24/protocol-design.html&quot;&gt;we discussed the network
protocols that would be required to debug this
device&lt;/a&gt;.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 1. Controlling an Underwater FPGA&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/netbus/sysdesign.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This “wet” device will then connect to a “dry” device (kept on land, via
Ethernet) where the sampled data can then be read, stored and processed.&lt;/p&gt;

&lt;p&gt;Now into today’s detail: while my customer has provided no requirement for
real-time processing, there’s arguably a need for it during development testing.
Even if there’s no need for real-time processing in the final delivery, there’s
arguably a need for it in the lab leading up to that final delivery.  That is,
I’d like to be able to just glance at my lab setup and know (at a glance or
two) that things are working.  For this reason, I’d like some real time
displays that I can read, at a glance, and know that things are working.&lt;/p&gt;

&lt;p&gt;So, what do we have available to us to get us closer?&lt;/p&gt;

&lt;h2 id=&quot;display-architecture&quot;&gt;Display Architecture&lt;/h2&gt;

&lt;p&gt;Some time ago, I built several RTL “display” modules to use for this
lab-testing purpose.  In general, these modules take an &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;AXI stream of incoming
data&lt;/a&gt;,
and they produce an &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video stream for
display&lt;/a&gt;.  At present,
there are only five of these graphics display modules:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;A histogram display&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;&lt;a href=&quot;/dsp/2019/12/21/histogram.html&quot;&gt;Histograms are exceptionally useful for diagnosing any A/D collection
issues&lt;/a&gt;, so having a live
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;histogram display&lt;/a&gt;
to provide insight into the sampled data distribution just makes sense.&lt;/p&gt;

    &lt;p&gt;However, &lt;a href=&quot;/dsp/2019/12/21/histogram.html&quot;&gt;histogram&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_histogram.v&quot;&gt;displays&lt;/a&gt;
need a tremendous dynamic range.  How do you handle that in hardware?  Yeah,
that was part of the challenge when building this display.  It involved
figuring out how to build multiplies and divides without doing either
multiplication or division.  A fun project, though.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;A trace module&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;By “trace”, I mean something to show the time series, such as a plot of
voltage against time.  My big challenge with this display so far has been
the reality that the SONAR A/D chips can produce more data than they eye can
quickly process.&lt;/p&gt;

    &lt;p&gt;Now that we’ve been through a test or two with the hardware, I have a better
idea of what would be valuable here.  As a result, I’m likely going to take
the absolute value of voltages across a significant fraction of a second,
and then use that approach to display a couple of seconds worth of data on
the screen.  Thankfully, my &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace display
module&lt;/a&gt; is
quite flexible, and should be able to display anything you give to it by way
of an AXI Stream input.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;A falling raster&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;The very first time my wife came to a family day at the office, way back
in the 1995-96 time frame or so, the office had a display set up with a
microphone and a sliding spectral raster.  I was in awe!  You could speak,
and see what your voice “looked” like spectrally over time.  You could hit
the table, whistle, bark, whatever, and every sound you made would look
different.&lt;/p&gt;

    &lt;p&gt;I’ve since &lt;a href=&quot;https://github.com/ZipCPU/fftdemo&quot;&gt;built this kind of capability&lt;/a&gt;
many times over, and even &lt;a href=&quot;/dsp/2020/11/21/spectrogram.html&quot;&gt;studied the best ways to do it from a
mathematical standpoint&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;In the SONAR world, you’ll find this sort of thing really helps you visualize
what’s going on in your data streams–what sounds are your sensors picking
up, what frequencies are they at, etc.  A good raster will let you “see”
motors in the water–all very valuable.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A spectrogram, via the same &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace
module&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This primarily involves plotting the absolute values of the data coming out
of an &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;,
applied to the incoming data.  Thankfully, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace
module&lt;/a&gt;
is robust enough to handle this kind of input as well.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;A split screen display&lt;/a&gt;,
that can place both an &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;trace&lt;/a&gt;
and a falling raster on the same screen.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ll come back to the split screen display in a bit.  In general, however,
the processing components used within it look (roughly) like Fig.  2 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 2. Split display video processing pipeline&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/split-pipeline.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Making this happen required some other behind the scenes components as well,
to include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_empty.v&quot;&gt;An empty video generator&lt;/a&gt;–to
generate an &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video
stream&lt;/a&gt; from scratch.
The video out of this device is a constant color (typically black).  This
then forms a “canvas” (via the &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI video
stream protocol&lt;/a&gt;)
that other things can be overlaid on top of.&lt;/p&gt;

    &lt;p&gt;This generator leaves &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;TVALID&lt;/code&gt; high, for reasons we’ve
&lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;discussed before&lt;/a&gt;,
and that we’ll get to again in a moment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_mux.v&quot;&gt;A video multiplexer&lt;/a&gt;–to
select between one of the various “displays”, and send only one to the
outgoing video display.&lt;/p&gt;

    &lt;p&gt;One of the things newcomers to the hardware world often don’t realize is that
the hardware used for a display can often not be reused when you switch
display types.  This is sort of like an ALU–the CPU will include support
for ADD, OR, XOR, and AND instructions, even if only one of the results is
selected on each clock cycle.  The same is true here.  Each of the various
displays listed
above is built in hardware, occupies a separate area of the FPGA (whether used
or not), and so something is needed to select between the various outputs to
choose which we’d like.&lt;/p&gt;

    &lt;p&gt;It did take some thought to figure out how to maintaining video
synchronization while multiplexing multiple video streams together.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;A video overlay module&lt;/a&gt;–to merge two displays together, creating a result that
looks like it has multiple independent “windows” all displaying real time
data.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wrote these modules years ago.  They’ve all worked beautifully–in simulation.
So far, these have only been designed to be engineering displays, and not
necessarily great finished products.  Their biggest design problem?  None of
them display any units.  Still, they promise a valuable debugging
capability–provided they work.&lt;/p&gt;

&lt;p&gt;Herein lies the rub.  Although these display modules have worked nicely in
simulation, and although many have been formally verified, for some reason
I’ve had troubles with these modules when placed into actual hardware.&lt;/p&gt;

&lt;p&gt;Debugging this video chain is the topic of today’s discussion.&lt;/p&gt;

&lt;h2 id=&quot;axi-video-rules&quot;&gt;AXI Video Rules&lt;/h2&gt;

&lt;p&gt;For some more background, each of these modules produces an AXI video stream.
In general, these components would take data input, and produce a video
stream as output–much like Fig. 3 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 3. General AXI Stream Video component&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/gendisplay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;In this figure, acoustic data arrives on the left, and video data comes out on
the right.  Both use AXI streams.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;/video/2022/03/14/axis-video.html&quot;&gt;AXI stream protocol, however, isn’t necessarily a good fit for video
proccessing&lt;/a&gt;.
You really have to be aware of who drives the pixel clock,
and where the blanking intervals in your design are handled.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Sink&lt;/p&gt;

    &lt;p&gt;If video comes into your device, the pixel clock is driven by that video
 source.  The source will also determine when blanking intervals need to
 take place and how long they should be.  This will be controlled via the
 video’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; signal.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Source&lt;/p&gt;

    &lt;p&gt;Otherwise, if you are not consuming incoming video but producing video out,
 then the pixel clock and blanking intervals will be driven by the video
 controller.  This will be controlled by the display controllers &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;READY&lt;/code&gt;
 signal.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our case, these intermediate display modules also need to be aware that
there’s often &lt;em&gt;no&lt;/em&gt; buffering for the input.  If you drop the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SRC_READY&lt;/code&gt; line,
data will be lost.  Acoustic sensor data is coming at the design whether you
are ready for it or not.  Likewise, the &lt;a href=&quot;/blog/2022/02/23/axis-abort.html&quot;&gt;video output data needs to get to the
display module, and there’s no room in the HDMI standard for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID&lt;/code&gt; dropping
when a pixel needs to be
produced&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Put simply, there are two constraints to these controllers: 1) the source can’t
handle &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VALID &amp;amp;&amp;amp; !READY&lt;/code&gt;, and 2) the display controller at the end of the video
processing chain can’t handle &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;READY &amp;amp;&amp;amp; !VALID&lt;/code&gt;.  Any IP in the middle needs
to do what it can to avoid these conditions.&lt;/p&gt;

&lt;p&gt;This leads to some self-imposed criteria, that I’ve “added” to the AXI stream
protocol.  Here are my extra rules for processing AXI video stream data:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All video processing components should keep READY high.&lt;/p&gt;

    &lt;p&gt;Specifically, nothing &lt;em&gt;within&lt;/em&gt;
the module should ever drop the ready signal.  Only the downstream display
driver should ever drop READY by more than a cycle or two between lines.
This drop in READY then needs to propagate through all the way through any
video processing chain.&lt;/p&gt;

    &lt;p&gt;My &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_mux.v&quot;&gt;video multiplexer&lt;/a&gt;
module is an example of an exception to this rule: It drops READY on all
of the video streams that aren’t currently active.  By waiting until the
end of a frame before adjusting/swapping which source is active, it can keep
all sources synchronized with the output.  This component will fail,
however, if one of those incoming streams is a true video source.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Keep VALID high as much as possible.&lt;/p&gt;

    &lt;p&gt;Only an upstream video source, such as a camera, should ever drop VALID by
more than a cycle or two between lines.  As with READY, this drop in VALID
should then propagate through the video processing chain.&lt;/p&gt;

    &lt;p&gt;In my case, There’s no such camera in this design, and so I’m never starting
from a live video source.  However, for reuse purposes in case I ever wish
to merge any of these components with a live feed, I try to keep VALID high
as much as possible.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Expect the environment to do something crazy.  Deal with it.  If your
algorithm depends on the image size, and that size changes, deal with it.&lt;/p&gt;

    &lt;p&gt;For example, if you are doing an overlay, and the overlay position changes,
you’ll need to move it.  If a video being overlaid isn’t VALID by the time
it’s needed, then you’ll have to diable the overlay operation and wait for
the overlay video source to get to the end of its frame before stalling it,
and then forcing it to wait until the time required for its first pixel
comes around again.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If your algorithm has a memory dependency, then there is always the
possibility that the memory cannot keep up with the videos requirements.
Prepare for this.  Expect it.  Plan for it.  Know how to deal with it.&lt;/p&gt;

    &lt;p&gt;For example, if you are reading memory from a frame buffer to generate a
video image, and the memory doesn’t respond in time then, again, you have
to deal with it.  Your algorithm should do something “smart”, fail
gracefully, and then be able to resynchronize again later.  Perhaps
something else, such as a disk-drive DMA, was using memory and kept the
frame buffer from meeting its real-time requirements.  Perhaps it will be
gone later.  Deal with it, and recover.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my case, I was building a falling raster.  I had two real-time requirements.&lt;/p&gt;

&lt;p&gt;First, data comes from the SONAR device at some incoming rate.  There’s no
room to slow it down.  You either handle it in time, or you don’t. In my case,
SONAR data is slow, so this isn’t really an issue.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 4. AXI Stream Video &quot;Rules&quot;&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/vidrules.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This data then goes through an
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;,
and possibly a logarithm or an averager,
before coming to the first half of the raster.  This component then writes
data to memory, one &lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt;
line at a time.  (See Fig. 2 above.)  If the memory is
too slow here, data may be catastrophically dropped.  This is bad, but rare.&lt;/p&gt;

&lt;p&gt;Second, the waterfall display data must be produced at a known rate.  VALID
must be held high as much as possible so that the downstream display driver
at the end of the processing chain can rate limit the pipeline as necessary.
That means the waterfall must be read from memory as often as the downstream
display driver needs it.  If the memory can’t keep up, the display goes on.
You can’t allow these to get out of sync, but if they do they have to be able
to resynchronize automatically.&lt;/p&gt;

&lt;p&gt;Those are my rules for AXI video.  I’ve also summarized them in Fig. 4.&lt;/p&gt;

&lt;h2 id=&quot;debugging-challenge&quot;&gt;Debugging Challenge&lt;/h2&gt;

&lt;p&gt;Now let’s return to my SONAR project, where one of the big challenges was that
the SONAR device wasn’t on my desktop.  It’s being developed on the other side
of the Atlantic from where I’m at.  It has no JTAG connection to Vivado.
There’s no ILA, although my &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;
works fine.  The bottom line here, though, is that I can’t just glance at the
device (like I’d like) to see if the display is working.&lt;/p&gt;

&lt;p&gt;I’ve therefore spent countless hours using both formal methods and video
simulations to verify that each of these display components work.  Each of
these displays has passed a lint check, a formal check, and a simulation check.
Therefore, they should all be working … right?&lt;/p&gt;

&lt;p&gt;Except that when I tried to deploy these “working” modules to the
hardware … they didn’t work.&lt;/p&gt;

&lt;p&gt;The classic example of “not working” was the split screen spectrum/waterfall
display.  This screen was supposed to display the current spectrum of the
input data on top, with a waterfall synchronized to the same data falling down
beneath it.  It’s a nice effect–when it works.  However, we had problems
where the two would get out of sync.  1) The waterfall would show energy in
locations separate from the spectral energy, 2) the waterfall could be seen
“jumping” horizontally across the screen–just like the old TVs would do when
they lost sync.&lt;/p&gt;

&lt;p&gt;This never happened in any of my simulations.  Never.  Not even once.&lt;/p&gt;

&lt;p&gt;Sadly, my integrated SONAR simulation environment isn’t perfect.  It has
some challenges.  Of course, there’s the obvious challenge that my simulation
isn’t connected to “real” data.  Instead, I tend to drive it with various sine
waves.  These tend to be good for testing.  I suppose I could fix this somewhat
by replaying collected data, but that’s only on my “To-Do” list for now.  Then
there’s the challenge that &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;my memory simulation
model&lt;/a&gt;
doesn’t typically match Xilinx’s MIG DDR3 performance.  (No, I’m not simulating
the entire DDR3 memory–although perhaps I should.) Finally, I can only
simulate about 5-15 frames of video data.  It just doesn’t take very long
before the &lt;a href=&quot;/blog/2017/07/31/vcd.html&quot;&gt;VCD trace file&lt;/a&gt;
exceeds 100GB, and then &lt;a href=&quot;https://gtkwave.sourceforge.net/&quot;&gt;my
tools&lt;/a&gt; struggle.&lt;/p&gt;

&lt;p&gt;Bottom line: &lt;a href=&quot;/blog/2018/08/04/sim-mismatch.html&quot;&gt;works in simulation, fails hard in
hardware&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now, how to figure this one out?&lt;/p&gt;

&lt;h2 id=&quot;first-step-formal-verification&quot;&gt;First Step: Formal verification&lt;/h2&gt;

&lt;p&gt;I know I said everything was formally verified.  That wasn’t quite true
initially.  Initially, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
wasn’t formally verified.&lt;/p&gt;

&lt;p&gt;In general, I like to develop with formal methods as my guide.  Barring that,
if I ever run into problems then formal verification is my first approach to
debugging.  I find that I can find problems faster when using the formal tools.
It tends to condense debugging very quickly.  Further, the formal tools aren’t
constrained by the requirement that the simulation environment needs to make
sense.  As a result, I tend to check my designs against a much richer
environment when checking them formally than I would via simulation.&lt;/p&gt;

&lt;p&gt;In this case, I was tied up with other problems, so I had someone else do the
formal verification for me.  He was somewhat new to formal verification, and
this particular module was quite the challenge–there are just so many cases
that had to be considered:&lt;/p&gt;

&lt;p&gt;We can start with the typical design, where the overlaid image lands nicely
within the main image window.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 5. Overlaid window in video&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/midoverlay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This is what I typically think of when I set up an overlay of some type.&lt;/p&gt;

&lt;p&gt;This isn’t as simple as it sounds, though, since the IP needs to know that
the overlay window has finished its line, and so it shouldn’t start on the
next line until the main window gets to the left corner of the overlay
window for the next line.&lt;/p&gt;

&lt;p&gt;What happens, though, when the overlay window scrolls off to once side and
wraps back onto the main window?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;padding: 25px; float: left&quot;&gt;&lt;caption&gt;Fig 6. Clipping the Overlaid video&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/overlay.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;It might also scroll off the bottom as well.&lt;/p&gt;

&lt;p&gt;In both cases, the overlay video should be clipped.  This is not something my
simulation environment ever really checked, but it is something we had no
end of challenges when checking via formal tools.&lt;/p&gt;

&lt;p&gt;These clipped examples are okay.  There’s nothing wrong with them–they just
never look right with only a couple clock cycles of trace.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;Fig 7. Overlay not ready&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/overlay-block.svg&quot; alt=&quot;&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;There’s also the possibility of what happens when the overlay window isn’t
ready when the main window is, as illustrated in Fig. 7 on the right.&lt;/p&gt;

&lt;p&gt;Remember our video rules.  Together, these rules require that VALID and READY
be propagated through the module–but never dropped internal to the module.
That means there’s no time to wait.  If the overlay module isn’t ready when
it’s time, then the image will be corrupted.  We can’t wait or the hardware
display will lose sync.  The overlay has to be ready or the image will be
corrupted.&lt;/p&gt;

&lt;p&gt;So, how to deal with situations like this?&lt;/p&gt;

&lt;p&gt;Yeah.&lt;/p&gt;

&lt;p&gt;Yes, my helper learned a lot during this process.  Eventually, we got to the
point of pictorially drawing out what was going on each time the formal engine
presented us with another verification failure, just so we could follow what
was going on.  Yes, our drawings started looking like Fig. 5 or 6 above.&lt;/p&gt;

&lt;p&gt;Yes, formal verification is where I turn when things don’t work.  Typically
there’s some hardware path I’m not expecting, and formal tends to find all
such paths to make sure the logic considers them properly.&lt;/p&gt;

&lt;p&gt;In this case, it wasn’t enough.  Even though I formally verified all of these
components, the displays still weren’t working.  Unfortunately, in order to
know this, I had to ask an engineer in a European time zone to connect a
monitor and … he told me it wasn’t working.  Sure, he was more helpful than
that: he provided me pictures of the failures.  (They were nasty.  These were
ugly looking failures.)  Unfortunately, these told me nothing of what needed
to be adjusted, and it was also costly in terms of requiring a team effort–I
would need to arrange for his availability, (potentially) his cost, all for
something that wasn’t (yet) a customer requirement.&lt;/p&gt;

&lt;p&gt;I needed a better approach.&lt;/p&gt;

&lt;p&gt;What I needed was a way to “see” what was going on, without being there.
I needed a digital method of screen capture.&lt;/p&gt;

&lt;p&gt;Building something like this, however, is quite the challenge: the waterfall
displays all use my memory bandwidth–they can even use a (potentially)
significant memory bandwidth.  Debugging meant that I was going to need a
means of capturing the screen headed to the display that wouldn’t
(significantly) impact my memory bandwidth–otherwise my test infrastructure
(i.e. any debugging screen capture) would impact what I was trying to test.
That might lead to chasing down phantom bugs, or believing things were still
broken even after they’d been fixed.&lt;/p&gt;

&lt;p&gt;This left me at an impass for some time–knowing there were bugs in the video,
but unable to do anything about them.&lt;/p&gt;

&lt;h2 id=&quot;enter-qoi-compression&quot;&gt;Enter QOI Compression&lt;/h2&gt;

&lt;p&gt;Some time ago, I remember reading about &lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI
compression&lt;/a&gt;.  It captured my attention, as a fun
underdog story.&lt;/p&gt;

&lt;p&gt;Yes, I’d implemented my own &lt;a href=&quot;https://en.wikipedia.org/wiki/GIF&quot;&gt;GIF&lt;/a&gt;
compression/decompression in time past.  This was back when I was still focused
on software, and thus before I started doing any hardware design. I’d even
looked up how to compress images with &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;
and how &lt;a href=&quot;https://en.wikipedia.org/wiki/Bzip2&quot;&gt;BZip2&lt;/a&gt; could compress files.
Frankly, over the course of 30 years working in this industry, compression is
kind of hard to avoid.  That said, none of these compression methods is
really suitable for FPGA work.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; is different.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; is &lt;em&gt;much&lt;/em&gt; simpler than
&lt;a href=&quot;https://en.wikipedia.org/wiki/GIF&quot;&gt;GIF&lt;/a&gt;,
&lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;,
or &lt;a href=&quot;https://en.wikipedia.org/wiki/Bzip2&quot;&gt;BZip2&lt;/a&gt;.  &lt;em&gt;Much&lt;/em&gt; simpler.  It’s so
simple, it can be implemented in hardware without too many challenges.  It’s so
simple, it can be implemented in 700 Xilinx 6-LUTs.  Not only that, it claims
better performance than &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;
across &lt;a href=&quot;https://qoiformat.org/benchmark/&quot;&gt;many (not all) benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Yeah, now I’m interested.&lt;/p&gt;

&lt;p&gt;With a little bit of work, I was able to implement a &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI compression
module&lt;/a&gt;.  A
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_encoder.v&quot;&gt;small wrapper&lt;/a&gt;
could encode and attach a small “file” header and trailer onto the compressed
stream.  This could then be followed by a &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_recorder.v&quot;&gt;QOI image capture
module&lt;/a&gt;
which I could then use to capture a series of subsequent video frames.&lt;/p&gt;

&lt;p&gt;This led to a debugging plan that was starting to take shape.  You can see how
this plan would work in Fig. 8 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 8. Video debug plan using QOI compression&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/qoiplan.svg&quot; alt=&quot;&quot; width=&quot;780&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;If all went well, video data would be siphoned off from between the video
multiplexer and the display driver generating the HDMI output.  This video
would be (nominally) at around (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;800*600*3*60&lt;/code&gt;) 82MB/s.  If the compression
works well, the data rate should drop to about 1MB/s–but we’ll see.&lt;/p&gt;

&lt;p&gt;Of course, as with anything, nothing works out of the box.  Worse, if you are
going to rely on something for “test”, it really needs to be better than
the device under test.  If not, you’ll never know which item is the cause of
an observation: the device under test, or the test infrastructure used to
measure it.&lt;/p&gt;

&lt;p&gt;Therefore, I set up a basic simulation test on my desktop.  I’d run the
SONAR simulation, visually inspect the HDMI output, and capture three frames
of data.  I’d then &lt;a href=&quot;https://github.com/phoboslab/qoi&quot;&gt;convert&lt;/a&gt; these three
frames of data to &lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;s.  If the resulting
&lt;a href=&quot;https://en.wikipedia.org/wiki/PNG&quot;&gt;PNG&lt;/a&gt;s visually matched, then I had
a &lt;del&gt;strong&lt;/del&gt; confidence the
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;,
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_encoder.v&quot;&gt;encoder&lt;/a&gt;, and
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_recorder.v&quot;&gt;recorder&lt;/a&gt;
were working.&lt;/p&gt;

&lt;p&gt;Note that I had to cross out the word “strong” there.  Unless and until an
IP can be tested through &lt;em&gt;every&lt;/em&gt; logic path, you really don’t have any “strong”
confidence something is working.  Still, it was enough to get me off the ground.&lt;/p&gt;

&lt;p&gt;The challenge here is that tracing the design through simulation while it
records three images can generate a 120GB+
&lt;a href=&quot;/blog/2017/07/31/vcd.html&quot;&gt;VCD file&lt;/a&gt;,
and took longer to test
in simulation than it did to build the hardware design, load the hardware
design, and capture images from hardware.  As a result, I often found myself
debugging both the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg&quot;&gt;QOI processing system&lt;/a&gt;
and the (buggy) video processing system jointly, &lt;a href=&quot;/blog/2017/06/02/design-process.html&quot;&gt;in hardware, at the same
time&lt;/a&gt;.
No, it’s not ideal, but it did work.&lt;/p&gt;

&lt;h2 id=&quot;the-first-bug-never-getting-back-in-sync&quot;&gt;The First Bug: Never getting back in sync&lt;/h2&gt;

&lt;p&gt;I started my debugging with the default display, a &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;split
screen&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrogram&lt;/a&gt;
and &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;.
Using my newfound capability, I quickly received an image that looked something
like the figure below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 9. First QOI capture -- no waterfall&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240527-qoi-before.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This figure shows what &lt;em&gt;should&lt;/em&gt; be a
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_split.v&quot;&gt;split screen&lt;/a&gt;
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrogram&lt;/a&gt;
and
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;
display.  The
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_trace.v&quot;&gt;spectrum&lt;/a&gt;
on top appears about right, however the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/gfx/vid_waterfall.v&quot;&gt;waterfall&lt;/a&gt;
that’s supposed to exist in the bottom half of the display is completely absent.&lt;/p&gt;

&lt;p&gt;Well, the good news is that I could at least capture a bug.&lt;/p&gt;

&lt;p&gt;The next step was to walk this bug backwards through the design.  In this case,
we’re walking backwards through Fig. 2 above and the first component to look at
is the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;.  It
is possible for the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
to lose synchronization.  This typically means either the overlay isn’t
ready when the primary display is ready for it, or that the overlay is still
displaying some (other) portion of its video.  Once out of sync, you can no
longer merge the two displays.  The two streams then need to be resynchronized.
That is, the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay&lt;/a&gt;
module would need to wait for the end of the secondary image (the image to be
overlaid on top of the primary), and then it would need to stall the secondary
image until the primary display was ready for it again.&lt;/p&gt;

&lt;p&gt;However, the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
wasn’t losing synchronization.&lt;/p&gt;

&lt;p&gt;No?&lt;/p&gt;

&lt;p&gt;This was a complete surprise to me.  This was where I was expecting the bug,
and where most of my debugging efforts had been (blindly) focused up until this
point.&lt;/p&gt;

&lt;p&gt;Okay, so … let’s move back one more step.  (See Fig. 2)&lt;/p&gt;

&lt;p&gt;It is possible for the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;video waterfall
reader&lt;/a&gt;
to get out of sync between its two clocks.  Specifically, one portion of the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
reads data, one line at a time, from the bus and stuffs it into
first a synchronous FIFO, and then
an &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous one&lt;/a&gt;.  This
half operates at whatever speed the bus is at, and that’s defined by the
memory’s speed.  The second half of the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
takes this data from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt; and attempts
to create an AXI stream video output from it–this time at the pixel clock rate.
Because we are not allowed to stall this video output to wait for memory, it
is possible for the two to get out of sync.  In this case, the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(pixel clock domain) is supposed to wait for an end of frame indication from
the memory
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(bus clock domain, via the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;),
and then it is to stall the memory
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
(by not reading from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;)
until it receives an end of video frame indication from its own video
reconstruction logic.&lt;/p&gt;

&lt;p&gt;A quick check revealed that yes, these two were getting out of sync.&lt;/p&gt;

&lt;p&gt;Here’s how the “out-of-sync” detection was taking place:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt;	&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pix_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pix_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Check when sending the last pixel of a line.  On this last&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// pixel, the data read from memory (px_hlast) must also&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// indicate that it is the last pixel in a line.  Further,&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// if this is also the last line in a frame, then both the&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// memory indicator of the last line in a frame (px_vlast)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// and the outgoing video indicator (M_VID_VLAST) must match.&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// We can resynchronize once both memory and&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// outgoing video streams have both reached the end of&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// a frame.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Following any reset, the entire design should be synchronized.  That’s the
easy part.&lt;/p&gt;

&lt;p&gt;Next, if the output of the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
(that’s the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_*&lt;/code&gt; prefix values) is ready to produce the last pixel of a
line, then we check if the FIFO signals line up.  In our example, we have two
sets of synchronization signals.  First, there are the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt; signals.  These are generated blindly based upon the frame size.
These indicate the last pixel in a line (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt;) and the end of a frame
(&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt;) respectively–from the perspective of the video stream.  Two
other signals, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_hlast&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_vlast&lt;/code&gt;, come through the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;.  These
are used to indicate the last bus word in a line and the end of a frame from
the perspective of the data found within the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;
containing the samples read from memory–one bus word (not one pixel) at a time.
If these two ever get out of sync, then perhaps memory hasn’t kept up with the
display or perhaps something else has gone wrong.&lt;/p&gt;

&lt;p&gt;So, to determine if we’ve lost sync, we check for it on the last pixel of any
line.  That is, when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; is true to indicate the last pixel in a
line, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_last&lt;/code&gt; should also be true–both should be synchronized.
Likewise, when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_VLAST&lt;/code&gt; (last line of frame) is true, then &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_vlast&lt;/code&gt;
should also be true–or the two have come out of sync.&lt;/p&gt;

&lt;p&gt;Because I’m also doing 128b bus word to 8b pixel conversions here, the two
signals don’t directly correspond.  That is, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;px_hlast&lt;/code&gt; might be true (last
bus word of a line), even though &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; isn’t true yet (last pixel of a
line).  Hence, I only check these values if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;M_VID_HLAST&lt;/code&gt; is true–on the last
&lt;em&gt;pixel&lt;/em&gt; of the line.&lt;/p&gt;

&lt;p&gt;That’s how we know if we’re out of sync.  But … how do we get synchronized
again?&lt;/p&gt;

&lt;p&gt;For this, the plan is to read from the memory reader as fast as possible until
the end of the frame.  Once we get to the end of the frame, we’ll stop reading
from memory and wait for the video (pixel clock) to get to the end of the
frame.  Once both are synchronized at the end of a frame, the plan is to then
release both together and we’ll be synchronized again.&lt;/p&gt;

&lt;p&gt;At least, that’s how this is &lt;em&gt;supposed&lt;/em&gt; to work.&lt;/p&gt;

&lt;p&gt;The key (broken) signal was the signal to read from the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.
This signal, called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;afifo_read&lt;/code&gt;, is shown below.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PW&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_valid&lt;/span&gt;
			&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TVALID&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_TREADY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;c1&quot;&gt;// Always read if we are out of sync&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Basically, we want to read from the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;
any time we don’t have a full pixel’s width left in our bus width to pixel
gearbox, any time we don’t have a valid buffer, or any time we reach the end
of the line–where we would flush the gearbox’s buffer.  The exception to this
is if the outgoing AXI stream is stalled.  This is how the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;FIFO&lt;/a&gt; read signal is supposed
to work normally.  There’s one exception here, and that is if the two are out
of sync.  In that case, we will always read from
the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;FIFO&lt;/a&gt;
until the last pixel in a line on the last line of the frame.&lt;/p&gt;

&lt;p&gt;This all sounds good.  It looked good on a desk check too.  II passed over this
many times, reading it, convincing myself that this was right.&lt;/p&gt;

&lt;p&gt;The problem is this was the logic that was broken.&lt;/p&gt;

&lt;p&gt;If you look closely, you might notice that this logic would never allow us to
get back in sync.  Once we lose synchronization, we’ll read until the end of
the frame and then stop, only to read again when any of the original criteria
are true–the ones assuming synhronization.&lt;/p&gt;

&lt;p&gt;Yeah, that’s not right.&lt;/p&gt;

&lt;p&gt;This also explains why all my hardware traces showed the waterfall never
resynchronizing with the outgoing video stream.&lt;/p&gt;

&lt;p&gt;One missing condition fixes this.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;px_lost_sync&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_hlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;px_vlast&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_HLAST&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;M_VID_VLAST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;afifo_read&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This last condition states that, if we are out of sync and we’ve reached the
last pixel in a frame, then we need to wait until the outgoing frame matches
our sync.  Only then can we read again.&lt;/p&gt;

&lt;p&gt;Once I fixed this, things got better.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 10. QOI capture, showing an attempted waterfall display&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240528-qoi-promising.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I could now get through a significant fraction of a frame before losing
synchronization for the rest of it.  In other words, I had found and fixed
the cause of why the design wasn’t recovering, just not the cause of what
caused it to get out of sync in the first place.&lt;/p&gt;

&lt;p&gt;The waterfall background is also supposed to be &lt;em&gt;black&lt;/em&gt;, not &lt;em&gt;blue&lt;/em&gt;–so I
needed to dig into that as well.  (That turned out to be a bug in the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI
compression
module&lt;/a&gt;.  I
could just about guess this bug, if I watched how the official decoder worked.)&lt;/p&gt;

&lt;p&gt;So, back I went to the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;,
this time &lt;a href=&quot;/blog/2017/06/08/simple-scope.html&quot;&gt;triggering the
scope&lt;/a&gt;
on a loss of sync event.  I needed to find out why this design lost sync in
the first place.&lt;/p&gt;

&lt;h2 id=&quot;the-second-bug-how-did-we-lose-sync-in-the-first-place&quot;&gt;The Second Bug: How did we lose sync in the first place?&lt;/h2&gt;

&lt;p&gt;Years ago, I wrote &lt;a href=&quot;/blog/2018/11/29/llvga.html&quot;&gt;an article that argued that good and correct video handling
was all captured by a pair of
counters&lt;/a&gt;.  You needed one
counter for the horizontal pixel, and another for the vertical pixel.  Once
these got to the raw width and height of the image, the counters would be
reset and start over.&lt;/p&gt;

&lt;p&gt;When dealing with memory, things are a touch different–at least for this
design.&lt;/p&gt;

&lt;p&gt;As hinted above, the bus portion of the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;waterfall
reader&lt;/a&gt;
works off of &lt;em&gt;bus words&lt;/em&gt;, not pixels.  It reads one line at a time from the
bus, reading as many bus words as are necessary to make up a line.  In the case
of this system, a bus word on the &lt;a href=&quot;https://store.digilentinc.com/nexys-video-artix-7-fpga-trainer-board-for-multimedia-applications&quot;&gt;Nexys Video
board&lt;/a&gt;
is 128-bits long–the natural width of the DDR3 SDRAM memory.  (Our &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;next
hardware
platform&lt;/a&gt;
will increase this to 512-bits.)  Likewise, the waterfall pixel size is only
8-bits–since it has no color, and a &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/48de0f29c1cb91fabb0ef4d0cba4829c4a43651c/rtl/gfx/vid_clrmap.v&quot;&gt;false
color&lt;/a&gt;
will be provided later.  Hence, to read an 800 pixel line, the bus master must
read 50 bus words (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;800*8/128&lt;/code&gt;).  The last word will then be marked as the last
in the line, possibly also the last in the frame, and the result will be
stuffed into the &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.
Once the last word in a line is requested of the bus, the bus master
needs to increment his line pointer address to the next line&lt;/p&gt;

&lt;p&gt;However, there’s a problem with bus mastering: the logic that makes &lt;em&gt;requests&lt;/em&gt;
of a bus has to take place many clocks before the logic that &lt;em&gt;receives&lt;/em&gt; the bus
responses.  The difference is not really that important, but it typically ends
up around 30 clock cycles or so.  That means this design needs two sets of
X and Y counters: one when making requests, to know when a full line (or frame)
has been requested and that it is time to advance to the next line (or frame),
and a second set to keep track of when the line (or frame) ends with respect
to the values &lt;em&gt;returned&lt;/em&gt; from the bus.  This second set controls the end of
line and frame markers that go into the synchronous and then &lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous
FIFO&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s walk through this logic to see if I can clarify it at all.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;First, there’s both an synchronous FIFO and an
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous one&lt;/a&gt;–since
it can be a challenge to know the &lt;em&gt;fill&lt;/em&gt; of the
&lt;a href=&quot;/blog/2018/07/06/afifo.html&quot;&gt;asynchronous FIFO&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the &lt;em&gt;synchronous&lt;/em&gt; FIFO is at least half empty, the reader begins a bus
transaction.  For a &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;Wishbone
bus&lt;/a&gt;, this means both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt; need to
be raised.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB &amp;amp;&amp;amp; !STALL&lt;/code&gt;, a request is made of the bus.  At this time, we
also subtract one from a counter keeping track of the number of available
(i.e. uncommitted) entries in the synchronous FIFO.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Likewise, for every &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB &amp;amp;&amp;amp; !STALL&lt;/code&gt;, the IP increments the requested memory
address.&lt;/p&gt;

    &lt;p&gt;Once you get to the end of the line, set the next address to the last line
start address &lt;em&gt;minus&lt;/em&gt; one line of memory.  Remember, we are creating a
&lt;em&gt;falling&lt;/em&gt; raster, where we go from most recent
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt; data to oldest
&lt;a href=&quot;/dsp/2018/10/02/fft.html&quot;&gt;FFT&lt;/a&gt; data.
Hence we read &lt;em&gt;backwards&lt;/em&gt; through memory, one line at a time.&lt;/p&gt;

    &lt;p&gt;Once we get to the beginning of our assigned memory area, we wrap back
to the end of our assigned memory area minus one line.&lt;/p&gt;

    &lt;p&gt;Once we get to the end of the &lt;em&gt;frame&lt;/em&gt;, we need to reset the address to
the last line the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_w.v&quot;&gt;writer&lt;/a&gt;
has just completed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On evey &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ACK&lt;/code&gt;, the returned by data gets stored into the synchronous FIFO.
With each result stored in the FIFO, we also add an indication of whether
this return was associated with the end of a line or the end of a frame.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;reader&lt;/a&gt;
gets to the end of the line, we restart the (horizontal) &lt;del&gt;pixel&lt;/del&gt; bus
word counter and increment the line counter.  When it gets to the end of
the frame, we reset the line counter as well.&lt;/p&gt;

    &lt;p&gt;Just to make sure that these two sets of counters (request and return)
remain synchronized, the return counters to set to equal the request
counters any time the bus is idle.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The IP then continues making requests until there would be no more room in
the FIFO for the returned data.  At this point, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;STB&lt;/code&gt; gets dropped and we
wait for the last request to be returned.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once all requests have been returned, drop &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; and wait again.&lt;/p&gt;

    &lt;p&gt;The rule of the bus is also the rule of the boarding house bathroom:
do your business, and get out of there.  Once you are done with any bus
transactions, it’s therefore important to get off the bus.  Even if we could
(now) make more requests, we’ll get off the bus and wait for the FIFO to
become less than half full again–that way other (potential) bus masters
can have a chance to access memory.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And … right there is the foundation for this bug.&lt;/p&gt;

&lt;p&gt;The actual bug was how I determined whether or not the last request was being
returned.  Let’s look at that logic for a moment, shall we?  Here’s what it
looked like (when broken):  (Watch for what clears &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;o_wb_cyc&lt;/code&gt; …)&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;initial&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wb_reset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Halt any requests on reset&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b00&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_stall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Drop the strobe signal on the last request.  Never&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// raise it again during this cycle.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;last_request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_stall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
					&lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_request&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Drop ACK once the last return has been received.&lt;/span&gt;
			&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;1'b0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fifo_fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGFIFO&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LGBURST&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// Start requests when the FIFO has less than a burst's size&lt;/span&gt;
		&lt;span class=&quot;c1&quot;&gt;// within it.&lt;/span&gt;
		&lt;span class=&quot;o&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mb&quot;&gt;2'b11&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

	&lt;span class=&quot;k&quot;&gt;always&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;posedge&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_clk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_reset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_cyc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;||&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i_wb_err&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;last_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wb_outstanding&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;o_wb_stb&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
				&lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i_wb_ack&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Look specifically at the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; signal.&lt;/p&gt;

&lt;p&gt;Depending upon the pipeline, this signal can be off by one clock cycle.&lt;/p&gt;

&lt;p&gt;This was the bug.  Because the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; signal, indicating that there’s only
one more acknowledgement left, compared the number of outstanding requests
against &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2&lt;/code&gt; plus the current acknowledgment, and because the signal was
&lt;em&gt;registered&lt;/em&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;last_ack&lt;/code&gt; might be set if there were two requests outstanding
&lt;em&gt;and&lt;/em&gt; nothing was returned on the current cycle.&lt;/p&gt;

&lt;p&gt;Since all requests would’ve been made by this time, the X and Y &lt;del&gt;pixel&lt;/del&gt;
bus word counters for the &lt;em&gt;request&lt;/em&gt; would reflect that we’d just requested a
line of data.  The &lt;em&gt;return&lt;/em&gt; counters, on the other hand, would be off by one
if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; ever dropped a cycle early.  These return counters would then get
reset to equal the &lt;em&gt;request&lt;/em&gt; counters any time &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; was zero.  Hence,
dropping the bus line one cycle early would result in a line
of pixels (well, bus words representing pixels …) going into the FIFO
that didn’t have enough pixels within it–or perhaps the LAST signal might
be missing entirely.  Whatever the case, it didn’t line up.&lt;/p&gt;

&lt;p&gt;This particular design was formally verified.  Shouldn’t this bug have shown
up in a formal test?  Sadly, no.  &lt;a href=&quot;/zipcpu/2017/11/07/wb-formal.html&quot;&gt;It’s &lt;em&gt;legal&lt;/em&gt; to drop &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt;
early&lt;/a&gt;, so there’s
no protocol violation there.  Further, my acknowledgment counter was off by
one in such that the formal properties allowed it.  If I added an assertion
that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; would never be dropped early (which I did once I discovered this
bug), the design would then immediately (and appropriately) fail.&lt;/p&gt;

&lt;p&gt;There’s one more surprise to this story though.  Why didn’t this bug show up in
simulation?&lt;/p&gt;

&lt;p&gt;Ahh, now there’s a very interesting lesson to be learned.&lt;/p&gt;

&lt;h2 id=&quot;reality-why-didnt-the-bugs-show-up-in-simulation&quot;&gt;Reality: Why didn’t the bug(s) show up in simulation?&lt;/h2&gt;

&lt;p&gt;Why didn’t the bug show up earlier?  Because of Xilinx’s DDR3 SDRAM controller,
commonly known as “The MIG”.&lt;/p&gt;

&lt;p&gt;I don’t normally simulate DDR3 memories.  A DDR3 SDRAM memory controller
requires a lot of hardware specific components, components that aren’t
necessarily easy to simulate, and it also requires a DDR3 SDRAM simulation
model.  I tend to simplify all of this and just simulate my designs with an
&lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;alternate SDRAM model–a model that looks and acts “about” right, but one that
isn’t exact&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It was the difference between &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;my simulation
model&lt;/a&gt;,
which wouldn’t trigger any of the bugs, and Xilinx’s MIG reality that
ended up triggering the bug.&lt;/p&gt;

&lt;p&gt;Fig. 11, for example, shows what the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; returned when documenting the
&lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/vid_waterfall_r.v&quot;&gt;waterfall
reader&lt;/a&gt;’s
transactions with the MIG.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 11.  The Waterfall reader's view of Wishbone bus handshaking when accessing memory&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240605-migfail.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Focus your attention on first the stall (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_stall&lt;/code&gt;) and then the
acknowledgment (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i_ack&lt;/code&gt;) lines.&lt;/p&gt;

&lt;p&gt;First, stall is high immediately as part of the beginning of the transaction.
This is to be expected.  With the exception of filling a minimal buffer, any
bus master requesting transactions of the bus is going to need to wait for
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;arbitration&lt;/a&gt;.
This only takes a clock or two.  Once
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;arbitration&lt;/a&gt; is received, the
&lt;a href=&quot;/blog/2019/07/17/crossbar.html&quot;&gt;interconnect&lt;/a&gt;
won’t stall the design again during this bus cycle.&lt;/p&gt;

&lt;p&gt;Only the stall line gets raised again after that–several times even.  These
stalls are all due to the MIG.&lt;/p&gt;

&lt;p&gt;Let’s back up a touch.&lt;/p&gt;

&lt;p&gt;There are a lot of rules to SDRAM interaction.  Most SDRAM’s are configured in
memory &lt;em&gt;banks&lt;/em&gt;.  Banks are read and written in &lt;em&gt;rows&lt;/em&gt;.  The data in each row
is stored in a set of capacitors.  This allows for maximum data packing in
minimal area (cost).  However, you can’t read from a row of capacitors.  To
read from the memory, that row first needs to be copied to a row of fast
memory.  This is called
“activating” the row.  Once a row is activated, it can be read from or written
to.  Once you are done with one row, it must be “precharged” (i.e. put back),
before a different row can be activated.  All of this takes time.  If the
row you want isn’t activated, you’ll need to switch rows.  That will cause a
stall as the old row needs to be precharged and the new row activated.  Hence,
when making a long string of read or a long string of write requests, you’ll
suffer from a stall every time you cross rows.&lt;/p&gt;

&lt;p&gt;Xilinx’s MIG has another rule.  Because of how their architecture uses an IO
trained PLL (Xilinx calls this a “phasor”), the MIG needs to regularly read
from memory to keep this PLL trained.  During this time the memory must also
stall.  (Why the MIG can’t train on &lt;em&gt;my&lt;/em&gt; memory reads, but needs its own–I
don’t know.)  These stalls are very periodic, and if you dig a bit you can
find this taking place within their controller.&lt;/p&gt;

&lt;p&gt;Then the part of the trace showing a long stalled section reflects the reality
that, every now and again, the memory needs to be taken entirely off line for
a period of time so that the capacitors can be recharged.  This requires a
longer time period, as highlighted in Fig. 12 below.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 12.  SDRAM refresh cycles force long stalls&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240605-migrefresh.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Once it’s time for a refresh cycle like this, several steps need to take place
in the memory controller–in this case the MIG.  First, any active rows need
to be precharged.  Then, the memory is refreshed.  Finally, you’ll need to
re-activate the row you need.  This takes time as well–as shown in Fig. 12.&lt;/p&gt;

&lt;p&gt;That’s part one–the stall signal.  &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;My over-simplified SDRAM memory
model&lt;/a&gt;
doesn’t simulate any of these practical memory realities.&lt;/p&gt;

&lt;p&gt;Part two is the acknowledgments.  From these traces, you can see that there’s
about a 30 cycle latency (300ns) from the first request to the first
acknowledgment.  However, unlike my &lt;a href=&quot;https://github.com/ZipCPU/zbasic/blob/e7b39a56ee515d1cabe8427f30c7add0592bfab1/sim/verilated/memsim.cpp&quot;&gt;over-simplified memory
model&lt;/a&gt;,
the acknowledgments also come back broken due to the stalls.  This makes sense.
If every request takes 30 cycles, and some get stalled, then it only makes
sense that the stalled requests would get acknowledged later the ones that
didn’t get stalled.&lt;/p&gt;

&lt;p&gt;Put together, this is why my waterfall display worked in simulation, but not
in hardware.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Wow, that was a long story!&lt;/p&gt;

&lt;p&gt;Yeah.  It was long from my perspective too.  Although the “bugs” amounted to
only 2-5 lines of Verilog, it took a lot of work to find those bugs.&lt;/p&gt;

&lt;p&gt;Here are some key takeaways to consider:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All of this was predicated on a &lt;a href=&quot;/blog/2018/08/04/sim-mismatch.html&quot;&gt;simulation vs hardware
mismatch&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Because the SDRAM simulation did not match the SDRAM reality, cycle for
cycle, a key hardware reality was missed in testing.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;This should’ve been caught via formal methods.&lt;/p&gt;

    &lt;p&gt;From now on, I’m going to have to make certain I check that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; is only
ever dropped either following either a reset, an error, or the last
acknowledgment.  There should be zero requests outstanding when &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CYC&lt;/code&gt; is
dropped.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Why wasn’t the pixel resynchronization bug caught via formal?&lt;/p&gt;

    &lt;p&gt;Because … FIFOs.  It can be a challenge to formally verify a design
containing a FIFO.  Rather than deal with this properly, I allowed the two
halves of the design to be somewhat independent–and so the formal tool
never really examined whether or not the design could (or would) properly
recover from a lost sync.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Did formally verifying the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
help?&lt;/p&gt;

    &lt;p&gt;Yes.  When we went through it, we found bugs in it.  Once the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
was formally verified, the result stopped &lt;em&gt;jumping&lt;/em&gt;.  Instead, the
overlay might just note a problem and stop showing the overlaid image.
Even better, unlike before the &lt;a href=&quot;https://github.com/ZipCPU/vgasim/blob/master/rtl/axisvoverlay.v&quot;&gt;overlay
module&lt;/a&gt;
was properly verified, I haven’t had any more instances of the top and
bottom pictures getting out of sync with each other.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What about that blue field?&lt;/p&gt;

    &lt;p&gt;Yes, the waterfall background should be black when no signal was present.
The blue field turned out to be caused by a bug in the &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;QOI compression
module&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Once fixed, the captured image looked like Fig. 13 below.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Fig 13.  SDRAM refresh cycles force long stalls&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/qoi-debug/20240528-qoi-working.png&quot; alt=&quot;&quot; width=&quot;800&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This was easily found and fixed.  (It had to deal with a race condition on the
pixel index when writing to the compression table, if I recall correctly …)&lt;/p&gt;

&lt;ol start=&quot;6&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;How about that &lt;a href=&quot;https://github.com/ZipCPU/qoiimg&quot;&gt;QOI module&lt;/a&gt;?&lt;/p&gt;

    &lt;p&gt;The thing worked like a champ!  I love the simplicity of the
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt;
encoding, enough so that I’m likely to use it again and again!&lt;/p&gt;

    &lt;p&gt;Okay, perhaps I’m overselling this.  It wasn’t perfect at first.  This is,
in many ways to be expected–this was the first time it was ever used.
However, it was small and cheap, and worked well enough to get the job done.&lt;/p&gt;

    &lt;p&gt;Some time later, I managed to formally verify the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;
engine, and I found another bug or two that had been missed in my hardware
testing.&lt;/p&gt;

    &lt;p&gt;That’s &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_compress.v&quot;&gt;compression&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Decompression?  That’s another story.  I think I’ve convinced myself that I
can do decompression in hardware, but the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;algorithm&lt;/a&gt;
(while cheap) isn’t
really straightforward any more.  At issue is the reality that it will take
several clock cycles (i.e. pipeline stages) to determine the table index for
storing colors into, yet the very next pixel might be dependent upon the
result of reading from the table.  Scheduling the pipeline isn’t
straightforward.  (Worse, I have simulation test cases showing that the
&lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;decompression logic I have&lt;/a&gt;
doesn’t work yet.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Are the displays ready for prime time?&lt;/p&gt;

    &lt;p&gt;I’d love to say so, but they don’t have labeled axes.  They really need
labeled axes to be proper &lt;em&gt;professional&lt;/em&gt; displays.  Perhaps a
&lt;a href=&quot;https://qoiformat.org&quot;&gt;QOI&lt;/a&gt; &lt;a href=&quot;https://github.com/ZipCPU/qoiimg/blob/master/rtl/qoi_decompress.v&quot;&gt;decompression
algorithm&lt;/a&gt;
can take labeled image data from memory and overlay it onto the display as
well.  However, to do this I’m going to have to redesign how I handle
scaling, otherwise the labels won’t match the image.&lt;/p&gt;

    &lt;p&gt;Worse, &lt;a href=&quot;https://x.com/Dg3Yev/status/1797779997190443498&quot;&gt;[DG3YEV Tobias] recently put my waterfall display to
shame&lt;/a&gt;.  My basic displays
are much too simple.  So, it looks like I might need to up my game.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I should point out, in passing, that the &lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3 SDRAM
controller&lt;/a&gt; doesn’t nearly have as
many stall cycles as Xilinx’s MIG.  It doesn’t use the (undocumented) hardware
phasors, so it doesn’t have to take the memory offline periodically.  Further,
it can schedule the row precharge and activation cycles so as to avoid
bus stalls (when accessing memory sequentially).  As such, it operates about
10% faster than the MIG.  It even gets a lower latency.  These details,
however, really belong in an article to themselves.&lt;/p&gt;

&lt;p&gt;I suppose the bottom line question is whether or not these displays are ready
for our next testing session.  The answer is a solid, No.  Not yet.  I still
need to do some more testing with them.  However, these displays are a lot
closer now than they’ve been for the last two years.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Seest thou a man diligent in his business? he shall stand before kings; he shall not stand before mean men. (Prov 22:29)&lt;/em&gt;</description>
        <pubDate>Sat, 22 Jun 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/video/2024/06/22/vidbug.html</link>
        <guid isPermaLink="true">https://zipcpu.com/video/2024/06/22/vidbug.html</guid>
        
        
        <category>video</category>
        
      </item>
    
      <item>
        <title>Bringing up Kimos</title>
        <description>&lt;p&gt;Ever had one of those problems where you were stuck for weeks?&lt;/p&gt;

&lt;p&gt;It’s not supposed to happen, but … it does.&lt;/p&gt;

&lt;p&gt;Let me tell you about the Kimos story so far.&lt;/p&gt;

&lt;h2 id=&quot;what-is-kimos&quot;&gt;What is Kimos?&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos is the name of one of the current open source
projects&lt;/a&gt; I’m working on.  The project is
officially named the “Kintex-7 Memory controller, Open Source toolchain”, but
the team shortened that to “KiMOS” and I’ve gotten to the point where I just
call it “Kimos” (pronounced KEE-mos).  The goals of the project are twofold.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Test an &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Open Source DDR3 SDRAM memory
controller&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This includes both performance testing, and performance comparisons against
Xilinx’s MIG controller.&lt;/p&gt;

    &lt;p&gt;Just as a note, &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s
controller&lt;/a&gt; has a couple of
differences with Xilinx’s controller.  One of them is a simpler
“native” interface: Wishbone, with an option for one (or more)
auxilliary wire(s).  The auxilliary wire(s) are designed to simplify
&lt;a href=&quot;https://github.com/ZipCPU/wb2axip/blob/master/rtl/axim2wbsp.v&quot;&gt;wrapping this controller with a full AXI
interface&lt;/a&gt;.
Another difference is the fact that &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s
controller&lt;/a&gt; is built using
documented Xilinx IO capabilities only–rather than the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHY_CONTROL&lt;/code&gt; and
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PHASER*&lt;/code&gt; constructs that Xilinx used and chose not to document.&lt;/p&gt;

    &lt;p&gt;My hypothesis is that these differences, together with some internal
structural differences that I encouraged Angelo to make, will make his a
faster memory controller.  This test will tell.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Once the memory controller works, our goal is to test Kimos using an
entirely open source tool flow.&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;This open source tool flow would replace Vivado.&lt;/em&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The project hardware itself is built by &lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt;.
It consists of two boards: a &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Mercury+ ST1
baseboard&lt;/a&gt;,
and an associated &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;KX2
daughterboard&lt;/a&gt;.
Together, these boards provide some nice hardware capability in one place:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;There’s a large DDR3 SDRAM memory, with a 64b data width.  Ultimately,
this means we should be able to transfer 512b per FPGA clock.  In the case
of this project, that’ll be 512b for every 10ns (i.e. a 100MHz FPGA system
clock)–even though the memory itself can be clocked faster.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The board also has two Gb Ethernet interfaces, although I only have plans
for one of them.&lt;/p&gt;

    &lt;p&gt;Each interface (naturally) includes an &lt;a href=&quot;https://en.wikipedia.org/wiki/Management_Data_Input/Output&quot;&gt;MDIO management
interface&lt;/a&gt;.
Although I might be tempted to take this interface for granted, it
shouldn’t be.  It was via the &lt;a href=&quot;https://en.wikipedia.org/wiki/Management_Data_Input/Output&quot;&gt;MDIO
interface&lt;/a&gt;
that I was able to tell which of the two hardware interfaces corresponded
to ETH0 on the schematic and which was ETH1.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s an SD card slot on the board, so I’ve already started using it to
test my &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO controller&lt;/a&gt; and it’s new
DMA capability.  Once tested, the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/dev&quot;&gt;dev branch (containing the
DMA)&lt;/a&gt; will have been “tested” and
“hardware proven”, and so I’ll be able to then merge it into the &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master&quot;&gt;master
branch&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I’m likely to use the FMC interface to test a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;new SATA
controller&lt;/a&gt; I’m working on.  A nice
&lt;a href=&quot;https://www.fpgadrive.com/&quot;&gt;FPGA Drive daughter board&lt;/a&gt;
from &lt;a href=&quot;https://www.ospero.com&quot;&gt;Ospero Electronic Design, Inc.,&lt;/a&gt; will help to
make this happen.&lt;/p&gt;

    &lt;p&gt;Do note, though, that &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;this controller&lt;/a&gt;,
although posted, is most certainly broken and broken badly at present–it’s
just not that far along in its development to have any reliability to it.
The plan is to first build a SATA Verilog model, get the controller running
in simulation, and then to get it running on this Enclustra hardware.  It’s
just got a long way to go in its process at present.  The good news is that
the project is funded, so if you are interested in it, come back and check
in on it later–after I’ve had the chance to prove (and therefore fix) it.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The device also has some I2C interfaces, which I might investigate for
testing my &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;ultimate I2C
controller&lt;/a&gt; on.  The
main I2C bus has three chips connected to it: an &lt;a href=&quot;https://media.digikey.com/pdf/Data%20Sheets/Silicon%20Laboratories%20PDFs/Si5338.pdf&quot;&gt;Si5338
clock controller&lt;/a&gt;
(which isn’t needed for any of my applications), an encrypted hash chip
(with … poor documentation–not recommended), and a &lt;a href=&quot;https://www.renesas.com/us/en/document/dst/isl12020m-datasheet&quot;&gt;real time
clock&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The design also has some of the more standard interfaces that everything
relies on, to include
&lt;a href=&quot;/blog/2019/03/27/qflexpress.html&quot;&gt;Flash&lt;/a&gt; and
&lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt;–both
of which I have controllers for already.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Although the
&lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;baseboard&lt;/a&gt;
has HDMI capabilities, Enclustra never connected the HDMI on the
&lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;baseboard&lt;/a&gt;
to the &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;KX2 daughterboard&lt;/a&gt;.
Hence, if I want video, I’ll need to use the DisplayPort hardware–something
I haven’t done before, but … it does have potential (just not funding).&lt;/p&gt;

    &lt;p&gt;This is a shame, because I have a bunch of live HDMI displays that I’d love
to port to this project that … just aren’t likely to happen.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Eventually, my plan is to port my SONAR work to this hardware–but that remains
a far off vision at this point.&lt;/p&gt;

&lt;p&gt;The project is currently a work in progress, so I have not gotten to the point
of completing either of the open source objectives.  (Since I initially drafted
this, &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;Angelo’s controller&lt;/a&gt; has
now been ported, and appears to be working–it’s performance just hasn’t
been measured yet.)&lt;/p&gt;

&lt;p&gt;I have, however, completed a first milestone: getting the design working with
Xilinx’s MIG controller.  For a task that should’ve taken no longer than a
couple of days, this portion of the task has taken a month and a half–leaving
me stuck in &lt;a href=&quot;/fpga-hell.html&quot;&gt;FPGA Hell&lt;/a&gt; for most of this
time.&lt;/p&gt;

&lt;p&gt;Now that I have Xilinx’s MIG working, I’d like to share a brief description of
what went wrong, and why this took so long.  Perhaps others may learn from my
failures as well.&lt;/p&gt;

&lt;h2 id=&quot;the-challenges-with-board-bringup&quot;&gt;The challenges with board bringup&lt;/h2&gt;

&lt;p&gt;The initial steps in board bringup went quickly: I could get the LEDs and
serial port up and running with no problems.  From there I could
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/sw/board/cputest.c&quot;&gt;test&lt;/a&gt; the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; (running out of block RAM), and
things looked good.  At this point, a year or so ago, I put the board on the
shelf to come back to it later when I had more time and motivation (i.e.
funding).&lt;/p&gt;

&lt;p&gt;I wasn’t worried about the next steps.  I already had controllers for the
main hardware components necessary to move forward.  I had &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/migsdram.v&quot;&gt;a controller that
would work nicely with Xilinx’s
MIG&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/net/enetstream.v&quot;&gt;another
that would handle the Gb
Ethernet&lt;/a&gt;, &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;a
flash controller&lt;/a&gt;,
and so on.  These were all proven controllers, so it was just a matter of
integrating them and making sure things worked (again) as expected.&lt;/p&gt;

&lt;p&gt;Once the &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;Kimos project&lt;/a&gt; kicked off, with the
goals listed above, I added these components to the project and immediately
had problems.&lt;/p&gt;

&lt;h3 id=&quot;the-done-led&quot;&gt;The DONE LED&lt;/h3&gt;

&lt;p&gt;The first problem was that the “DONE” LED wouldn’t light.  Or, rather, it would
light just fine until I tried to include Xilinx’s MIG controller.  Once I
included Xilinx’s MIG controller into the design the LED would no longer light.&lt;/p&gt;

&lt;p&gt;Now … how do you fix that one?  I mean, where do you even start?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/one-bug.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I started by reducing the design as much as possible.  I removed components
from the design, and adjusted which components were in the design and which
were not.  With a bit of work, I was able to prove–as mentioned above–that
the design would work as long as Xilinx’s MIG (DDR3 SDRAM) controller was not
a part of the design.  The moment I added Xilinx’s MIG, the design stopped
working.&lt;/p&gt;

&lt;p&gt;Ouch.  What would cause that?  Is there a short circuit on the board somewhere?
Did I mess up the XDC file?  The MIG configuration?&lt;/p&gt;

&lt;p&gt;With some help from some other engineers, we traced the first problem to the
open source FPGA loader I was using:
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;.  As it turns
out, this &lt;a href=&quot;https://github.com/trabucayre/openFPGALoader/issues/229&quot;&gt;loader struggles to load large/complex designs at high JTAG
frequencies&lt;/a&gt;.  However,
if you drop the frequency down from 4MHz to &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/Makefile#L283&quot;&gt;3.75MHz, the loader will “just”
work&lt;/a&gt;
and the DONE LED will get lit.&lt;/p&gt;

&lt;p&gt;The problem goes a bit deeper, and highlights a problem I’ve had personally as
well: since the developer of the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;
component can’t replicate the problem with the hardware he has, he can’t really
test fixes.  Hence, although a valid fix has been proposed, the developer
is uncertain of it.  Still, without help, I wouldn’t have made it this far.&lt;/p&gt;

&lt;p&gt;Sadly, now that the DONE LED lit up for my design, it still didn’t work.
Worse, I no longer trusted the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;FPGA loader&lt;/a&gt;.
This left me always looking over my shoulder for another loading option.&lt;/p&gt;

&lt;p&gt;For example, I tried programming the design into
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;
and then using my &lt;a href=&quot;https://github.com/ZipCPU/kimos/rtl/wbicapetwo.v&quot;&gt;internal configuration access port (ICAPE)
controller&lt;/a&gt; to
load the design from
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;.
This didn’t work, until I first took the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt; out of
eXecute in Place (XiP) mode.  (Would I have known that, if I hadn’t been the
one to build the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash
controller&lt;/a&gt;
and put it into XiP mode in the first place?  I’m not sure.)
However, if I first told the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/rtl/qflexpress.v&quot;&gt;flash&lt;/a&gt;
to leave XiP mode, I could then specify a warm-boot address to my
&lt;a href=&quot;https://github.com/ZipCPU/kimos/rtl/wbicapetwo.v&quot;&gt;ICAPE&lt;/a&gt; controller,
followed by an IPROG command, which could then load any design that …
didn’t include Xilinx’s MIG DDR3 SDRAM controller.&lt;/p&gt;

&lt;p&gt;At this point, I had proved that my problem was no longer the
&lt;a href=&quot;https://github.com/trabucayre/openFPGALoader&quot;&gt;openFPGALoader&lt;/a&gt;.  That was
the good news.  The bad news was that the design still wasn’t working whenever
I included the MIG.&lt;/p&gt;

&lt;h3 id=&quot;jtaguart-not-working&quot;&gt;JTAG/UART not working&lt;/h3&gt;

&lt;p&gt;If the design loads, the place I want to go next is to get an internal logic
analyzer up and running.  Here, I have two options:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Xilinx’s ILA requires a JTAG connection.&lt;/p&gt;

    &lt;p&gt;Without a Xilinx compatible JTAG connector, I can’t use Xilinx’s ILA.&lt;/p&gt;

    &lt;p&gt;At one point I purchased a USB based JTAG controller.  I … just didn’t
manage to purchase the right one, and so the pins never fit.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I typically do my &lt;a href=&quot;/blog/2017/06/28/dbgbus-goal.html&quot;&gt;debugging over
UART&lt;/a&gt;, using a
&lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt;–something we’ve
&lt;a href=&quot;/blog/2017/07/08/getting-started-with-wbscope.html&quot;&gt;already discussed on the
blog&lt;/a&gt;.
Using this method I can quickly find and debug problems.&lt;/p&gt;

    &lt;p&gt;However, with this particular design, any time I added the MIG SDRAM
controller to the design my &lt;a href=&quot;/blog/2017/06/28/dbgbus-goal.html&quot;&gt;UART debugging
port&lt;/a&gt; would stop
working–together with the rest of the design.  That left me with no UART,
and no JTAG.  Indeed, I could’ve ping’d the board via the Gb Ethernet unless
and until I added the MIG.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Something was seriously wrong.  This is definitely &lt;em&gt;not&lt;/em&gt;
&lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So how then do you debug something this?  LEDs!&lt;/p&gt;

&lt;h3 id=&quot;leds-not-working&quot;&gt;LEDs not working&lt;/h3&gt;

&lt;p&gt;Debugging by LED is slow.  It can take 10+ minutes to make a change to a design,
and each LED will only (at best) give you one bit of output.  So the feedback
isn’t that great.  Still, they are an important part of debugging early design
configuration issues.  In this case, the &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra KX2
daughterboard&lt;/a&gt;
has four LEDs on it, and the &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Mercury+ ST1
baseboard&lt;/a&gt;
has another 4 LEDs.  Perhaps they could be used to debug the next steps?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/side-by-side.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Normally, I build my designs with a &lt;a href=&quot;/blog/2017/05/20-knight-rider.html&quot;&gt;“Knight Rider” themed LED
display&lt;/a&gt;.  This helps me
know that my FPGA design has loaded properly.  There are two parts to this
display.  First, there’s an “active” LED that moves from one end of the LED
string to the other and then back again.  This “active” LED is ON with full
brightness–whatever that means for an individual design.  Then, once the
“active” LED moves on to the next LED in the string, a PWM (actually
&lt;a href=&quot;/dsp/2017/09/04/pwm-reinvention.html&quot;&gt;PDM&lt;/a&gt;)
signal is used to “dim” the LED in a decaying fashion.  Of course, &lt;a href=&quot;/zipcpu/2019/02/09/cpu-blinky.html&quot;&gt;the
CPU can easily override this
display&lt;/a&gt; as necessary.&lt;/p&gt;

&lt;p&gt;My problem was that, even though the “DONE” LED would (now) light up when
loading a design containing the MIG, these user LEDs were not doing anything.&lt;/p&gt;

&lt;p&gt;Curiously, if I overrode the LEDs at the top level of the design, I could make
them turn either on or off.  I just couldn’t get my internal design to control
these LEDs properly.  (I call this an “override” method because the
&lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/rtl/toplevel.v#L474-L477&quot;&gt;top level&lt;/a&gt;
of my design is generated by
&lt;a href=&quot;/zipcpu/2017/10/05/autofpga-intro.html&quot;&gt;AutoFPGA&lt;/a&gt;, and I
wasn’t going so far as to adjust the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/autodata/spio.txt#L64-L67&quot;&gt;original source&lt;/a&gt;s
describing how these LEDs should ultimately operate.) Still, using this
top-level override method, I was able to discover that I could see LEDs 4-7
from my desk chair, that these were how I had wired up the LEDs on the
baseboard (a year earlier), and that LEDs 6 and 7 had an opposite polarity
from all of the other LEDs on the board.&lt;/p&gt;

&lt;p&gt;All useful, it just didn’t help.&lt;/p&gt;

&lt;p&gt;At one point, I noticed that the LEDs were configured to use the IO standard
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SSTL15&lt;/code&gt; instead of the normal &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LVCMOS15&lt;/code&gt; standard I normally use.  Once I
switched from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SSTL15&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LVCMOS15&lt;/code&gt;, my &lt;a href=&quot;/blog/2017/05/20-knight-rider.html&quot;&gt;knight-rider
display&lt;/a&gt; worked.&lt;/p&gt;

&lt;p&gt;Unfortunately, neither the serial port nor the Ethernet port worked.  Both of
these continued to work if the MIG controller wasn’t included in the design,
just not if the MIG controller was included.&lt;/p&gt;

&lt;h3 id=&quot;voodoo-engineering&quot;&gt;Voodoo Engineering&lt;/h3&gt;

&lt;p&gt;I like to define Voodoo engineering as “Changing what isn’t broken, in an
attempt to fix what is.”  Not knowing what else to try, I spent a lot of time
doing Voodoo engineering just trying to get the design working.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;With the help of a hardware friend and his lab, we examined all of the power
rails.  Could it be that the design was losing power during the startup
sequence, and so not starting properly even though the “DONE” LED was
lighting up?&lt;/p&gt;

    &lt;p&gt;No.&lt;/p&gt;

    &lt;p&gt;After a lot of work with various probes, all we discovered was that the
design used about 50% more power when the MIG was included.  Did this mean
there was a short circuit somewhere?&lt;/p&gt;

    &lt;p&gt;Curiously, it was the FPGA that got warmer, not the DDR3 SDRAM.&lt;/p&gt;

    &lt;p&gt;I left this debug session convinced I needed to look for a bug in my XDC
file somewhere.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I spent a lot of time comparing the schematic to the XDC file.  I discovered
some rather important things:&lt;/p&gt;

    &lt;ul&gt;
      &lt;li&gt;
        &lt;p&gt;Some banks required internal voltage references.  These were not declared
in any of the reference designs.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;Two banks needed DCI cascade support, but the reference design only had
one bank using it.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;The design required a voltage select pin that I wasn’t setting.  This pin
needed to be set to high impedance.&lt;/p&gt;
      &lt;/li&gt;
      &lt;li&gt;
        &lt;p&gt;I had the DDR3 CKE IO mapped to the wrong pin.&lt;/p&gt;
      &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://www.enclustra.com/en/products/base-boards/mercury-st1/&quot;&gt;Enclustra ST1
baseboard&lt;/a&gt;
can support multiple IO voltages.  These need to be configured via a set of
user jumpers, and the constraints regarding how these IO voltages are to be
set are … complex.  Eventually, I set banks A and B to 1.8V and bank C
to 1.2V.&lt;/p&gt;

    &lt;p&gt;Sadly, nothing but the LEDs were using banks B and C, so … none of these 
changes helped.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I suppose I should be careful here: I was probably fixing actual bugs during
these investigations.  However, none of the bugs I fixed actually helped move
me forward.  Fixing these bugs didn’t get the
&lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt;+SDRAM working, nor did
they get the network interface working whenever the SDRAM was included.  Both
of these interfaces worked without the SDRAM as part of the design, they just
didn’t work when connecting the MIG SDRAM controller to the design.&lt;/p&gt;

&lt;p&gt;Was there some short circuit connection between SDRAM pins and something
on the &lt;a href=&quot;/formal/2019/02/21/txuart.html&quot;&gt;UART&lt;/a&gt; or network IO
banks?  There shouldn’t be, I mean, both of these peripherals were on
separate IO banks from the DDR3 SDRAM.&lt;/p&gt;

&lt;h3 id=&quot;reference-design&quot;&gt;Reference design&lt;/h3&gt;

&lt;p&gt;At this point, I needed to use the reference design to make certain the
hardware still worked.  I’d had weeks of problems where the DONE pin wasn’t
going high.  Did this mean I’d short circuited or otherwise damaged the board?
The design was using a lot more power when configured to use the SDRAM.  Did
this mean there was a short circuit damaging the board?  Had my board been
broken?  Was there a manufacturing defect?&lt;/p&gt;

&lt;p&gt;Normally, this is where you’d use a reference design.  Indeed, this was
&lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt;’s recommendation to me.  Normally this
would be a good recommendation.  They recommended I use their reference design,
prove that the hardware worked, and then slowly migrate that design to my
needs.  My problem with this approach was that their reference design wasn’t
written in RTL.  It was written in TCL with a Verilog wrapper.  Worse, their
TCL Ethernet implementation depended upon an Ethernet controller from Xilinx
that … required a license.  Not only that,
&lt;a href=&quot;https://www.enclustra.com&quot;&gt;Enclustra&lt;/a&gt; did not provide any master XDC file(s).
(They did provide schematics and a .PRJ file with many of the IOs declared
within it.)  Still, how do you “slowly migrate” TCL to RTL?  That left me with
just their MIG PRJ file to reference and … I still had a bug.&lt;/p&gt;

&lt;p&gt;There were a couple of differences between &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/doc/mig.prj&quot;&gt;my MIG PRJ configuration
file&lt;/a&gt;
and their reference MIG configuration.  My &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/doc/mig.prj&quot;&gt;MIG PRJ configuration
file&lt;/a&gt;
used a 100MHz user clock, and hence a 400MHz DDR3 clock, whereas their
reference file used an 800MHz DDR3 clock.  (My design wouldn’t close timing at
200MHz, so I was backing away to 100MHz.)  Could this be the difference?&lt;/p&gt;

&lt;p&gt;Upon request, one of my teammates built a LiteX design for this board.  (It
took him less than 2hrs.  I’d been stuck for weeks!  How’d he get it going so
fast?  Dare I mention I was jealous?)  This LiteX design had no problems with
the DDR3 SDRAM–although it doesn’t use Xilinx’s MIG.  I even had him
configure this LiteX demo for the 400MHz DDR3 clock, and … there were no
problems.&lt;/p&gt;

&lt;p&gt;Given that the LiteX design “just worked”, I knew the hardware on my board
still worked.  I just didn’t know what I was doing wrong.&lt;/p&gt;

&lt;h3 id=&quot;the-final-bug-the-reset-polarity&quot;&gt;The final bug: the reset polarity&lt;/h3&gt;

&lt;p&gt;One difference between the MIG driven design and the non-MIG design (i.e. my
design without a DDR3 SDRAM controller) is that the MIG controller wants to
deliver both system clock and the system reset to the rest of the design.  Any
failure to get either a system clock or a system reset from the MIG controller
could break the whole design.&lt;/p&gt;

&lt;p&gt;So, I went back to the top level LEDs again.  I re-examined the logic, and
made sure LED[7] would blink if the MIG was held in reset, and LED[6] would
blink if the clocks didn’t lock.  This lead me to two problems.  The first
problem was based upon where I had my board set up: I couldn’t see LED[7]
from my desk top with a casual glance.  I had to make sure I leaned forward
in my desk chair to see it.  (Yes, this cost me a couple of debug cycles before
I realized I couldn’t see all of the LEDs without leaning forward.)  Once I
could see it, however, I discovered the system reset wire was being held high.&lt;/p&gt;

&lt;p&gt;Well, that would be a problem.&lt;/p&gt;

&lt;p&gt;Normally, when I use the MIG controller, I use an active high reset.  This
time, in order weed out all of the possible bugs, I’d been trying to make my
MIG configuration as close to the example/reference configuration I’d been
given.  That meant I set the design up to use an active-low reset–like the
reference design.  I had assumed that, if the MIG were given an active low
reset it would produce an active low user reset for the design.&lt;/p&gt;

&lt;p&gt;Apparently, I was wrong.  Indeed, after searching out the Xilinx user guide,
I can confirm I was definitely wrong.  The synchronous user reset was active
high.&lt;/p&gt;

&lt;p&gt;Once I switched to an active high reset things started working.  My serial
port now worked.  I could now read from memory over the UART interface, and
“ping” the network interface of the device.  Even better, my debugging
interface now worked.  That meant I could use my
&lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt; again.&lt;/p&gt;

&lt;p&gt;I was now &lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;cleaning-up&quot;&gt;Cleaning up&lt;/h3&gt;

&lt;p&gt;From here on out, things went quickly.  Sure, there were more bugs, but these
were easily found, identified, and thus fixed quickly.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;While the design came up and I could (now) read from memory, I couldn’t write
to memory without hanging up the design.  &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/96a24e5756a9e9a363d3d47c7962303afb2f65bd/rtl/migsdram.v#L354&quot;&gt;After tracing it, this bug turned
out to be a simple copy
error&lt;/a&gt;.  It was part of some logic I was getting ready
to test which would’ve ran the MIG at 200MHz, and the design at 100MHz–just
in case that was the issue.&lt;/p&gt;

    &lt;p&gt;This bug was found by adding a &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; to the design, and then
seeing the MIG accept a request that never got acknowledged.&lt;/p&gt;

    &lt;p&gt;Yeah, that’d lock a bus up real quick.&lt;/p&gt;

    &lt;p&gt;I should point out that, because I use Wishbone and because Wishbone has the
ability to &lt;em&gt;abort&lt;/em&gt; an ongoing transaction, I was able to rescue my
connection to the board, and therefore my connection to the bus, even after
this fault.  No, I couldn’t rescue my connection to the SDRAM without a
full reset, but I could still talk to the board and hence I could still use
my &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone scope&lt;/a&gt; to debug the problem.
Had this been an AXI bus, I would not have had this capability without using
some form of &lt;a href=&quot;/formal/2020/05/16/firewall.html&quot;&gt;protocol
firewall&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Other bugs were found in the &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/sw/host/nexbus.cpp&quot;&gt;network software&lt;/a&gt;.
This was fairly new software, never used before, so finding bugs here were
not really all that surprising.&lt;/p&gt;

    &lt;p&gt;At least with these bugs, I could use my &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/804559a307bd58e34527ebb3f791ad414fe71803/sw/host/nexbus.cpp&quot;&gt;network
software&lt;/a&gt;
together with my Verilator-based simulation environment.  Indeed, my &lt;a href=&quot;https://github.com/ZipCPU/kimos/blob/master/sim/netsim.cpp&quot;&gt;C++
network model&lt;/a&gt;
allows me to send/receive UDP packets to the simulated design, and receive
back what the design would return.&lt;/p&gt;

    &lt;p&gt;Like I said, by this point I was &lt;a href=&quot;https://english.stackexchange.com/questions/25897/origin-of-the-phrase-now-were-cooking-with&quot;&gt;“cooking with gas”&lt;/a&gt;.
It took about two days (out of 45) to get this portion up and running.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The one bug that was a bit surprising was due to a network access test that
set the host software into an infinite loop.  During this infinite loop, the
software would keep writing to a debug dump, which I was hoping to later use
to debug any issues.  The surprise came from the fact that I wasn’t expecting
this issue, so I had let the test run while I stepped away for some family
time.  (Supper and a movie with the kids may have been involved here …)
When I discovered the bug, the debug dump file had grown to over 270GB!
Still, fixing this bug was pretty routine, and there’s not a lot to share
other than it was just another bug.&lt;/p&gt;

&lt;h2 id=&quot;lessons-learned&quot;&gt;Lessons learned&lt;/h2&gt;

&lt;p&gt;There are a lot of lessons to be learned here, some of which I’ve done to
myself.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;All RTL&lt;/p&gt;

    &lt;p&gt;I like all RTL designs.  I prefer all RTL designs.  I can debug an all RTL
design.  I can adjust an all RTL design.  I can version control an all RTL
design.&lt;/p&gt;

    &lt;p&gt;I can’t do this with a TCL design that references opaque components that
may get upgraded or updated any time I turn around.  Worse, I can’t fix
an opaque component–and Xilinx isn’t known for fixing the bugs in their
designs.  As an example, the following bug has been lived in Xilinx’s
Ethernet-Lite controller for years:&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: center&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/xilinx-axi-ethernetlite/2022.1-rvalid.png&quot; width=&quot;749&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;I reported this in 2019.  This is only one of several bugs I found.  The logic
above is as of Vivado 2022.1.  In this snapshot, you can see how they commented
the originally broken code.  As a result, the current design now looks like
they tried to fix it and … it’s still broken on its face.  (i.e. RVALID
shouldn’t be adjusted or dropped unless RREADY is known to be true …)&lt;/p&gt;

&lt;p&gt;Or what about RDATA?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: center&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/xilinx-axi-ethernetlite/2022.1-check.png&quot; width=&quot;749&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;This also violates the first principles of &lt;a href=&quot;/blog/2021/08/28/axi-rules.html&quot;&gt;AXI
handshaking&lt;/a&gt;.  Notice that
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RDATA&lt;/code&gt; might not get set if &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;!RVALID &amp;amp;&amp;amp; !RREADY&lt;/code&gt;–hence the first &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RDATA&lt;/code&gt;
value read from this device might be in error.&lt;/p&gt;

&lt;p&gt;Yeah, … no.  I’m not switching to Xilinx IP any time soon if I can avoid it.
At least with my own IP I can fix any problems–once I find them.&lt;/p&gt;

&lt;p&gt;For all of these reasons, I would want an all HDL reference design from any
vendor I purchase hardware from.  At least in this case, you can now find
an &lt;a href=&quot;https://github.com/ZipCPU/kimos&quot;&gt;all-Verilog reference design for the ST1+KX2 boards in my Kimos
project&lt;/a&gt;–to include a working (and now open
source) &lt;a href=&quot;https://github.com/AngeloJacobo/uberDDR3&quot;&gt;DDR3 SDRAM controller&lt;/a&gt;.&lt;/p&gt;

&lt;ol start=&quot;2&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Simulation.&lt;/p&gt;

    &lt;p&gt;Perhaps my biggest problem was that I didn’t have an all-Verilog simulation
environment set up for this design from the top level on down.  Such an
environment should’ve found this reset bug at the top level of the design
immediately.  Instead, what I have is a joint Verilog/C++ environment
designed to debug the design from just below the top level using Verilator.
This kept me from finding and identifying the reset bug–something that
could have (and perhaps should have) been found in simulation.&lt;/p&gt;

    &lt;p&gt;In the end, after finding the reset bug, I did break down and I found a
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3/blob/main/testbench/ddr3.sv&quot;&gt;Micron model of a DDR3 memory&lt;/a&gt;.
This was enough to debug some issues associated with getting the &lt;a href=&quot;https://github.com/ZipCPU/wbscope&quot;&gt;Wishbone
scope&lt;/a&gt; working inside the memory
controller, although it’s not really a permanent solution.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/kimos/open-sim.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Still, this is a big enough problem that I’ve been shopping around the idea
   of an open source all-Verilog simulation environment–something faster than
   Iverilog, with more capability.  If you are interested in working on
   building such a capability–let me know.&lt;/p&gt;

&lt;ol start=&quot;3&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Finger pointing&lt;/p&gt;

    &lt;p&gt;As is always the case, I tend to point the finger everywhere else when I
can’t find a bug.  This seems to be a common trait among engineers.  For
the longest time I was convinced that my design was creating a short
circuit on the board.  As is typically the case, I often have to come back
to reality once I do find the bugs.&lt;/p&gt;

    &lt;p&gt;I guess the bottom line here is that I have more than enough humble pie to
share.  Feel free to join me.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since writing this, the project has moved forward quite significantly.  The
design now appears to work with both the MIG and with the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller–although I
made some more beginner mistakes in the clock setup while getting that
controller up and running.  Still, it’s up and running now, so my next task
will be running some performance metrics to see which controller runs
faster/better/cheaper.  (Hint: the
&lt;a href=&quot;https://github.com/AngeloJacobo/UberDDR3&quot;&gt;UberDDR3&lt;/a&gt; controller uses about
30% less logic, so there’s at least one difference right off the bat.)&lt;/p&gt;

&lt;p&gt;Stay tuned, and I’ll keep you posted regarding how the two controllers compare
against each other.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;For I am not ashamed of the gospel of Christ: for it is the power of God unto salvation to every one that believeth; to the Jew first, and also to the Greek. (Romans 1:16)&lt;/em&gt;</description>
        <pubDate>Thu, 13 Jun 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2024/06/13/kimos.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/06/13/kimos.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Chasing resets</title>
        <description>&lt;p&gt;A true story.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/cost-estimate.svg&quot; width=&quot;240&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Some years ago, given a customer’s honest need and request, I proposed a
change to a client’s &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt;
IP.  Specifically, I wanted to add &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC
checking&lt;/a&gt;,
based upon a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
kept in an &lt;a href=&quot;https://en.wikipedia.org/wiki/Out-of-band&quot;&gt;out-of-band memory
region&lt;/a&gt;, to verify the ability
to properly read memory regions error free.  I said the change shouldn’t take
more than about two weeks, and I’d clean up some other problems I was aware of
in the mean time.  This change solved an urgent problem, so he agreed
to it.&lt;/p&gt;

&lt;p&gt;By the time I was done, my 80 hr proposal had turned into 270+ hrs of work.&lt;/p&gt;

&lt;h2 id=&quot;build-it-well&quot;&gt;Build it well&lt;/h2&gt;

&lt;p&gt;I’d like to start my discussion of what went wrong with a list of good
practices to follow.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 20px;&quot;&gt;&lt;caption&gt;Fig 1. Basic test bench components&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/verilogtb.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Just as a background, a general test bench follows the format shown in Fig. 1,
on the right.  The “test bench” itself is composed of a series of scripts.
These scripts then interact with a common test bench “library”, which then
makes requests of an AXI bus via a “bus functional model”.  This project
was designed to make minor changes to the device under test.&lt;/p&gt;

&lt;p&gt;With that vocabulary under our belt, here are some of the good practices
I would expect to find in a well built design.&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Avoid &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Yes, I harp on &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;
a lot.  There’s a reason for it.  While it wasn’t hard at all to make the
requested changes, I had to come back later and spend more than two weeks
chasing down
&lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;
buried in the test bench.&lt;/p&gt;

    &lt;p&gt;Specifically, I wanted to add a hardware capability to calculate and store
a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; in an &lt;a href=&quot;https://en.wikipedia.org/wiki/Out-of-band&quot;&gt;out
of band&lt;/a&gt; area on a storage
device, and then to check those values again when reading the data back.
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
can be calculated and checked
quickly and efficiently in hardware–especially if the data is already
moving.  Unfortunately, the test bench had hard coded locations where
everything was supposed to land in the hardware, and as a result all of
these locations needed updating in order to add room for the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;I spent quite a bit of time chasing down all of these
&lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;This applies to register address names as well–but we’ll come back to
these in a moment.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;“Rule of three”: If you have to write the same thing three times,
refactor it&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;If the &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;
were confined to one or two places, that would be
one thing.  Unfortunately, they were found throughout the test library
copied from place to place to place.  Every one of those copies then needed
personal attention to double check, in order to answer the question of
whether or not the “copied” number was truly a copied number that could
be modified or removed.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Name your register addresses.  It makes moving them easier.&lt;/p&gt;

    &lt;p&gt;Or, in this case, four versions of this IP earlier someone had removed a
control register from the IP.  The address was then reallocated for another
purpose.  No one noticed the test scripts were still accessing the old
register until I came along and tried to assign names to all of the
registers within the IP.  I then asked, where is the XYZ register?  It’s
not at this address …&lt;/p&gt;

    &lt;p&gt;I hate coming across situations like this.  “Fixing” such situations
always risks making a change (which needs to be made) that then might break
something else later.  (Yes, that happens too …)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;There’s a benefit to naming even one bit &lt;a href=&quot;https://en.wikipedia.org/wiki/Magic_number_(programming)&quot;&gt;magic
numbers&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Not to get side tracked, but in another design there was a one-bit
number to indicate data direction.  Throughout the logic, you’d find
expressions like: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction)&lt;/code&gt;, or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (!direction)&lt;/code&gt;.  While you
might think this was okay, the designer wrote the design for the wrong
sense.&lt;/p&gt;

    &lt;p&gt;I then came along and then wanted to “fix” things.&lt;/p&gt;

    &lt;p&gt;Not knowing how deep the corruption lie, or whether or not I was getting
the direction mapping right in the first place, I changed all of these
expressions to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction == DIR_SOURCE)&lt;/code&gt; or
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if (direction == DIR_SINK)&lt;/code&gt;.  This way, if necessary, I could come back
later and change &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DIR_SOURCE&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DIR_SINK&lt;/code&gt; at one location (okay, one per
file …) and then trust that everything would change consistently
throughout the design.&lt;/p&gt;

    &lt;p&gt;I got things “mostly” right on my first pass.  The place where I struggled
was in the test bench, where things were named backwards.  Why?  Because if
the design was the &lt;em&gt;source&lt;/em&gt;, the test bench needed to be the &lt;em&gt;sink&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;That reset delay.&lt;/p&gt;

    &lt;p&gt;This is really what I want to discuss today.  How long should a design be
held in reset before being released?&lt;/p&gt;

    &lt;p&gt;My personal answer?  No longer than it needs to be.  Xilinx asks for a 16
clock period AXI reset.  Most designs don’t need this.  Indeed, most
digital designs can reset themselves in a single clock period, although some
require two.&lt;/p&gt;

    &lt;p&gt;Some designs do very validly need a long reset.  I’ve come across this often
where an analog tracking circuit needs to start and lock before the digital
logic should start working with the results of that circuit.  This
make sense, I can understand it, and I’ve built this sort of thing before
when the hardware requires it.  SDRAMs often require long resets as well,
on the order of 200us.&lt;/p&gt;

    &lt;p&gt;In the case of today’s example and lesson learned story, the test bench for
the digital portion of the design was using a 1,000 clock reset.  That is,
the test bench held the design in reset for 1,000 clock cycles.  Why?  That’s
a good question.  Nothing in the IP required such a long reset.  So, I
changed it to 3 cycles.  Three cycles was still overkill–one cycle
should’ve been sufficient, but simulation time can be expensive.  Why
waste simulation time if you don’t need to?&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/initial-turn-in.svg&quot; width=&quot;240&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;After changing to a 3 cycle reset, the design worked fine and passed its
   test cases.  I turned my work in, and counted the project done.  All my
   work had been completed in (roughly) the 80 hours I had projected.
   Nice.&lt;/p&gt;

&lt;p&gt;(Okay, my notes say my initial turn in took closer to 120hrs, but I’m going
   to tell the story and pretend my cost estimate was 80hrs.  I can eat a
   40hr overrun on an 80hr contract if I have to–especially if it’s an
   overrun in what I had proposed to do.)&lt;/p&gt;

&lt;ol start=&quot;6&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Constants should be constant.  Parameters are there for that purpose.&lt;/p&gt;

    &lt;p&gt;If a design has a startup constant, something it depends upon, then that
constant should be set on &lt;em&gt;startup&lt;/em&gt;–before the first clock tick is
over, and not later.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/parameters.svg&quot; width=&quot;420&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Some engineers like to specify fixed design parameters via input ports
   rather than parameters.  While there are good reasons for doing
   this–especially in
   &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; designs,
   those fixed constants should be set before the first clock cycle.  If they
   are supposed to be equivalent to wires that are hardwired to either power
   or ground, then they should act like it.&lt;/p&gt;

&lt;p&gt;Personally, I think this purpose is better served by parameters rather
   than hard wired constants, but I can understand a need to build an
   &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; that
   can then be reconfigured in the field via hard switches.  For example,
   consider how switches can be used to adjust the FPGA wires controlling the
   boot source.  In other words, there is a time for configuring a design via
   input wires.  Just … make those values constants from startup for
   simulation purposes.&lt;/p&gt;

&lt;ol start=&quot;7&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;Calculated values should be &lt;em&gt;calculated&lt;/em&gt;, not set in fixed macros.&lt;/p&gt;

    &lt;p&gt;This particular design depended upon a set of macros, and one test
configuration required one set of macros whereas another test configuration
might require another set of macros.&lt;/p&gt;

    &lt;p&gt;These macros contained all kinds of computed constants.  For instance, if
the design had 512 byte ECC blocks, then the block boundaries were things
like bytes 0-511, 512-1023, 1024-1535, etc–all captured in macros used by
the test bench, and all dependent on the devices page size.  Further
constants captured things like where the ECC would be located in a page, or
how many ECC bytes were used for the given ECC size–which was also a macro.&lt;/p&gt;

    &lt;p&gt;These constants got even worse when it came time to test the ECC.  In this
case, there were macros specifying where to place the errors.  So, for
example, the test bench for a 4bit ECC might generate one error in bytes
0-63, one in bytes 64-127, and macros existed defining these ranges all the
way up to the (macro-defined) size of the page which could be 2kB, 4kB, 8kB,
etc.&lt;/p&gt;

    &lt;p&gt;Sadly, the test script would only run a set of 30 test cases for &lt;em&gt;one&lt;/em&gt; set
of macros.  The design then needed to be reconfigured if you wanted to run
it with another set of macros.  Specifically, every time you needed to change
which ECC option you were testing, or which device model you wished to test
against, then you needed to switch macro sets.  In all, there were over 50
sets of macros, and each macro set contained between 40-150 macros the
design required in order to operate.  Worse, many of those macros were
externally calculated.  Running all tests required starting and restarting
the test driver, one macro set at a time.&lt;/p&gt;

    &lt;p&gt;Here was the problem:  What happens when a macro set configures the IP
to run in one fashion, and you need to reconfigure your operations
mid-sim-runtime to another macro set?  More specifically, what happens when
you need to boot with one ECC option (defined as a macro), and then switch
to another?  In this case, the macro set determined how memory was laid
out, and the customer wanted to change the memory layout in the middle of
a test run.  (He then couldn’t figure out why this was a problem for us …)&lt;/p&gt;

    &lt;p&gt;Lesson learned?  When some configuration points are dependent upon others,
use functions and calculate them within the IP.  That way, if you switch
things around later–or even at runtime, those test-library functions can
still capture all the necessary dependencies.&lt;/p&gt;

    &lt;p&gt;Second lesson learned?  IP should be configured via &lt;em&gt;parameters&lt;/em&gt;, not
macros, and those parameters should all be able to be scripted by the test
driver.  Perhaps you may recall how I discussed handling this in an article
discussing an upgrade to the &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;ZipCPU’s test
infrastructure&lt;/a&gt; some time
back.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If requirements are in flux, the IP can’t be delivered.&lt;/p&gt;

    &lt;p&gt;This should be a simple given, a basic no-brainer–it’s really basic
engineering 101.  If you don’t know what you want built, you shouldn’t
hire someone to build it until you have solid requirements.  If you want
to change things mid-task, any rework that will be required is going to be
charged against your bottom line.&lt;/p&gt;

    &lt;p&gt;In this case, the end customer of this IP discovered how I was intending
to meet their requirement by adding a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;.
They then wanted things done
in a different manner.  Specifically, they wanted the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s
stored somewhere else.  Of course, this didn’t take place until after I’d
already proposed a fixed price contract based upon 80 hours of work, and
accomplished most
of that work.  Sure, I can support some changes–if the changes are minor.
For example, I initially built a 32b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
capability and they then wanted a 16b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
capability.  I figured that’d be a cheap change–since the design was (now)
well parameterized, only two parameters needed to change to adjust.
In this case, however, their simple desire to switch
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
sizes from 32b to 16b now doubled the time spent in verification–since we
now needed to run the verification test suite twice–once for a 32b
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; and once again
for the 16b &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; they
wanted.  Their other change request, moving the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
storage elsewhere, was major enough that it couldn’t be done without
starting the entire update over from scratch.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/crcsz.svg&quot; width=&quot;360&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Change is normal.  Customers don’t always know what they want.  I get that.
   The problem here was that as long as requirements were in flux I wasn’t
   going to deliver any capability.  Let’s agree on what we’re going to deliver
   first, then I’ll deliver that.&lt;/p&gt;

&lt;p&gt;Then the customer started asking why it was taking so long to deliver the
   promised changes, when could we deliver the IP, and they had a hard RTL
   freeze deadline, and …  Yes, this became quite contradictory: 1) They
   wanted me to make a change that would force me to start my work all over
   from scratch, but at the same time 2) wanted all of my changes delivered
   immediately to meet their hard deadline.&lt;/p&gt;

&lt;p&gt;You can’t make this stuff up.&lt;/p&gt;

&lt;ol start=&quot;9&quot;&gt;
  &lt;li&gt;
    &lt;p&gt;If a design can fail, then a simulation test case should exist that can
trigger that failure.&lt;/p&gt;

    &lt;p&gt;This is especially true of &lt;a href=&quot;/blog/2021/03/06/asic-lsns.html&quot;&gt;ASIC&lt;/a&gt; designs, and a lesson I’m needing to learn
in a hard way.  In my case, I knew that I could properly calculate and
detect &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
errors.  I had formally proven that.&lt;/p&gt;

    &lt;p&gt;However, because I didn’t (initially) generate a simulation test to verify
what would happen on a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failure, no one noticed how complicated the register handling for these
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failures had become.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Test bench drivers should mirror software&lt;/p&gt;

    &lt;p&gt;At some point in time, someone’s going to need to build control software.
They’ll start with the test bench driver.  The closer that test bench
driver looks to real software, the easier their task will be.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;so-what-happened&quot;&gt;So what happened?&lt;/h2&gt;

&lt;p&gt;Okay, ready for the story?&lt;/p&gt;

&lt;p&gt;Here’s what happened: I made my changes inside my promised two weeks.  I
merged and delivered the changes the customer had requested.  Everything worked.&lt;/p&gt;

&lt;p&gt;Life was good.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right; padding: 20px&quot;&gt;&lt;caption&gt;Fig 2. Everything fell apart when merging&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/merge-failures.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;Then my client then said, oops, we’re sorry, you made the changes to the wrong
version of the IP.  The end customer had asked us to make a simple change to
allow the software to read a sector from non-volatile memory to boot from on
startup.  Here’s the correct version to change.&lt;/p&gt;

&lt;p&gt;The changes appeared minor, so I merged my changes and re-submitted.  This
time, many of the tests now failed.&lt;/p&gt;

&lt;p&gt;What went wrong?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Fig 3. I now use watchdog timers in my test benches&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/chasing-resets/watchdogs.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;The first problem was the reset.  Remember how I removed that 1,000 clock reset,
because it wasn’t needed?  One of the test cases was waiting 100 clock cycles,
and then calling a startup task which would then set the “constant” input values
that were only sampled during reset.  This value would determine whether
the new bootloader capability would be run on startup or not.  The test bench
would then wait on the signal that the bootloader had completed its task.
However, with a 3 cycle reset, the boot on startup constant was never set
before the end of the reset period, so the bootloader never started and the
test bench then hung waiting for the bootloader to complete.  (Waiting on a
non-existent boot loader wasn’t a part of the design I started with.)&lt;/p&gt;

&lt;p&gt;It didn’t help that the test script (in file #1) called a task (in file #2),
that set a value (in file #3), that was checked elsewhere (in file #4), that
was … In other words, there was so much indirection on this reset between
where it was set and its ultimate consequence that it took quite a bit of time
to sort through.  No, it didn’t help that I hadn’t written this IP, nor its
test bench, nor its test scripts, nor its test libraries in the first place.&lt;/p&gt;

&lt;p&gt;Unfortunately, that was only the first problem.&lt;/p&gt;

&lt;p&gt;The second problem was due to an implied requirement that, if your test bench
reads from memory on bootup, there must be an initial set of valid data in
memory for it to boot from–especially if you are checking for valid
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;s and failing a
test if any &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt; failed.
This requirement didn’t exist in either branch, but became an implied
requirement once the boot up and
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
branches were merged together.  We hadn’t forseen that one coming either.&lt;/p&gt;

&lt;p&gt;A third problem came from how fault detection was handled.  In the case of
a fault, an &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
 would be generated.  The test bench would wait for that
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;, read the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; register from the IP, and
then handle each active &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
as appropriate.&lt;/p&gt;

&lt;p&gt;In order to properly handle a
&lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;CRC&lt;/a&gt;
failure, I needed to adjust how
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s were handled in the test
library.  That’s fair.  Let’s look at that logic for a moment.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;Interrupt&lt;/a&gt;s were handled in the test
library within a Verilog task.  The relevant portion of this task read
something like:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h01&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #1&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h02&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;8'h03&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupts #1 and #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;task_not_done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This was a hidden violation of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;rule of
three&lt;/a&gt;,
since you’d find the same &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;
handler for &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; #1 following
a check for the &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; register
equalling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h01&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h03&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h05&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h07&lt;/code&gt;, etc.&lt;/p&gt;

&lt;p&gt;Worse, the &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handlers didn’t
just handle &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s.  They would
also issue commands, reset the interrupt register, use delays, etc., so that
handling &lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; #1 wasn’t the
same between a reading of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h01&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;8'h05&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;My solution was to spend about two days refactoring this, so that every
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; would be given its own
independent handler properly.  The result looked something like the logic below.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-verilog&quot; data-lang=&quot;verilog&quot;&gt;	&lt;span class=&quot;k&quot;&gt;do&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
		&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #1&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #2&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt;

		&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;interrupt_register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;begin&lt;/span&gt;
			&lt;span class=&quot;c1&quot;&gt;// Handle interrupt #3&lt;/span&gt;
		&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;

		&lt;span class=&quot;n&quot;&gt;clear_interrupts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// and adjust the mask if necessary&lt;/span&gt;
	&lt;span class=&quot;k&quot;&gt;end&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;task_not_done&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Among other things, I removed all of the register accesses from the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; “handling” routines,
capturing their needs instead in some registers so the accesses could all
happen at the end.  As a result, &lt;em&gt;nothing&lt;/em&gt; took simulation time during these
handlers and things truly could be merged properly.&lt;/p&gt;

&lt;p&gt;I was proud of this update.  The portion of the test library handling
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt;s now “made sense”.&lt;/p&gt;

&lt;p&gt;So, I sent the design off to the test team again only to have it come back to
me again a couple days later.  It had failed another test case.  Where?  In a
second &lt;em&gt;copy&lt;/em&gt; of the same broken
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handler that I had just
refactored.&lt;/p&gt;

&lt;p&gt;While I might argue that the &lt;a href=&quot;https://en.wikipedia.org/wiki/Rule_of_three_(computer_programming)&quot;&gt;rule of
three&lt;/a&gt;
should’ve applied to this second copy, you could also argue that it didn’t
simply because it was a &lt;em&gt;second&lt;/em&gt; copy of the same
&lt;a href=&quot;https://en.wikipedia.org/wiki/Interrupt&quot;&gt;interrupt&lt;/a&gt; handler and not
a &lt;em&gt;third&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I could go on.&lt;/p&gt;

&lt;p&gt;As I mentioned in the beginning, a basic 80 hour task became a 270+ hour task.
Further, the task went from being &lt;em&gt;on time&lt;/em&gt; to late very suddenly.  Yes,
this was how I spent my Thanksgiving weekend that year.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;A good design plus test bench &lt;em&gt;should&lt;/em&gt; be easy to adjust and modify.&lt;/p&gt;

&lt;p&gt;Building a poor design, a poor test bench, or (worse) both constitutes taking
out a loan from your future self.  This is often called “&lt;a href=&quot;https://en.wikipedia.org/wiki/Technical_debt&quot;&gt;technical
debt&lt;/a&gt;.”
If this is a prototype you are willing to throw away later, then perhaps this
is okay.  If not, then you will end up paying that loan back later, with
interest, at a time you are not expecting to pay it.  It will cost you more
than you want to pay, at a time when you aren’t expecting a delay.&lt;/p&gt;

&lt;p&gt;What about formal methods?  Certainly formal methods might have helped, no?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vlog-wait/rule-of-gold.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;I suppose so.  Indeed, all of my updates were formally verified.  Better yet,
everything that was formally verified worked right the first time.  What about
the stuff that failed?  None of it had ever seen a formal tool.  Test bench
scripts, libraries, and device models, for example, tend not to be formally
verified.  Further, why would you formally verify a “working” design that you
were handed?  Unless, of course, it was never truly “working” in the first
place.&lt;/p&gt;

&lt;p&gt;Remember, well verified, well tested RTL designs are gold in this business.
Build them well, and you can sell or re-use them for years to come.&lt;/p&gt;
&lt;hr /&gt;&lt;p&gt;&lt;em&gt;For yet a little while, and he that shall come will come, and will not tarry.  (Heb 10:37)&lt;/em&gt;</description>
        <pubDate>Mon, 01 Apr 2024 00:00:00 -0400</pubDate>
        <link>https://zipcpu.com/blog/2024/04/01/chasing-resets.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/04/01/chasing-resets.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>2023, Year in review</title>
        <description>&lt;p&gt;It should come as no surprise that a blog with &lt;a href=&quot;https://zipcpu.com/blog/2017/08/01/advertising.html&quot;&gt;no
advertisements&lt;/a&gt; has never
paid my bills–at least not directly.  I blog for fun, and to some extent for
&lt;a href=&quot;https://en.wikipedia.org/wiki/Rubber_duck_debugging&quot;&gt;rubber duck debugging&lt;/a&gt;.
As I learn new concepts, I enjoy sharing them here.  Going through the rigor
to write about a topic also helps to make sure I understand the topic as well.&lt;/p&gt;

&lt;p&gt;Why are there &lt;a href=&quot;https://zipcpu.com/blog/2017/08/01/advertising.html&quot;&gt;no
advertisements&lt;/a&gt;?  For
two reasons.  First, because I’m not doing this to make money.  Second, because
because I want more control over any advertising from this site than
most advertisers want to provide.  Perhaps some day the site will be supported
by advertising.  Until then, the web site works fine without advertisements.&lt;/p&gt;

&lt;p&gt;So how then does the blog fit into my business model?  Simply because the blog
helps me find customers via those who read articles here and write to me.&lt;/p&gt;

&lt;h2 id=&quot;business-projects&quot;&gt;Business Projects&lt;/h2&gt;

&lt;p&gt;So, if the blog doesn’t pay my bills, then what does?&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;2023 Projects&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/2023-funding.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Well, six projects have paid the bills this year.  Three of these have been
ASIC projects, to include the &lt;a href=&quot;https://www.arasan.com/product/xspi-psram-master/&quot;&gt;PSRAM/NOR flash
controller&lt;/a&gt;, and an &lt;a href=&quot;https://www.arasan.com/product/onfi-4-2-controller-phy/&quot;&gt;ONFI
NAND flash controller&lt;/a&gt;.
Three other projects this year have been FPGA projects, to include an &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;open
source 10Gb Ethernet switch&lt;/a&gt;
and a SONAR front end based upon my
&lt;a href=&quot;https://github.com/ZipCPU/videozip&quot;&gt;VideoZip&lt;/a&gt; design–after including several
very significant upgrades, such as handling ARP and ICMP requests in hardware.
That’s four of the six projects from this year.  Once the other two projects
become a bit more marketable, I may mention them here as well.&lt;/p&gt;

&lt;p&gt;Since I’ve already discussed the &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet
design&lt;/a&gt;, let me take a moment
and discuss the &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt;&lt;/strong&gt; within it.
The &lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt; was originally designed
to support the SONAR project.  Perhaps you may remember the &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;initial article,
outlining the design goals for this
controller&lt;/a&gt;.  Thankfully,
it’s met all of these goals and more–but we’ll get to that in a moment.  As
part of the SONAR project, its purpose was to sample various non-acoustic
telemetry data: temperature, power supply voltage, current usage, humidity
within the enclosure, and more.  All of these needed to be sampled at regular
intervals.  At first glance, &lt;a href=&quot;https://www.reddit.com/r/FPGA/comments/13ti5zx/when_do_you_solve_a_problem_in_software_instead/&quot;&gt;this sounds like a software
task&lt;/a&gt;–that
is until you start adding real-time requirements to it such as the need to
shut down the SONAR transmitter if it starts overheating, or using so much
power that the FPGA itself will brown out shortly.  So, the
&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;I2C controller&lt;/a&gt; was designed to generate
(AXI stream) data packets automatically, without CPU intervention, which could
then be forwarded … somewhere.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;An example I2C-driven OLED output&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/ssdlogo-demo.jpg&quot; width=&quot;260&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;This design&lt;/a&gt; was then incorporated into the
&lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet design&lt;/a&gt;.  There
it provided the team the ability to 1) read the DDR3 memory stick
configuration–useful for making sure the &lt;a href=&quot;https://github.com/AngeloJacobo/DDR3_Controller&quot;&gt;DDR3
controller&lt;/a&gt; was properly
configured, 2) read the SFP+ configuration–and discover that we were using
1GbE SFP+ connectors initially instead of 10GbE connectors (Oops!), 3) read the
&lt;a href=&quot;https://en.wikipedia.org/wiki/Extended_Display_Identification_Data&quot;&gt;Extended Display Identification Data
(EDID)&lt;/a&gt;
from the downstream &lt;a href=&quot;https://en.wikipedia.org/wiki/HDMI&quot;&gt;HDMI&lt;/a&gt;
monitor, 4) configure and verify the &lt;a href=&quot;https://www.skyworksinc.com/-/media/Skyworks/SL/documents/public/data-sheets/Si5324.pdf&quot;&gt;Si5324&lt;/a&gt;’s register
settings, 5) draw a logo onto a
&lt;a href=&quot;https://www.amazon.com/Teyleten-Robot-Display-SSD1306-Raspberry/dp/B08ZY4YBHL/&quot;&gt;small OLED display&lt;/a&gt;,
all in addition to 6) actively monitoring hardware temperature.&lt;/p&gt;

&lt;p&gt;Supporting these additional tasks required two fundamental changes to the
&lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;initial vision for this I2C
controller&lt;/a&gt;.  First, I
needed an &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;I2C
DMA&lt;/a&gt;, to
quietly transfer results read from the device to memory.  Only once I had
&lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;this DMA&lt;/a&gt;
could the CPU then inspect and/or report on the results.  (It was probably one
of the easiest DMA’s I’ve written, since &lt;a href=&quot;/blog/2021/11/15/ultimate-i2c.html&quot;&gt;I2C is a rather slow
protocol&lt;/a&gt;.)
Second, each packet needed a designated &lt;em&gt;destination&lt;/em&gt; channel, so the design
could know where to forward the results.  This was useful for knowing if the
I2C information should be forwarded to &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/wbi2c/wbi2cdma.v&quot;&gt;the
DMA&lt;/a&gt;, for
storing in memory, or the &lt;a href=&quot;https://en.wikipedia.org/wiki/HDMI&quot;&gt;HDMI&lt;/a&gt;
slave controller, for forwarding the downstream monitor’s
&lt;a href=&quot;https://en.wikipedia.org/wiki/Extended_Display_Identification_Data&quot;&gt;EDID&lt;/a&gt;
to the upstream monitor.  The fact that &lt;a href=&quot;https://github.com/ZipCPU/wbi2c&quot;&gt;this
controller&lt;/a&gt;, designed for completely separate
project, in a completely different domain (i.e. SONAR), ended up working so
well in an &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;10Gb Ethernet
design&lt;/a&gt; project
is a basic testament to a well designed interface.&lt;/p&gt;

&lt;p&gt;The year has also included some internally funded projects.  These include
a new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;, a (to-be-posted)
upgrade to &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;my standard debugging
bus&lt;/a&gt;, and a
&lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;ZipCPU upgrade&lt;/a&gt;.  Allow
me to take a moment to discuss these three (unfunded) projects in a bit more
detail.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;&lt;/strong&gt; is new.  By
using all four data lanes and a higher clock rate, this upgrade offers a
minimum 8x transfer rate performance improvement over my prior SPI-only
version.  That’s kind of exciting.  Even better, the IP has been tested on
both an SD card as well as an &lt;a href=&quot;http://www.skyhighmemory.com/download/eMMC_4GB_SML_PKG_S40FC004_002_01112.pdf&quot;&gt;eMMC
chip&lt;/a&gt;
as part of the &lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;KlusterLab (i.e. 10Gb Ethernet
board)&lt;/a&gt; design.  The IP, &lt;a href=&quot;https://github.com/ZipCPU/sdspi/tree/master/sw&quot;&gt;plus
software&lt;/a&gt;, is so awesome I’m
likely to add it to any future designs I have with SD cards or eMMC chips in
the future.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;The difference between SPI and SDIO: Speed&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/sdiovspi.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;That’s just the beginning, too.  Just because &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;this new SDIO
controller&lt;/a&gt; works on hardware, doesn’t mean
it works in all modes.  Since its original posting, I’ve added verification to
support all the modes our hardware doesn’t (yet) support.  I’ve also started
adding eMMC BOOT mode support, and I expect I’ll be (eventually) adding DMA
support to this IP as well.  My goal is also to make sure I can support
multiple sector read or write commands–something the SPI only version couldn’t
support, and something that’s supposed to be supported in this new version but
isn’t tested (yet).  (Remember, &lt;a href=&quot;/zipcpu/2022/07/04/zipsim.html&quot;&gt;if it’s not tested it doesn’t
work&lt;/a&gt;.)  In other
words, despite declaring this IP as “working”, it remains under very active
development.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;caption&gt;I will use Slave/Master Terms where appropriate&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/slave.svg&quot; width=&quot;480&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;!-- (COMSONICS, SoundWire) --&gt;

&lt;p&gt;Then there’s the upgrade to the &lt;strong&gt;&lt;a href=&quot;https://github.com/ZipCPU/dbgbus&quot;&gt;debuging
bus&lt;/a&gt;&lt;/strong&gt;.  This has been in the works now for
quite a while.  My current/best debugging bus implementation
uses six printable characters to transmit a control code (read request, write
data, or new address) plus 32-bits of data.  At six data bits per 8-bit
character transmitted, this meant six characters would need to be sent
(minimum) in order to send either a 32-bit address or 32-bit data word,
leading to a 36b internal word.  It also required &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;10*6&lt;/code&gt; baud periods (10 baud
periods times six characters) for every uncompressed 32b of data transferred,
for a best case efficiency of 53%.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;The debugging bus multiplexes console and bus channels&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/dbgbus.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Since then, I’ve slowly been working on an upgrade to this protocol that will
use five (not necessarily printable) characters to transmit 32-bits of data
plus a control code.  This upgrade should achieve an overall 64% worst case
(i.e.  uncompressed) efficiency, for a speed improvement of about 16% over the
prior controller in worst case conditions.  The upgrade comes with some
synchronization challenges, but currently passes all of its simulation
checks–so at this point it’s ready for hardware testing.  My only problem
is … this upgrade isn’t paid for.  Inserting it into one of my business
projects is likely to increase the cost of that project–both in terms of
integration time as well as verification while chasing down any new bugs
introduced by this new implementation–at least until the upgraded bus is
verified.  This has kept this debugging bus upgrade at a lower priority to the
other paying projects.  Well, that and the fact that I only expect a 16%
improvement over the prior implementation.  As a result, the upgrade isn’t
likely to pay for itself for a long time.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: none&quot;&gt;&lt;caption&gt;Moving from 6 characters to 5 characters to send 32bits&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/exbus.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Finally, let’s discuss the &lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;ZipCPU’s big
upgrades&lt;/a&gt;.  As with the
other upgrades, these were also internally funded.  However, the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; has now formed a backdrop to a
majority of my projects.  Indeed, it’s &lt;a href=&quot;/zipcpu/2021/07/23/cpusim.html&quot;&gt;helped me verify ASIC IP in both
simulation&lt;/a&gt; and FPGA
contexts.  One upgrade in particular will keep on giving, and that is the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s &lt;a href=&quot;https://github.com/ZipCPU/zipcpu/tree/master/rtl/zipdma&quot;&gt;new DMA
controller&lt;/a&gt;.  I’ve
already managed to integrate it into a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;(work in progress) SATA
controller&lt;/a&gt;, and I’m likely to retarget this
DMA engine (plus a small state machine) to meet the DMA needs of my new
&lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;.  Indeed, it is so
versatile that I’m likely to use this controller across a lot of projects.
Better yet, at this rate, I’m likely to build an AXI version of this new DMA
supporting all of these features as well.  It’s just that good.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;All labour is profitable, whether or not it's paid for&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/tweets/bible/all-labour.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;p&gt;As for dollars?  Well, let’s put it this way: the year is now over, and I’m
still in business.  Not only that, but I’ve also managed to keep two kids in
college this year.  More specifically, I expect my third child to graduate
from college this year.  (Five to go …)  So, I’ve been hanging in there,
and I thank my God that my bills have been paid.&lt;/p&gt;

&lt;h2 id=&quot;articles&quot;&gt;Articles&lt;/h2&gt;

&lt;p&gt;2023 has been a slower year for articles than past years.  Much of this is due
to the fact that my time has been so well spent on other paying projects.
That’s left less time for blogging.  (No, it doesn’t help that my family
has fallen in love with Football, and that my major blogging times have been
spent watching my son’s high school games, Air Force Academy Falcon’s football,
the Kansas City Chiefs, Miami Dolphins, Philadelphia Eagles, and my own home
team–the disappointing Minnesota Vikings.) Still, I have managed to push out
seven new articles this year.  Let’s look at each, and see how easy they can
be found using DuckDuckGo.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/vikings.svg&quot; width=&quot;320&quot; alt=&quot;What does a Vikings fan do after watching the Vikings win the super bowl?  He turns off the play-station 4.&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/02/13/eccdbg.html&quot;&gt;Debugging the hard stuff&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;This article discusses some of the challenges I went through when debugging
modifications I made to a working ECC algorithm.  ECC, of course, is one of
those “hard” problems to debug since the intermediate data tends to look
meaningless when viewed.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “FPGA ECC Debugging” brings up the
&lt;a href=&quot;&quot;&gt;ZipCPU home page&lt;/a&gt; as return #111.&lt;/p&gt;

    &lt;p&gt;That’s kind of disappointing.  Let’s try a search using Google.  Google
finds &lt;a href=&quot;/blog/2023/02/13/eccdbg.html&quot;&gt;the correct page&lt;/a&gt;
immediately as its #1 result.  At first I thought the difference was because
Google knew I was interested in &lt;a href=&quot;&quot;&gt;ZipCPU&lt;/a&gt; results.  Then
I asked my daughter to repeat my test on her phone in private mode.  (She
has no interest in FPGA anything, so this would be a first for her.)  Her
Google ranking came up identical, so maybe I can trust this Google ranking.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/zipcpu/2023/03/13/swic.html&quot;&gt;What is a SwiC&lt;/a&gt;?&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/swic/barecpu.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; was originally designed
  to be a System within a Chip, or a SwiC as I called it.  This article
  discusses what a SwiC is, and tries to answer the question of whether or not
  a SwiC makes sense, or equivalently whether or not the
  &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; made for a good SwiC in the
  first place.  In many ways, this article was a review of whether or not the
  &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s
  design goals were appropriate, and whether or not they’ve been met.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; Searches on SwiC return all kinds of
  irrelevant results, and searches on “System within a Chip” return all kinds
  of results for “Systems on a Chip”.  If you cheat and search for “ZipCPU
  SwiC”, you get the &lt;a href=&quot;&quot;&gt;ZipCPU&lt;/a&gt; web site as the #1 page.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;What is a Virtual Packet FIFO&lt;/a&gt;?&lt;/p&gt;

    &lt;p&gt;A virtual FIFO is a first-in, first-out data structure built in hardware, but
using &lt;em&gt;external&lt;/em&gt; memory–such as a DDR3 SDRAM–for its memory.  A virtual
packet FIFO is a virtual FIFO that guarantees completed packets and packet
boundaries, in spite of any back pressure that might otherwise cause the FIFO
to fill or overflow.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: right&quot;&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/vfifo/pktvfifo.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;&lt;a href=&quot;/blog/2023/04/08/vpktfifo.html&quot;&gt;This article&lt;/a&gt;
  goes over the why’s and how’s of a virtual packet FIFO: why you
  might need it, how to use it, and how it works.&lt;/p&gt;

&lt;p&gt;Since writing this article, I’ve now built and tested a &lt;a href=&quot;https://github.com/ZipCPU/eth10g/blob/master/rtl/net/pktvfifo.v&quot;&gt;Wishbone based
  virtual packet FIFO as part of the 10Gb Ethernet
  project&lt;/a&gt;.
  Conclusion?  First, verifying the FIFO is a pain.  Second, I might be able to
  tune its memory usage with some better buffering.  But, overall, the FIFO
  itself works quite nicely in all kinds of environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DuckDuckGo ranking:&lt;/strong&gt;  The &lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; comes up as
  the #2 ranking on DuckDuckGo following a search for “Virtual Packet FIFO”.
  The &lt;a href=&quot;https://www.reddit.com/r/ZipCPU&quot;&gt;ZipCPU reddit page&lt;/a&gt; comes up as the #7
  ranking.  The page itself?  Not listed.  However, both of the prior pages
  point to this article, so I’m going to give this a DuckDuckGo ranking of #2.
  Sadly, most of DuckDuckGo’s other results are completely irrelevant to a
  Virtual Packet FIFO.  In general, they’re about Virtual FIFOs–not
  Virtual &lt;em&gt;Packet&lt;/em&gt; FIFOs.  As before, though, Google gets the right article
  as it’s number one search result.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;Introducing the ZipCPU 3.0&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;After years of updates, &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; 3.0 is
here!  This means that the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
now has support for multiple bus structures, wide bus widths, clock stopping,
and a brand new DMA.  &lt;a href=&quot;/zipcpu/2023/05/29/zipcpu-3p0.html&quot;&gt;The
article&lt;/a&gt; announces this
new release, and discusses the importance of each of these major upgrades.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “ZipCPU” on DuckDuckGo yields
&lt;a href=&quot;&quot;&gt;ZipCPU.com&lt;/a&gt; as the #1 search result.  That’s good
enough for me.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/06/28/sdiopkt.html&quot;&gt;Using a Verilog task to simulate a packet generator for an SDIO
controller&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;I haven’t written a lot about either Verilog test benches, or how to build
them, so this is a bit of a new topic for me.  Specifically, the question
involved was how to make your test bench generate properly synchronous
stimuli.  No, the correct answer is &lt;em&gt;NOT&lt;/em&gt; to generate your stimulus on the
negative edge of the clock.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “SDIO Verilog Tasks” on DuckDuckGo
yields the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO repository&lt;/a&gt; as the #31
search result.  (Google returns the correct article, after searching for
“SDIO Verilog” at #3.)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/formal/2023/07/18/sdrxframe.html&quot;&gt;SDIO RX: Bugs found with formal methods&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;If you’ve read my blog often enough, you’ll know that I’m known for formally
verifying my designs.  In the case of the new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
controller&lt;/a&gt;, I had it “working” on hardware
before either the formal verification or the full simulation model were
complete.  This leaves open the question, how many bugs were missed by my
hardware and (partial) simulation testing?&lt;/p&gt;

    &lt;p&gt;The article spends a lot of time also discussing “why” proper verification,
whether formal or simulation, is so important.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “SDIO formal verification” turns up the
&lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; as result #69.  Adding “verilog” to the
search terms, returns the blog as number #46.  As before, Google returns
the right article as the #1 search result after only searching for “SDIO
formal”.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/blog/2023/11/25/eth10g.html&quot;&gt;An Overview of a 10Gb Ethernet Switch&lt;/a&gt;&lt;/p&gt;

    &lt;p&gt;As I mentioned above, one of the big projects of mine this year was a &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;10Gb
Ethernet switch&lt;/a&gt;.  This article goes over
the basics of the switch, and how the various data paths within the design
move data around.&lt;/p&gt;

    &lt;p&gt;&lt;strong&gt;DuckDuckGo Ranking:&lt;/strong&gt; A search for “10Gb Ethernet Switch FPGA” turns up
the &lt;a href=&quot;https://github.com/ZipCPU/eth10g&quot;&gt;Ethernet design&lt;/a&gt; as the #16 result,
and a search on “10Gb Ethernet Switch Verilog” returns the same github result
as the #1 result.  Curiously, the &lt;a href=&quot;https://github.com/ZipCPU/blob/master/bench/rtl/tbenet.v&quot;&gt;10Gb Ethernet test bench
model&lt;/a&gt; for the same
repository comes up as the #2 result.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For all those who like to spam my email account, my conclusions from these
numbers are simple: 1) the &lt;a href=&quot;&quot;&gt;ZipCPU blog&lt;/a&gt; holds its own just
fine on a Google ranking, and 2) DuckDuckGo’s search engine needs work.  &lt;a href=&quot;/blog/2022/11/12/honesty.html&quot;&gt;If
you want to sell me web-based services and don’t know
this&lt;/a&gt;, I’ll assume you haven’t
done your homework and leave your email in my spam box.&lt;/p&gt;

&lt;h2 id=&quot;upcoming-projects&quot;&gt;Upcoming Projects&lt;/h2&gt;

&lt;p&gt;So, what’s next for 2024?  Here are some of the things I know of.  Some of
these are paid for, others still need funding.&lt;/p&gt;

&lt;table align=&quot;center&quot; style=&quot;float:none&quot;&gt;&lt;caption&gt;2024 Projects&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/2024-funding.svg&quot; width=&quot;640&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Still, this is a good list to start from:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;One of my ASIC projects is in the middle of a massive speed upgrade.  This is
not a clock upgrade, or a fastest supported frequency upgrade, but rather an
upgrade to adjust the internal state machine.  I’m anticipating an additional
speed up of between 8x and 256x as a result of this upgrade.&lt;/p&gt;

    &lt;p&gt;Status?  &lt;strong&gt;Funded.&lt;/strong&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;My brand new &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC controller&lt;/a&gt;
has neither eMMC boot support, nor DMA support.  Boot support might allow me
to boot the &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt; directly from
an eMMC card, whereas DMA support would allow the
&lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;
to read lots of data from the card without CPU interaction.
Both may be on the near-term horizon, although neither upgrade is funded.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table align=&quot;center&quot; style=&quot;float: left; padding: 25px&quot;&gt;&lt;caption&gt;Laptop projects have additional requirements&lt;/caption&gt;&lt;tr&gt;&lt;td&gt;&lt;img src=&quot;/img/2023-review/laptop.svg&quot; width=&quot;320&quot; /&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;

&lt;p&gt;Status?  Not funded.  On the other hand, this project fits quite nicely on
  my laptop for those days when I have the opportunity to take my son to his
  basketball practice … (He’s a 6’4” high school freshman, who is new to
  the sport as of this year …)&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ZipCPU/AutoFPGA&quot;&gt;AutoFPGA&lt;/a&gt; is now, and has for some time,
been a backbone of any of my designs.  I use it for everything.  It makes
adding and removing IP components easy.  One of its key capabilities is
&lt;a href=&quot;/zipcpu/2019/09/03/address-assignment.html&quot;&gt;address assignment (and adjustment)&lt;/a&gt;.
Sadly, it’s worked so well that it now needs some maintenance.  Specifically,
I’d like to upgrade it so that it can handle partially fixed addressing, such
as when some addresses are given and fixed while others are allowed to change
from one design to the next.  This is only a precursor, though, to supporting
2GB memories where the memory address range overlaps one of the ZipSystem’s
fixed address ranges.&lt;/p&gt;

    &lt;p&gt;Status?  A &lt;strong&gt;funded&lt;/strong&gt; (SONAR) project requires these upgrades.  Unlike my
current SONAR project, built around &lt;a href=&quot;https://store.digilentinc.com/nexys-video-artix-7-fpga-trainer-board-for-multimedia-applications/&quot;&gt;Digilent’s Nexys Video
board&lt;/a&gt;,
this one will be built around &lt;a href=&quot;https://www.enclustra.com/en/products/fpga-modules/mercury-kx2/&quot;&gt;Enclustra’s Mercury
KX2&lt;/a&gt;, and
so either &lt;a href=&quot;https://github.com/ZipCPU/AutoFPGA&quot;&gt;AutoFPGA&lt;/a&gt; gets upgraded or
I can’t use the full memory range.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;/about/zipcpu.html&quot;&gt;ZipCPU&lt;/a&gt;’s GCC backend urgently
needs a fix.  Specifically, it has a problem with &lt;a href=&quot;https://en.wikipedia.org/wiki/Tail_call&quot;&gt;tail (sibling)
calls&lt;/a&gt; that jump to register
addresses.  This problem was revealed when testing the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC software
drivers&lt;/a&gt;, and needs a proper fix before I
can make any more progress on upgrading the
&lt;a href=&quot;https://zipcpu.com/zipcpu/2021/03/18/zipos.html&quot;&gt;ZipOS&lt;/a&gt;.&lt;/p&gt;

    &lt;p&gt;Did I mention working on the
&lt;a href=&quot;https://zipcpu.com/zipcpu/2021/03/18/zipos.html&quot;&gt;ZipOS&lt;/a&gt;?  Indeed.
realistically, further work on the &lt;a href=&quot;https://github.com/ZipCPU/sdspi&quot;&gt;SDIO/eMMC
software&lt;/a&gt; really wants a proper OS of some
type, so … this may be a future and upcoming task.&lt;/p&gt;

    &lt;p&gt;Status?  This project isn’t likely to get any funding, but other projects
are likely to require this fix.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As another potential project, an old friend is looking into building a
“see-in-the-dark” capability–kind of like a “better” version of
night-vision goggles.  He’s currently arranging for funding, and after all of
my video work I might finally find a customer for it.  Yes, his work will
require some secret sauce processing–but it’s all quite doable, and could
easily fit nicely into this years upcoming work.&lt;/p&gt;

    &lt;p&gt;Status?  If this moves forward, it will be &lt;strong&gt;funded&lt;/strong&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I’d also like to continue my work on a &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;Wishbone controlled SATA
controller&lt;/a&gt; this year.  I started working
on this controller under the assumption that it would be required by my
SONAR project, and so funded.  Now it no longer looks like it will be funded
under this vehicle.  Still, the controller is now written, even though the
verification work is far from complete.  Specifically, I’ll need to work on
my &lt;a href=&quot;https://github.com/ZipCPU/wbsata&quot;&gt;SATA (Verilog) Verification IP&lt;/a&gt;, until
it’s sufficient enough to get me past knowing if I have the Xilinx GTX
transceivers modeled correctly or not.  Once I get that far, I can both
start testing against actual hardware (on my desk), as well as against
&lt;a href=&quot;/blog/2017/06/21/looking-at-verilator.html&quot;&gt;Verilator&lt;/a&gt;
models.&lt;/p&gt;

    &lt;p&gt;Status?  Funding has been applied for.  Sadly, it’s not likely to be enough
to pay for my hours, but perhaps I can have a junior engineer work on this.
Still, whether or not the funding comes through remains to be determined.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Did I mention that the new debugging bus upgrades are on my list to be
tested?  Who knows, I may test their AXI counterparts first, or I may test
the UDP version first, or …  Only the Good Lord knows how this task will
move forward.&lt;/p&gt;

    &lt;p&gt;Status?  Not funded at all.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;I am looking into getting some funding for a second version of an Ethernet
based Memory controller.  The SONAR project required a &lt;a href=&quot;https://zipcpu.com/blog/2022/08/24/protocol-design.html&quot;&gt;first version of this
controller&lt;/a&gt;,
and it smokes &lt;a href=&quot;/blog/2017/06/05/wb-bridge-overview.html&quot;&gt;my serial port based debugging
controller&lt;/a&gt;.  A
second version of this controller, designed for resource constrained FPGAs,
designed for speed, designed for throughput from the ground up … could
easily become a highly desired product.&lt;/p&gt;

    &lt;p&gt;We’ll see.&lt;/p&gt;

    &lt;p&gt;Status?  Sounds fun, but not (yet) funded.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally, I have an outstanding task to test an open source memory controller,
using an open source synthesis, and place and route tool, for both Artix-7
and Kintex-7 devices.  I’ll let you know how that works out.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Since these are business predictions about the future, I am required by the
Good Lord to add that these are subject to whether or not I live and the
Lord wills.  (See &lt;a href=&quot;https://www.blueletterbible.org/kjv/jam/4/13-15&quot;&gt;James
4:13-15&lt;/a&gt; for an explanation.)&lt;/p&gt;

&lt;p&gt;As always, let me know if you are interested in any of these projects, and
especially let me know if you are interested in funding one or more of them.
Either way, the upcoming year looks like it will be quite busy and it’s only
January.&lt;/p&gt;

&lt;p&gt;“My cup runneth over (&lt;a href=&quot;https://blueletterbible.org/kjv/psa/23/5&quot;&gt;Ps 23:5&lt;/a&gt;)”, and
so I shall also pray that God grants you the many blessings He has given me.&lt;/p&gt;

&lt;hr /&gt;&lt;p&gt;&lt;em&gt;Let every thing that hath breath praise the LORD.  Praise ye the LORD. (Ps 150:6)&lt;/em&gt;</description>
        <pubDate>Sat, 20 Jan 2024 00:00:00 -0500</pubDate>
        <link>https://zipcpu.com/blog/2024/01/20/2023-in-review.html</link>
        <guid isPermaLink="true">https://zipcpu.com/blog/2024/01/20/2023-in-review.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
  </channel>
</rss>
