How To Add UART To Your FPGA Projects

September 6, 2018

Being able to communicate between a host computer and a project is often a key requirement, and for FPGA projects that is easily done by adding a submodule like a UART. A Universal Asynchronous Receiver-Transmitter is the hardware that facilitates communications with a serial port, so you can send commands from a computer and get messages in return.

Last week I wrote about an example POV project that’s a good example for learn. It was both non-trivial and used the board’s features nicely. But it has the message hard coded into the Verilog which means you need to rebuild the FPGA every time you want to change it. Adding a UART will allow us to update that message.

The good news is the demo is open source, so I forked it on GitHub so you can follow along with my new demo. To illustrate how you can add a UART to this project I made this simple plan:

Store the text in a way that you can be changed on the fly
Insert a UART to receive serial data to store as text
Use a carriage return to reset the data pointer for the new text
Use an escape to clear the display and reset the data pointer

Finding and Adding a UART Core

A UART is a fairly common item and you’d think there would be one handy in the Altera IP catalog you see in Quartus. There is and it is buried under the University Program. It also has a few quirks because it really expects to be part of a whole system and I just wanted to use a UART.

Actually, for this application, we only really need the receiver. I’ve written that code plenty of times but lately I’ve been using an MIT licensed UART that I acquired sometime in the past. There are a few versions of this floating around including at freecores — which has a lot of reusable FPGA bits and on OpenCores. If you poke around you can find FPGA code ranging from UARTs and PS/2 interfaces to entire CPUs. True, you can also grab a lot of that out of the official IP catalog, but you probably won’t be able to use those on other FPGA families. The UART I’m using, for example, will work just fine on a Lattice IceStick or probably any other FPGA I care to use it on. (the version I use is a little newer than any other copy I could find, even using Google, so check my repo.)

The UART resides in a single file and it was tempting to just plop it into the project, but resist that urge in favor of some better practices. I created a cores directory in the top-level directory and placed it there.

The UART interface is very simple and for this project, we don’t need the transmitter so a lot of it will be empty. Here’s the code:


uart #(
.baud_rate(9600), // default is 9600
.sys_clk_freq(12000000) // default is 100000000
)
uart0(
.clk(CLK12M), // The master clock for this module
.rst(~nrst), // Synchronous reset
.rx(UART_RXD), // Incoming serial line
.tx(UART_TXD), // Outgoing serial line
.transmit(), // Signal to transmit
.tx_byte(), // Byte to transmit
.received(isrx), // Indicated that a byte has been received
.rx_byte(rx_byte), // Byte received
.is_receiving(), // Low when receive line is idle
.is_transmitting(),// Low when transmit line is idle
.recv_error() // Indicates error in receiving packet.
);

Simple enough. The baud rate is 9600 baud and the input clock frequency is 12 MHz. The clk argument connects to that 12 MHz clock and the rst argument gets a positive reset signal. The nrst signal is active low, so I invert it on the way in. I connected both pins of the MAX1000 board even though I won’t use the transmitter since I thought I might use it at some point.

The only other two signals connected are rx_byte — you can guess what that is — and isrx which goes high when something has come in on the serial port.

Constraint Changes

The signals UART_RXD and UART_TXD now appear in the top-level module:

module top (
input CLK12M,
input USER_BTN,
output [7:0] LED,
output SEN_SDI,
output SEN_SPC,
input SEN_SDO,
output SEN_CS,
output [8:1] PIO,
input UART_RXD,
output UART_TXD
);

That’s not enough, though. You need to set the constraints to match the physical pins. The documentation is a bit of a mess in this regard because the serial port is on port B of the USB chip which — in theory — could be anything. It took a little poking into some examples to figure out which pins were which.

In the assignment editor, the two pins we want are marked as BDBUS[0] and BDBUS[1]. I turned the location constraints to Enabled=No on those pins so they wouldn’t conflict with my more descriptive names. Then I made these four entries in the assignment editor:

That will do the trick. Now we need to get those input characters to the LEDs somehow.

Data Storage

The original code set the text to display as an assignment. This is compact at compile time but doesn’t let you change the message at runtime (you’d need to recompile and upload again).


assign foo[0] = &quot;W&quot;;

To improve upon this I changed foo to be a register type and initialized it. (Some Verilog compilers won’t synthesize an initial block, but the version of Quartus I’m using will.) This is not the same as responding to the reset, though. It initializes the registers at configuration time and that’s it. For this application, I thought that was fine. Cycle the power or force a reconfigure if you want to reload, but a push of the user button won’t change the message.

Here’s what the change looks like:


reg [7:0] foo [0:15];
initial
begin
foo[0] = &quot;W&quot;;
foo[1] = &quot;D&quot;;
foo[2] = &quot;5&quot;;

. . .

Processing Input

With this new configuration, the text is malleable, and we can hear data coming from the PC. We just have to connect those dots.


// Let's add the ability to change the text!
// Note: I had to change the foo &quot;array&quot;
// To be more than a bunch of assigns, above, too

wire isrx;   // Uart sees something!
wire [7:0] rx_byte;  // Uart data
reg [3:0] fooptr=0;   // pointer into foo array
integer i;

always @(posedge CLK12M) begin
   if (isrx)    // if you got something
	    begin
		  if (rx_byte==8'h0d)
		     fooptr&lt;=0;
		  else if (rx_byte==8'h1b) 
		  begin
		      fooptr&lt;=0;
// Note: Verilog will unroll this at compile time
		      for (i=0;i&lt;16;i=i+1)
			  	   foo[i] &lt;= &quot; &quot;; 
		  end
		  else begin
		    foo[fooptr]&lt;=rx_byte;  // store it
		    fooptr&lt;=fooptr+1;   // natural roll over at 16
		  end
		end
end

This is pretty straightforward. On each rising clock edge, we look at isrx. If it is high, we look to see if the character is a carriage return (8’h0d) or an escape (8’h1b). Don’t forget that this line is an assignment, not a “less than or equal to” statement:

fooptr <= 0;

If the character is anything else, the code stores it at foo[fooptr]. The fooptr variable is only 4 bits so the 16 character rollover will take care of itself when we increment it.

The only other oddity in the code is the use of a for loop in FPGA synthesis. Some tools may not support this, but notice that the i variable is an integer. The compiler is smart enough to know that it needs to run that loop at compile time and generate code for the body. So you could easily replace this with 16 lines explicitly naming each element of foo. Of course, the for loop has to have definite limits and there are probably other restrictions on how many loops are acceptable, for example.

That’s all there is to it. Once you load the FPGA it will look like it always did. But if you open a terminal on the USB serial port (on my machine it is /dev/ttyUSB9 because I have a lot of serial ports; yours will almost certainly be different like /dev/ttyUSB0), set it for 9600 baud and you can change the text.

If your terminal doesn’t do local echo you’ll be typing blind. Echoing the characters back from the FPGA would be a good exercise and would make use of the transmitter, too.

What’s Next?

If you want to experiment, you now have a framework that can read the accelerometer and — with just a little more work — can talk back and forth to the PC. You could PWM the LEDs to control the brightness from the terminal, for example. Or make a longer text string that scrolls over time.

One of the attractive things about modern FPGAs is that they can accommodate CPUs. Even this inexpensive board can host a simple NIOS processor. That allows you to do things like serial communications and makes managing things like host communications much simpler. You can still use the rest of the FPGA for high speed or highly parallel tasks. For example, this project could have easily had a NIOS CPU talking to PC while the FPGA handled the motion detection and LED driving. Just like I rejected the “stock” UART and got one elsewhere, there are plenty of alternatives to the NIOS that will fit on this board, too.

If you still want more check out our FPGA boot camps. They aren’t specific to the MAX1000, but most of the material will apply. Intel, of course, has a lot of training available if you are willing to invest the time.

18 thoughts on “How To Add UART To Your FPGA Projects”

jafinch78 says:

September 6, 2018 at 10:12 am

Awesome!

Reply
Pat says:

September 6, 2018 at 10:45 am

Xilinx’s PicoBlaze (or the HDL equivalent PacoBlaze, which will run on other FPGAs) is great for interacting over serial – there are also UART macros in the code release, too, resulting in extremely tiny implementations: as in, you could easily have like 50+ UARTs/processors in even the smallest modern device. Plus you’ll (probably) have to learn assembly, so that’s a bonus. :)

Reply
Mike Jarabek says:

September 6, 2018 at 6:11 pm

I’d personally be very careful with that UART code… the RX input feeds directly into the state machine next state logic. Since this signal is coming in asynchronously from the external source, it has no fixed relationship to the clock in your FPGA. This means your state machine flip-flops could have their setup or hold time violated, and the next state will be unpredictable. That’s bad news. Before applying the RX line to your input, you need to synchronize it to your clock. Typical designs use at least 3 flip-flops arranged as a shift register to limit the metastability to the first one or two flip-flops, and protect the rest of the circuit. You might also consider adding a noise filter to the line too, that would consist of a shift register, with an output decision that goes high or low depending on the majority of 1’s or 0’s in the registers, maybe with some hysteresis. Otherwise you may be latching in garbage bits in a noisy environment…

That code has a synchronous reset… it could be a bad thing, if your clock is not running, the outputs will be in an unknown state. Adding the ‘rst’ signal to the always() sensitivity list, and putting the rest of the block in the else clause of the check for the asserted reset usually gives the synthesis tool the hint that you want an asynchronous reset.

The last thing I’d probably fix in there is that they have mixed up using ‘=’ and ‘<='. In the UART core the are assigning the state variable with '=', which in the simulator will generate extra 'glitch' events on the outputs, this could lead to mismatches between simulation and synthesis.

Reply
1. Al Williams says:
  
  September 6, 2018 at 7:51 pm
  
  Like I said I have my own production UART that is significantly more complex (and bigger). It does 16X oversampling on receive among other things. However, the GitHub one seems to work fine despite any possible issues. Even with a 3-stage flip flop synchronizer, you don’t eliminate the possibility of metastability, you just reduce the likelihood, but yes. Altera has a pretty good paper on calculating the MTBF and, of course, you can trade latency for increased reliability.
  
  However, after looking at the code a little, I don’t think it is as bad as you think. The clock is not 1X baud and the start bit has to assert over some period of time. Keep in mind you’ll only have a metastable state on a transition which is relatively slow compared to the operating clock. So you might miss the first slice of the start bit but you’d make it up on the remaining slices. Same for noise filtering. If you notice, he is oversampling. Again, you might miss the first slice if you are unlucky. Then he counts off the bit state and it has to be more than 4 (or more) times in the interval to count. So while he doesn’t use a classic synchronizer nor does he have a noise filter in the way you suggest, he does in fact prevent against spurious events including metastability. If you think about it, his “shift register/majority” noise filter is spread out across the clock cycles implicitly, but it still does the same job.
  
  As for sync reset, granted, although if the clock isn’t running you have more problems in this case.
  
  I had not dug into the code to see that they had the blocking assignments in the code. I wonder if the synthesis tool is correctly inferring a state machine and rewriting the code anyway? Because, again, despite it not being a great practice it does work. I agree it is bad practice.
  
  This is actually a great example of the issues you have grabbing IP from different places. In this case, the UART is floating around several places and I’ve personally synthesized it on at least 3 different FPGA families with no trouble so I never really looked at it other than as a black box. When you get vendor IP you really do usually get a black box (EDIF) although you assume they do a good job. If latches are being inferred that probably limits the speed you can get correct optimization and wastes some resources. However, at the speeds involved it seems to work OK. I might “fix” that and see what the difference is in the synthesis.
  
  Just as a thought. I don’t know this, but I wonder if originally the state machine logic was combinatorial and a later author (note there are two listed and who knows how many anonymous ones) moved the state machine into the clocked logic? Because it seems well-written enough that the author knew the difference between blocking and nonblocking assignment.
  
  Reply
Mike Jarabek says:

September 6, 2018 at 8:34 pm

Yeah, the code is not so bad, and it does have a lot of room for improvement. Definitely agree that 3 FF’s may not be enough, if the clock rate is high enough, and you are looking for an upper bound on the probability that the metastability will make it all the way through. Crossing clock domains was always my favorite part of designing these things, those circuits always give the timing analyzers headaches.

With this design, I’d say that the synthesis tool will still be able to find the state machine, despite it not following the ‘usual’ pattern. It can then assign the states to patterns in the state register at will. If it picks one-hot or a gray encoding, things will probably work well. If it picks a binary code, which the designer has coded, where more than one bit changes with a state change, then the timing through the next state logic could cause one FF to switch and one not to. This could certainly happen if the paths from the Rx pin to the various state registers are of different lengths. Then you could end up in a state where there is no exit. We saw this kind of behaviour once in a processor on a DTACK signal, if you timed the edge just right, half the chip thought you had completed the cycle, and the other half did not. It locked up solid, only a reset could get it back again.

It doesn’t look like oversampling in the RX_READ_BITS state, there just seems to be the traditional ‘sample in the middle of the bit period’ strategy. Rx _should_ be stable and not toggling then, unless you get a short spike at exactly the wrong moment. The sampling window is narrow, so the probability is low, but not zero. Definitely, like your production version, oversampling would be an excellent idea to weed out noise. The start bit detection does look for continuous assertion, as you note, so the state machine should not get too far without a stable enough signal.

Vendor IP is fun too… The quality does vary significantly. We once replaced a hardware DMA controller with an ‘Exact’ replacement in an FPGA. When we ran the system software, it turned out that they had actually not done such an ‘Exact’ implementation of the controller. It was missing a few features that the code used. It was fun to demonstrate to the vendor that their block didn’t actually meet their published specs… We had logic analyser captures of the real chip, and could compare them with the IP block, doing the ‘same’ thing, they did not match despite the vendor’s assurances that this was an exact match for the chip we were replacing. Fun times.

Interesting thought on it being asynch to start out. Possibly, it does look a lot like test-bench type behavioral code, it could have it’s origins there. Someone could have written a behavioral model to be used in a test bench, and then someone modified it so that it would synthesize without latches.

Reply
1. Al Williams says:
  
  September 6, 2018 at 8:56 pm
  
  Maybe I’m looking at the wrong code. Look at the GitHub link: https://github.com/wd5gnr/max1000-tutorial/blob/master/cores/osdvu/uart.v
  
  The key is that he isn’t firing the clock every bit period. He’s using a variable counter. So state RX_IDLE:
  RX_IDLE: begin // A low pulse on the receive line indicates the // start of data. if (!rx) begin // Wait 1/2 of the bit period rx_clk = one_baud_cnt / 2; recv_state = RX_CHECK_START; end end
  
  When he sees what MIGHT be a start bit, he delays 1/2 baud. Granted that’s just a sync delay but it still counts. Then if you still have the start bit asserted you get this:
  rx_clk = (one_baud_cnt / 2) + (one_baud_cnt * 3) / 8; rx_bits_remaining = 8; recv_state = RX_SAMPLE_BITS; rx_samples = 0; rx_sample_countdown = 5;
  
  So if you look at RX_SAMPLE_BITS, he waits for the time tick which will next time be 8X the baudrate and count until he’s done it 5 times. He counts how many asserted bits he gets in rx_samples.
  RX_SAMPLE_BITS: begin // sample the rx line multiple times if (!rx_clk) begin if (rx) begin rx_samples = rx_samples + 1'd1; end rx_clk = one_baud_cnt / 8; rx_sample_countdown = rx_sample_countdown -1'd1; recv_state = rx_sample_countdown ? RX_SAMPLE_BITS : RX_READ_BITS; end end
  
  So after 8 passes he decides if he got 4 or more counts and then sets up to do it again until the word is done:
  RX_READ_BITS: begin if (!rx_clk) begin // Should be finished sampling the pulse here. // Update and prep for next if (rx_samples > 3) begin rx_data = {1'd1, rx_data[7:1]}; end else begin rx_data = {1'd0, rx_data[7:1]}; end
  
  So I think it is oversampling, it is just doing it in an “unrolled way.” Unless I’m falling asleep or one of us is looking at different code which is possible.
  
  And yes I’ve had my run arounds with Vendor IP that I had to strong arm the source out of to fix.
  
  Reply
  1. Mike Jarabek says:
    
    September 7, 2018 at 5:15 am
    
    Aha! We are looking at different code! Indeed the code you are citing is using a much better algorithm… Counting sub-bits and taking a majority vote, you are not sleeping. :) I have been looking at the one linked at: https://github.com/cyrozap/osdvu/blob/d18488c41141cfb1c7b29f5a5840510e727ae5a2/uart.v. The one linked in the 6th paragraph. Now, that explains a lot… The latter version is definitely much better, with using a counter to count the asserted time for a ‘1’, even there, though the rx line is applied unfiltered to the counter “+1” input, so we could have it skip counts, or end up with a messed up counter. Given that, it probably works ‘well enough’ that nobody would notice, under most circumstances, after the start bit is detected, the signal will be stable enough to feed the counter, ignoring noise pulses, or evil people putting in glitches.
    
    Reply
    1. Al Williams says:
      
      September 7, 2018 at 2:41 pm
      
      Yes it appears to be an earlier version and I use the newer version which is why I didn’t make it a subproject. I think I had mentioned that but it may not have survived the editor’s knife.
      
      Ok, I thought maybe we were looking at different things. Great dialog though. Thank you!
      
      Reply
    2. Al Williams says:
      
      September 12, 2018 at 6:56 pm
      
      If you check out the GitHub, I just pushed an update to make the UART code nonblocking. It is a bit more complex than you’d hope. There are places where the original author does something like subtracts one from something and then tests it for zero and since it was blocking that all happened together.
      
      I did a quick test and it did save a handful of gates but no registers and upped the Fmax a few hundred kHz. However, when it did not work I added a quick and dirty serial echo so it now uses MORE gates than before, and the FMax is very similar (a little faster but not much). You’d have to pull the echo code back out to get a fair comparison.
      
      The echo doesn’t handshake, so if you use a terminal that sends data at maximum throughput, you’ll miss every other echoed character. The board still processes it, the transmitter is just too busy to echo it.
      
      Reply
Allan H. says:

September 6, 2018 at 11:57 pm

For the Xilinx-heads, Ken Chapman included UART Tx and Rx modules in his Picoblaze (tiny 8 bit microcontroller) source.
There are Verilog and VHDL versions supplied, but he coded it by instantiating low-level Xilinx primitives, meaning it’s small and tight but very non-portable.
You have to register to download. It’s no cost to use, but it’s definitely not open source. Still, it’s worth looking at to see how small a UART can be made.
https://www.xilinx.com/products/intellectual-property/picoblaze.html

Reply
1. Allan H. says:
  
  September 7, 2018 at 12:00 am
  
  I missed Pat’s comment above that already mentioned the Picoblaze UART.
  Dang.
  
  Reply
Nathan McCorkle says:

September 7, 2018 at 12:33 am

I had to comment out the last line in the udev rules file `/etc/udev/rules.d/51-arrow-programmer.rules` because this line was disabling the virtual COM port that the FPGA was expecting data on `RUN=”/bin/sh -c ‘echo $kernel > /sys/bus/usb/drivers/ftdi_sio/unbind'”`

Reply
Nathan McCorkle says:

September 7, 2018 at 12:38 am

Any chance the next installment could be saving the string to flash, and reloading it upon reset? Or is the on-board flash off-limits and only for storing the FPGA gate configuration?

Reply
1. Al Williams says:
  
  September 7, 2018 at 6:30 am
  
  It is possible.
  
  Well, I shouldn’t reply before 10AM. I was thinking this was an icestick so I wrote up about that below. I’ll leave it there for posterity, but the answer should be similar for the MAX1000 — I got my boards confused (both are on my desk along with 6 other FPGA boards at the moment).
  
  The equivalent document for the MAX10 is https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/max-10/ug_m10_config.pdf
  
  But the better document for your purpose is
  https://www.intel.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/max-10/ug_m10_ufm.pdf
  
  This is made to interface with a CPU core via the “Avlaon” interface but you could do it yourself pretty easily. Maybe even easier than SPI if you use the parallel controller.
  
  I might do this at some point or something similar, but if you beat me to it, be sure to submit to the tip line (write it up on Hackaday.io or GitHub or somewhere) and I have a feeling we’d post it.
  
  ####
  
  Start here: http://www.latticesemi.com/~/media/LatticeSemi/Documents/ApplicationNotes/IK/iCE40ProgrammingandConfiguration.pdf
  
  The SPI EEPROM is normally put to sleep after configuration but you can pass -s to icepack to set a bit to tell the chip not to do that.
  
  Then you need to control the SPI EEPROM lines and read/write above the maximum configuration address. For the HX1K, the configuration bit stream is 34,112 bytes. The SPI flash is 32 megabit/4 MB, so lots of room.
  https://www.micron.com/~/media/documents/products/data-sheet/nor-flash/serial-nor/n25q/n25q_128mb_3v_65nm.pdf
  
  You’ll notice there is a feature to store 5 images (a cold boot and 4 warm boot). I haven’t looked to see if the two select pins are available on the IceStick but a) you know if you are using it or not, b) you can use SB_WARMBOOT regardless, and c) 5 x 34K is a drop in the bucket compared to the size of the EEPROM.
  
  When you configure for multiboot, you can set the different images to different offsets, so you’d avoid those. You could store, for example, at the top of memory and work down if you were super paranoid.
  
  Just remember, you have to write in a page at a time, so it would make sense to align the storage with a page and plan on writing the whole thing every time you write (maybe add a command to store the current string? ^W or something).
  
  Some of the Lattice devices have SPI (and I2C) built in, but I do not think the one on the icestick does. However, SPI is easy to work with or you can grab some IP: https://opencores.org/project/spi_verilog_master_slave and probably many others, too.
  
  I might do this at some point or something similar, but if you beat me to it, be sure to submit to the tip line (write it up on Hackaday.io or GitHub or somewhere) and I have a feeling we’d post it.
  
  Reply
Dean Macri says:

September 7, 2018 at 8:28 am

The first PoV installment of this interested me enough that I ordered one of the devices from Arrow last weekend. It arrived on Wednesday and I finally got around to trying to use it yesterday. I’m a total noob to FPGAs, so it took me a while to figure out how to get Quartus to talk to the device (maybe I skimmed something somewhere and glossed over it). I figured I’d share the link I eventually found for anyone else who buys the device, installs on Linux (I’m actually running 16.04 via Parallels on an iMac :-P) but can’t get the device recognized. Download the drivers from here: https://shop.trenz-electronic.de/Download/?path=Trenz_Electronic/Software/Drivers/Arrow_USB_Programmer/Arrow_USB_Programmer_2.2 and follow the instructions in the readme.

Reply
remydyer says:

September 12, 2018 at 12:56 am

gh://jamesbowman/swapforth
has a forth system on an icestick, which of course talks using a uart.
The code in the above Github repo under the /j1a/verilog/ directory gives a good example of this — plus you can compile it all on a raspberry pi in about 20 to 40 minutes… or in a singularity container on a desktop PC in something like a minute or two.

Want another on 3.3V ttl pins? copy, paste in the top level j1a.v file, give it an io address, and add some pin definitions to j1a.pcf, recompile and you’re basically good to go.

I also do both SPI masters and SPI slave this way. The latter running at 100MHz from another FPGA, and just keeping a 64-bit double-buffered (clock domain crossing) packet nice and fresh for the swapforth systems’ use. (the other FPGA is the one that saves captured data someplace else, but I wanted to check levels of a few channels to safely operate some machinery).

There are subtle tricks to connecting two FPGA boards via 0.1″ headers and getting that kind of bandwidth without interference, but it’s not too hard to understand.

Reply
sjw says:

September 12, 2018 at 1:45 am

None of that code is aligned properly and it is horrible

Reply
issam qalajy says:

November 17, 2019 at 3:20 pm

Hi every body,

I was facing a problem and I had find the solution so I like to cher it with others,

that was a crane control Sys. Its chines crane will protected possessor /freescal MPC…./ to do you need the specific tool of it , that cost toooo match money !!! but finely I have fond this ,

[https://www.com-port-monitoring.com/]
its perfect and strong I find it the best , this software had just solve a big problem of me, i had inter a industrial chines high protected kit and had a copy of the flash and the system possessor prog to solve anther machine problem without any interface or have the programming software tool .
by the paid soft “Serial Port Monitor” I can inject the Hex-file.

I did this operation 3 times successfully 25 tons crane, 50 tons and 70 tons.

the soft I use in first place it was “Advanced Serial Port Terminal” to inter and have the flash copy !!!!

Reply