VLSI Interview Questions and Answers – Part 4

Question 31: Vending Machine FSM

Design a synthesizable Mealy FSM in Verilog for a vending machine that accepts 5-cent (nickel) and 10-cent (dime) coins. It sells pencils that cost 15 cents and should be able to return extra money (e.g., a nickel in change).

A Mealy FSM’s output depends on both the current state and the current inputs. We need states to represent the amount of money deposited so far.

  • States: S_0 (0c), S_5 (5c), S_10 (10c)
  • Inputs: `in_nickel`, `in_dime`
  • Outputs: `dispense_pencil`, `return_change`
Current State Input: Nickel Input: Dime
Next State Dispense Change Next State Dispense Change
S_0 (0c) S_5 0 0 S_10 0 0
S_5 (5c) S_10 0 0 S_0 1 0
S_10 (10c) S_0 1 0 S_0 1 1

module vending_machine (
    input  logic clk, reset,
    input  logic in_nickel, in_dime,
    output logic dispense_pencil, return_change
);
    typedef enum logic [1:0] { S_0, S_5, S_10 } state_t;
    state_t current_state, next_state;

    // State Register
    always_ff @(posedge clk or posedge reset) begin
        if (reset) current_state <= S_0;
        else       current_state <= next_state;
    end

    // Next State and Output Logic
    always_comb begin
        // Default outputs
        dispense_pencil = 1'b0;
        return_change = 1'b0;
        next_state = current_state; // Hold state by default

        case (current_state)
            S_0: begin
                if (in_nickel) next_state = S_5;
                else if (in_dime)   next_state = S_10;
            end
            S_5: begin
                if (in_nickel) next_state = S_10;
                else if (in_dime) begin
                    next_state = S_0;
                    dispense_pencil = 1'b1;
                end
            end
            S_10: begin
                if (in_nickel) begin
                    next_state = S_0;
                    dispense_pencil = 1'b1;
                end
                else if (in_dime) begin
                    next_state = S_0;
                    dispense_pencil = 1'b1;
                    return_change = 1'b1; // 10c + 10c = 20c. Return 5c.
                end
            end
        endcase
    end
endmodule
            

Question 32: FSM for 0-1 vs 1-0 Transitions

A circuit receives an endless stream of 0s and 1s. Build a state machine to determine if there have been more "0-to-1" transitions or "1-to-0" transitions since reset.

We can use a state machine that saturates. Let's define three states: `EQUAL`, `MORE_01`, and `MORE_10`. This approach uses a fixed number of states and is a more literal interpretation of an FSM solution.

Current State Input Transition Next State
EQUAL 0 -> 1 MORE_01
1 -> 0 MORE_10
MORE_01 0 -> 1 MORE_01 (Saturate)
1 -> 0 EQUAL
MORE_10 0 -> 1 EQUAL
1 -> 0 MORE_10 (Saturate)

module transition_fsm (
    input  logic clk, reset,
    input  logic data_in,
    output logic more_01_trans,
    output logic more_10_trans
);
    typedef enum {EQUAL, MORE_01, MORE_10} state_t;
    state_t current_state, next_state;
    logic data_prev;

    // Detect transitions
    wire is_01_trans = (data_prev == 1'b0) && (data_in == 1'b1);
    wire is_10_trans = (data_prev == 1'b1) && (data_in == 1'b0);

    // State register and previous data register
    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            current_state <= EQUAL;
            data_prev <= 1'b0;
        end else begin
            current_state <= next_state;
            data_prev <= data_in;
        end
    end

    // Next state logic
    always_comb begin
        next_state = current_state;
        case (current_state)
            EQUAL: begin
                if (is_01_trans) next_state = MORE_01;
                else if (is_10_trans) next_state = MORE_10;
            end
            MORE_01: begin
                if (is_10_trans) next_state = EQUAL;
            end
            MORE_10: begin
                if (is_01_trans) next_state = EQUAL;
            end
        endcase
    end

    // Output logic
    assign more_01_trans = (current_state == MORE_01);
    assign more_10_trans = (current_state == MORE_10);
endmodule
            

Question 33: Divisible-by-5 FSM

Design a logic circuit that mimics an infinitely wide register. It takes a binary number serially, one bit at a time (MSB first). The output should be asserted high when the number held in the "register" is divisible by 5.

This is a classic FSM problem. We can determine if a number is divisible by 5 by tracking the remainder after division by 5. Let the number seen so far be `N`. When a new bit `b` arrives, the new number is `2*N + b`.

The new remainder is `(2*N + b) mod 5`. We can design states based on the current remainder (`N mod 5`). The output is high only when a transition leads to state S0 (remainder 0).

Current State (Remainder) Input: 0 Input: 1
Next State Output Next State Output
S0 (rem=0) S0 1 S1 0
S1 (rem=1) S2 0 S3 0
S2 (rem=2) S4 0 S0 1
S3 (rem=3) S1 0 S2 0
S4 (rem=4) S3 0 S4 0

Question 34: Serial Index Reporter

Design a circuit with inputs `in_vld` and `Data_in[7:0]`. When `in_vld` is high, it accepts `Data_in`. The circuit should then output the index of each set bit in `Data_in` serially, one index per cycle, asserting `out_vld` and providing the index on `index[2:0]`.

Example: `Data_in: 01010010` -> Output should be: index 1, then 4, then 6 over three cycles.

This requires a combination of a priority encoder and a state machine to iterate through the bits.

  1. When `in_vld` is asserted, latch the `Data_in` into an internal register, let's call it `data_reg`.
  2. In each subsequent cycle, feed `data_reg` into a priority encoder to find the index of the lowest-set bit.
  3. Output this index and assert `out_vld`.
  4. To find the *next* set bit in the next cycle, clear the bit that was just found from `data_reg`. A common trick for this is `data_reg <= data_reg & (data_reg - 1)`. This clears the LSB '1'.
  5. Repeat until `data_reg` becomes zero, at which point `out_vld` is de-asserted.

module serial_index_reporter (
    input  logic clk, reset,
    input  logic in_vld,
    input  logic [7:0] data_in,
    output logic out_vld,
    output logic [2:0] index
);
    logic [7:0] data_reg;

    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            data_reg <= 8'b0;
        end else if (in_vld) begin
            data_reg <= data_in; // Latch new data
        end else if (out_vld) begin
            // Clear the LSB '1' to find the next one
            data_reg <= data_reg & (data_reg - 1);
        end
    end

    // Combinational logic for output
    assign out_vld = (data_reg != 8'b0);
    // Use a priority encoder to find the index of the LSB '1'
    assign index = $clog2(data_reg & -data_reg);

endmodule
            

Question 35: Power of 2 Detector

Design a combinational circuit to detect if an 8-bit number is a power of 2.

A positive integer is a power of 2 if and only if its binary representation has exactly one bit set to '1'. There are two common ways to detect this:

Method 1: Population Count (Popcount)

Count the number of set bits in the vector. If the count is exactly 1, the number is a power of 2. This can be done with a tree of adders but is often less efficient.

Method 2: Bitwise Trick (More Efficient)

If a number `n` is a power of 2, then `n - 1` will have all the lower bits set to '1'. Therefore, for any power of 2, the bitwise AND of `n` and `n-1` will be zero. We must also handle the case of `n=0`, which is not a power of 2.

Logic Equation: is_power_of_2 = (n != 0) && ((n & (n-1)) == 0)


module power_of_2_detector (
    input  logic [7:0] n,
    output logic       is_power_of_2
);
    assign is_power_of_2 = (n != 8'b0) && ((n & (n - 1)) == 8'b0);
endmodule
            

Question 36: Count Number of 1s in a Vector

Design a circuit to count the number of '1's in a vector (population count).

The most direct way to implement this in hardware is with a tree of adders. For an N-bit vector, this will produce a `log2(N)`-bit sum.

Example for a 7-bit vector using full adders:

  1. Layer 1: Group the 7 bits into two groups of 3 and one leftover. Use two 3-input full adders. Each full adder takes 3 bits and produces a 2-bit sum (Sum, Carry).
  2. Layer 2: Add the results from Layer 1. This will involve adding the two 2-bit sums and the single leftover bit from the input. This can be done with more adders.
  3. The final result will be the total count.

In Verilog, you can simply use the `+` operator in a loop, and the synthesis tool will create an efficient adder tree.


module pop_counter #(parameter WIDTH = 7) (
    input [WIDTH-1:0] data_in,
    output [$clog2(WIDTH):0] count
);
    always_comb begin
        count = '0;
        for (int i = 0; i < WIDTH; i++) begin
            count = count + data_in[i];
        end
    end
endmodule
            

Question 37: Find the Unique Number in an Array

You have an array of `2n+1` non-zero values. In this array, `n` elements appear exactly twice, and one element appears only once. Design a circuit with minimum hardware to find the unique number.

This is a classic problem that can be solved elegantly using the properties of the XOR gate.

The key properties of XOR are:

  • Self-inversing: `A ^ A = 0` (Any number XORed with itself is zero).
  • Identity property: `A ^ 0 = A` (Any number XORed with zero is itself).
  • Commutative & Associative: The order of operations doesn't matter.

Solution: If you XOR all the numbers in the array together, every number that appears twice will cancel itself out (`A ^ A = 0`). The final result of the cumulative XOR operation will be the one number that appeared only once, as it will be XORed with a final result of 0.

Hardware Implementation:

The circuit needs to iterate through the array and accumulate the XOR result.

  1. A register (xor_accumulator), initialized to 0.
  2. A counter to read each address of the memory/array.
  3. An XOR gate whose inputs are the current value from the array and the current value of `xor_accumulator`. The output feeds back into the accumulator on the next clock cycle.

Question 38: Serial Two's Complement FSM

Design an FSM for a serial two's complement block. The input is a binary number, one bit at a time, starting from the LSB.

The algorithm for two's complement is: starting from the LSB, copy all bits up to and including the first '1'. After that, invert all subsequent bits.

This can be modeled with a simple two-state Mealy FSM:

  • S0 (Copy State): The initial state. In this state, we have not yet seen a '1'. The output is the same as the input. If the input is 0, we stay in S0. If the input is 1, we also output 1 but transition to S1.
  • S1 (Invert State): We have already seen the first '1'. In this state, we invert every subsequent input bit. We remain in S1 for all future inputs.
Current State Input Next State Output
S0 (Copy) 0 S0 0
1 S1 1
S1 (Invert) 0 S1 1
1 S1 0

Question 39: Synthesis of Different Assignment Orders

Analyze the synthesis results for the following four Verilog modules. All are intended to create a three-stage shift register.

This problem further explores how assignment type (blocking/non-blocking) and statement order affect synthesis.

Module A (Blocking, Forward Order): q1=d; q2=q1; q=q2;

This will synthesize to a single flop where `q = d`. The assignments execute sequentially in the same timestep, so the value of `d` propagates instantly through `q1` and `q2` to `q`. The intermediate variables are optimized away. Result: One Flop.

Module B (Blocking, Reverse Order): q=q2; q2=q1; q1=d;

This is a classic example of a race condition and incorrect usage. `q` gets the old value of `q2`. Then `q2` gets the old value of `q1`. Finally, `q1` gets `d`. The logic does not form a clean shift register. Synthesis tools may produce unpredictable results or a single flop with complex feedback, but it will not be a three-stage shifter. This coding style should always be avoided for sequential logic. Result: Incorrect Logic.

Module C & D (Non-Blocking, Any Order):

q1<=d; q2<=q1; q<=q2; and q<=q2; q2<=q1; q1<=d;

For non-blocking assignments, the order of statements within the `always` block does not matter. In both cases, all right-hand sides are evaluated first using the values from before the clock edge. Then, all left-hand sides are updated simultaneously. Both modules correctly describe the behavior where `q1` gets `d`, `q2` gets the old `q1`, and `q` gets the old `q2`. Result: Both synthesize to a three-stage shift register. This demonstrates why non-blocking assignments are the standard for modeling sequential logic.

Question 40: Asynchronous vs. Synchronous Resets

Compare asynchronous and synchronous resets. Which one is better and why?

Neither is universally "better"; the choice depends entirely on the design requirements. Both have distinct advantages and disadvantages.

Feature Asynchronous Reset Synchronous Reset
Behavior Resets the flip-flop immediately, independent of the clock. Reset only takes effect on the next active clock edge.
Clock Dependency Works even if the clock is not running. Useful for system-wide power-on reset. Requires a running clock to function.
Timing (STA) The reset signal is not part of the data path, making timing closure easier. The reset logic is part of the data path to the flop's D-input, which can make the combinational path larger and timing harder to meet.
Glitches Highly sensitive to glitches on the reset line, which can cause spurious resets. Not recommended for internally generated resets. Inherently filters out glitches on the reset line because it is only sampled at the clock edge. Safe for internal resets.
Reset Removal (De-assertion) Can cause metastability if the reset is removed too close to an active clock edge (violates recovery/removal times). Requires a reset synchronizer. The reset is removed synchronously with the clock, which is inherently safe and avoids metastability issues related to de-assertion.
Typical Use Case Top-level, chip-wide reset pin that needs to work before clocks are stable. Resets generated internally within a design (e.g., FSM resets, local module resets).

Conclusion: Use asynchronous resets for global, power-on reset conditions. Use synchronous resets for most internal logic to ensure robust, glitch-free, and predictable behavior.

VLSI Interview Questions and Answers – Part 2

Question 11: D-Flip-Flop with Enable (Power Optimized)

Add an enable signal to a D-Flip-Flop. How would you optimize the design for power?

There are three common ways to add an enable signal, but only one is truly power-efficient.

  1. Gating the Input: Place a MUX before the D-input of the flop. If `enable` is high, the MUX passes the new data. If `enable` is low, it feeds the flop’s own output back to its input. The flop still sees a clock edge every cycle, consuming switching power.
  2. Gating the Output: Place a tri-state buffer or MUX on the output. This is generally not done as it doesn’t stop the internal flop from switching and consuming power.
  3. Gating the Clock (Power Optimized): This is the best method for power optimization. The clock signal itself is turned off when the enable signal is low. This prevents the flip-flop from toggling at all, saving significant dynamic power.

However, simple clock gating with an AND gate is dangerous as it can create glitches. The industry-standard solution is to use an Integrated Clock Gating (ICG) Cell. An ICG cell is essentially a latch-based, glitch-free AND gate that ensures the gated clock output is always clean.

Power-Optimized Enable using ICG Cell

ICG Cell

DFF
Clock
Enable
gated_clk

Question 12: Detecting a Previously Seen Number

A 4-bit number arrives every clock cycle. Design a circuit that raises an output signal `high` if the number seen in the current cycle has already appeared in a previous cycle (since the last reset).

This is a classic use case for a direct-mapped lookup table. Since the input is 4 bits, there are only 24 = 16 possible values (0 to 15).

Architecture:

  1. Create a 16-bit register or a small memory (16×1 bit), let’s call it seen_table. Initialize all bits to 0 at reset.
  2. Each bit in this table corresponds to one of the possible input numbers. For example, `seen_table[0]` corresponds to the number `4’b0000`, `seen_table[5]` to `4’b0101`, and so on.
  3. In any given cycle:
    • Use the incoming 4-bit number as an index into the seen_table.
    • Check: Read the value at that index. If the bit is already `1`, it means we have seen this number before. Set the output `seen_before` to high.
    • Update: After the check, write a `1` to that index in the table to mark the current number as “seen” for future cycles.

module seen_detector (
    input  logic clk, reset,
    input  logic [3:0] data_in,
    output logic       seen_before
);
    reg [15:0] seen_table;

    // Check logic is combinational
    assign seen_before = seen_table[data_in];

    // Update logic is sequential
    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            seen_table <= 16'b0;
        end else begin
            // Set the bit corresponding to the current data
            seen_table[data_in] <= 1'b1;
        end
    end
endmodule
            

Question 13: STA Path Adjustments

For the circuit below, how would you adjust the delay elements `dly1`, `dly2`, and `dly3` to fix a) a setup violation and b) a hold violation at input B of flip-flop FF2?

FF1
FF2
dly3
dly1
dly2
A
B
O
clk

In this standard timing path, FF1 is the launch flop and FF2 is the capture flop.

  • dly1 affects the launch clock path.
  • dly2 affects the capture clock path.
  • dly3 affects the data path.

a) Fixing a Setup Violation

A setup violation means the data at B is arriving too late. To fix this, we need to make the data arrive earlier relative to the capture clock. We can:

  • Decrease dly3: This speeds up the data path directly.
  • Increase dly2: This delays the capture clock, giving the data more time to arrive and stabilize.
  • Decrease dly1: This makes the launch clock arrive earlier, starting the data’s journey sooner.

b) Fixing a Hold Violation

A hold violation means the data at B is changing too soon after the clock edge, before FF2 can reliably capture it. To fix this, we need to make the data path slower relative to the capture clock. We can:

  • Increase dly3: This slows down the data path, making it stable for longer after the clock edge. This is the most common fix.
  • Decrease dly2: This makes the capture clock arrive earlier, effectively “outrunning” the fast data change.
  • Increase dly1: This delays the launch clock, making the data start its journey later.

Question 14: RTL for a SIPO Register

Write a Verilog RTL module for an 8-bit SIPO (Serial-In, Parallel-Out) shift register.

A SIPO register shifts in one bit at a time and makes all the bits available simultaneously on a parallel output. The most common implementation is a simple shift register. The original code in the PDF was flawed; here is the corrected, standard implementation.


module sipo_8bit (
    input  logic       clk,
    input  logic       reset,
    input  logic       serial_in,
    output logic [7:0] parallel_out
);

    // Internal register to hold the shifted data
    logic [7:0] shift_reg;

    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            shift_reg <= 8'b0;
        end else begin
            // Concatenate new input bit with the top 7 bits
            // This shifts the register to the right
            shift_reg <= {shift_reg[6:0], serial_in};
        end
    end

    // Continuously assign the register value to the output
    assign parallel_out = shift_reg;

endmodule
            

Operation: On each rising clock edge, the `serial_in` bit is placed into the least significant bit (LSB) position of the register, and all other bits are shifted one position to the left (towards the MSB). The entire 8-bit value of the internal register is continuously available at `parallel_out`.

Question 15: Resolving a CDC Pulse Synchronization Issue

Given the timing diagram below where a narrow pulse on `data` crosses from `clkA` domain to `clkB` domain, how do you resolve the CDC issue? Assume nothing about the relationship between `clkA` and `clkB`.

clkB
data

A narrow pulse is extremely difficult to synchronize. A standard two-flop synchronizer will likely miss it entirely if the pulse is not active during a rising edge of `clkB`. Even if it is captured, it could be metastable.

The standard solution is to convert the pulse into a level-sensitive signal that can be safely synchronized. A toggle-flop synchronizer is ideal for this.

Architecture:

  1. Source Domain (`clkA`): Use the incoming pulse to toggle a flip-flop. This converts the one-cycle pulse into a level change that will persist until the next pulse arrives.
  2. CDC Path: Synchronize this toggling signal using a standard two-flop synchronizer into the `clkB` domain.
  3. Destination Domain (`clkB`): Detect the change in the synchronized level. An XOR gate comparing the synchronized signal with its one-cycle-delayed version will generate a single-cycle pulse in the `clkB` domain every time the level changes.

This method reliably transfers the pulse event across the domain, though it introduces a constraint: a new pulse cannot be sent until the previous one has been acknowledged by the destination, to prevent missing a toggle.

Question 16: RTL for a D-Latch

Write the Verilog RTL for a D-type latch.

A D-latch is a level-sensitive memory element. It is transparent when its enable signal is high (output follows input) and opaque when the enable is low (output holds its last value). Latches are inferred in Verilog when combinational logic has incomplete specification (e.g., an `if` without an `else`, or a `case` without a `default`).

Here is the RTL for a simple active-high D-latch with an asynchronous reset.


module d_latch (
    input  logic d,       // Data input
    input  logic en,      // Enable (level-sensitive)
    input  logic rst_n,   // Asynchronous active-low reset
    output logic q        // Output
);

    always_latch begin
        if (!rst_n) begin
            q <= 1'b0;
        end else if (en) begin
            q <= d; // Transparent: output follows input
        end
        // When en is low, q holds its value, creating the latch.
    end

endmodule
            

While useful in specific applications, latches are often avoided in synchronous designs because their timing analysis is more complex than for edge-triggered flip-flops.

Question 17: Synthesis of Incomplete Sensitivity List

What will the following code synthesize to? always @(!a) if (b) d=a;

This code has multiple issues that highlight bad RTL coding practices.

  1. Sensitivity List: The sensitivity list `always @(!a)` is nonsensical for synthesis. Synthesis tools create hardware based on logic structure, not arbitrary event triggers. The tool will likely ignore `!a` and interpret the block as combinational, issuing a warning that the sensitivity list is incomplete. The correct way to specify a combinational block is with `always @*` or `always_comb`.
  2. Incomplete Specification: The `if (b)` statement has no corresponding `else` clause. This means the code does not specify what the value of `d` should be when `b` is false. To prevent this ambiguity, a synthesis tool will infer a latch to hold the previous value of `d`.

Synthesis Result: The code will synthesize to a latch. The input to the latch will be `a`, and the enable for the latch will be `b`. This is almost certainly not the designer’s intent and can lead to major timing problems. A linter would flag this as a critical warning.

Question 18: Synthesis of Incomplete Sequential Logic

What will the following code synthesize to? always @(posedge clk) if (reset) q <= 0;

This is a sequential block that only specifies behavior when `reset` is high. It does not define what `q` should do when `reset` is low. In a sequential (`always_ff` or `always @(posedge clk)`) block, if a signal is not assigned a value under all conditions, it is inferred to hold its previous value.

Synthesis Result: The synthesizer will create a D-flip-flop where the output `q` is fed back to its own D-input, effectively holding its value forever once `reset` goes low. The `reset` signal acts as a synchronous load enable for the value `0`.

The hardware will be a D-flip-flop with a MUX at its input. The MUX selects `1’b0` when `reset` is high, and selects the flop’s own output `q` when `reset` is low.

DFF
MUX
D
q
1’b0
10
reset

Question 19: Safely Gating a Clock

How do you safely gate a clock to save power, and what are the risks of doing it incorrectly?

Clock gating is a technique used to turn off the clock to parts of a design to reduce dynamic power consumption. However, doing it naively is dangerous.

The Risk of Incorrect Gating: Using a simple AND or OR gate to combine a clock and a control signal is unsafe. If the control signal changes while the clock is high, it can create a glitch or a shortened pulse on the gated clock output. This glitch can cause downstream flops to either miss a clock edge or capture data incorrectly, leading to functional failure.

The Safe Solution: Integrated Clock Gating (ICG) Cell

The industry-standard method is to use a library-provided Integrated Clock Gating (ICG) cell. An ICG cell is essentially a latch-based, glitch-free circuit.

How it works:

  1. It contains a level-sensitive latch that holds the enable signal.
  2. The latch is transparent only when the clock is low. This ensures that the enable signal can only change its state when the clock is inactive.
  3. When the clock goes high, the latch becomes opaque, holding the enable value stable for the entire duration of the clock pulse.
  4. An AND gate then combines the stable latch output with the original clock.

This guarantees that the gated clock output is always a full, clean pulse without any glitches.

Question 20: Fixing Timing Violations from RTL

How can you fix a setup violation from RTL code only? Can you also fix a hold violation from RTL?

Fixing Setup Violations in RTL

A setup violation means the data signal is arriving too late at a flip-flop’s input relative to the clock edge. This is caused by a long combinational logic path between registers. From RTL, you can:

  • Pipelining: This is the most common and effective method. You break the long combinational path into two or more smaller, faster paths by adding registers in between. This increases latency but allows for a higher clock frequency. For example, changing `out = a + b + c;` to `temp <= a + b;` and `out <= temp + c;` in subsequent cycles.
  • Logic Restructuring: Rewrite the logic to be more efficient. For example, changing a long `if-else-if` chain (which synthesizes to a slow priority encoder) to a `case` statement (which synthesizes to a faster parallel MUX).
  • FSM State Encoding: For Finite State Machines, changing the state encoding can help. A binary encoding is dense but can have complex next-state logic. A one-hot or gray encoding can simplify the next-state logic, reducing the delay and helping fix setup violations.

Fixing Hold Violations in RTL

Generally, no. A hold violation means the data signal is changing too soon after the clock edge, before the flip-flop has had time to reliably capture it. This is usually caused by a data path that is *too fast* relative to the clock path.

Hold violations are considered a physical design issue, not a logical one. They are fixed during the Place & Route stage by the backend team, who add delay to the data path by inserting buffers or using longer wire routes. It is not something an RTL designer can or should try to fix by changing the Verilog code.

VLSI Interview Questions and Answers – Part 3

Question 21: Flops Needed for a Sequence Detector FSM

How many flip-flops are required to implement a Finite State Machine (FSM) that detects the input sequence “10101”?

The number of flip-flops depends on two factors: the number of states in the FSM and the state encoding scheme used.

  1. Number of States: To detect “10101”, we need states to track the progress. For a non-overlapping sequence detector:
    • S0: Idle state (or no part of sequence seen)
    • S1: `1` seen
    • S2: `10` seen
    • S3: `101` seen
    • S4: `1010` seen

    Upon receiving the final `1` in state `S4`, the machine outputs a ‘detect’ signal. This requires a total of 5 states for a Mealy FSM (where output depends on state and input). A Moore FSM would require an additional state for the output, for a total of 6.

  2. State Encoding:
    • Binary Encoding: Requires `ceil(log2(N))` flops, where N is the number of states. For 5 states, this is `ceil(log2(5)) = 3` flip-flops.
    • One-Hot Encoding: Requires `N` flops, one for each state. For 5 states, this is 5 flip-flops.

So, the minimum number of flip-flops required is 3 (using a Mealy FSM with binary encoding).

Question 22: Calculate Maximum Frequency of Operation

Given a sequential path, calculate the maximum possible operating frequency. Use the following values: Clock-to-Q delay (`t_c2q`) = 1ns, Combinational logic delay (`t_comb`) = 3ns, Setup time (`t_setup`) = 1ns, and Clock skew (`t_skew`) = 1ns, where the launch clock is delayed relative to the capture clock.

The minimum clock period (`T_min`) must be long enough for the data to travel from the launch flop to the capture flop and meet the setup time of the capture flop.

The standard setup timing equation is:

T_min ≥ t_c2q + t_comb + t_setup + t_skew

Where `t_skew` is defined as `T_capture_clk – T_launch_clk`. In this problem, the launch clock is delayed by 1ns, so `T_launch_clk` is later than `T_capture_clk`. This results in a negative skew, which helps meet setup time.

  • `t_c2q` = 1 ns
  • `t_comb` = 3 ns
  • `t_setup` = 1 ns
  • `t_skew` = -1 ns

We can calculate the minimum period:

T_min ≥ 1ns + 3ns + 1ns + (-1ns)

T_min ≥ 4 ns

The maximum frequency (`F_max`) is the reciprocal of the minimum period:

F_max = 1 / T_min = 1 / 4 ns = 250 MHz

Therefore, the maximum frequency of operation is 250 MHz.

Question 23: Clock Multiply-by-2 Circuit

Draw a logic diagram for a circuit that multiplies an input clock’s frequency by two.

A clock’s frequency can be doubled by XORing the original clock with a version of itself that has been phase-shifted by 90 degrees (or one-quarter of its period). This creates a pulse on both the rising and falling edges of the original clock.

Clk

Clk_90

Clk_x2

In practice, this is usually done inside a PLL (Phase-Locked Loop) or DLL (Delay-Locked Loop) which can generate precise phase shifts and frequencies.

Question 24: Fixed Priority Arbiter RTL

Provide an optimized RTL implementation for a fixed priority arbiter.

A fixed priority arbiter takes a multi-hot request vector (`req_in`) and produces a one-hot grant vector (`gnt_out`), granting the request with the highest fixed priority (e.g., LSB is highest). The following is a scalable and optimized implementation using a `for` loop with a `genvar` which synthesizes into a chain of priority logic.


module fixed_priority_arbiter #(
    parameter WIDTH = 8
) (
    input  logic [WIDTH-1:0] req_in,
    output logic [WIDTH-1:0] gnt_out
);
    // Grant the LSB if it is requested, as it has the highest priority.
    assign gnt_out[0] = req_in[0];

    // For other bits, grant only if requested AND no lower bit (higher priority) was granted.
    genvar i;
    for (i = 1; i < WIDTH; i = i + 1) begin : grant_logic
        assign gnt_out[i] = req_in[i] & ~|req_in[i-1:0];
    end

endmodule
            

Note: The logic `& ~|req_in[i-1:0]` is more efficient than `& ~|gnt_out[i-1:0]` because it avoids long dependency chains on the grant signals themselves, resulting in a faster circuit.

Question 25: Round Robin Arbiter RTL

Provide an RTL implementation for a Round Robin arbiter.

A Round Robin arbiter gives each requestor an equal chance by rotating priority. After a request is granted, it becomes the lowest priority in the next cycle. A common way to implement this is by using a fixed priority arbiter on a shifted version of the request vector.


module round_robin_arbiter #(
    parameter REQ_WIDTH = 4
) (
    input  logic clk, reset,
    input  logic [REQ_WIDTH-1:0] req,
    output logic [REQ_WIDTH-1:0] gnt
);

    logic [REQ_WIDTH-1:0] priority_ff;
    logic [REQ_WIDTH-1:0] req_masked;
    logic [REQ_WIDTH-1:0] gnt_unmasked;

    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            priority_ff <= '1; // Initial priority to requestor 0
        end else if (|gnt) begin
            // Rotate priority to the next requestor
            priority_ff <= {gnt[REQ_WIDTH-2:0], gnt[REQ_WIDTH-1]};
        end
    end

    // Mask requests based on priority. Only higher priority requests are considered.
    // This is a simplified approach. A more robust one would involve shifting.
    // A common implementation uses a fixed priority arbiter on shifted requests.
    // For simplicity, here is a pointer-based approach:
    
    integer i;
    always_comb begin
        gnt = '0;
        // Start checking from the highest priority requestor
        for (i = 0; i < REQ_WIDTH; i = i + 1) begin
            if (req[i] && priority_ff[i]) begin
                gnt[i] = 1'b1;
                break; // Grant first one found
            end
        end
        // If no priority requestor is active, check all requests
        if (|gnt == 1'b0) begin
             for (i = 0; i < REQ_WIDTH; i = i + 1) begin
                if (req[i]) begin
                    gnt[i] = 1'b1;
                    break;
                end
            end
        end
    end

endmodule
            

Question 26: Weighted Round Robin Arbiter

Provide the pseudo-code or RTL structure for a Weighted Round Robin (WRR) arbiter.

A WRR arbiter extends the Round Robin scheme by allowing some requestors to be granted more often than others, based on assigned "weights". A requestor is only skipped in the Round Robin scheme if its weight counter is zero.

Architecture:

  1. Each requestor has a `weight_counter`. These are initialized to their assigned weights.
  2. The arbiter uses a standard Round Robin scheme, but only considers requests where the corresponding `weight_counter` is greater than zero.
  3. When a request is granted, its `weight_counter` is decremented.
  4. When all active requestors have their `weight_counter` at zero, all counters are reloaded with their initial weight values.

// --- Conceptual RTL for a 4-requestor WRR Arbiter ---

// State Elements
reg [3:0] weight_counter_0, weight_counter_1, ...;
reg [3:0] priority_pointer; // As in standard RR

// Inputs: req[3:0], clk, reset
// Outputs: gnt[3:0]

// Combinational Logic to determine which requests are active
wire [3:0] active_req;
assign active_req[0] = req[0] & (weight_counter_0 > 0);
assign active_req[1] = req[1] & (weight_counter_1 > 0);
// ... and so on for all requestors

// Use a standard Round Robin arbiter on `active_req` to generate `gnt`

// Sequential Logic for counter updates
always @(posedge clk or posedge reset) begin
    if (reset) begin
        // Reload all counters to initial weights
        weight_counter_0 <= INITIAL_WEIGHT_0;
        // ...
    end else begin
        // Decrement counter for the granted requestor
        if (gnt[0]) weight_counter_0 <= weight_counter_0 - 1;
        // ...

        // Check if all active counters are zero to reload
        if (reload_condition) begin
             weight_counter_0 <= INITIAL_WEIGHT_0;
             // ...
        end
    end
end
            

Question 27: Frequency Divide-by-3 Circuit

Draw a logic diagram for a circuit that divides an input clock frequency by 3.

A divide-by-3 counter requires 2 flip-flops to achieve the 3 unique states. The output will be high for one cycle and low for two, resulting in a 33% duty cycle. We can derive the logic from a state table.

State Table: (Current State: Qa, Qb | Next State: Da, Db | Output)

  • 00 -> 01 | 0
  • 01 -> 10 | 0
  • 10 -> 00 | 1

From this, we can derive the next-state logic equations:

Da = Qb'
Db = Qa
Output = Qa (if we want the output high in state `10`)

A common implementation uses a slightly different state assignment to simplify logic:

Da = (Qa | Qb)'
Db = Qa
Output = Qb

ADaQa BDbQb out

Question 28: Find Position of First Set Bit

How do you find the index of the first '1' in a binary vector, searching from a) the right (LSB) and b) the left (MSB)?

This is functionally equivalent to designing a priority encoder.

a) From the Right (LSB)

This is a standard priority encoder. It takes a multi-bit vector and outputs the binary index of the highest priority active bit, where the LSB (`bit 0`) has the highest priority.

Example: Input `8'b01101000` -> Output `3'b011` (index 3).

In Verilog, this can be implemented with a `casex` statement or a `for` loop.

b) From the Left (MSB)

This is also a priority encoder, but with the priority reversed (MSB is highest). The easiest way to implement this in RTL is to simply reverse the input vector before feeding it into a standard LSB-first priority encoder.

Example: Input `8'b01101000` -> Output `3'b110` (index 6).

Question 29: Count Trailing Zeros

Design a circuit that takes a binary number and returns the count of trailing zeros (the number of zeros from the LSB up to the first '1'). If the input is all zeros, it should return the width of the word.

This is another application of a priority encoder. The number of trailing zeros is simply the index of the first '1' from the right (LSB).

The most scalable and elegant solution combines a priority encoder with the system function `$clog2`.

  1. Isolate the LSB '1': A common trick is `lsb_one = data_in & -data_in`. This produces a one-hot vector with only the least significant '1' of `data_in` set. For example, `8'b01101000` becomes `8'b00001000`.
  2. Convert to Index: The SystemVerilog function `$clog2(lsb_one)` takes this one-hot number and returns its binary index. For `8'b00001000`, `$clog2` returns `3`.
  3. Handle the All-Zeros Case: An explicit check is needed for an input of `0`.

module trailing_zero_detector #(
    parameter WIDTH = 8
)(
    input  logic [WIDTH-1:0] data_in,
    output logic [$clog2(WIDTH):0] zero_count
);
    logic [WIDTH-1:0] lsb_one;

    assign lsb_one = data_in & -data_in;

    always_comb begin
        if (data_in == '0) begin
            zero_count = WIDTH;
        end else begin
            zero_count = $clog2(lsb_one);
        end
    end
endmodule
            

Question 30: Pulse Stretching Circuit

Draw a diagram for a pulse stretching circuit.

A pulse stretcher extends the duration of a short input pulse. A simple way to do this for a fixed number of cycles is with a shift register and an OR gate.

Architecture: The input pulse is fed into a shift register. The output is the OR-reduction of all the bits in the shift register. If the shift register is N-bits wide, an input pulse of 1 cycle will be stretched to an output pulse of N cycles.

3-Cycle Pulse Stretcher FF1 FF2 FF3 in_pulse out_stretched

For stretching a pulse by a large or variable number of cycles, a counter-based approach is more efficient than a long shift register.

VLSI Interview Questions and Answers – Part 1

 

Question 1: Average of a Data Stream

An incoming data stream has an unknown length and is not in any specific order. Design a hardware architecture to find the average of all data values, excluding the two largest values seen so far. The primary constraint is to minimize data storage.

To solve this without storing the entire stream, we can process the data on the fly. We need a few key storage elements (registers):

  • max_1: To store the largest number seen so far.
  • max_2: To store the second-largest number seen so far.
  • sum: A running total of all numbers *except* for max_1 and max_2.
  • count: A counter for the numbers included in the sum.

The logic operates as follows for each new data value (new_data) that arrives:

  1. If new_data > max_1:
    • The old max_1 is now demoted. Add it to the sum.
    • Update max_2 with the old max_1 value.
    • Update max_1 with new_data.
  2. Else if new_data > max_2:
    • The old max_2 value is demoted. Add it to the sum and increment count.
    • Update max_2 with new_data.
  3. Else (new_data is smaller than both):
    • Add new_data directly to the sum.
    • Increment count.

The final average is sum / count. Hardware division is resource-intensive. In an interview, it’s good to discuss trade-offs. A simple approach is to wait until the count is a power of 2 and then perform a fast right-shift to calculate the average at that point.

Question 2: Sequential Pulse Generation

Write a synthesizable Verilog module for an N-bit signal, `A`. The behavior should be that `A[0]` goes high for 10 clock cycles, then goes low. Immediately after, `A[1]` goes high for 10 clock cycles, then low, and so on for all `N` bits in a cyclical manner.

This describes a “walking 1” pattern where each bit position stays high for a fixed duration. We can achieve this with two counters: one to count the 10 cycles and another to track which bit should be active.


module walking_pulse #(
    parameter N = 8,
    parameter PULSE_WIDTH = 10
) (
    input  logic clk,
    input  logic reset,
    output logic [N-1:0] A
);

    localparam COUNT_WIDTH = $clog2(PULSE_WIDTH);
    localparam BIT_SELECT_WIDTH = (N > 1) ? $clog2(N) : 1;

    logic [COUNT_WIDTH-1:0]      cycle_count;
    logic [BIT_SELECT_WIDTH-1:0] active_bit;

    always_ff @(posedge clk or posedge reset) begin
        if (reset) begin
            cycle_count <= '0;
            active_bit  <= '0;
        end else begin
            if (cycle_count == PULSE_WIDTH - 1) begin
                cycle_count <= '0;
                if (active_bit == N - 1) begin
                    active_bit <= '0; // Wrap around
                end else begin
                    active_bit <= active_bit + 1;
                end
            end else begin
                cycle_count <= cycle_count + 1;
            end
        end
    end

    // Use a one-hot encoding based on the active_bit counter
    assign A = (1'b1 << active_bit);

endmodule
            

Question 3: MUX-Based CDC Synchronizer

Explain the architecture of a MUX-based synchronizer for safely transferring multi-bit data across clock domains, and provide the corresponding Verilog code for the receiver-side logic.

A simple two-flop synchronizer is unsafe for multi-bit data because different bits can transition at slightly different times, leading to an incorrect value being captured in the destination domain. A MUX-based synchronizer (or feedback synchronizer) solves this by only allowing the destination domain to sample the data when it is stable.

Architecture:

  1. A control signal from the source domain (in_ctrl) is first synchronized to the destination domain using a standard two-flop synchronizer.
  2. In the destination domain, a MUX selects between the incoming new data (in_data) and the previously captured (and held) data.
  3. The synchronized control signal (q2) acts as the select line for this MUX. When the control signal indicates new data is available, the MUX passes it to a register. Otherwise, the register feeds its own output back through the MUX, holding its value stable.


Clock Domain A
Clock Domain B

FF

FF


data_in
in_data


ctrl_in
in_ctrl

q1

q2


MUX
1
0


sel

FF
sync_data


clkAclkA
clkBclkBclkB


module mux_synchronizer #(
    parameter WIDTH = 8
) (
    input  logic clkB, reset,
    input  logic [WIDTH-1:0] in_data,
    input  logic             in_ctrl,
    output logic [WIDTH-1:0] sync_data
);

    logic q1, q2;
    logic [WIDTH-1:0] mux_out;

    // 1. Synchronize the control signal into the clkB domain
    always_ff @(posedge clkB or posedge reset) begin
        if (reset) begin
            q1 <= 1'b0;
            q2 <= 1'b0;
        end else begin
            q1 <= in_ctrl;
            q2 <= q1;
        end
    end

    // 2. MUX selects new data or holds the old stable data
    assign mux_out = q2 ? in_data : sync_data;

    // 3. Register the MUX output to get the final synchronized data
    always_ff @(posedge clkB or posedge reset) begin
        if (reset) begin
            sync_data <= '0;
        end else begin
            sync_data <= mux_out;
        end
    end

endmodule
            

Question 4: Handling Glitches in a Circuit

What is a glitch, and what is the standard method to prevent one from propagating through a digital circuit?

A glitch is a short, unwanted pulse or transition on a signal line, typically at the output of combinational logic. It occurs when different signal paths through the logic have different propagation delays, causing the output to momentarily change to an incorrect value before settling.

Glitches can cause serious issues, such as incorrectly clocking a downstream register or triggering spurious events.

The most effective way to eliminate glitches is to register the output of the combinational logic. A flip-flop is inherently immune to glitches on its data input because it only samples the input at the precise moment of the active clock edge. By the time the clock edge arrives, the glitch will have subsided, and the flip-flop will capture the final, stable value. This prevents the glitch from ever propagating to the next stage of the logic.


Problem: Glitch from Logic
FF1
Logic
FF2
Glitch


Solution: Register the Output
FF1
Logic
FF_Fix
FF2

Question 5: Blocking vs. Non-Blocking Assignments

Explain the synthesis result for the following Verilog code, first using blocking assignments (`=`) and then using non-blocking assignments (`<=`).


// Input is 'a', output is 'o'
always @(posedge clk) begin
  b = a;  // or b <= a;
  c = b;  // or c <= b;
  o = c;  // or o <= c;
end
            

This question highlights the fundamental difference between blocking and non-blocking assignments in sequential logic.

1. Using Blocking Assignments (`=`)

Blocking assignments execute sequentially within a block. The next statement only executes after the current one is complete.

b = a; // b gets the value of a.

c = b; // c gets the NEW value of b (which is a).

o = c; // o gets the NEW value of c (which is also a).

Synthesis Result: The synthesizer recognizes that `o` simply gets the value of `a`. The intermediate signals `b` and `c` are optimized away. The final hardware is a single flip-flop where input `a` is connected to the D-input and the output is `o`.

2. Using Non-Blocking Assignments (`<=`)

Non-blocking assignments schedule updates to occur at the end of the time step. All right-hand side (RHS) expressions are evaluated first, using the “old” values of the signals from before the clock edge.

b <= a; // Schedule b to get a’s value.

c <= b; // Schedule c to get b’s OLD value.

o <= c; // Schedule o to get c’s OLD value.

Synthesis Result: This correctly models a chain of registers. The value of `a` propagates through one flip-flop per clock cycle. The hardware is a three-stage shift register.

Question 6: CDC with Plesiochronous Clocks

Data is being sent from a 99 MHz clock domain to a 100 MHz clock domain. Is a standard two-flop synchronizer always sufficient for safe synchronization?

Not necessarily. It depends on the source of the clocks.

Clocks with very close frequencies, like 99 MHz and 100 MHz, are called plesiochronous. The critical factor is whether they are derived from the same original clock source or are completely independent.

  • Derived from Same Source: If both clocks come from the same PLL or oscillator, their phase relationship, while drifting, is not random. The clock edges will align frequently and predictably. This creates “worst-case” scenarios for metastability more often, and a standard two-flop synchronizer’s Mean Time Between Failures (MTBF) might be unacceptably low. More robust solutions like an Asynchronous FIFO are often required.
  • From Different Sources: If the clocks are from two independent crystal oscillators, their phase relationship is truly random. This scenario is no different from any other asynchronous clock crossing, and a standard two-flop synchronizer, designed with appropriate timing constraints, is generally sufficient.

Question 7: Fixing a Combinational CDC Path

The circuit below shows combinational logic in the source clock domain (`clkA`) feeding directly into a flip-flop in the destination domain (`clkB`). How would you modify this to eliminate the Clock Domain Crossing (CDC) issue?

Domain clkA
Domain clkB
FF
FF
OR
FF

Placing combinational logic directly before a synchronizer is a bad design practice. Any glitches or high-frequency toggling from the logic will be presented to the synchronizer, which drastically reduces the MTBF and increases the chance of failure.

The correct way to fix this is a two-step process:

  1. Register the logic output in the source domain: Add a flip-flop, clocked by `clkA`, immediately after the OR gate. This contains any glitches or rapid toggling within the source domain and presents a clean, stable signal to the clock domain boundary.
  2. Synchronize the stable signal: Use a standard two-flop synchronizer, clocked by `clkB`, to safely transfer the now-stable signal from the new register into the destination domain.

This ensures that only a stable, registered signal attempts to cross the clock domain, which is the foundation of robust CDC design.

Question 8: Observing Multiple Clocks on a Single Pin

Imagine a design has hundreds of internal clock signals. How would you design a circuit to observe any one of these clocks on a single physical output pin for debugging?

This is a common requirement for silicon debug. The most straightforward solution is to use a large multiplexer (MUX).

Architecture:

  • MUX Inputs: All the internal clock signals that need to be observed are connected to the data inputs of the MUX.
  • MUX Select Lines: The select lines of the MUX are driven by a control register. This register can be written to via a standard interface like JTAG, APB, or a simple test interface.
  • MUX Output: The single output of the MUX is routed directly to the physical debug pin.

To observe a specific clock, a test engineer would write the corresponding index value into the control register. This selects the desired clock and routes it to the output pin. It is crucial that this MUX is designed to be glitch-free to avoid creating invalid clock signals at the output.

Question 9: Verilog Lint and Synthesis Issues

Find the linting and synthesis issues in the following Verilog code.


module basic(b, a, en);
  output reg b[4:0];
  input [1:0] a;
  input en;
  always @(a, en) begin
    assign b = 4'b0;
    case ({en, a}) begin
      3'b1_00: b = 3'b001;
      3'b1_01: b = 3'b010;
      3'b1_10: b = 3'b100;
      3'b1_11: b = 3'b101;
    endcase
  end
endmodule
            

This code has several significant issues:

  1. Illegal Syntax: The `assign b = …` statement is a continuous assignment and is illegal inside a procedural block like `always`. To assign a value inside a combinational `always` block, a blocking assignment (`b = …;`) must be used.
  2. Incomplete `case` statement: The `case` statement only defines behavior when `en` is `1`. It does not specify what `b` should be when `en` is `0`. Without a `default` case or a default assignment for all conditions, the synthesizer will infer a latch to hold the value of `b` when `en` is `0`. Latches are generally undesirable in synchronous designs.
  3. Bit-width Mismatch: The output `b` is 5 bits wide, but the code assigns 3-bit and 4-bit values to it. This is a common linting warning.

Corrected Code:

A better way to write this to avoid latches is to provide a default value at the beginning of the `always` block and use `always_comb` for clarity.


module basic (
    output reg [4:0] b,
    input      [1:0] a,
    input            en
);
    // Use always_comb for combinational logic.
    // It automatically includes all inputs in the sensitivity list.
    always_comb begin
        b = 5'b00000; // Default assignment to prevent latches
        if (en) begin
            case (a)
                2'b00: b = 5'b00001;
                2'b01: b = 5'b00010;
                2'b10: b = 5'b00100;
                2'b11: b = 5'b01000;
                default: b = 5'b00000; // Good practice
            endcase
        end
    end
endmodule
            

Question 10: Clock Domain of an Async FIFO

In which clock domain does an asynchronous FIFO’s synchronizer logic reside?

This is a trick question. An asynchronous FIFO, by its very nature, bridges two different clock domains. It does not exist in a single domain.

  • The write logic (write pointer, full flag generation) operates entirely in the write clock domain.
  • The read logic (read pointer, empty flag generation) operates entirely in the read clock domain.

The “synchronizer logic” is the crucial part that connects them. Specifically:

  • The binary write pointer is converted to Gray code and then synchronized from the write domain to the read domain to be used in the empty flag logic.
  • The binary read pointer is converted to Gray code and then synchronized from the read domain to the write domain to be used in the full flag logic.

Therefore, the synchronizer logic doesn’t reside in one domain; it is the very mechanism that crosses from one to the other.