Skip to content
Victor edited this page Jan 18, 2023 · 22 revisions

Clock speed

  • NTSC: 23.01136 MHz
  • PAL: 22.801467 MHz

ABI

The ABI for SuperH is to pass the first four parameters in r4 to r7, and any others on the stack. Registers r8 to r14, pr, macl and mach MUST be preserved if you use them, normally by pushing them on the stack, which is r15. r0 to r7 may be freely changed without saving, and the result is returned in r0 as long as it's 4 bytes or less. For 8 bytes or less, the result is returned in r0 and r1.

The SH-2 and 32X hardware is all big-endian.

Clock cycles

The SH2 has a five stage pipeline - each instruction takes (nearly always with a few exceptions) five cycles to complete. However, the pipe can be loaded on each cycle, so after five cycles for the first instruction, all further instructions complete on the next cycle for an effective cycle count of one. Conditional branching can result in the pipe being flushed, meaning four more cycles. You'll want to read the pipeline section of the Hitachi SH2 Programming Manual for details (section 7). In general, though, you can count most instructions as one cycle long... as long as the code is cached and makes no outside memory fetches/stores.

Memory access

The 32X hardware manual tells you how many cycles for reading/writing various blocks in the SH2 address map. For example, reading SDRAM takes 12 cycles since it does a burst read, but only 2 cycles on a write since writes are not burst. Burst reading reads 8 words (one cache line) in one go of 12 cycles - or 1.5 cycles per word on average (the fastest non-cache memory can be read). However, even when reading a single word that is uncached, it still does a burst read - 8 words are read in 12 cycles, and the other 7 are tossed out. So reading an uncached word in SDRAM is the slowest thing you can do on the SH-2s. Keeping in mind the burst reads on the SH-2 is one of the key things to remember when designing code for the 32X when trying to get as much speed as possible.

The division unit (DIVU)

The hardware division unit can work in parallel with the rest of the CPU.

When a read or write instruction is issued while the division unit is operating, the read or write instruction is continuously extended until the operation ends. This means that instructions that do not access the division unit can be parallel-processed.

For 64:32 bit division, the quotient is accessible from two registers: DVDNT and DVDNTL

The divider can't be saved/restored, so make sure that no function used by interrupt handlers uses the divider.

DMA

The SH2 processors have two Direct Memory Access Controllers (each). These allow you to set a source from which to fetch data, a destination to store the data to, a count of how much data to transfer, and a control register to tell the channel things like whether or not to increment or decrement (or neither) the source and destination, how big the data units to be transferred are (byte/word/long/16 bytes), if the transfer is done, if there was an error, and to generate an interrupt when the transfer is done.

DMA in 16-byte mode

Note, the DMA in the SH2 can use this burst mode when put in 16-byte mode. If you're trying to get the best speed from DMA, put the source data on 16 byte boundaries, and use the 16 byte transfer word size.

For a 16-byte transfer, the address is incremented by +16 regardless of the SM1 and SM0 values.

CPU cache bus width

The internal cache bus width isn't specified directly, but a couple things allow you to assume it either IS 32 bits, or is fast enough to not matter - the HW manual says it takes one cycle to fetch the data for the CPU regardless of the size requested, and it says the cache data bus uses four longwords to fill the cache AND that the cache data bus is what the CPU reads to get the data, therefore the cache data width is indeed 32 bits.

Access Timing of each CPU to 32X Block

The timing sequence when the CPU accesses the peripheral is called a bus cycle, and takes a minimum of 4 Clock with 68000 and 2 Clock with SH2*. In addition, wait time is created on the CPU side due to the difference of the peripheral and operating speeds. 1 Wait means that the minimum bus cycle + 1 Clock is necessary in the access. A wait is required for all 32X blocks (as shown below) to access from 68000 and SH2 in response to the process contents and operation status.

* Besides inputting a Wait signal from the outside, SH2 can input Wait by setting the built-in bus state controller, but after implementing boot ROM only external Wait is set.

32X Mode and Cartridge ROM

CPU min wait max wait
SH2 (Read/Write) 6 15
68K (Read/Write) 0 5

Frame Buffer

CPU min wait max wait
SH2 (Read) 5 12
SH2 (Write) 1 3
68K (Read) 2 4
68K (Write) 0 0

Write access to the SH2 frame buffer assumes continuous accessing without an Idle Cycle. When the Idle Cycle is inserted between accesses, the next access time is shortened only by the number entered by the Idle Cycle (the next access time cannot be shorter than a minimum cycle of 3 clock).

A 4 word component of FIFO is held for frame buffer writing. Thus, 5 Clock is required if FIFO is FULL and 3 Clock is required if FIFO is not FULL.

Palette

CPU min wait
SH2 (Read/Write) 5 ~ 64 μsec
68K (Read) 2 ~ 64 μsec
68K (Write) 3 ~ 64 μsec

Wait number 64 μsec means that a wait of a 1 line component display is required. (If access to the palette competes with the CPU and VDP, a wait of a 1 line component is required in the CPU side).

VDP Register

CPU wait (const)
SH2 (Read/Write) 5
68K (Read) 2
68K (Write) 0

System Register

CPU wait (const)
SH2 (Read/Write) 1
68K (Read/Write) 0

Boot ROM

CPU wait (const)
SH2 (Read) 1

SDRAM Access Time

The 32X SDRAM is specialized for the "replace" in the case of the SH2 cache miss, and read transfers in the 8 word bursts mode* while write transfers in the 1 word single mode. Access time is fixed at the following values:

Op time
Read 12 Clock / 8 Words
Write 2 Clock / 1 Word

* 8-Word burst mod of read is a read operation that takes data in batches of 8 word components from the first address specified by the word address. Because 8 word corresponds to a single line cache, there will be conformity when a cache miss-hit occurs and line data is replaced. But when the SDRAM is read using cache-through, even if the data to be read is only a single word, the access operation to the SH2 SDRAM is 8-word-burst-read-fixed, and action time is required by that amount.

Internal I/O Register Access Cycles

32X Technical Bulletin #32 - SH2 Internal IO Register Access Cycles - [1994-12-08]

Module Name Minimum Number of Cycles
BSC 3
DMAC 3
DIV 3
UBC 3
INTC 4
MDC (CCR, SBYCR) 4
FRT 11
WDT 11
SCI 11

Access to the internal I/O is done in the following sequence:

  1. A wait occurs if the bus is determined to be busy 1 cycle after the internal I/O access begins.
  2. Internal I/O access occurs after the bus master completes the use of the bus.
  3. After access to the internal I/O is completed, bus access is enabled for the other bus master on hold.

Therefore, the access time to the internal I/O = Wait time + minimum number of cycles

Bus Masters

DMA via DMAC

When cycle stealing, the bus is released for each access. During burst transfers, the bus is released after 1 burst is completed.

Bus Request

For example, when the slave side has the bus right, the master side's internal I/O access will be on wait status until the slave side releases the bus right.

Clone this wiki locally