Some years ago, on our z800 processor, we measured the performance of (in-place) TR against a software-coded loop. We found that the loop was faster than TR for strings shorter than nine (9) bytes in length. When we spoke to IBM about this, we learned that TR had been partially moved into millicode for the z900/z800. It ran slower for short strings because of the millicode start/stop (aka “subroutine linkage”) costs.
For strings longer than nine bytes, TR was faster because it had access to a hardware facility that could translate two bytes per cycle. The code fragments we compared were:
|CASE1 DC 0H
| LA R2,9
| LA R3,DATA
| XR R4,R4
|CASE1L1 DS 0H
| IC R4,0(,R3)
| IC R4,EBCDIC(R4)
| STC R4,0(,R3)
| AHI R4,1
| AHI R3,1
| JCT R2,CASE1L1
|CASE1L EQU *-CASE1
|CASE2 DC 0H
| TR DATA(9),EBCDIC
|CASE2L EQU *-CASE2
We later “unrolled” the loop, interleaving the use of three different registers, and found it was now faster than TR for strings of 24 bytes or fewer!
|Stride EQU 3
|CASE1 DC 0H
| LA R0,9/Stride
| LA R3,DATA
| XR R4,R4
| XR R5,R5
| XR R6,R6
|CASE1L1 DS 0H
| IC R4,0(,R3)
| IC R5,1(,R3)
| IC R6,2(,R3)
| IC R4,EBCDIC(R4)
| IC R5,EBCDIC(R5)
| IC R6,EBCDIC(R6)
| STC R4,0(,R3)
| STC R5,1(,R3)
| STC R6,2(,R3)
| AHI R3,Stride
| JCT R0,CASE1L1
|CASE1L EQU *-CASE1
The results of the above experiments suggest that your loop has an excellent chance of being faster than *any* sequence involving TR or TRE, for strings shorter than some number of bytes ‘n’, on any given hardware generation supporting z/Architecture.


