8301 (ZX8301, the QL's Master Chip MC) - facts and figures

Nagging hardware related question? Post here!
Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

Peter wrote:
Nasta wrote:DB0..DB7 - input - RAM (and in most cases 8302) data bus, connects to the CPU data bus using a 74LS245 bus buffer.
In case of a "modern" FPGA 8301 replacement with sufficient RAM, I would make DB0..DB7 bidirectional, not input.
Keeping /RAS, /CAS0, /CAS1 and /WE inactive should be sufficient to keep the QL DRAMs at lowest possible power and off the data lines (except pin capacities).
Quotes 4164 data sheet:
"The data output is in the high−impedance (floating) state until CAS is brought low."
"RAS is similar to a chip enable in that it activates the sense amplifiers as well as the row decoder."
Have also a look at the maximum period required for /RAS. It does have a maximum (as long as it is) so it cannot be completely static. But I am sure at that rate the current consumption would be FAR lower.
/TXOE would only control data buffer direction between CPU and 8301 replacement.
DA0..DA7 would only be used by the FPGA to read the CPU address bus, selecting upper or lower part by /ROW.
This suggests that you would implement the entire 128k inside the FPGA? Otherwise a part of the original RAM or some other RAM would have to be used. I would strongly vote for 'other RAM' as I'm guessing a FPGA with 128k block SRAM would be a huge overkill when it comes to the amount of logic it contains which would basically remain unused. Actually, you could probably implement a 68008 in there too :P
Given the FPGA would come with lots of pins and the whole replacement would have to be a small PCB I vote for a smaller FPGA that can contain SCR0 and 1, and a 128k SRAM on the side to implement the actual 'CPU' RAM, and shadow the video RAM inside the FPGA. This way DB lines could still be kept input only and perhaps even simplify and streamline the logic as one could buffer the write to the screen RAM inside the FPGA and not worry about data being read, since that's provided from the SRAM chip.
Biggest question for me would be if the QL monitor signal pins should be implemented at all, or only a separate VGA signal path.
Perhaps only through a discrete buffer and that should be disabled for VGA compatible timing. This way it wold still be possible to remain faithful to the original hardware for users who insist on the original monitor(s). Anything faster would I think need a separate output anyway. VSYNCH and /CSYNCH need to be provided at close to standard TV intervals for SGC compatibility (HSYNCH derived from VSYNCH and /CSYNCH is used to trigger DRAM refresh on the SGC).
Second biggest question whether to use FPGA internal PCI clamp diodes plus series resistors, or level shifters.
Given that most output signals would either remain strapped to a constant level or can be 3V3, there are not that many that need fast level shifting - something like a quickswitch also provides 3V3 clamping and could turn out to be feasible, eg. for DA0..7 and DB0..7. Or even just series resistors and internal PCI clamps with the help of a capacitor or several to filter stuff where it's critical, especially for CPU signals that did not come through a TTL multiplexer and bus buffer. For the latter the added capacitances of the RAM chip pins might even be of some help.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

So, to finish off, here is a short recap of some 'quirks' of the 8301, which could probably have been 'differently engineered'.
I should also put in a warning - all of this is (as 'educated' as I can make it) largely conjecture from the available data. Perhaps more could be known by de-capping the chip, though a better idea would be first figuring out the difference between the various versions (2 CLAxxxx and one ZX8301 that I know if).
Some ideas are based on knowledge of PLD hardware but in truth, gate arrays (which were the first custom logic available) have some fairly crucial differences. In particular, routing of signals is a point where huge differences can accrue - migrating a single layer metal to double layer metal process can mean the difference between fitting a circut even possible or not, onto a same give 'sea of gates' type chip. However, in most cases I think the points below have merit.

1) The issue of overscan and 7.5 vs 8MHz CPU clock.

As was amply explained, almost the entire operation of the QL RAM to CPU interface is based on how the 8301 accesses screen data.
So, we know that it uses a master 15MHz clock to derive all timings, and that 'time' in CPU sense is divided into 64us intervals, which are further divided into 40x 12-clock chunks. These are further divided into thirds (which are 4 clocks long each). During each 64us interval, 32 12-clock chunks have only one 4-clock period dedicated to CPU data access, while 8 have all 3 4-clock periods dedicated to CPU data access. When you crunch the numbers, you get a dot clock (I will be using MODE4 as reference) running at 15MHz/1.5, so 10MHz and the 32 chunks where screen data is accessed add up to exactly 51.2us of pixels, i.e. 512 pixels per line.
This is even continued during unused lines with the sole difference no actual data is being read to generate the pixels, though the actual RAM access does take place as a refresh cycle.
So, what would we need to do to get our 512 pixels into a 48us period, which is what is defined as 'standard', i.e. no overscan.
Obviously, the clock would have to run faster as more 'chunks' where pixel data is read need to be fitted into a shorter time. Once the math is done, you end up with exactly 8MHz for the CPU clock, which means a 16MHz main clock and still a /1.5 pixel clock which comes to 10.6666MHz.
However, the line interval, 64us, needs to remain unchanged and this is a bit of a problem. Given that the whole logic is based on a 12-clock timing chunk, and there are 512 clocks per one 64us line, it's rather obvious that 512 is not wholly divisible by 12, we get 42.66666 12-clock chunks.
So we can either extend this to 43, which slightly lengthens the 64us standard line period to 64.5us, which is still just in tolerance for a TV.
We can also shorten it to 42, which shortens the interval to 63us, which might even be more compatible because US NTSC uses a fater line rate and most TVs are more tolerant to faster rather than to slower rates.
Or, we can use different logic that makes it possible to have a 4-clock based timing for the period where no pixels are displayed during a scan line - and this would be the most complex change to the logic, as it has a 'special case'. The two former ones involve different logic to detect the number of the chunk where the line period should end, which is a LOT simpler. While it needs very slightly more complex logic for this, the most complex part, i.e. the various counters that generate the timing and the RAM timing state machine remain the same.
Along with a more compatible picture, it comes with a speed increase, which is slightly over that of just increasing the clock to the CPU. The reason is that now we have more timing chunks per 64us line, but the number of them used to access screen data has remained the same, so more are available for CPU access to RAM.
Before there were 40 chunks of 3 4-clock periods, giving us 120 total 4-clock slots, 56 of which could be used by the CPU. This is a 46.66% ratio.
Now we would have 42 chunks (lets take the worst case implementation) of 3 4-clock periods, giving us 126 total 4-clock slots, 62 of which could be used by the CPU. This is a 49.21% ratio, but the CPU clock is also increased by 6.66% so the total increase over the current scheme is about 12.5%.
There is however also a drawback. The CPU clock is used to derive some timing, notably the serial port baud rate and, certainly at least possibly and therefore more problematic, microdrive data rate timing in the 8302, so as easy as this mod would have been for the 8301, it would imply a lot more problems for the 8302 logic.

2) What about the 56 unused display lines where the display data is still being 'fake read' and the CPU is being slowed down?

Well, this is a tougher one to crack. The question here is, do we want more CPU speed or perhaps, if we really need to sacrifice memory bandwidth, how about not making the reads 'fake' but real ones, and get a higher vertical resolution?

Let's deal with the second idea first:
Given that there still need to be counter bits to count the lines all the way from 0 up to 311 total, no change is needed to that part of the logic. What would be needed is a different logic that determines when the invisible lines start, which is now trivial, as we have visible lines going from 0 to 255, and need 9 bits (0..8) to count to 311 which is past 255 that 8 bits would give us, we simply use bit 8 to tell us if the lines are visible or not, as well as what type of screen data read should be done (real or refresh).
The PAL standard defines the maximum number of visible lines as 288, but this only gives us 24 lines of retrace which means it's likely to be a problem on some TVs.
However, old QL users will remember there was an 'extended 4' QL emulator which indeed could do 768x288 and in fact Aurora also supports this resolution, as well as 512x288. What needs to be done is more complex logic than just looking at bit 8 of the line counter to determine what the invisible lines are and what the position of the vertical synch pulse needs to be to center the extra 'tall' display inside the monitor or TV screen, and, finally bit 8 of the line counter must be passed as address bit A15 to the RAM address multiplexer in the 8301, instead of bit 7 of the MCR register. This would have meant that 368064 bytes would be used for display memory, reducing the amount of free RAM by exactly 4k.
It would also require a different approach to switching between screen area 0 and 1, as now it's not neatly fitted into a power of 2 number of bytes - if this feature was deemed important enough to begin with as a trade-off vs having more vertical resolution. The obvious and possibly even more convenient solution would be to move screen 1 to the beginning of the top 64k of RAM, as that only changes the logic used to decode that /CAS1 should be used for screen data reads by the 8301, rather than /CAS0.
On a regular 128k QL (and possibly even on a 640k, don't have the data to calculate the size of the slave block table for 640k RAM) it would fall into the free RAM and could simply be reserved by job 0 using some simple tricks, and used for various things including games. The one game I am aware that used it had to completely use stand-alone code as using screen 1 meant not having any system variables in the usual place, and before Minerva, that meant not having an OS once the game is loaded. Not a huge problem, but if one wanted to use two screen areas WITH the OS, this alternative solution would have probably been better.
Other than some slightly more complex logic and 4k less of free RAM this would have no other serious repercussions to the bare QL as we know it - even if the second screen was not implemented at all as a result, and it could be said that the response would have been positive - it would certainly put the QL at the top of usable graphics resolution at the time.

So, back to 'wasting' cycles during the invisible display lines.
Normally there are 56 of them, and some of what will follow has been written about in the previous posts.
If we go back and look at the way address bits are presented as row and column portions to the RAM, and given that the row address counting through all 256 rows is important to insure proper RAM refresh, we can see the following pattern of address bits:

Row address 7..0 = {A11, A10, A9, A8, A7, A4, A3, A2}.

Given that these are generated from the line and chunk counters that are used to generate the display timing, and we know they go sequentially through the 32k screen RAM, 4 bytes at a time. Given that there are 128 bytes per line, or 32 long words, address bits A6 down to A2 form a 5-bit counter that counts the long word to be accessed, from 0 to 31, starting with the lefthand side of the screen. A6 carries over into A7 and A7 to A14 then count the visible line number. We have already established that we need to go through a certain number of addresses within ~2ms which is what determines that A11 should be the top address bit of the row part of the RAM address.
However, we see that the row address bits do not continue down from A11 to A4 but rather A6 and A5 are 'skipped', or invisible. This means that when we look at the row address as a whole, A2..A4 count from 0 to 7 (000 to 111 binary) and then repeat this 4 times before A7..A11 change and count up by one, because A5 and A6 have to go through 00 to 11 (4 states total) to carry into A7, but we do not see this since A5 and A6 are not part of the row address.
What happens is that a sequence of row addresses XXXXX000, XXXXX001, etc, to XXXXX111 where XXXXX counts up for every new display line, appears 4 times within each line. This means that when the RAM is not actually read, but only refreshed during the inactive display lines, it is actually refreshed 4 times over, where once would have been enough.
This could have been used to free up bandwidth for the CPU to use.
Lets take a step back.
We said there were 40 chunks total of 12 clocks each per line, 8 of which were completely free to be used by the CPU. If you take the above in consideration, the refresh pattern 'just happens to repeat' every 8 chunks. And there are 5 total of these 8-chunk blocks in a scan line, and 1 is always free for the CPU to use, no screen data is read or refreshed. So, the simplest way to free more of them during the 56 invisible lines would have been to actually generate only ONE 8-chunk block with 8301 accesses. And the most logical way would be to simply reverse the logic for the visible lines - when a line is visible, use 4 blocks for the 8301+CPU, 1 block for CPU only. When a line is invisible, use the previously CPU only block for the 8301+CPU in refresh mode, and the previously 8301+CPU blocks for the CPU only - which actually SIMPLIFIES the logic as it re-uses the already existing logic to determine the visible vs invisible part of every line.
What would be the result?
Well, as we said before, each line has 120 4-clock slots, 24+32 used for the CPU, the rest for 8301. Multiply this by 312 lines, so we have 37440 total slots, 17472 used for the CPU (56 per each of 312 lines), the well known 46.66% utilization figure.
In the modified version, we start with the same total of 37440 total slots, 256x56 used by the CPU during visible lines, and 56x104 used by the CPU during invisible lines. This is a total of 20160 slots used by the CPU, 53.85% utilization figure, a 15.4% improvement.
We could push this a little further by shortening the time the 8301 uses to access RAM in refresh mode to one 4-cycle slot rather than 2, because it is using the same timing as if it was accessing 4 bytes of consecutive data, when in fact it is not accessing data at all. One 4-cycle slot would have been enough for refresh.
This would rise the number of slots used by the CPU from 20160 to 20608, giving us a 55% utilization figure and a roughly 18% speed improvement - and this would be about the maximum one can have without seriously changing the hardware.
Things would not be that good if it had been decided to extend the vertical resolution to 288 lines.
Here are the figures: The simple version would get us 49.74% utilization, a mere 6.6% improvement over standard. The more complex version eeks out 1.1% more at 7.7% improvement over standard - but mind you, with 12.5% more pixels on the screen (32 more rows).

* Interesting bug (quirk?) I forgot to mention before:
Astute readers will note that with the row addressing set up as outlined, the 8301 reads or refreshes all rows every 32 display lines. What is not obvious is that there is actually a bug in the refresh scheme because the total number of lines is 312 and this is not wholly divisible by 32 - it comes out as 9.75. This means that all rows get refreshed within 2.048ms which is just to spec, only 9 out of 10 times. The tenth time only 3/4 of the rows get refreshed, so the other 1/4, namely the top 16k of both 64k RAM banks does not get refreshed. Doing the math tells us that this part of the RAM gets refreshed normally 8 times, followed by one refresh every 3.584ms rather than the specified ~2ms. While this may, now that you know about it, seem alarming, in real circumstances DRAM can retain data FAR longer than the refresh spec - as some have noted when the QL is quickly powered off and back on while holding the reset button - often most pixels of the display such as it was at power off remain preserved on power-up. That being said, there was a simple way this could have been avoided, and that is with a different mapping of screen address bits to row and column address.
If the row address was made up as:
Row address 7..0 = {A9, A8, A7, A6, A5, A4, A3, A2} all the rows would be refreshed every 8 lines, and since 312/8=39, there would be no partial refresh.
The above logic to get a better RAM bandwidth utilization figure would be different, though, the simple reversing of 8-chunk blocks could not be used, but it would still not be that complicated.

* Aside: The presence of the 'screen blanking' bit in the 8301 MCR register is curious to say the least as it has to my knowledge never been used by the OS. The obvious use would be a 'screen saver' though it would actually only display a black screen rather than switch off the video signals, which is perfectly fine given that the monitors, let alone TV sets had no notion of power saving and would not switch themselves into a low power (almost off) mode when there is no signal present for a while.
But there is a much more interesting and not obvious use, that has to do with a lot of what was discussed above, regarding memory bandwidth utilization.
What if the blanking bit switched off the 8301's display data reading mechanism since no data is needed to display a blank screen, with the consequence of freeing clock cycles for the CPU to access RAM during invisible (blank) display lines? Given that there still needs to be a refresh scheme in use, we can use the same idea but extend it to all lines when the blank bit is set. Under such circumstances we still have the aforementioned 37440 total potential CPU access slots, of which only 2496 would be used to refresh the RAM, giving us a 93.33% utilization figure for CPU-RAM bandwidth. Given a program and data in RAM, the QL would have been exactly 2x faster. While it's difficult to imagine a scenario where this could be exploited to execute some program faster, one interesting one does come to mind. Imagine you had some QL's networked to yours but with no screen attached or... simply, running some software remotely? It wouldn't be that bad to have them run rather faster than usual in that sort of configuration.

3) Why not 16 colors instead of flash?

The obvious answer would be, we need an extra pin for that. This would mean a different connector for the monitor as well. I'm not going to say anything about adding one more video related signal to the expansion connector J1, as IMHO having the original RGB there was not a very good idea to begin with (but then I would be putting my own foot in my mouth a bit as a long time ago I actually used these signals to drive some highly experimental hardware to get more colors and resolution out of the 8301... but quickly went to picking them off the 8301 socket instead).
In fact, there are several ways a pin could have been freed for other purposes on the 8301.
The most direct one would have been to change the way the 8302 chip enable signal /PCEN is decoded inside the 8301. Choosing address lines that are in the column address part of the RAM address for that was a dubious choice, because it requires a separate signal to control the address multiplexer in order to tell it to pass the required address bits to the DA lines of the 8301. Usually the /RAS signal is used to flip from the row to the column address for the RAM as the row address is required to be stable when /RAS is high and persist for at most a very short time (shorther than it takes the multiplexer to actually replace the row address with the column address) when /RAS goes low, after which the column address can be put on the DA lines. As it is, a separate signal is needed because /RAS must stay high not to start an access cycle in RAM while the 8302 is being accessed.
However, one could argue that /RAS could still have been used as the multiplexer select signal anyway, since there is no /CAS generated when the 8302 is accessed so the RAM would just internally do a refresh while data was actually being read or written from or to the 8302. While it does increase the RAM current consumption a bit, the 8302 is accessed so infrequently compared to anything else that the trade-off would have been well worth the extra pin to use for much more clever things.

Be that as it may, let's explore the possibility of having a way of getting more colors without having to use an extra signal or turning the RGB lines into analog signals - which is actually not possible inside the type of chip used to implement the 8301 ULA.
The clue that points the way is there when one loads a MODE 8 screen but forgets to actually change the screen mode to MODE 8 and leaves it in MODE 4. Surprisingly, the picture is very much recognizable, except for flashing bits if there even are any.
So, the trick would be to display 512 pixels in a line instead of 256 but interpret the 2 bits for each of the 512 pixels using different color components for every even and odd pixel. There are numerous ways this can be implemented.

* Quick aside: such a trick was used on the original Apple II video board but on the level of luma/chroma component signals rather than RGB signals).

I am not certain that the following is not a particular quirk of the version of the 8301 I have explored, but there is a slight shift in the horizontal position of a MODE8 picture with respect to MODE4 which suggests that MODE8 is actually generated from mode 4 pixels internally to the 8301, rather than implementing a different way to shift bits of the 4-byte daya buffer into pixels, depending on the mode selected. The way FLASH works is also an indicator as the flash bit latches the rest of the bits concurrent with itself in the same MODE8 pixel, which ideally means it would have to be present a short while before the others, or there would be a slight delay due to the extra latch - in either case it implies logic that converts two MODE4 pixels to one MODE8 pixel.
Without such logic, the trick I mentioned above to get a semblance of more than 8 colors using only 3 digital signals limits the usability of the display as it actually produces a stippled pixel, which does not look 'smooth' enough for every combination of colors, so we would get a rather odd selection even if extra logic was used to do more complex mapping of bits to RGB components depending on the state of said bits and not only on even or odd pixel.
Depending on how far one can go, the results can actually be quite useful.
One of the more creative ways to do this would be to 'abuse' the way the 10MHz mode 4 pixel clock is generated from the main 15MHz clock of the 8301. I mentioned a while back that 3 consecutive cycles of the 15MHz clock were used to generate 2 cycles of the 10MHz clock by putting the point where the 10MHz clock goes from an even to odd cycle in the middle one of every 3 cycles at 15MHz. When looking at the signals with a scope. one can see that the implementation is partially done with combinatorial logic as the periods of the even and odd 10MHz clock cycles are not exactly the same. A modification of the logic to deliberately produce different lenghth even and odd pixels (such as 2 15MHz clock cycles for even and 1 cycle for odd) can be used along with some mapping logic (usually with one or two pixels total delay) to implement a 'pulse width modulation' scheme on the RGB lines and get a decent representation of 16 colors. Actually 16 out of 64 discrete combinations can be chosen.
Downside: more logic but it is not overly complex, but the screen is a stippled one and without some form of filtering (which would BTW interfere with regular MODE4 making it blurry!) would produce moire patterns on a color monitor and dot crawl on a TV.
While the 'top end' implementation of this logic indeed is capable of generating 64 'colors' given only a single bit for RGB, I am virtually certain there would have been no way to implement palette functionality inside the 8301.
How do we know this? Well, I am sure some of us wished for a 4 out of 8 colors MODE4, and this only requires 4x 3-bit 'memory' to store the 4 pallete entries for MODE4, that's 12 bits of storage. If that did not fit, a 16x6 bit storage plus 64x6bit look-up table ('ROM') hardly could :(

All that being said, there is another aspect of MODE8 which is actually very annoying and may have well contributed to the QL's market fail because it makes graphics for games slower if you want to have it smooth at the same time - MODE8 has a very odd bit to pixel mapping. While it could be said that MODE4 mapping makes sense once one looks at the MOVEP assembler command (which later comes back to haunt us on 68040 and 68060...) and saves us a number of fairly small lookup tables in the already chock-full original ROMs, it does complicate things for games. A simple chunky 4 consecutive bits per pixel, 'packed nibbles' organization would have made things much easier and faster. One thing that it certainly would have helped is already having a usable 16-color mode for business applications when higher resolution hardware became possible.
One could also argue using the flash bit as a 'hold' (no flashing) instead could have been much more useful, as it would give the QL the ability to very quickly create filled polygons on screen (though not in many colors...). So, I'll leave this bit up for discussion...

4) Why not 4 screen areas given that there are already 2 banks of RAM, 64k each?

Well, this is a true mystery to me because out of all of the improvements or alternative ways of implementing stuff, this one would have been the easiest. It does require an extra control bit in the MCR register (that's very little logic) and a line from it to the already existing decode logic for /CAS0 and /CAS1. The latter might actually be tha problem as this supplies an 'internal' bit for A16 to be used when the display data is read in order to refresh the screen. Routing is one of the BIG deciding factors on gate array utilization as it takes the same space that logic would normally take, made out of closely located gates connected by short 'wire' segments. Long wires (or busses) actually pass over un-committed gates that obviously can't then be used for logic, so this is a trade-off.
Granted, this is perhaps the least useful modification to be proposed, but then having two screen areas to begin with did not prove to be very useful.
One could argue, however, that since one is already using a given type of ULA and there is no added charge whatever the mask is that connects it's gates into a usable circuit (production costs don't depend on the pattern), one might as well put in small features even if they end up never been used as the only cost is a slight increase in engineer-hours used. Unfortunately, we all know that despite the huge delay, the QL was rushed to market. While I don't think that if any of the mods here proposed would have made a huge difference, things do add up and... who knows.
For instance, implementing all of the above would get us:
1) No horizontal overscan (though perhaps a more hazy picture in MODE4 on a TV), compatible with many monitors.
2) 63.7% memory speed utilization with the standard screen resolution, 59.95% with vertical resolution extended to 288 lines
3) 6.66% higher CPU speed, resulting in a speed improvement of 45.6% in speed of execution from RAM for standard and 37% for extended resolution, over stock QL.
4) Obviously 32 more display lines (12.5% improvement) and at least a more game friendly 8 color (if not 16 color) display. Pretty much cutting edge graphics in that price bracket at the time (business oriented, not game).
5) Full speed 'dark screen' mode when needed (given that it is a multitasking system, there could always be tasks that do things in the background when the screen saver goes up!).
6) The ability to do double-buffered flicker-free displays without having to dislodge the OS.

Would that have been of interest? Well, I'll leave this to you all :)


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

BTW I do very much wish I had a comparable collection of data on the 8302. It would be nice to document it comprehensively all in one place.


User avatar
M68008
Trump Card
Posts: 223
Joined: Sat Jan 29, 2011 1:55 am
Contact:

Re: 8301

Post by M68008 »

Something related to the ZX8301 that I found surprising is that writing to RAM on the QL seems to take one cycle longer than reading from RAM (according to the CPU user manual, both should be able to take as little as 4 cycles in the ideal case). Might be related to DS being asserted one cycle later than AS for writes. Not sure if there is a good reason for writes being slower, or it's a cost-saving measure to simplify the ZX8301 logic or some other issue.


User avatar
mk79
QL Wafer Drive
Posts: 1349
Joined: Sun Feb 02, 2014 10:54 am
Location: Esslingen/Germany
Contact:

Re: 8301

Post by mk79 »

Nasta wrote:BTW I do very much wish I had a comparable collection of data on the 8302. It would be nice to document it comprehensively all in one place.
I tried redacting your information into one coherent technical document, but at 16+ densely written A4 pages it was too much to handle right now Image


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

M68008 wrote:Something related to the ZX8301 that I found surprising is that writing to RAM on the QL seems to take one cycle longer than reading from RAM (according to the CPU user manual, both should be able to take as little as 4 cycles in the ideal case). Might be related to DS being asserted one cycle later than AS for writes. Not sure if there is a good reason for writes being slower, or it's a cost-saving measure to simplify the ZX8301 logic or some other issue.
On average it should take the same as most of the time (4/5ths of it) the 8301 incurs at least an 8-clock wait. However, one could (very carefully) construct a limit case. This would happen every time the CPU attempts to write a byte just before the 8301 starts reading a long word from RAM. At this point it will ignore the CPU some clocks in advance of it's read slot if it deems there is no time to finish the CPU access before it has to read screen data. It only looks at /DS, so the fact it goes low later on write will make it skip a potentially valid access slot. A similar situation can happen if it is occasionally writing data to RAM while the visible portion of a line is in progress, interspersed with accesses performed elsewhere - it can miss potentially valid access slots. This is indeed because the logic is based on /DS timing on read, where it comes 1 clock cycle early. On write, the 8301 counts on that one extra clock so it appears to it the cycle will take too long so it stalls the CPU.

Perhaps this is also a point where improvement could have been made, especially since the RAM actually latches the data provided by the CPU on write so looking at address lines and /WR could have been used to start a RAM cycle (/RAS goes low) and if /DS and /WR are found to be low right before /CAS is to be generated, it could safely proceed with the write counting on data having been latched on the falling edge of /CAS - even if the actual CPU cycle then extends into the next screen data read. If not, /CAS is not generated and the cycle ends up being a refresh which does not produce nor alter any data.
However, given how 8302 decoding is handled (as if it was RAM) obviously the logic was simplified as much as possible and re-used for both reading and writing RAM as well as the 8302 and 8301. Making it more sophisticated would have complicated things - but then one could argue there could have been logic to spare had the 8302 decoding been re-done to connect it directly to the CPU bus, as it could have been right from the start. 8301 MCR write would easily be catered for because it's write only and very simple.

In the post above I have only mentioned improvements that can be handled inside the 8301, without breaking compatibility with existing motherboard versions - If one could only count on the 8302 being directly on the CPU bus, there would have been savings in logic in the 8301, not to mention the existence of a 'HAL' logic chip to implement some small parts of logic that would save us pins on the 8301, for even more streamlining. But alas, it is what it is.

IF the motherboard was re-spun even with the original 8301 and 8302 many improvements can be made, one of the better ones being the ability to run the CPU from a clock independent of the 7.5MHz the 8301 provides and screen RAM shadowing.
Replacing the 8301 with a FPGA based PCB with RAM on-board (either the PCB or the FPGA or both) opens up a whole world of possibility, even just with the re-implementation of the basic functions. For instance, it's practically a given that full CPU access speed can be had AND line doubling VGA AND asynchronous/independent CPU clock all at once. Even a simpler non-VGA version based on an 128k x 8 static RAM would do a lot, as the common SRAM chip of that capacity can perform an entire access in a little more than half a CPU clock - so plenty of timing space to time-multiplex CPU and screen access. One could base the logic on alternate clock cycles where even is CPU and odd is screen access, at 7.5MHz a typical 80ns SRAM would cover the standard screen refresh needs without slowing down the CPU. A FPGA with 64k internal SRAM for both screens with an external 128k SRAM added as shadow and the other 64k of RAM (to emulate a 128k QL) would easily do QL video at VGA compatible timings with line doubling/tripling - and run the CPU with an independent clock and no waiting.


User avatar
M68008
Trump Card
Posts: 223
Joined: Sat Jan 29, 2011 1:55 am
Contact:

Re: 8301

Post by M68008 »

Thank you for the explanation, and for this great series of posts about the ZX8301!

I see the RAM writes being consistently slow, not just in "limit cases". I think the following trace shows that writes take 5 cycles instead of 4 even when not colliding with a screen refresh:
write5.png
write5.png (8.58 KiB) Viewed 4721 times
This trace was kindly collected years ago by a QL user on my request. My interpretation is that it shows a memory write to RAM (to address $3E8D6) taking 5 cycles, with an extra wait cycle due to the DTACK signal. The write starts 13 cycles before VDA is asserted, too far for the screen refresh to be the cause for the wait (and in fact both this write and the following one are able to complete before the next refresh starts). Not sure what was the revision of this QL.
Similar traces for RAM reads show that they only take 4 cycles on the same QL (unless wait states are inserted to wait for screen refresh).


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: 8301

Post by Nasta »

M68008 wrote:Thank you for the explanation, and for this great series of posts about the ZX8301!

I see the RAM writes being consistently slow, not just in "limit cases". I think the following trace shows that writes take 5 cycles instead of 4 even when not colliding with a screen refresh:

write5.png

This trace was kindly collected years ago by a QL user on my request. My interpretation is that it shows a memory write to RAM (to address $3E8D6) taking 5 cycles, with an extra wait cycle due to the DTACK signal. The write starts 13 cycles before VDA is asserted, too far for the screen refresh to be the cause for the wait (and in fact both this write and the following one are able to complete before the next refresh starts). Not sure what was the revision of this QL.
Similar traces for RAM reads show that they only take 4 cycles on the same QL (unless wait states are inserted to wait for screen refresh).
Very interesting. I wonder what revision of the ULA it was - mine is the ceramic CLA one. I don't have the time to re-check right now but it might be worth the trouble. I still have that motherboard wired up for the analyzer, I'll do some more traces when I have the time!


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: 8301

Post by Peter »

Nasta wrote:Have also a look at the maximum period required for /RAS. It does have a maximum (as long as it is) so it cannot be completely static. But I am sure at that rate the current consumption would be FAR lower.
Not sure what you mean - /RAS pulse width seems irrelevant, /RAS would never be active. Do you think there is a requirement to toggle /RAS if DRAM contents is never needed?
Nasta wrote:This suggests that you would implement the entire 128k inside the FPGA?
Not necessarily. If it was a PCB for BGA cases anyway, a small external RAM should also fit. But simply spending €4..5 more for 128 KB FPGA Blockram and save the extra work is tempting. Total overkill, I know.
Nasta wrote:Actually, you could probably implement a 68008 in there too :P
That temptation needs to be resisted. ;) It would be not easy to get the CPU cycle-accurate. And once faster CPU timings are allowed, throwing a Q68 derivative on a QL form factor mainboard seems easier.
(Side remark: When Dave asked about the Q68 FPGA as a separate chip for re-use on other form factors, I was not totally opposed. I just asked for a separate discussion thread.)
Nasta wrote:Given the FPGA would come with lots of pins and the whole replacement would have to be a small PCB I vote for a smaller FPGA that can contain SCR0 and 1, and a 128k SRAM on the side to implement the actual 'CPU' RAM, and shadow the video RAM inside the FPGA.
An additional point for this would be a 256 KB RAM, then shadow ROM also.
Nasta wrote:(HSYNCH derived from VSYNCH and /CSYNCH is used to trigger DRAM refresh on the SGC).
Good point - I had forgotten that.


User avatar
QLvsJAGUAR
Gold Card
Posts: 455
Joined: Tue Feb 15, 2011 8:42 am
Location: Lucerne, Switzerland
Contact:

Re: 8301

Post by QLvsJAGUAR »

Peter wrote:
Nasta wrote:My gripe with the motherboard would be components that are problematic to replace or remove when they fail (in this case it would be RAM), and signal integrity. After all I just re-used the 8302 and IPC on the Aurora :P
Yes, a QL motherboard replacement for obsolecence, signal quality and repair reasons could make more sense than I first thought.

Probably there are a lot of non-working QLs with intact black box, CPU and microdrives. If I one can buy a replacement motherboard, simply re-seat the 68008 and voila, system works, this could sell well. However, I think that success would depend on a strict "retro" approach, i.e.
Yes, I think such a replacement motherboard would bring back many broken QLs back to life and with some design fixes and lessons learned applied, the thing would be a more stable computer.
Some ideas for an issue 8 motherboard
Some ideas for an issue 8 motherboard
Highres Version can be found here:
http://www.sinclairql.net/srv/Sinclair_ ... _ISS_8.jpg

QL forever!
Urs


QL forever!
https://www.sinclairql.net/ - Go and get THE DISTRIBUTION & QL/E!
https://www.youtube.com/QLvsJAGUAR/community - Blog
https://www.youtube.com/QLvsJAGUAR - Dedicated QL videos
Sinclair, QL, ATARI, JAGUAR, NUON, APPLE, NeXT, MiST & much more...
Videos, pictures & information
Post Reply