Re: Q68 speed vs 680X0
Posted: Thu Nov 29, 2018 8:00 pm
I checked again. Copyback cache was active when I measured 18 seconds for the Q60/80.
Well, that's what the QL firmware and software uses so it's great benchmark for our purposes.Peter wrote:I am a little surprised myself, I did not expect to (almost or actually) beat the 68040 in a maths-centered benchmark, even though the code only uses plain 68k instructions.Nasta wrote: Now the Q68 - The 030 approaches 1 instruction per 2 cycles since the whole benchmark fits inside the cache. If we scale it's result to 80MHz (to match 2x 40MHz of the Q68), to compare with Q68 running from SRAM at near 1 instruction per cycle, we get 49.7 secs, so the Q68 is actually faster by about 4% when running from full speed memory. That is an excellent result.
While the speed would no doubt be a major thrill, SRAM has one disadvantage which produces a net result of less available RAM for a higher price.I'm not in the mood for a 40 MHz SRAM based design myself, because it is more PCB layout work than SDRAM and requires clever sourcing of the otherwise expensive fast SRAM chips. But it is amazing, how well it would perform.
Nasta, if I remember correctly, you were considering an SRAM based design? Time permitting, I' d be willing to create a TQFP144 FPGA-CPU for you for free, with SRAM interface. As long as it is without the Q68 graphics & SDRAM controller (which I'd like to keep confidential for now) other free on-chip peripherals are debatable.
Just an idea, since you seemed to like the performance. I don't know if you have the time...
Peter
The 68030 output voltages for the address lines are specified for 3.2 mA, while the FPGA specifies them for max. 24 mA. This indicates a higher driver strength.Nasta wrote:The main limit is the load the pins of the FPGA can drive, as this determines how many physical chips can be fitted, which in turn defines the maximum feasible amount of RAM, even if price was no object.
Pre-decoding to your wishes is no problem. Speedgrade could also be 12 ns, depending on details. Alternatively, you could use 10 ns and spend 6€ more for a higher speed grade FPGA running at 50 MHz.Nasta wrote:Because we are talking full 40MHz speed (25ns cycle), the fastest asynchronous SRAM that is also practically large, has a 10ns speed grade. This pretty much precludes any external decoding, so the FPGA needs to provide pre-decoded chip selects.
To me 8 MB sounds reasonable for SMSQ/E. I chose 32 MB for the Q68 because there was no significant price difference to smaller size, and some interest in ucLinux.Nasta wrote:A 16-bit data bus also has it's limits - with it the practical limit to RAM size would be somewhere between 4 and 8 Mbytes if the largest chips were used, and there would be some leeway for a potentially higher clock frequency.
Separate VRAM is desirable anyway, because it does not have the Q68 effect of slowing down the CPU at higher resolutions.Nasta wrote:Unfortunately, this sort of RAM is not as suitable for pulling video data from it as SDRAM is, so video would have to be handled separately - but this requires some way to access a 'slower' memory interface at the same time.
Yes. But compared to a 020/030 SRAM design, less work. For example, it could have the Q68 internal "ROM loader" and SDHC card interface, saving Flash and additional mass storage. Also, the FPGA can generate some chip selects and have all pins exactly where you want them. I don't remember the video circiut on the Aurora, but possibly it could be re-used with some changes.Nasta wrote:While very interesting, it seems like a lot of work...
Yes, this implies stronger drive, but of course, it comes down to the capacitance/current producing delay but if the pinout is customizable, this can be kept low and well controlled. Basically, the SRAM chip(s) need to be as close as possible to the FPGA and along with them there would be a buffer to an 'external' slower bus. Given that the largest feasible SRAM is 2M x 16 (so 4 Mbytes per chip) I am sure 2 or 3 could be fitted. I'll get back to this a bit later.Peter wrote:The 68030 output voltages for the address lines are specified for 3.2 mA, while the FPGA specifies them for max. 24 mA. This indicates a higher driver strength.Nasta wrote: The main limit is the load the pins of the FPGA can drive, as this determines how many physical chips can be fitted, which in turn defines the maximum feasible amount of RAM, even if price was no object.
Well (blush) I was kind of counting on something like thatPre-decoding to your wishes is no problem. Speedgrade could also be 12 ns, depending on details. Alternatively, you could use 10 ns and spend 6€ more for a higher speed grade FPGA running at 50 MHz.
Well, that's the way one does it - if the optimum price point gives 'more than you need' (kind of an oxymoron, history shows us if it's there, someone will find an application for it ) you put on that part, it's a win-win situation.To me 8 MB sounds reasonable for SMSQ/E. I chose 32 MB for the Q68 because there was no significant price difference to smaller size, and some interest in
Exactly. As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec. Now, there are different ways to implement this in actuality, and it's not a trivial decision given the problem of VESA timing, not to mention the demise of 4:3 or 5:4 aspect ratio monitors.Separate VRAM is desirable anyway, because it does not have the Q68 effect of slowing down the CPU at higher resolutions.
Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.Different waitstates could either be generated inside the FPGA, or by an external CPU clock enable.
Well, that's perfectly adequate, though there would probably have to be some smaller CPLD on the PCB as well if one wanted to convert FPGA core access to 'old style' peripherals, including 16<>8 not bus conversion and VRAM access, without taking up too much board space. This is one (though not major) advantage of the 030, it has dynamic bus sizing - but then, the conversion problem has already been handled on GC, so no big deal.Nasta wrote:While very interesting, it seems like a lot of work...Yes. But compared to a 020/030 SRAM design, less work. For example, it could have the Q68 internal "ROM loader" and SDHC card interface, saving Flash and additional mass storage. Also, the FPGA can generate some chip selects and have all pins exactly where you want them. I don't remember the video circiut on the Aurora, but possibly it could be re-used with some changes.
The Q68 has no instruction prefetch and no write buffering yet.
All the best
Peter
Can you tell which the part it is?Nasta wrote:As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec.
Should be no problem.Nasta wrote:Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.
Sure, IBM 025161 I think, also some KM4216C256, 60ns. Not 100% sure as I am not in a position to go look at them at the momentPeter wrote:Can you tell which the part it is?Nasta wrote: As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec.
As far as I have understood, it's a 16-bit data bus. Should be no problem to design a converter / bus interface along with other needed decoding in a CPLD alongside, since the FPGA needs an interface to 5V logic anyway, and then there is the matter of driving the VRAM, though I have some (possibly odd) ideas about how that could be simplified.Should be no problem.Nasta wrote:Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.
I might be too lazy to design full bus sizing for the FPGA, so GC style logic that splits (long)word cycles into several separate byte cycles would be external.
Alternatively, if the only purpose is to access QL internal registers and a few known QL peripherals, bytewide register access is probably enough. 16 to 8 bit bus multiplexing could easily be inside the FPGA then. IIRC there's only one place in Minerva where a non-byte access to QL registers is relevant (and fixable). The only "standard" QL peripheral that uses word access is probably Qubide, where a relatively small change in the QubATA driver could do the byte-split in software. (FPGA-integrated SDHC interface is much faster than QubIDE, so even the need for that is debatable.)
Peter
You should use the commands COPYBACK, WRITETHROUGH and SERIALIZED to select cache modes on Qx0. I was not even aware that CACHE_ON and CACHE_OFF still exist.Silvester wrote:After blowing the dust of my Q40 I tried the benchmark program and got 54 seconds, which sounds OK for 40MHz 68040 against Peter's 80MHz 68060. That was with CACHE_ON, with CACHE_OFF it was 167 seconds! That doesn't compare well with benchmark on my SGC: 176 seconds with CACHE_ON (212 with CACHE_OFF). Has something gone wrong with my Q40? - be interested to hear if other Q40 user gets same results.
BTW SMSQ/E v3.xx doesn't appear to honour CACHE_ON with Q40 (benchmark always 170 seconds). SMSQ/E v2.98 is OK.
Peter, how does your Q60 do with CACHE_OFF (and what SMSQ/E version are you running)?.
Could you try again after sending the COPYBACK command please?Silvester wrote:After blowing the dust of my Q40 I tried the benchmark program and got 54 seconds, which sounds OK for 40MHz 68040 against Peter's 80MHz 68060.
It is normal that serialized mode (probably selected by CACHE_OFF) is extremely slow - it is more restrictive than just running without cache.