Q68 speed vs 680X0

Nagging hardware related question? Post here!
User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Q68 speed vs 680X0

Post by Peter »

I checked again. Copyback cache was active when I measured 18 seconds for the Q60/80.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Q68 speed vs 680X0

Post by Nasta »

Peter wrote:
Nasta wrote: Now the Q68 - The 030 approaches 1 instruction per 2 cycles since the whole benchmark fits inside the cache. If we scale it's result to 80MHz (to match 2x 40MHz of the Q68), to compare with Q68 running from SRAM at near 1 instruction per cycle, we get 49.7 secs, so the Q68 is actually faster by about 4% when running from full speed memory. That is an excellent result.
I am a little surprised myself, I did not expect to (almost or actually) beat the 68040 in a maths-centered benchmark, even though the code only uses plain 68k instructions.
Well, that's what the QL firmware and software uses so it's great benchmark for our purposes.
I'm not in the mood for a 40 MHz SRAM based design myself, because it is more PCB layout work than SDRAM and requires clever sourcing of the otherwise expensive fast SRAM chips. But it is amazing, how well it would perform.
Nasta, if I remember correctly, you were considering an SRAM based design? Time permitting, I' d be willing to create a TQFP144 FPGA-CPU for you for free, with SRAM interface. As long as it is without the Q68 graphics & SDRAM controller (which I'd like to keep confidential for now) other free on-chip peripherals are debatable.
Just an idea, since you seemed to like the performance. I don't know if you have the time...
Peter
While the speed would no doubt be a major thrill, SRAM has one disadvantage which produces a net result of less available RAM for a higher price.
If we use the most dense static RAM at 16 bit width on a FPGA based on the Q68, we are limited to 4M of total RAM - the largest signle chip has a 32Mbit density, 2M x 16 bits. It is very fast and can work at even faster speed than the current Q68 clock. However, it is slower at accessing sequential data (important for efficient generation of video) - and it is VERY expensive compared to SDRAM as currently used.

Now, to answer the question(s)...

At one point a SRAM based design was on the board for a SGC replacement based on the 68EC020. The main advantages are very simple control/decode logic, low power requirement and somewhat higher speed given that the fastest EC020 could run with zero wait states. The disadvantage is that more space is required for the chips, so more effort with packaging on a PCB. The RAM price per bit is also a LOT higher compared to current offerings, but is in reality offset by the 68EC020 using 5V logic so we are 'stuck' with older RAM of any kind, which has sort-of equalized in price because 5V DRAM has become scarce, while 5V SRAM is till in use, so this lessens the price issue. That being said, the largest commonly available 5V SRAM is half a megabyte, usually 8-bit wide, so RAM has to be provided in increments of 4 chips (to get 32 bits of data bus), meaning 2Mbytes. Here we soon run into a problem, which is the need to buffer address lines going to the SRAM chips, which has a slight speed penalty, but that can actually end up incurring a wait state for larger RAM sizes. Even with only one bank (4 chips at 512k x 8), all chips share common address lines, so 4 loads per relevant address line while presenting only one load on any data line. Without buffering, this is about as much as one can drive with the CPU directly, taking into account there are other devices that also need addressing, so extra loads are present. Obviously, a 50% cut in available RAM with a benefit of slightly higher speed is actually a step back compared to SGC. With buffering things get better, but again, when considering practical space limits, this sort of thing pegs the maximum RAM limit between 8 and 12Mbytes. Given that the EC020 only has a total address range of 16M, and it would probably be reasonable to keep some of it for other interesting things like the usual QL expansions and some extras like on-board flash, it still sort-of holds together, but is marginally worth it since there is no significant speed advantage (and when I say significant, perceptively, that's at least double speed) as this is limited by the highest speed grade of the EC020 one can get.

At this point, one could say, why not move to a full 68020, you get a full 32-bit address range and clock speed up to 33MHz, perhaps slightly overclockable. While this might be tempting, the chip is larger, not that much faster (at most 30%) but most importantly VERY expensive for what it is, even used. This is actually logical since it was the highest grade, not produced in large numbers and by the time it became available, the 68030 was already out.

Migrating to a EC030 CPU, which is now even lower priced than an EC020, keeps about the same limits on RAM, due to buffering and the need to use 5V parts, only now we can run at over twice the clock of the SGC, but then would prefer faster parts so price can go up. It should be noted that the EC030 and EC020 are virtually identical in terms of performance per clock MHz, but you can get an EC030 CPU that runs at twice the clock speed of the fastest EC020, and in theory, you can get the 030 to run from external memory about as fast (in terms of clock cycles) as the EC020 can only do from cache. Therefore, there is a clear speed advantage - the one remaining advantage, sadly not really usable in this case, is that the EC030 has a full 32-bit address bus, thus 4G address range, compared to the 16M of the EC020. On the other hand, to get this sort of performance, really fast SRAM is needed. It is available but more expensive, though still limited to the same RAM sizes for 5V parts. There are larger and cheaper 3V parts available but that requires 3v to 5V conversion which again puts a speed penalty, so in the end we get to having to use the fastest RAM and to higher prices - and actually beat QXL speed.
However, once 3V conversion is brought into the picture, SDRAM becomes a possibility, and with it a way to cheaply get as much RAM as the OS can actually realistically support. The price is a much more complex RAM control logic, especially if it is to run close to the maximum speed the 030 can do - while this is not exactly 100% it can approach it. The interesting thing is, the way the 030 works, it's prudent to run the SDRAM at twice the clock rate of the CPU (this it can easily do, the maximum CPU clock becomes the bottleneck), which means the SRAM is in theory (for lon bursts of successive data reads) capable of twice the data rate the 030 can take. This can be used for shared RAM based video with fairly high performance in QL terms, but here we move from CPLD territory to FPGA.
One middle way could be to provide an amount of fast SRAM for speed critical applications, and then use dynamic ram for 'regular uses'. At this point we are talking rather complex hardware in terms of board size and complexity, as well as chip count.

Then there is the Q68. As the results in the other thread have shown, it can get close to 1 instruction per cycle, which means that at the same clock frequency it will beat any EC030 and get close to (or perhaps even surpass in a QL application) an 040 CPU. However, if a SRAM version was done, it is subject to the same limitations as the ones outlined above. In it's case we need to use 3V logic only, which puts us in more expensive but twice as large SRAM chips, especially since a very high speed grade of SRAM is needed to keep up even with the current 40MHZ clock. The main limit is the load the pins of the FPGA can drive, as this determines how many physical chips can be fitted, which in turn defines the maximum feasible amount of RAM, even if price was no object. Because we are talking full 40MHz speed (25ns cycle), the fastest asynchronous SRAM that is also practically large, has a 10ns speed grade. This pretty much precludes any external decoding, so the FPGA needs to provide pre-decoded chip selects. A 16-bit data bus also has it's limits - with it the practical limit to RAM size would be somewhere between 4 and 8 Mbytes if the largest chips were used, and there would be some leeway for a potentially higher clock frequency. Twice that would be possible with a 32-bit data bus. Unfortunately, this sort of RAM is not as suitable for pulling video data from it as SDRAM is, so video would have to be handled separately - but this requires some way to access a 'slower' memory interface at the same time. If it can be done within loading constraints, a set of buffers can be used to couple slower RAM to the same SRAM memory interface, under the condition that the CPU can be made to wait for the memory (or peripheral for that matter) to respond when not accessing the SRAM part of the address map. This again depends on the internal decoding that can be done inside the FPGA.
I suppose the set of integrated peripherals would have to be kept at a minimum to provide enough pins for interfacing non-multiplexed and potentially wider bus memory, so again, some sort of bus to access some IO would be a must, either as a separate set of pins or multiplexed with the main interface.
While very interesting, it seems like a lot of work...

Still, I would love to know some more details on the architecture of the FPGA based 68k core (mostly to see if there is any pre-fetch and write buffering, which relates to a possible implementation of a 32-bit wide data bus).


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Q68 speed vs 680X0

Post by Peter »

Nasta wrote:The main limit is the load the pins of the FPGA can drive, as this determines how many physical chips can be fitted, which in turn defines the maximum feasible amount of RAM, even if price was no object.
The 68030 output voltages for the address lines are specified for 3.2 mA, while the FPGA specifies them for max. 24 mA. This indicates a higher driver strength.
Nasta wrote:Because we are talking full 40MHz speed (25ns cycle), the fastest asynchronous SRAM that is also practically large, has a 10ns speed grade. This pretty much precludes any external decoding, so the FPGA needs to provide pre-decoded chip selects.
Pre-decoding to your wishes is no problem. Speedgrade could also be 12 ns, depending on details. Alternatively, you could use 10 ns and spend 6€ more for a higher speed grade FPGA running at 50 MHz.
Nasta wrote:A 16-bit data bus also has it's limits - with it the practical limit to RAM size would be somewhere between 4 and 8 Mbytes if the largest chips were used, and there would be some leeway for a potentially higher clock frequency.
To me 8 MB sounds reasonable for SMSQ/E. I chose 32 MB for the Q68 because there was no significant price difference to smaller size, and some interest in ucLinux.
Nasta wrote:Unfortunately, this sort of RAM is not as suitable for pulling video data from it as SDRAM is, so video would have to be handled separately - but this requires some way to access a 'slower' memory interface at the same time.
Separate VRAM is desirable anyway, because it does not have the Q68 effect of slowing down the CPU at higher resolutions.

Different waitstates could either be generated inside the FPGA, or by an external CPU clock enable.
Nasta wrote:While very interesting, it seems like a lot of work...
Yes. But compared to a 020/030 SRAM design, less work. For example, it could have the Q68 internal "ROM loader" and SDHC card interface, saving Flash and additional mass storage. Also, the FPGA can generate some chip selects and have all pins exactly where you want them. I don't remember the video circiut on the Aurora, but possibly it could be re-used with some changes.

The Q68 has no instruction prefetch and no write buffering yet.

All the best
Peter


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Q68 speed vs 680X0

Post by Nasta »

Peter wrote:
Nasta wrote: The main limit is the load the pins of the FPGA can drive, as this determines how many physical chips can be fitted, which in turn defines the maximum feasible amount of RAM, even if price was no object.
The 68030 output voltages for the address lines are specified for 3.2 mA, while the FPGA specifies them for max. 24 mA. This indicates a higher driver strength.
Yes, this implies stronger drive, but of course, it comes down to the capacitance/current producing delay but if the pinout is customizable, this can be kept low and well controlled. Basically, the SRAM chip(s) need to be as close as possible to the FPGA and along with them there would be a buffer to an 'external' slower bus. Given that the largest feasible SRAM is 2M x 16 (so 4 Mbytes per chip) I am sure 2 or 3 could be fitted. I'll get back to this a bit later.
Pre-decoding to your wishes is no problem. Speedgrade could also be 12 ns, depending on details. Alternatively, you could use 10 ns and spend 6€ more for a higher speed grade FPGA running at 50 MHz.
Well (blush) I was kind of counting on something like that :P
To me 8 MB sounds reasonable for SMSQ/E. I chose 32 MB for the Q68 because there was no significant price difference to smaller size, and some interest in
Well, that's the way one does it - if the optimum price point gives 'more than you need' (kind of an oxymoron, history shows us if it's there, someone will find an application for it :) ) you put on that part, it's a win-win situation.
That being said, while SMSQ/E applications are small, data may become big by virtue of improved graphics. Program's windows get stored in RAM, more resolution and color automatically needs more RAM to do that.
I think Q68 is well balanced (plus extra!) in this regard, even though one could argue at the highest graphical settings the bandwidth penalty is on the high side. If you remember, that's why I argued for Aurora 256 color mode and I think it's 'just right' to get both the resolution and some usable colors. Given that Aurora (in 256 color mode) is borderline on the SGC which has 4Mbytes of RAM, extending the resolution to up to 1024x512 in 256 colors would be about the maximum to implement given an 8-12Mbytes of RAM.
Separate VRAM is desirable anyway, because it does not have the Q68 effect of slowing down the CPU at higher resolutions.
Exactly. As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec. Now, there are different ways to implement this in actuality, and it's not a trivial decision given the problem of VESA timing, not to mention the demise of 4:3 or 5:4 aspect ratio monitors.
Different waitstates could either be generated inside the FPGA, or by an external CPU clock enable.
Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.
If I could make wishes, some sort of 'bus grant' signal would be great (input to the CPU) but not strictly necessary.
Nasta wrote:While very interesting, it seems like a lot of work...
Yes. But compared to a 020/030 SRAM design, less work. For example, it could have the Q68 internal "ROM loader" and SDHC card interface, saving Flash and additional mass storage. Also, the FPGA can generate some chip selects and have all pins exactly where you want them. I don't remember the video circiut on the Aurora, but possibly it could be re-used with some changes.
The Q68 has no instruction prefetch and no write buffering yet.
All the best
Peter
Well, that's perfectly adequate, though there would probably have to be some smaller CPLD on the PCB as well if one wanted to convert FPGA core access to 'old style' peripherals, including 16<>8 not bus conversion and VRAM access, without taking up too much board space. This is one (though not major) advantage of the 030, it has dynamic bus sizing - but then, the conversion problem has already been handled on GC, so no big deal.
One large portion of a discussion around this sort of hardware would have to be the address map and booting, as it's important for compatibility. I think there are ways to simplify that, so this could be another topic for discussion.


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Q68 speed vs 680X0

Post by Peter »

Nasta wrote:As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec.
Can you tell which the part it is?
Nasta wrote:Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.
Should be no problem.

I might be too lazy to design full bus sizing for the FPGA, so GC style logic that splits (long)word cycles into several separate byte cycles would be external.

Alternatively, if the only purpose is to access QL internal registers and a few known QL peripherals, bytewide register access is probably enough. 16 to 8 bit bus multiplexing could easily be inside the FPGA then. IIRC there's only one place in Minerva where a non-byte access to QL registers is relevant (and fixable). The only "standard" QL peripheral that uses word access is probably Qubide, where a relatively small change in the QubATA driver could do the byte-split in software. (FPGA-integrated SDHC interface is much faster than QubIDE, so even the need for that is debatable.)

Peter


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Q68 speed vs 680X0

Post by Nasta »

Peter wrote:
Nasta wrote: As it happens I have 256k x 16 VRAM chips. One per board would satisfy the above mentioned spec.
Can you tell which the part it is?
Sure, IBM 025161 I think, also some KM4216C256, 60ns. Not 100% sure as I am not in a position to go look at them at the moment :)
Nasta wrote:Given that this option is intended for non SRAM access which can be of various kinds with different delays, some sort of 'DTACK' signal would have to be provided, basically the CPU disables it's own clock as soon as a non-SRAM cycle is attempted, until released by an external signal.
Should be no problem.
I might be too lazy to design full bus sizing for the FPGA, so GC style logic that splits (long)word cycles into several separate byte cycles would be external.
Alternatively, if the only purpose is to access QL internal registers and a few known QL peripherals, bytewide register access is probably enough. 16 to 8 bit bus multiplexing could easily be inside the FPGA then. IIRC there's only one place in Minerva where a non-byte access to QL registers is relevant (and fixable). The only "standard" QL peripheral that uses word access is probably Qubide, where a relatively small change in the QubATA driver could do the byte-split in software. (FPGA-integrated SDHC interface is much faster than QubIDE, so even the need for that is debatable.)
Peter
As far as I have understood, it's a 16-bit data bus. Should be no problem to design a converter / bus interface along with other needed decoding in a CPLD alongside, since the FPGA needs an interface to 5V logic anyway, and then there is the matter of driving the VRAM, though I have some (possibly odd) ideas about how that could be simplified.
Once there is an external bus and sufficient address lines, lots of things can be done outside of the FPGA and free resources inside.
There would be some discussion needed on address maps, boot sequence etc. And, some thing I would REALLY like to see in Minerva in order to make it fairly easily customizable. FInally... QL compatibility, 8302, IPC, MDVs...

My big wish would be some sort of bus request/bus grant like on the 040/060, where the CPU does not assume bus ownership but is rather given if by external hardware. That would open up all sorts of options... this would even be significantly higher on my priority list than a full 32-bit data bus :P


Silvester
Gold Card
Posts: 436
Joined: Thu Dec 12, 2013 10:14 am
Location: UK

Re: Q68 speed vs 680X0

Post by Silvester »

After blowing the dust of my Q40 I tried the benchmark program and got 54 seconds, which sounds OK for 40MHz 68040 against Peter's 80MHz 68060. That was with CACHE_ON, with CACHE_OFF it was 167 seconds! That doesn't compare well with benchmark on my SGC: 176 seconds with CACHE_ON (212 with CACHE_OFF). Has something gone wrong with my Q40? - be interested to hear if other Q40 user gets same results.

BTW SMSQ/E v3.xx doesn't appear to honour CACHE_ON with Q40 (benchmark always 170 seconds). SMSQ/E v2.98 is OK.

Peter, how does your Q60 do with CACHE_OFF (and what SMSQ/E version are you running)?.


David
User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Q68 speed vs 680X0

Post by Peter »

Silvester wrote:After blowing the dust of my Q40 I tried the benchmark program and got 54 seconds, which sounds OK for 40MHz 68040 against Peter's 80MHz 68060. That was with CACHE_ON, with CACHE_OFF it was 167 seconds! That doesn't compare well with benchmark on my SGC: 176 seconds with CACHE_ON (212 with CACHE_OFF). Has something gone wrong with my Q40? - be interested to hear if other Q40 user gets same results.

BTW SMSQ/E v3.xx doesn't appear to honour CACHE_ON with Q40 (benchmark always 170 seconds). SMSQ/E v2.98 is OK.

Peter, how does your Q60 do with CACHE_OFF (and what SMSQ/E version are you running)?.
You should use the commands COPYBACK, WRITETHROUGH and SERIALIZED to select cache modes on Qx0. I was not even aware that CACHE_ON and CACHE_OFF still exist.

It is normal that serialized mode (probably selected by CACHE_OFF) is extremely slow - it is more restrictive than just running without cache.


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Q68 speed vs 680X0

Post by Peter »

Silvester wrote:After blowing the dust of my Q40 I tried the benchmark program and got 54 seconds, which sounds OK for 40MHz 68040 against Peter's 80MHz 68060.
Could you try again after sending the COPYBACK command please?

Still quite a sensation for me that an SRAM based Q68 would beat a 68040 at same clock rate.


Silvester
Gold Card
Posts: 436
Joined: Thu Dec 12, 2013 10:14 am
Location: UK

Re: Q68 speed vs 680X0

Post by Silvester »

Q40 SMSQ/E v3.33

COPYBACK 54 seconds
WRITETHROUGH 54 seconds
SERIALIZED 169 seconds
It is normal that serialized mode (probably selected by CACHE_OFF) is extremely slow - it is more restrictive than just running without cache.
:o


David
Post Reply