Symmetric multiprocessing...

Nagging hardware related question? Post here!
Derek_Stewart
Font of All Knowledge
Posts: 3928
Joined: Mon Dec 20, 2010 11:40 am
Location: Sunny Runcorn, Cheshire, UK

Re: Symmetric multiprocessing...

Post by Derek_Stewart »

Hi Dave,

Looking good, if you ever produce any these boards, I would be interested.


Regards,

Derek
Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Symmetric multiprocessing...

Post by Nasta »

...needs some refinement to the hardware it it's going to run useful code. Some of it can be implemented using extra features present on DPRAM chips, depending on the actual chip. Also, some form of interrupt is needed if any kind of kernel is to run, to provide a periodic tick. Similar to QDOS, at least the poll interrupt must be implemented.


User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: Symmetric multiprocessing...

Post by tofro »

Nasta wrote:Also, some form of interrupt is needed if any kind of kernel is to run, to provide a periodic tick. Similar to QDOS, at least the poll interrupt must be implemented.

the periodic tick (the polling interrupt) is the heartbeat of any QDOS-based system. If you ever want to be able to implement anything QDOS-like, you'll need a timer interrupt on all CPUs.

Tobias


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: Symmetric multiprocessing...

Post by Pr0f »

Also, the 68K vector area and etc. are so small, it might be unnecessary to remap the DPRAMs, IF they do not conflict with a ROM image. I haven't tackled that part of the design yet.
Pr0f wrote:you can use the FC0-FC2 to signal a fetch of restart vector - as it's different to other vector fetches - and could remap the dual port ram when a reset is done on the off board processor, but otherwise leaving it alone.
If you'd like to expand on that I'd be interested to read it.
The 68K series CPU's expect to have the restart vector at address 0 - in fact, it's the stack pointer in the first vector and the initial PC in the 2nd vector - it's the only one that's 2 vectors - and it's the only one in supervisor program space (where FC2=1, FC1=1 and FC1=0), the other vectors are all in Supervisor data space (FC2=1, FC1=0 and FC0=1).

If you had a ROM that was mapped into a high address normally, you could also map that ROM into the first 4 bytes of memory if Supervisor Program space was indicated by FC2-FC0, so that the vectors come out of the ROM. There is nothing stopping you doing much the same with the dual port RAM, having it mapped at $00000000 when you reset by using the FC2-FC0 as well as address lines, but there after mapping it elsewhere in memory. You just lose the first 4 bytes of storage.

The standard QL addressing only ever makes use of FC1 and FC0 and uses those to flag interrupt acknowledge using auto vectors, FC2 is never decoded. Since FC2 =0 when FC1 and FC0 was undefined in the earlier models of 68K, it was assumed that FC1=1 and FC0=1 must therefore be signalling an interrupt acknowledge.


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Symmetric multiprocessing...

Post by Nasta »

Pr0f wrote:
Also, the 68K vector area and etc. are so small, it might be unnecessary to remap the DPRAMs, IF they do not conflict with a ROM image. I haven't tackled that part of the design yet.
Pr0f wrote:you can use the FC0-FC2 to signal a fetch of restart vector - as it's different to other vector fetches - and could remap the dual port ram when a reset is done on the off board processor, but otherwise leaving it alone.
If you'd like to expand on that I'd be interested to read it.
The 68K series CPU's expect to have the restart vector at address 0 - in fact, it's the stack pointer in the first vector and the initial PC in the 2nd vector - it's the only one that's 2 vectors - and it's the only one in supervisor program space (where FC2=1, FC1=1 and FC1=0), the other vectors are all in Supervisor data space (FC2=1, FC1=0 and FC0=1).

If you had a ROM that was mapped into a high address normally, you could also map that ROM into the first 4 bytes of memory if Supervisor Program space was indicated by FC2-FC0, so that the vectors come out of the ROM. There is nothing stopping you doing much the same with the dual port RAM, having it mapped at $00000000 when you reset by using the FC2-FC0 as well as address lines, but there after mapping it elsewhere in memory. You just lose the first 4 bytes of storage.

The standard QL addressing only ever makes use of FC1 and FC0 and uses those to flag interrupt acknowledge using auto vectors, FC2 is never decoded. Since FC2 =0 when FC1 and FC0 was undefined in the earlier models of 68K, it was assumed that FC1=1 and FC0=1 must therefore be signalling an interrupt acknowledge.
Well, you still need to decode addresses and decoding down to the 4 byte granularity requires a lot of address lines.
Also, there are only a few vectors that actually can and will be used - reset, trace, address error, divide by zero and perhaps one interrupt.
Here is one simple way to do this, since there is only the DPRAM and SRAM:
Map DPRAM and SRAM initially (by reset) the same address, with SRAM shadowing the DPRAM, so that when the node CPU reads, it reads from DRPAM, but when it writes, it writes to both the DPRAM and SRAM at the same address.
The host CPU will then load the boot code and initial vectors from it's side of the DRPAM, and release the node reset.
The node will execute the code, part of which is to first copy the code onto itself, resulting in the contents of the DPRAM being copied into SRAM. It then accesses a 'trigger address' (for instance any address above the top of DPRAM), resulting in a re-map so that the first 512k is SRAM (rad and write), the second 512k (aliased) DPRAM.
When this happens, the code can happily continue running at the next instruction as it will just be replaced by a copy of it from SRAM rather than DPRAM, at the same address. so the following code being executed takes care of the rest of the process of booting, including requesting further code to be downloaded from the host.

Interrupt can also be made very simple. The QL uses VSYNCH as the source of the poll tick, so an edge triggered FF can be used to be set when this occurs, and drive the IPL1 line, to cause the usual QDOS level 2 interrupt. Since it's the only one, an interrupt acknowledge cycle (detected by FC0,1=11) simply resets the FF, as does hardware reset.

Some logic is needed to decode all this, about half a small GAL and a HCT74 dual FF chip per node. Things can get a bit more involved if there are more interrupt sources, some DPRAM chips have extra resources to implement interrupt driven mailboxes on-chip which can be used for such a thing.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

Nasta wrote:...needs some refinement to the hardware it it's going to run useful code. Some of it can be implemented using extra features present on DPRAM chips, depending on the actual chip. Also, some form of interrupt is needed if any kind of kernel is to run, to provide a periodic tick. Similar to QDOS, at least the poll interrupt must be implemented.
tofro wrote:the periodic tick (the polling interrupt) is the heartbeat of any QDOS-based system. If you ever want to be able to implement anything QDOS-like, you'll need a timer interrupt on all CPUs.
This, to me, is the most encouraging thing in this thread. Two people who have high status saying "if you want to do X you'll need to implement Y..." Thank you.

Yes, a poll interrupt will be provided. I did not consider it, but will add it to my plans. Do you think I can add a global interrupt, or should it be local to each processor chain? In a case where a 68000 is running just an assembly program with no OS support, is it preferable to have a faster or slower poll interrupt?
pr0f wrote:The 68K series CPU's expect to have the restart vector at address 0 - in fact, it's the stack pointer in the first vector and the initial PC in the 2nd vector - it's the only one that's 2 vectors - and it's the only one in supervisor program space (where FC2=1, FC1=1 and FC1=0), the other vectors are all in Supervisor data space (FC2=1, FC1=0 and FC0=1).
I understand what you're saying. I just don't know how to implement something in hardware to create anything useful for you. My current plan is to ALWAYS have SRAM from $0, and have a toggle state for the 2K DP RAM where it is mapped to $0 or $80000 and will alias every 2K. The only reason to provide the relocation feature is because it might be asking too much for people to edit QDOS or Minerva to create a hole in the bottom 2K for message passing.

I did consider having a ROM, but adding four ROMs with identical contents struck me as wasteful though simple, and adding a single ROM, shared by the four CPUs is an unsolved problem for me. It does make a lot of sense to have four ROM or FLASH 64K blocks, one on each CPU, mapped from 0 as this does provide a useful feature if there's known code that will go there. However, this board is very speculative experimentation, and the range of code ideas people might want to load is much wider than simply a fixed code base. Since it would not be possible to in system reprogram the ROM/Flash easily, it just creates a bit of an inflexible headache for users while no such code exists.
Nasta" wrote:Well, you still need to decode addresses and decoding down to the 4 byte granularity requires a lot of address lines.
You have shared access to my schematic, so you will see that I used two QL-side GALs. One is labeled "decoder" and uses A19..A9. It has outputs identifying accesses in the 2K blocks of each CPU, plus a "carry" to a second GAL, labeled "Reset manager", which has A8..A0 so can resolve an individual byte with a 15ns decode time. That GAL will have four flip flop outputs going to the guest CPU-side GALs indicating the mapping state of the DPRAM as 1 ($0) or 0 ($80000). It's a bit brute force, but it keeps the decoding chain really short. I considered using a small EPROM and four flip flops instead, but I was less confident about its flexibility and adding features later.

My general thoughts are to get basically functional programmed logic, then move that to a single CPLD (well, a Cypress PSoC 5LP)...

I did originally intend to just use one of the unused top address lines A23..A20 as the toggle for the DPRAM position, but realised this could only be done from the guest CPU side, not by the host CPU - so if you had a problem with your code the situation would be unrecoverable except by reset. The lack of transparency bothered me, even though I knew a reset with DPRAM back at $0 would allow someone to go through the memory contents to post-mortem, but they would not be able to recover the CPU register and flag states. I'm not sure that is possible with my approach either, but the possibility exists as long as the CPU is still running - and I haven't provided a mechanism to halt it.

I have a GAL on each CPU that uses local A18..A9 to decode to 2K blocks for chip select, plus an input from the flip flop on the QL-side GAL so the DP RAM can be remapped. Do you think 4 free outputs and three free inputs is enough to implement what you describe?

My final thought is, having four CPUs and supporting hardware working from a common clock is going to be murder on the power supply. I am considering it might be interesting to put the CPU clock through an inverter so two chains are on a beat and the other two are on the opposite beat. I could use a single inverter IC to introduce a ~6nS delay per stage so all four clocks are out of step with each other.

I have definitely decided that, for usefulness, a future board would convert the bus to 3.3V, use a 64Kx16 DPRAM accessed word on the guest side and even/odd on the QL side, with 3.3V SRAM. This is helpful with the power situation, but also will reduce EMI considerations a little. Also, 64K is enough that people may never need to remap it, and is also enough that they might never need to use the 512K or 16M - which interestingly the people asking questions and making suggestions (thank you pr0f, Tobias and Nasta) haven't suggested or questioned anything about the memory sizes, so I think I am good there. I was worried about it because I see people either need very little (a few K to run a tight loop) or a huge amount to work on larger data sets. One option is to have a 16M DRAM IC, but again that pushes me to 3.3V.

BTW: to be clear, this is an open hardware project licensed under the CERN Open Hardware License v1.2. Firmware and software are expressly included under that license.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

And, a pivot!

So, I made up a breadboard single CPU chain with a 68008FN16, and I got it all basically working to where I can put an image in the DPRAM and have the 68008 execute it.

It taught me a couple of things...

2K is a very small window. If you want to copy anything across into the CPU's private RAM, you're looking at copying 1K blocks at a time. With a word pointer, that's 65,536 1K blocks, or 64mbytes. The 68SEC000 is permanently locked into 8-bit mode.

I have some 3.3 volt 64Kx16 DP SRAMs (yes, 128K!) that can use each byte separately, so on one side it can be accessed as 64Kx16 and on the other side as 128Kx8. They're a much smaller QFP package. I think I need to go back and redesign the board around that DPRAM.

This would double the throughput of the CPU, massively increase the size of the portal, and in simpler systems, no address decoding is required at all for simpler systems as the DPRAM could be the entire system memory.

I also looked at 3.3V GALs, but have been playing recently with Cypress PSoC 5LP devices. They have a 1.8V - 5V ARM Cortex M3 and smaller CPLD combined into a single device, with enough functionality to be an excellent candidate to replace the 8302/8049) and it looks like the 5LP might be able to provide some more advanced logic and some interesting interfaces. This might allow Io to be placed under control of an external processor, with good pre-processing capability. So there's spin-off applications of playing with this that might bear fruit, if the DPRAM were mapped in over the IO area.

Anyway, I am going to set this aside for a little while to let the ideas percolate and resolve, and get back to focusing on the serial and Issue 8 board.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

On the multi-CPU board, I have reached a decision point. Let me lay it out. I have a choice between:

4-CPU 8-bit bus, 2K per CPU DPRAM and 512K private RAM giving 514K total, DPRAM remap feature needed
Four CPUs have 2K dual port SRAM accessed via 8-bit bus that can be at 0K or 512K, plus a 512K SRAM. ALL exchanges between CPUs are though the 2K dual port RAM (IE, through the main CPU) and all busses are 8-bit.

OR

2-CPU 16-bit bus, 128K per CPU DPRAM and 128K shared DPRAM giving 256K total, no DPRAM remap needed
Two CPUs have 128K dual port SRAM accessed via 16-bit bus at $0, with a second 128K DPRAM mapped in at 128K, connecting the two CPUs together directly. So, they have 128K shared with the QL and 128K shared with each other - or divisible between them in any manner the programmer wishes to use - but that 128K is transparent to both CPUs.

Block example:
IMG_2110.jpg
How the memory sharing is handled is critical to throughput. I fear that trusting a QL CPU at 7.5MHz to make meaningful inroads with four CPUs as their sole means of inter-communication would leave precious few resources available for updating the display, etc.

By my math, the doubled bus width doubles the throughput of the CPUs, so only two are required. As transfers can be made between the CPUs, overhead on the host system can be dramatically reduced for some tasks. But the 128K DPRAM is expensive so the 2 CPU 16-bit system would probably cost about $30 more than the 4 CPU system. CY7C028-20AXI are quite expensive but I have a tray ;)

The other challenge is, on an <=Issue 7 QL, where does one map a 256K expansion that is not part of continuous memory?

What do you think? What implications do the two choices have? What might I have not considered that I should?


Derek_Stewart
Font of All Knowledge
Posts: 3928
Joined: Mon Dec 20, 2010 11:40 am
Location: Sunny Runcorn, Cheshire, UK

Re: Symmetric multiprocessing...

Post by Derek_Stewart »

Hi Dave,

I like both proposals, but could you detail the possible advantages of each system.

Is each CPU able to be programmed for separate tasks?


Regards,

Derek
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Symmetric multiprocessing...

Post by Dave »

Derek_Stewart wrote:Hi Dave,
Hi Derek!
Derek_Stewart wrote:I like both proposals, but could you detail the possible advantages of each system.
That I have thought of by myself, without the assistance of the far smarter people here....

Option 1 upsides:
More CPUs. If you're running single threaded, non-multitasking code with no time-sharing capability, this lets you maintain four threads.

Option 1 downsides:
CPUs can only share results or pass them on by passing them to the main processor in the host QL, at 7.5MHz clock.
8-bit bus for the faster guest CPU operations.

Option 2 upsides:
The two CPUs can share information with each other, ignoring the host processor. For example, you could develop a virtual system on CPU1, and monitor its activity with CPU2. Every CPU can exchange data directly with every other,
16-bit data bus - two times the throughput.
It's a faster DPRAM, so higher speeds would be supported.
I can source 512Kx8 SRAMs if more local memory is required.
3.3V - less power draw and heat.

Option 2 downsides:
Fewer CPUs.
64Kx16 dual port SRAMs are VERY expensive. $45 each. $135 for three. Luckily, I have a tray that was acquired some time ago for much less.
Derek_Stewart wrote:Is each CPU able to be programmed for separate tasks?
You're free to use the processors for any purpose and in any way you wish. It's also Open Source hardware, so if you had a specific purpose that needed a particular tweak or extra feature, you have the complete framework to build it.

I'd help.


Post Reply