Floating a thought to understand the issues...

Nagging hardware related question? Post here!
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Floating a thought to understand the issues...

Post by Dave »

Pr0f wrote:Those DUART chips still have the same kind of ISR built in, but for those that are Motorola Interrupt mode, they are expecting to use the vectored system that has basically been disabled on the QL platform.
That gave rise to an earlier question to Nasta about this very point. I mused that it would be useful if IPL0/2 were separated, and the OS massaged a little so it at least understood levels of interrupt and proper use of vectors.

But then I'm a nutjob who thinks keyboard input should create an interrupt instead of being polled constantly. I mean, even if you type as fast as you can you'll never have a change of keypresses more than 0.0001% of the time the IO routine polls the keyboard.

Another reason why it would do well on a separate CPU.


User avatar
Peter
QL Wafer Drive
Posts: 1953
Joined: Sat Jan 22, 2011 8:47 am

Re: Floating a thought to understand the issues...

Post by Peter »

Dave wrote:Another reason why it would do well on a separate CPU.
No CPU neeed... just a case for hardware FIFOs ;)


User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: Floating a thought to understand the issues...

Post by tofro »

Dave wrote:
Another reason why it would do well on a separate CPU.
The keyboard is handled by a separate CPU on the QL

Tobias


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Floating a thought to understand the issues...

Post by Dave »

I have 300x 2Kx9 FIFOs... https://4donline.ihs.com/images/VipMast ... 2793-1.pdf

Using one for input. If you read the datasheet it works as an automatic self-managing ring buffer, so I can just read through a 2K loop.

It's nice.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Floating a thought to understand the issues...

Post by Dave »

tofro wrote:The keyboard is handled by a separate CPU on the QL
Haha, well, yes. You got me there!


User avatar
tofro
Font of All Knowledge
Posts: 2685
Joined: Sun Feb 13, 2011 10:53 pm
Location: SW Germany

Re: Floating a thought to understand the issues...

Post by tofro »

;)


ʎɐqǝ ɯoɹɟ ǝq oʇ ƃuᴉoƃ ʇou sᴉ pɹɐoqʎǝʞ ʇxǝu ʎɯ 'ɹɐǝp ɥO
Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Floating a thought to understand the issues...

Post by Nasta »

Let me backtrack here a bit, re 'blocking' and 'non-blocking' tasks.
Once a second CPU is in the picture to handle IO, almost all tasks become 'non-blocking' in 'Dave-speak' :)
The reason some things take over the entire machine are mostly historic. The underlying principle, though, is always present, and it rears it's head whenever something like interrupt handling overhead takes up more time than is available between data transfers. This of course changes depending on the required data rate and CPU speed.

I will give two examples, one very relevant to the UART questions.

The first is a floppy controller. Since it does not have any sort of read/write buffer, data transfer from RAM to disk and back have to happen completely real-time. In effect, once a 'read sector' or 'write sector' command is given (used here as the quintessential example for the problem at hand) the CPU does polling for data, by looking at a bit in a status register and then reading or writing data when the bit state tells that data is available or required, as the case may be.
All floppy controller chips ever used on the QL can generate interrupts for the same condition, basically a 'I need/have data' interrupt.
Two problems happen here:
1) Interrupt overhead + processing takes more time than there is between data requests, or close enough. In our example there is an underlying problem, and that is that various floppy drive bit densities require various speeds of reaction. HD floppy is already fast enough that the old QL would not be able to process this under interrupt. It should be noted that when the interrupt occurs, the handler already knows a few things: is it for rad or write, where the data is coming from and where it needs to go, and data count. If one were to write a short snippet of code that does this, it would indeed be very short - a few instructions. But the OS arrives at them only after going through several layers of other code, plus the CPUs own interrupt latency + processing overhead. Compared to all of that, the actual data move takes only a VERY short fraction of the time.
2) Supposing the system CAN in principle cope with the frequency of interrupts. The problem that arises now, given this is a multitasking OS, is that certain operations (mostly regarding resource management or driving other peripherals) require intervals when the code executing must not be pre-empted, i.e. must not be interrupted, so the most common way to do this is to mask the relevant interrupt. Since there is really only one interrupt level on the QL, and masking of it is sometimes abused, this adds an almost completely unpredictable component to interrupt latency (time from interrupt being caused to being handled), which may break normal IO operation for such devices.
Once we are in danger of this, the only available solution is to switch to direct polling, and yes, because we do not want anything to interrupt the data transfer or data may be lost, interrupts are disabled during that time, and the machine is prevented from doing anything else except data transfer. However, if we examined the data transfer loop, it takes most of it's time running in a tight loop waiting for the relevant status bit to change, to know when data is needed or is available, and then one instruction to transfer data.

The second example is a serial port. Old type serial chips could also cause an interrupt when data was available or requested, and also this is the way handshake was handled. The 16550 is a notorious example, as it is in fact derived from an even older and simpler 8250. Unlike a floppy controller, the serial port can influence data transfer by using handshake (that is of course, if it is used and enabled in the first place). Even so - once the data rate increases, we run into the exact same problem as the floppy controller, with the one difference that we CAN stop the data flow in order to give the CPU some breathing room, IF you can react soon enough between byte transfers. This in fact still requires the very same guaranteed short interrupt latency as a floppy controller, except it has to operate on a (several) byte level, rather than on a floppy sector level (2^N x 512 bytes). For most cases, it's the same problem, by the 'if you can do it for one byte, you can do it for many' as the whole process operates on a byte by byte basis.
The obvious idea here is, is there a way to work on a 'several bytes by several bytes' level - an attempt was made in the update from a 16450 serial port chip to the 16550 chip, FIFO buffers (16 bytes) were added in the data path. The idea was, when a byte is received, an interrupt is generated, and if data continues to arrive while the CPU has not yet reacted to the interrupt, there is a buffer capable of receiving it without data loss. As long as the CPU react within up to 16 bytes received, things will be fine. Also, to make it possible to do things on a several bytes rather than a per byte level, a FIFO threshold system was added so that an interrupt does not happen until a number of bytes have been received, to lower the frequency of the interrupt, trading off for shorter latency. Unfortunately it was one of the more half-a**ed attempts by not having hardware handshake, so that it still has to be handled by interrupt, and it is a problem for transmitting bytes, where the FIFO provided loses practically all utility. Fortunately, even newer versions FINALLY implemented real hardware handshake, preventing data loss and still relaxing latency requirements for the CPU.
The FIFO buffers capitalizes on the capability of the CPU to transfer data quickly once it does react to the interrupt - and since the reaction time is usually the major part of how much time the CPU uses to transfer X amount of data, you get significantly more X per almost same time, by handling more than one byte of data per interrupt.

Now I will return for a short while to the floppy controller example - for one, because 1Mbit/s transfer rates are not unknown of with modern serial ports, and that just happens to be the transfer speed of a HD floppy. The latest generation of floppy controllers (and I think this includes the one on the GC/SGC) also include a FIFO in the data path - considering the floppy works on sector-sized chunks of data, this could be really useful, but it was never used on the QL. Fortunately, when hard drives became available, the data transfer speed was already high enough the designers thought to include a sector (or multi-sector) buffer, and the CPU got interrupted only when it was filled up with newly read data or the data filled in by the CPU has been written to disk. There is more, see below.

Finally, let me also write a bit on a method that can also be used to handle the problems above (and this is VERY relevant for the QL).
Given that there is a maximum speed for the above peripherals, that we can calculate in advance, we can also calculate the worst case time it takes to fill up a FIFO. For instance, if we decided we have the serial port at maximum 1Mbit/s, it translates to 100kbytes/s transfer rate, and if the receive FIFO is 16 bytes, it can fill up at most 6250 times a second (100000/16). So, if we generate a periodical interrupt at a high enough rate that it is always more than that, we can use it to pull for a service request for such devices - and in fact more than one, as transferring up to 16 bytes from the FIFO takes only a very short time, you sue one single interrupt latency on every poll, and then go through a list of devices that may receive data within that time frame, and transfer it if it is indeed received. As far as I remember, this is the way Q40/60 and probably Q68 works, by using a much higher frequency of poll interrupt.
In fact, I wich more capable QL OS systems had a fast poll interrupt along with a slow poll one. The fast one would ONLY ever do data transfers, and the interrupt handler linked list would basically have a prototype handler that gets a pointer to a register structure of the IO device, knows hat to do to check if the device needs to be serviced, and transfer X data between data port and memory, and that's it. It has to be as short as possible, especially the 'check for service part'. And it can also have a limit of how much data it can transfer in one poll. If more granularity is needed in which device needs more or less bandwidth, the maximum amount of data transferred can be manipulated with, as well as the number of times and at which positions in the poll list it's linked (which is an interesting technique TT himself discussed once regarding his Stella RTOS).

One step above that (but possibly still internally using the same principle) is using an entirely separate CPU to do exactly the above, or indeed react to interrupts directly. It is HIGHLY advantageous if that CPU executes it's interrupt handling code in separate memory, or even better, on-chip cache. If this is done, the main CPU is only ever bothered for actual data transfers (or control register reads), it never sees IO related interrupts, the system does not see a cost of interrupt reaction latency (it's the other CPU spending time on that), and data transfers can be very fast and also (unlike DMA, which was kind of a beta version of this idea) as clever as you need them to be.

An alternative solution is to again use a second CPU that handles IO entirely, and communicates with the main CPU through a common memory buffer or a set of FIFO buffers - in other words, that second CPU or uC then buffers the data, handles the protocols, and in general takes care of the particulars of the IO devices, while presenting a simplified 'abstraction' to the main CPU, and only bothering it when the low level tasks of assembling data into a buffer or transmitting data from a buffer has been done - or the main CPU can also poll the status of the buffers.

It goes without saying that both of the above approaches can (And often are) used together (in fact the latter one is precisely how most mass storage devices work internally), and either of them lighten the load for the main CPU in a system quite significantly.


User avatar
Dave
SandySuperQDave
Posts: 2765
Joined: Sat Jan 22, 2011 6:52 am
Location: Austin, TX
Contact:

Re: Floating a thought to understand the issues...

Post by Dave »

We need a "like" feature added to the forum. Desperately. Every now and then a post comes along that's so insightful, so informative and so entertaining....

https://github.com/satanasov/postlove

(I'd suggest changing the small heart to a small QFP chip!)

Thanks Nasta.

One of the neat aspects of a multi-processor system would be the option to run different versions of SMSQ or Qdos on each processor. The "IO" processor could run a full featured OS but would most likely run just a few bytes of very tightly written assembly. If it were a 68030 this loop could exist almost entirely inside internal or external cache.

If *I* were building this system I would provide 256K of dual port RAM in the full width of the buss. The A port would belong exclusively to the IO controller, and the B port would be shared between the workload CPUs. Contention would only be managed for that block - all other areas would be private memory and the decoders would not have to take any special measures. Alternatively, (or additionally) I would provide each system with its own 10/100 ethernet and allow them to communicate through a switch.

For extra fun factor, I would probably design this as a VMEbus or NuBus system - something 68K friendly, but with wider market applications.

But no matter. This is just musing about what a multi-headed QL might look like.
100x 68SEC000
100x 68SEC000
It's not like I have a lot of CPUs or anything.


User avatar
Pr0f
QL Wafer Drive
Posts: 1298
Joined: Thu Oct 12, 2017 9:54 am

Re: Floating a thought to understand the issues...

Post by Pr0f »

That dual ported RAM idea is effectively what the TUBE was on the BBC - but that was only a small dual ported RAM with interrupt lines for both side to indicate changed data was available.

Some improvement in Interrupt handling can be done with the 'status affects vector' mechanism that z80 I/O and some of the 68K peripherals use.

These rely on parts of the interrupt vector being set by the interrupting condition within the chip (for serial it might be Frame Error, parity error, buffer over and under run etc) and so the generated vector would point to a group of vectors each of which targets the specific code to deal with the condition. Little or no need to interrogate an ISR to find out what caused the interrupt. The down side is that almost all of the user vector space is clobbered by QDOS code in the bbql. But it is possible that FC0-2 decoding could provide a way to map over the top of that...


Nasta
Gold Card
Posts: 443
Joined: Sun Feb 12, 2012 2:02 am
Location: Zapresic, Croatia

Re: Floating a thought to understand the issues...

Post by Nasta »

Pr0f wrote: Some improvement in Interrupt handling can be done with the 'status affects vector' mechanism that z80 I/O and some of the 68K peripherals use.
...Little or no need to interrogate an ISR to find out what caused the interrupt. The down side is that almost all of the user vector space is clobbered by QDOS code in the bbql. But it is possible that FC0-2 decoding could provide a way to map over the top of that...
Now that's an interesting point, interrupt acknowledge does access 'CPU space' (FC0..2=111) so that could be decoded 'elsewhere'. The only problem with such systems is somehow abstracting them for a configurable system - there are only so many vectors, which is why most OSs don't use this approach any more on 'universal' hardware, but is used inside of micro-controllers.


Post Reply