Bad Apple demos

mk79 · Post by **mk79** » Wed Nov 20, 2019 12:25 pm

QLvsJAGUAR wrote:The other day I spent some time to run a Bad Apple video on the Q68, see the result:
https://youtu.be/d7HPuAyHx5A

Ah, I was very surprised by my own player that it continued to play the sound while skipping frames, but now I see that you downsized the video to 3fps for this

Cheers, Marcel

stephen_usher · Post by **stephen_usher** » Wed Nov 20, 2019 2:53 pm

QLvsJAGUAR wrote: @Stephen, did you continue to work on your project?

No, I thought it probably a dead end given Marcel's far faster rendering, though I did post a link to the source and test files so anyone who has a C68 cross compiler can play.

stephen_usher · Post by **stephen_usher** » Wed Nov 20, 2019 3:10 pm

mk79 wrote:
stephen_usher wrote:
Derek_Stewart wrote:Nice video, just shows the difference between C and assembler.
Not quite, it's more akin to the difference between interpreted code and compiled.

My code was reading what to put into the bytes onto the screen and doing so one at a time. mk79 wrote a program which interpreted the same sort of differencing and translated the changes into a series of 68000 instructions (compiled) which were then loaded from the file and called (I assume) as a subroutine which then acted upon the memory directly. (This is what I gathered from one of his previous posts.)
Well, actually when considering the SuperGoldCard which I used in the end for my video I expect the "interpreter" approach might even be faster than my "compiler" approach. The reason being that the compiler approach essentially executes 33MB of code (that has to come through the 8-bit bus bottleneck), rendering the code cache completely useless. The "interpreter" approach needs to execute MORE code, which on one hand is bad, but with a tight decoder loop (< 256 bytes) the code could be executed entirely from cache, freeing memory bandwidth for the data, which is a huge plus.
The equation is entirely different if you do not have a code cache like on GoldCard or unexpanded QL, which is why I tried the compiler approach first. On the unexpanded QL I currently guess that streaming whole frames directly from SD to screen memory might even be the fastest approach overall. I might try this one day as it's also the simplest approach.

One of the huge advantages that I used compared to your C code is that I fetch the data through low level raw sector access, I don't go through the file system layer.

I was actually running this on one of Tetroid's SGC clones so it was practically the same as yours in terms of processor speed. I did find that doing file data caching sped the rendering up hugely, that was still nowhere near the speed you were getting. Maybe it didn't help that I got ImageMagic to do dithering so there were larger numbers of pixels to change per frame too.

There's definitely a lot of optimisation which can be done. There are more memory copies being done than are strictly necessary and then there is indeed the filesystem overhead, but the caching should help with that. I wrote the code with readability in mind as it was a proof of concept and hence I wanted to make sure that the algorithm was sound and understandable.

My file format is a list of <token byte><data> where the token byte describes the new scan line, i.e. is it a set of differences (and if so how many) or is it a complete line of data. It also has a bit saying if it's the first scan-line, allowing an early abort of a frame if there are no more changes below the previous scan line.

A fill scan line token is followed by 256 bytes containing the pixel bitmap. A difference line is made up of 'n' two byte pairs (where 'n' is defined in the line token) <pixel bitmap byte address (0-255)<8 pixel set value>

You can set in the encoder at what point you abandon differences in the line and just dump the whole scanline pixmap.

I suppose you could increase the speed by replacing the single byte pixel blocks with words containing the QL 8 pixel screen memory blocks and using single move.w instructions, but this doubles the file size.

Peter · Post by **Peter** » Wed Nov 20, 2019 4:50 pm

mk79 wrote:On the unexpanded QL I currently guess that streaming whole frames directly from SD to screen memory might even be the fastest approach overall.

Well, almost the fastest. The only approach I found faster (with an unmodified QL-SD) than whole frames, is to store "half" frames with 1 bit per pixel instead of R and G separately. Then use a specialized SD card routine, which writes every byte from SD card immediately to two consecutve screen bytes, expanding on-the-fly from 16 KB per frame on the card to 32 KB on the screen.

This way I streamed an unfragmented file outside the QL filesystem container, with modified versions of my original FAT32 / SD card routines. Those were still "only" in C language, before Adrian and Wolfgang translated them to assembler for the actual QL-SD and Q68 drivers. Still the result did not look too bad on the original machine. To have the file outside the QLWA container made streaming long files possible for 128K RAM. This way, I could use a small 8 MB QL filesystem container. With large containers, the driver consumed (too) much memory.

By the way, the other demo scenes, like Spectrum, were using original machines just with mass storage! It would be nice if that could be done for the QL as well - otherwise people might blame the success on the SuperGoldCard. Which is an almost completely different machine than the QL, two CPU generations newer, with cache and 32 bit RAM.

I was mentioning "unmodified QL-SD" because there is a relatively simple trick to improve the QL-SD hardware in a way that the "whole frames" approach becomes the fastest possible, even faster than "half frames", despite the doubled video file size. QL-SD reads are equivalent to ROM accesses - and as such faster than instruction fetches from contended 128 KB QL DRAM. To exploit this, I tricked QL-SD into delivering two bytes at once by a word access to $FEE4, simply by ignoring the lowest address bit when decoding. Should be fully backward compatible, although I have no time for any testing. Unfortunately, I decoded transfer speed switching at $FEE6, limiting this trick to word size. Otherwise, even longword access would be an option.

A bit sad I didn't implement this hardware trick, when I designed QL-SD in the first place. Looking back, I even had the idea early enough - but forgot it, because I worked on the Q68 at the same time, where it was not needed. (The Q68 is theoretically fast enough to allow undelayed consecutive SD card reads even at byte access width.)

QLvsJAGUAR · Post by **QLvsJAGUAR** » Thu Nov 21, 2019 8:42 pm

Derek_Stewart wrote:Nice video of the Q68, just shows some of the features that it brings.

Thanks, I appreciate it. Yes, the Q68 is a great machine. QL forever!

Derek_Stewart wrote:I note you have QL/E in DISP_MODE 6 (512x384 Hi-Colour), does suit QL/E better?

For many moons already, QL/E supports all DISP_MODEs of the Q68 quite nicely.

stephen_usher · Post by **stephen_usher** » Sat Nov 23, 2019 12:15 pm

Well, after the last couple of day's discussions I had another look at my code.

It seems that with the XTCC compiler function calls are extremely expensive in terms of time so I refactored the code to decrease the number of calls to my file cache read and this sped the code up massively. I also changed the file format slightly so that the line header is now a word instead of a byte. This has two effects. The first is that all data reads are now word aligned which will help 68000 and above processors as word aligned reads are far faster. Seocndly the top byte can now be used for audio information. I'm thinking the top bit encodes sound on/off and the lower 7 bits the tone. I'm thinking that this information would only be looked at when there's a new frame, which should be frequently enough.

Anyway, here's a video of how it's looking on a Tetroid SGC clone: https://youtu.be/tFkLyBalMWE

stephen_usher · Post by **stephen_usher** » Sat Nov 23, 2019 2:45 pm

Marcel, I don't suppose you have a file containing the notes and timings for the "Bad Apple" tune do you? Preferably in terms of musical notes (maybe relative to middle C) and timings in seconds or some known fraction?

I just want to try adding some music just with the QL beeper.

Cristian · Post by **Cristian** » Sun Nov 24, 2019 7:53 am

stephen_usher wrote:Seocndly the top byte can now be used for audio information

Great improvements Stephen!

mk79 · Post by **mk79** » Sun Nov 24, 2019 7:56 am

stephen_usher wrote:Marcel, I don't suppose you have a file containing the notes and timings for the "Bad Apple" tune do you? Preferably in terms of musical notes (maybe relative to middle C) and timings in seconds or some known fraction?

I just want to try adding some music just with the QL beeper.

If it were that easy I'd have done it

The PT3 I used is a tracker format, so it kind of contains notes, but it's very complicated, I don't think it's of much use to you.

Marcel

stephen_usher · Post by **stephen_usher** » Sun Nov 24, 2019 9:47 am

mk79 wrote:
stephen_usher wrote:Marcel, I don't suppose you have a file containing the notes and timings for the "Bad Apple" tune do you? Preferably in terms of musical notes (maybe relative to middle C) and timings in seconds or some known fraction?

I just want to try adding some music just with the QL beeper.
If it were that easy I'd have done it The PT3 I used is a tracker format, so it kind of contains notes, but it's very complicated, I don't think it's of much use to you.

Marcel

I did find a PT3 player for the Atari ST, but it's written in impenetrable Pascal for some reason. Even a tracker format could be analysed.

The Sinclair QL Forum

Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos

Re: Bad Apple demos