The direction is correct. Now replace MDM by IOP. The MDM doesn't know and doesn't care who is using it. It only replies to commands.
The IOP is essentially one specialist ALU with many sets of registers, that are used in a round-robin fashion. Each set does contain data like "where to get the next IO instruction" and "where to read the output data" or "where to store the input data". It is a bit mystical, but essentially you have two sets of register sets. The MSC controls the IOP and has some registers more, the BCEs handle the IO with a serial shuttle bus. Each BCE is controlled by the same ALU, thus only one BCE can be active at a time, but switching BCEs happens so fast that you practically doesn't notice the difference. The most common BCE programs reside in BCE PROM and can't be changed in flight.
A BCE can essentially do the following actions (accuracy limited by research):
- Wait for index (#WIX) is a known instruction and means the BCE waits until the first MDM starts to respond on the bus.
- It can read n 16-bit words from the bus
- It can send a command and write n 16-bit words to the bus.
- Branches in a BCE program are possible, conditional and unconditional.
n can be a large number, because of the ability of the MDM to execute special MDM programs from PROM. Its uncommon for a BCE program to read and write at the same time, but it does happen and is perfectly legal. The maximum n possible in theory should be 512 words. If a single read or write instruction is executed without using the PROM, the maximum number of words is 32.
Thus, a good implementation would execute a number of BCE/MSC instructions per timestep. A good enough implementation would be defining the programs in C++ and simply point them to the beginning of the memory used for data transfer.
I can't find any hint that a BCE program can jump over multiple memory locations in GPC memory. It looks like they are always processed in sequence.