Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to faster serial #25

Merged

Conversation

carterturn
Copy link
Contributor

@carterturn carterturn commented Feb 3, 2024

This pull request replaces the Pi Pico SDK serial over USB CDC with a custom implementation in TinyUSB that is typically faster.

It aims to be backwards compatible, though it may also allow commands terminated in only newline (\n) to be accepted in addition to the carriage return + newline termination (\r\n).

Not ready to merge yet, as a binary mode should still be added for even faster programming.

Should be backwards compatible, except that commands ending in just newline (\n) are accepted in addition to commands ending in carriage return + newline (\r\n).
@philipstarkey
Copy link
Member

Thanks @carterturn , I'm keen to dig into this once it's ready. The speed improvements shown in the other repo certainly seem to make this worthwhile!

I'd like to merge this after #26, so keep that in mind (it'll be a squash merge though so there is no point rebasing yet). If you had time to review that PR too, I would appreciate it. It should mean you wont need to include the uf2 files explicitly (they'll get built by GitHub actions and uploaded as an artefact for testing, and then built for a release after merging and tagging)

@dihm
Copy link
Collaborator

dihm commented Feb 5, 2024

Well, I see what the problem was when I tried this. I included fast_serial as native C++. That did not work.

Also, I think it is really interesting that the binary blobs are less than half their size when using this serial comms mode. That's kind-of wild.

@carterturn
Copy link
Contributor Author

Binary mode is now implemented, and I have tested programming for a single-clock-pulse sequence.
I will try to setup a more elaborate test environment soon (and then get performance results). If someone else has a Windows test setup, it would probably also be good to check performance there.

Waiting to merge for #26 sounds good. I am a bit pleased to have fewer binary blobs in git repositories.

Curiously, I ran into compilation issues when not using the "extern C" directive, but I suspect there are some cmake options that could fix that. I suppose this serial code does not have many features that the pico SDK version does have, though that is indeed a surprising difference in executable size.

@dihm
Copy link
Collaborator

dihm commented Feb 9, 2024

@carterturn I've had a go at doing some performance tests and I'm not getting as impressive of results as for the prawn_do.

Basically, there is a floor on the binary write of ~200 ms regardless of the number of instructions and it appears to be the same floor as doing individual serial writes. I don't see increases in timing until at least 150 instructions are sent for the block binary (obviously the individual writes are much slower).

Test script with prawn being the serial interface handle

instr = 250
dummy_data = np.tile([[100, 4],
                      [1000000, 1]],
                      (instr, 1))

# individual serial writes
for i in range(dummy_data.shape[0]):
    prawn.write(f'set 0 {i:d} {dummy_data[i,0]:d} {dummy_data[i,1]:d}\r\n'.encode())
    _ = prawn.readline().decode()

# block binary of same data
prawn.write(f'setb 0 0 {dummy_data.shape[0]:d}\r\n'.encode())
serial_buffer = b''
for i, (dur, reps) in enumerate(dummy_data):
    serial_buffer += struct.pack('<I', dur)
    serial_buffer += struct.pack('<I', reps)

prawn.write(serial_buffer)
resp = prawn.readline().decode()

Timing summary using %%time in a jupyter cell for each of the above vs number of instructions. I'm on a windows machine with python=3.10.

instruction number individual time (ms) block time (ms)
10 220 215
100 297 205
200 381 216
500 621 228
1000 1070 262
2000 1880 310
5000 4510 459
10000 8800 697

It would appear there is something in the firmware that is really bogging us down. Nothing obvious is jumping out at me. Only real differences I see are that we recalculate the address_offset + addr * 2 more than we need to (2 times per loop regardless of branch), or the more complicated else/if tree. But that doesn't seem all that likely. Note that my dummy data doesn't include any wait or stop instructions, so we aren't even traversing all the branches.

@dihm
Copy link
Collaborator

dihm commented Feb 14, 2024

I am an idiot. The jupyter cells I was running those timing snippets in have a readlines call and my serial timeout was set to 200ms. Here is the updated timing table. As expected, the binary writes are much, much faster.

instruction number individual time (ms) block time (ms)
20 6.8 1
200 77 5.2
400 153 8.3
1000 378 18
2000 834 39
4000 1780 96
10000 4220 281
20000 8610 480

Edit: stupidity continues. My dummy data is alternating two instructions. So the total number of instructions written is actually 2*instr. I've updated the instruction number in the table of this comment to be accurate.

@dihm
Copy link
Collaborator

dihm commented Feb 14, 2024

As another optimization that may be worth implementing, the for-loop building of the serial buffer string doesn't scale especially well to these very high values. If the data we want to write is in a well-ordered numpy array, we could use the tobytes function instead which appears to scale much better.

instr = 4000
seq = np.array([[100, 4], [1000000, 1]], dtype='<u4')
dummy_data = np.tile(seq, (instr//2, 1))

# for loop
serial_buffer = b''
for i in range(dummy_data.shape[0]):
    serial_buffer += struct.pack('<I', dummy_data[i,0])
    serial_buffer += struct.pack('<I', dummy_data[i,1])

# numpy
serial_buffer2 = dummy_data.tobytes()

serial_buffer == serial_buffer2  ## returns True

Timing using python 3.8 and %%timit magic

instructions loop numpy
10 7us 95ns
100 76us 164ns
4000 6.4ms 889ns
20000 83ms 3.7us

May be worth implementing that in the labscript device.

@carterturn
Copy link
Contributor Author

Thank you for the updated test, that would have been a confusing bug to try to replicate.
I think the even more numpy based buffer construction looks great; I added it to labscript-suite/labscript-devices#112.

Copy link
Collaborator

@dihm dihm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all good to go, but am just now realizing the README hasn't been updated with the new setb command. Probably should do that before merging.

Copy link
Collaborator

@dihm dihm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me.

Copy link
Collaborator

@dihm dihm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While checking write speeds, I accidentally broke the binary writes in a way that should be preventable. We should probably address that before merging.

prawnblaster/prawnblaster.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@dihm dihm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to go ahead and merge this. Suspect the program command has the same issue as the digital output board, but I say we fix it later.

@dihm dihm merged commit 783634d into labscript-suite:master May 21, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants