Code challenge to implement a simple transaction processor that takes a csv with transactions as input and outputs another csv with account states through stdout.
To process inputs you can run in 2 modes (remember to run gen.sh
first):
- Single threaded:
cargo run --release inputs/big/random.csv > result.csv
- Multi threaded:
cargo run --features multithreaded --release inputs/big/random.csv > result.csv
You can also enable verbose output of the invalid transactions with --features stderr
but it slows down performance considerably so it should only be used in smaller inputs like cargo run --features stderr --release inputs/complicated.csv > result.csv
.
You can generate input csv data with 2 commands, both of them generate 10,000,000 transactions.
- Random
cargo run --release inputs/big/random.csv genrandom
- "Smart" Random
cargo run --release inputs/big/random.csv gen
orgen.sh
The weird ordering of the arguments is because of challenge constraints.
You can generate documentation by running doc.sh
, it will automatically open in a browser tab.
You can time the execution of the multi threaded and single threaded processing with time_multi.sh
and time.sh
respectively.
You can generate a flamegraph profile of the single threaded implementation by running profile.sh
, it requires that you have previously ran cargo install flamegraph
.
Remember adding
[profile.release]
debug = true
to Cargo.toml
or the flamegraph will not have the debug symbols.
Run cargo test
to execute unit tests.
- From the start I chose the async approach because from experience it often ends up utilizing system resources better.
- In the beginning I opted for async-std as the runtime because tokio did not support csv-async.
- Also in the beginning I used csv-async + serde for both serialization and deserialization because of simplicity.
- After I had verified that the processing worked correctly, I started searching for bottlenecks and found the first one in the runtime, I switched to smol which doubled performance.
- After switching runtimes, it became apparent in the flamegraph that there were 2 distinct operations that could happen in parallel, parsing and processing (mostly hashmap lookups), so I implemented the multi threaded version and switched to a faster hashmap.
- While testing the multi threaded implementation, initially it was slower, but I later found out it was because with enough (10 mil) transactions, if there is an even distribution, most accounts end up locked and therefore the processing step ends up being a single hashmap lookup. I then reduced the probability of the
chargeback
transactions in the generator which made the multi threaded approach faster. - The next bottleneck seemed to be parsing so I switched to a custom parser, this also allowed me to switch the transaction representation to a better one, from
struct Transaction{
ty: TransactionType,
client: ClientId,
tx: TransactionId,
amount: Option<f64>
}
to
pub enum Transaction {
Deposit {
client: ClientId,
tx: TransactionId,
amount: f64,
},
Dispute {
client: ClientId,
tx: TransactionId,
},
etc...
}
Parsing is one of the most dangerous steps in computing, so if this was a production system, a lot more testing and scrutiny would have to go into the parser as well as its lower level dependencies atoi, fast-float parse-display and of course the smol runtime itself.
- If I had more time I would try to parallelize the parsing step by partitioning the file by lines, and feeding a chunk of lines to each task.