Linux OS (we've tested on Ubuntu 18.04 STD)
Docker version 20+
40GB memory
To install Docker on Linux, run:
$ sudo apt-get update
$ sudo apt-get install -y docker-ce
Make sure the docker daemon is running, then download the compressed artifact from the provided link and load it like so:
$ docker load < oopsla21ae.tar.gz
This might take some time. Once done, start the docker container:
$ docker run -it --cap-add=sys_nice --name artifact oopsla21ae
And finally test that the artifact works:
$ python3 ExpDriver.py --figure1 --figure59 --figure7table3
This should complete in about an hour.
Any commands should be run from /home/oopsla21ae/
.
Running all experiments fully takes almost two days to complete, so
we have implemented a fast path that can run all experiments
(on fewer libraries and applications) in about three hours.
The fast path is enabled by default, so use the '--full' flag
to run the full versions of experiments:
$ python3 ExpDriver.py [OPTIONS] --full
To run all experiments, run:
$ python3 ExpDriver.py --all [--full]
Expected running times for all experiments on this machine, running Ubuntu 18.04 STD and Docker 20.10.2, are listed here:
Figure 1 | Table 1 | Figures 5 and 9 | Figure 7 and Table 3 | Table 4 | Figure 8 | Total | |
---|---|---|---|---|---|---|---|
Fast | 20 min | - | 40 min | 2 min | - | - | 1 hr |
Full | 7 hrs | 20 min | 9 hrs | 20 min | 1 hr | 1 hr | ~19 hrs |
To run individual experiments, simply replace '--all' with the corresponding experiment's flag, found by running:
$ python3 ExpDriver.py --help
...
--figure1 generate figure 1
--table1 generate table 1
--figure59 generate figures 5 and 9
--figure7table3 generate figure 7 and table 3
--table4 generate table 4
--figure8 generate figure 8
...
To generate Figure 7 and Table 3, for example, run the following:
$ python3 ExpDriver.py --figure7table3 [--full]
Some expected output is in /home/oopsla21ae/example-results/
; you can compare your generated plots to those as a sanity check.
Our artifact generates PDFs that can be copied out of the docker container using
docker cp:
$ docker cp <container_id>:/home/oopsla21ae/images/ .
- To get the <container_id> of a running container, run:
$ docker container ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
<container_id> oopsla21ae ... ... ... artifact
- To get the <container_id> of a stopped container, run:
$ docker container ps -a
Descriptions of each generated PDF file in /home/oopsla21ae/images/
are listed in the following subsections. Once the generated PDFs have been copied locally, reviewers can view them using their favorite PDF viewer.
In general, the figures and tables produced here are analogous to the figures and tables presented in the paper, but we describe how to interpret results in more detail below.
Generated files:
figure1_all.pdf
figure1_histogram.pdf
figure1_hurt.pdf
figure1_improved.pdf
figure1_insignificantly_affected.pdf
figure1_histogram.pdf
is analogous to Figure 1 in the paper:
- Clustering around the vertical speedup == 1 line shows that the overhead of checked indexing is insignificant in most cases (~65% of benchmarks)
- The left tail depicts benchmarks where checked indexing did have significant overhead (~24% of benchmarks)
- The right tail depicts benchmarks where checked indexing, surprisingly, improves performance (~11% of benchmarks)
figure1_all.pdf
contains the same information available in figure1_histogram.pdf
but shows it in a slightly different way:
- Bars clustered around the horizontal speedup == 1 line represent the benchmarks where the overhead of checked indexing is insignificant (~65% of benchmarks)
- Bars below the line represent benchmarks where checked indexing does have significant overhead (~24% of benchmarks)
- Bars above the line represents benchmarks where checked indexing, surprisingly, improves performance (~11% of benchmarks)
figure1_hurt.pdf
zooms in on the ~24% of benchmarks where we expect checked indexing to have significant overhead.
figure1_improved.pdf
zooms in on the ~11% of benchmarks where we expect checked indexing to, surprisingly, improve performance.
figure1_insignificantly_affected.pdf
zooms in on the ~65% of benchmarks where we expect checked indexing to have insignificant overhead.
No generated files.
The three contexts are:
- A baseline context: rustc 1.52, compression level = 5
- A different workload: rustc 1.52, compression level = 11
- A different compiler: rustc 1.46, compression level = 5
See this section for why we do not reproduce the "different architecture" column.
We expect the overhead of checked indexing to be around:
$ python3 ExpDriver.py --table1
Getting overheads for baseline context... [Context 1]
Overhead == 0.0852062889815508
Getting overheads for different workload... [Context 2]
Overhead == 0.05165770297643811
Getting overheads for different compiler... [Context 3]
Overhead == 0.1482833160361338
The difference in overheads of checked indexing across these three contexts shows that developers cannot attribute a flat cost to checked indexing in every context they are used. Furthermore, we expect reviewers to have different results if any part of their underlying context (architecture, operating system and version, etc) is different, as this is exactly the point we are trying to make.
Generated files:
figure5.pdf
figure9.pdf
figure5.pdf
compares four different heuristics for reintroducing bounds checks
in the rust-brotli benchmark:
- Random
- Hotness
- One-checked slowdown
- One-unchecked speedup
With more "successful" heuristics reintroducing more bounds checks within a certain threshold, i.e. hugging the black 0% line the longest. We expect the random heuristic (red line) to reintroduce the smallest number of bounds checks before hitting the threshold, then one-unchecked (yellow line). Hotness (orange line) should perform best until the very end, where it is surpassed by one-checked (blue line).
figure9.pdf
compares the random and hotness heuristics to NADER's combined-heuristic
approach on the rust-brotli benchmark. Similarly to figure5.pdf
, the hotness line
(orange) should be above the random line (red). At the far right of the graph, a
dark blue line shows when NADER switches from the hotness heuristic to the
one-checked heuristic, and should be above both hotness and random lines.
Generated files:
figure7.pdf
table3.pdf
figure7.pdf
shows, for each of the 27 applications we selected, the number of
direct and indirect unchecked indexing used in a bar chart. On average, we expect
there to be 86 times more indirect unchecked indexing than direct
unchecked indexing, which would be evidenced by bar charts with much more (about 86
times more, per application) red than blue.
table3.pdf
presents the results from figure7.pdf
in a table, and also includes,
per application, the total number of dependencies and the number of dependencies
with at least one use of unchecked indexing. Please see the table3.pdf
in
/home/oopsla21ae/example-results/
for approximate expectations.
Reviewers may observe some slight variation in these results due to
different dependency versions.
No generated files.
The four steps of NADER are:
-
Check for any unchecked indexing
-
Compare original binary with one generated after converted all unchecked indexing to checked indexing
-
Measure overhead of all converted checked indexing (applicable to current context only)
-
If significant, run NADER to only reintroduce bounds checks up to a threshold
We expect the applications we evaluate to stop after the following steps:
tantivy
after step 2 (binaries are identical)rage
after step 2 (binaries are identical)swc
after step 3 (checked indexing overhead == 0.13%)warp
after step 3 (checked indexing overhead == -0.31%)iron
after step 3 (checked indexing overhead == -2.01%)RustPython
after step 3 (checked indexing overhead == 0.71%)zola
after step 3 (checked indexing overhead == 0.25%)COST
after step 4 (not generated here, see figure 8)rust-brotli
after step 4 (not generated here, see figure 9)
Generated files:
figure8.pdf
figure8.pdf
presents the same information as figure9.pdf
(excluding the
random line) for the COST benchmark instead of rust-brotli. Specifically,
the dark blue line at the far right of the graph shows when NADER switches
from the hotness heuristic to the one-checked heuristic and should be above
the orange hotness line.
-
The "different architecture" column in Table 1 is not supported by our artifact because the reviewers may not have access to two or more different architectures on which to run our experiments.
-
The last column of Table 3 is also not supported by our artifact because it was the result of a manual process. We proceeded with applications that had reasonable synthetic profiling workloads, although there is room for a more rigorous process of elimination here.
- Artifact supports all major claims made by paper (outlined in this document by the Figures and Tables)
- Artifact documents detailed steps for result reproduction and lists any potential deviations from what the paper claims
Deviations:
- All but Figure 7 and Table 3 are performance results and will vary, but we describe trends and patterns to look for
- A full evaluation takes almost 19 hours, but we offer reviewers a fast path that can complete in about three hours
- Future researchers can run this artifact on more libraries and applications by cloning their source code
- Future researchers building off this artifact can do so by adding new benchmarks and their arguments
- Future researchers can directly modify
/home/oopsla21ae/scripts/Nader.py
to improve its exploration algorithm - Artifact source code can be reused as separate components much in the same way as the individual plots are generated
- Others can learn about our benchmarking and large-scale application analysis techniques
- Others can extend the artifact beyond bounds checks to other code patterns by modifying
/home/oopsla21ae/scripts/regexify.py