Skip to content

Commit

Permalink
Pushing final changes.
Browse files Browse the repository at this point in the history
  • Loading branch information
jjhenkel committed Feb 6, 2020
1 parent e8d388f commit 69a8005
Show file tree
Hide file tree
Showing 58 changed files with 963 additions and 59 deletions.
15 changes: 14 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
{
"files.watcherExclude": {
"datasets/**": true
}
},
"cSpell.ignoreWords": [
"aix",
"check",
"commit",
"compat",
"data",
"hdfs",
"mockito",
"mode",
"output",
"stream",
"test"
]
}
114 changes: 112 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,112 @@
# averloc
Repository for the Adversarial ML on Code things
# averloc (AdVERsarial Learning On Code)

Repository for Semantic Robustness of Models on Source Code.

## Directory Structure

In this repository, we have the following directories:

### `./datasets`

**Note:** the datasets are all much too large to be included in this GitHub repo. This is simply the
structure as it would exist on disk once our framework is setup.

```bash
./datasets
+ ./raw # The four datasets in "raw" form
+ ./normalized # The four datasets in the "normalized" JSON-lines representation
+ ./preprocess
+ ./tokens # The four datasets in a representation suitable for token-level models
+ ./ast-paths # The four datasets in a representation suitable for code2seq
+ ./transformed # The four datasets transformed via our code-transformation framework
+ ./normalized # Transformed datasets normalized back into the JSON-lines representation
+ ./preprocessed # Transformed datasets preprocessed into:
+ ./tokens # ... a representation suitable for token-level models
+ ./ast-paths # ... a representation suitable for code2seq
+ ./adversarial # Datasets in the format < source, target, tranformed-variant #1, #2, ..., #K >
+ ./tokens # ... in a token-level representation
+ ./ast-paths # ... in an ast-paths representation
```

## `./models`

We have two Machine Learning on Code models. Both of them are trained on the Code Summarization task. The
seq2seq model has been modified to include an adversarial training loop and a way to compute Integrated Gradients.
The code2seq model has been modified to include an adversarial training loop and emit attention weights.

```bash
./models
+ ./code2seq # seq2seq model implementation
+ ./pytorch-seq2seq # code2seq model implementation
```

## `./results`

This directory stores results that are small-enough to be checked into GitHub. In addition, a few utility scripts
live here.

## `./scratch`

This directory contains exploratory data analysis and evaluations that did not fit into the overall workflow of our
code-transformation + adversarial training framework. For instance, HTML-based visualizations of Integrated Gradients
and attention exist in this directory.

## `./scripts`

In this directory there are a large number of scripts for doing various chores related to running and maintaing
this code transformation infrastructure.

## `./tasks`

This directory houses the implementations of various pieces of our core framework:

```bash
./tasks
+ ./astor-apply-transforms
+ ./depth-k-test-seq2seq
+ ./download-c2s-dataset
+ ./download-csn-dataset
+ ./extract-adv-dataset-c2s
+ ./extract-adv-dataset-tokens
+ ./generate-baselines
+ ./integrated-gradients-seq2seq
+ ./normalize-raw-dataset
+ ./preprocess-dataset-c2s
+ ./preprocess-dataset-tokens
+ ./spoon-apply-transforms
+ ./test-model-code2seq
+ ./test-model-seq2seq
+ ./train-model-code2seq
+ ./train-model-seq2seq
```

## `./vendor`

This directory contains dependencies in the form of git submodukes.

## `Makefile`

We have one overarching `Makefile` that can be used to drive a number of the data generation, training, testing, adn evaluation tasks.

```
download-datasets (DS-1) Downloads all prerequisite datasets
normalize-datasets (DS-2) Normalizes all downloaded datasets
extract-ast-paths (DS-3) Generate preprocessed data in a form usable by code2seq style models.
extract-tokens (DS-3) Generate preprocessed data in a form usable by seq2seq style models.
apply-transforms-c2s-java-med (DS-4) Apply our suite of transforms to code2seq's java-med dataset.
apply-transforms-c2s-java-small (DS-4) Apply our suite of transforms to code2seq's java-small dataset.
apply-transforms-csn-java (DS-4) Apply our suite of transforms to CodeSearchNet's java dataset.
apply-transforms-csn-python (DS-4) Apply our suite of transforms to CodeSearchNet's python dataset.
apply-transforms-sri-py150 (DS-4) Apply our suite of transforms to SRI Lab's py150k dataset.
extract-transformed-ast-paths (DS-6) Extract preprocessed representations (ast-paths) from our transfromed (normalized) datasets
extract-transformed-tokens (DS-6) Extract preprocessed representations (tokens) from our transfromed (normalized) datasets
extract-adv-datasets-tokens (DS-7) Extract preprocessed adversarial datasets (representations: tokens)
do-integrated-gradients-seq2seq (IG) Do IG for our seq2seq model
docker-cleanup (MISC) Cleans up old and out-of-sync Docker images.
submodules (MISC) Ensures that submodules are setup.
help (MISC) This help.
test-model-code2seq (TEST) Tests the code2seq model on a selected dataset.
test-model-seq2seq (TEST) Tests the seq2seq model on a selected dataset.
train-model-code2seq (TRAIN) Trains the code2seq model on a selected dataset.
train-model-seq2seq (TRAIN) Trains the seq2seq model on a selected dataset.
```
Binary file removed ig-results.tar.xz
Binary file not shown.
40 changes: 40 additions & 0 deletions results/code2seq-csv.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash



for MODEL in normal adversarial-one-step adversarial-all; do

if [ "${MODEL}" = "normal" ]; then
FULL_STR='code2seq,Natural,'
elif [ "${MODEL}" = "adversarial-one-step" ]; then
FULL_STR=',Adv-${\seqs^1}$,'
else
FULL_STR=',Adv-${\seqs^{1,5}}$,'
fi

for DATASET in c2s/java-small csn/java sri/py150 csn/python; do

THE_PATH_NORM="${1}/${DATASET}/${MODEL}/normal/attacked_metrics.txt"
THE_PATH_ONE="${1}/${DATASET}/${MODEL}/just-one-step-attacks/attacked_metrics.txt"
THE_PATH_ALL="${1}/${DATASET}/${MODEL}/all-attacks/attacked_metrics.txt"

F1_NORM=0.0
if [ -f "${THE_PATH_NORM}" ]; then
F1_NORM=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
fi

F1_ONE=0.0
if [ -f "${THE_PATH_ONE}" ]; then
F1_ONE=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
fi

F1_ALL=0.0
if [ -f "${THE_PATH_ALL}" ]; then
F1_ALL=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
fi

FULL_STR+=$(printf %.2f,%.2f,%.2f, ${F1_NORM} ${F1_ONE} ${F1_ALL})
done

echo ${FULL_STR::-1}
done

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_0.txt

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_1.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_2.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_3.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_4.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/csn/java/attacked_metrics_Q_5.txt

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_0.txt

This file was deleted.

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_1.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_2.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_3.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_4.txt

This file was deleted.

1 change: 0 additions & 1 deletion results/depth-k-results/sri/py150/attacked_metrics_Q_5.txt

This file was deleted.

This file was deleted.

6 changes: 3 additions & 3 deletions results/seq2seq-csv.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,17 +20,17 @@ for MODEL in normal adversarial-one-step adversarial-all; do

F1_NORM=0.0
if [ -f "${THE_PATH_NORM}" ]; then
F1_NORM=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
F1_NORM=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
fi

F1_ONE=0.0
if [ -f "${THE_PATH_ONE}" ]; then
F1_ONE=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
F1_ONE=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
fi

F1_ALL=0.0
if [ -f "${THE_PATH_ALL}" ]; then
F1_ALL=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
F1_ALL=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
fi

FULL_STR+=$(printf %.2f,%.2f,%.2f, ${F1_NORM} ${F1_ONE} ${F1_ALL})
Expand Down
31 changes: 31 additions & 0 deletions scratch/before-after/ex-1-after.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
public static void quoteHtmlChars(
OutputStream output, byte[] buffer, int off, int len
) throws IOException {
System.out.println("usedMergeTransitionUgiSetup");
System.out.println("statusPipelineUrl");
System.out.println("shuffleTrashApplicationsCodeRestartSplitAllocatedMaximumCorrupt");
for (int i = off; i < (off + len); i++) {
switch (buffer[i]) {
case '&' :
output.write(ampBytes);
break;
case '<' :
output.write(ltBytes);
break;
case '>' :
output.write(gtBytes);
break;
case '\'' :
output.write(aposBytes);
break;
case '"' :
output.write(quotBytes);
break;
default :
output.write(buffer, i, 1);
}
}
System.out.println("validMinBalancerUserSkipSyncCodecRename");
System.out.println("namenodeFsSlashJarFirstHosts");
System.out.println("badHostsCounterEncryptionEntitiesRenderSortIdentifier");
}
25 changes: 25 additions & 0 deletions scratch/before-after/ex-1-before.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
public static void quoteHtmlChars(
OutputStream output, byte[] buffer, int off, int len
) throws IOException {
for (int i = off; i < (off + len); i++) {
switch (buffer[i]) {
case '&' :
output.write(ampBytes);
break;
case '<' :
output.write(ltBytes);
break;
case '>' :
output.write(gtBytes);
break;
case '\'' :
output.write(aposBytes);
break;
case '"' :
output.write(quotBytes);
break;
default :
output.write(buffer, i, 1);
}
}
}
18 changes: 18 additions & 0 deletions scratch/before-after/ex-2-after.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
public void testCheckCommitAixCompatMode() throws IOException {
DFSClient dfsClient = Mockito.mock(DFSClient.class);
Nfs3FileAttributes attr = new Nfs3FileAttributes();
HdfsDataOutputStream fos = Mockito.mock(HdfsDataOutputStream.class);
// Last argument "true" here to enable AIX compatibility mode.
OpenFileCtx ctx = new OpenFileCtx(fos, attr, "/dumpFilePath",
dfsClient, new IdUserGroup(new NfsConfiguration()), 1 == 1);
// Test fall-through to pendingWrites check in the event that commitOffset
// is greater than the number of bytes we've so far flushed.
Mockito.when(fos.getPos()).thenReturn(((long) (2)));
COMMIT_STATUS status = ctx.checkCommitInternal(5, null, 1, attr, 0 != 0);
Assert.assertTrue(status == COMMIT_STATUS.COMMIT_FINISHED);
// Test the case when we actually have received more bytes than we're trying
// to commit.
Mockito.when(fos.getPos()).thenReturn(((long) (10)));
status = ctx.checkCommitInternal(5, null, 1, attr, 1 != 1);
Assert.assertTrue(status == COMMIT_STATUS.COMMIT_DO_SYNC);
}
18 changes: 18 additions & 0 deletions scratch/before-after/ex-2-before.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
public void testCheckCommitAixCompatMode() throws IOException {
DFSClient dfsClient = Mockito.mock(DFSClient.class);
Nfs3FileAttributes attr = new Nfs3FileAttributes();
HdfsDataOutputStream fos = Mockito.mock(HdfsDataOutputStream.class);
// Last argument "true" here to enable AIX compatibility mode.
OpenFileCtx ctx = new OpenFileCtx(fos, attr, "/dumpFilePath",
dfsClient, new IdUserGroup(new NfsConfiguration()), true);
// Test fall-through to pendingWrites check in the event that commitOffset
// is greater than the number of bytes we've so far flushed.
Mockito.when(fos.getPos()).thenReturn(((long) (2)));
COMMIT_STATUS status = ctx.checkCommitInternal(5, null, 1, attr, false);
Assert.assertTrue(status == COMMIT_STATUS.COMMIT_FINISHED);
// Test the case when we actually have received more bytes than we're trying
// to commit.
Mockito.when(fos.getPos()).thenReturn(((long) (10)));
status = ctx.checkCommitInternal(5, null, 1, attr, false);
Assert.assertTrue(status == COMMIT_STATUS.COMMIT_DO_SYNC);
}
8 changes: 8 additions & 0 deletions scratch/before-after/ex-3-after.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
def delete_record(scheme_job_scheduler, response_master_running_report_args):
"""Delete a record and it's persistent identifiers."""
response_master_running_report_args.delete()
PersistentIdentifier.query.filter_by(object_type='rec', object_uuid=
response_master_running_report_args.id).update({
PersistentIdentifier.status: PIDStatus.DELETED})
scheme_job_scheduler.delete_buckets(response_master_running_report_args)
db.session.commit()
7 changes: 7 additions & 0 deletions scratch/before-after/ex-3-before.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
def delete_record(cls, record):
"""Delete a record and it's persistent identifiers."""
record.delete()
PersistentIdentifier.query.filter_by(object_type='rec', object_uuid=
record.id).update({PersistentIdentifier.status: PIDStatus.DELETED})
cls.delete_buckets(record)
db.session.commit()
15 changes: 15 additions & 0 deletions scratch/before-after/ex-4-after.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
def replace(identifier_commit_invalid):
print('line_maps')
print('mode_clean_invalid_mask_valid_am_fetch')
"""Replaces a snapshot"""
app = get_app()
tracker_or_and_reader_failover = app.get_snapshot(identifier_commit_invalid
)
if not tracker_or_and_reader_failover:
click.echo("Couldn't find snapshot %s" % identifier_commit_invalid)
sys.exit(1)
app.remove_snapshot(tracker_or_and_reader_failover)
app.create_snapshot(identifier_commit_invalid)
click.echo('Replaced snapshot %s' % identifier_commit_invalid)
print('usage_incr_segment_stat_feature_started_apps')
print('metadata_read_locations_summary_one')
Loading

0 comments on commit 69a8005

Please sign in to comment.