Pushing final changes.

mloncode · Feb 6, 2020 · 69a8005 · 69a8005
1 parent e8d388f
commit 69a8005
Show file tree

Hide file tree

Showing 58 changed files with 963 additions and 59 deletions.
diff --git a/.vscode/settings.json b/.vscode/settings.json
@@ -1,5 +1,18 @@
 {
   "files.watcherExclude": {
     "datasets/**": true
-  }
+  },
+  "cSpell.ignoreWords": [
+    "aix",
+    "check",
+    "commit",
+    "compat",
+    "data",
+    "hdfs",
+    "mockito",
+    "mode",
+    "output",
+    "stream",
+    "test"
+  ]
 }
diff --git a/README.md b/README.md
@@ -1,2 +1,112 @@
-# averloc
-Repository for the Adversarial ML on Code things
+# averloc (AdVERsarial Learning On Code)
+
+Repository for Semantic Robustness of Models on Source Code.
+
+## Directory Structure
+
+In this repository, we have the following directories:
+
+### `./datasets`
+
+**Note:** the datasets are all much too large to be included in this GitHub repo. This is simply the
+structure as it would exist on disk once our framework is setup.
+
+```bash
+./datasets
+  + ./raw            # The four datasets in "raw" form
+  + ./normalized     # The four datasets in the "normalized" JSON-lines representation 
+  + ./preprocess
+    + ./tokens       # The four datasets in a representation suitable for token-level models
+    + ./ast-paths    # The four datasets in a representation suitable for code2seq
+  + ./transformed    # The four datasets transformed via our code-transformation framework 
+    + ./normalized   # Transformed datasets normalized back into the JSON-lines representation
+    + ./preprocessed # Transformed datasets preprocessed into:
+      + ./tokens     # ... a representation suitable for token-level models
+      + ./ast-paths  # ... a representation suitable for code2seq
+  + ./adversarial    # Datasets in the format < source, target, tranformed-variant #1, #2, ..., #K >
+    + ./tokens       # ... in a token-level representation
+    + ./ast-paths    # ... in an ast-paths representation
+```
+
+## `./models`
+
+We have two Machine Learning on Code models. Both of them are trained on the Code Summarization task. The
+seq2seq model has been modified to include an adversarial training loop and a way to compute Integrated Gradients.
+The code2seq model has been modified to include an adversarial training loop and emit attention weights.
+
+```bash
+./models
+  + ./code2seq         # seq2seq model implementation
+  + ./pytorch-seq2seq  # code2seq model implementation
+```
+
+## `./results`
+
+This directory stores results that are small-enough to be checked into GitHub. In addition, a few utility scripts
+live here.
+
+## `./scratch`
+
+This directory contains exploratory data analysis and evaluations that did not fit into the overall workflow of our
+code-transformation + adversarial training framework. For instance, HTML-based visualizations of Integrated Gradients
+and attention exist in this directory.
+
+## `./scripts`
+
+In this directory there are a large number of scripts for doing various chores related to running and maintaing
+this code transformation infrastructure.
+
+## `./tasks`
+
+This directory houses the implementations of various pieces of our core framework:
+
+```bash
+./tasks
+  + ./astor-apply-transforms
+  + ./depth-k-test-seq2seq
+  + ./download-c2s-dataset
+  + ./download-csn-dataset
+  + ./extract-adv-dataset-c2s
+  + ./extract-adv-dataset-tokens
+  + ./generate-baselines
+  + ./integrated-gradients-seq2seq
+  + ./normalize-raw-dataset
+  + ./preprocess-dataset-c2s
+  + ./preprocess-dataset-tokens
+  + ./spoon-apply-transforms
+  + ./test-model-code2seq
+  + ./test-model-seq2seq
+  + ./train-model-code2seq
+  + ./train-model-seq2seq
+```
+
+## `./vendor`
+
+This directory contains dependencies in the form of git submodukes.
+
+## `Makefile`
+
+We have one overarching `Makefile` that can be used to drive a number of the data generation, training, testing, adn evaluation tasks.
+
+```
+download-datasets                  (DS-1) Downloads all prerequisite datasets
+normalize-datasets                 (DS-2) Normalizes all downloaded datasets
+extract-ast-paths                  (DS-3) Generate preprocessed data in a form usable by code2seq style models. 
+extract-tokens                     (DS-3) Generate preprocessed data in a form usable by seq2seq style models. 
+apply-transforms-c2s-java-med      (DS-4) Apply our suite of transforms to code2seq's java-med dataset.
+apply-transforms-c2s-java-small    (DS-4) Apply our suite of transforms to code2seq's java-small dataset.
+apply-transforms-csn-java          (DS-4) Apply our suite of transforms to CodeSearchNet's java dataset.
+apply-transforms-csn-python        (DS-4) Apply our suite of transforms to CodeSearchNet's python dataset.
+apply-transforms-sri-py150         (DS-4) Apply our suite of transforms to SRI Lab's py150k dataset.
+extract-transformed-ast-paths      (DS-6) Extract preprocessed representations (ast-paths) from our transfromed (normalized) datasets 
+extract-transformed-tokens         (DS-6) Extract preprocessed representations (tokens) from our transfromed (normalized) datasets 
+extract-adv-datasets-tokens        (DS-7) Extract preprocessed adversarial datasets (representations: tokens)
+do-integrated-gradients-seq2seq    (IG) Do IG for our seq2seq model
+docker-cleanup                     (MISC) Cleans up old and out-of-sync Docker images.
+submodules                         (MISC) Ensures that submodules are setup.
+help                               (MISC) This help.
+test-model-code2seq                (TEST) Tests the code2seq model on a selected dataset.
+test-model-seq2seq                 (TEST) Tests the seq2seq model on a selected dataset.
+train-model-code2seq               (TRAIN) Trains the code2seq model on a selected dataset.
+train-model-seq2seq                (TRAIN) Trains the seq2seq model on a selected dataset.
+```
diff --git a/ig-results.tar.xz b/ig-results.tar.xz
diff --git a/results/code2seq-csv.sh b/results/code2seq-csv.sh
@@ -0,0 +1,40 @@
+#!/bin/bash
+
+
+
+for MODEL in normal adversarial-one-step adversarial-all; do
+
+  if [ "${MODEL}" = "normal" ]; then
+    FULL_STR='code2seq,Natural,'
+  elif [ "${MODEL}" = "adversarial-one-step" ]; then
+    FULL_STR=',Adv-${\seqs^1}$,'
+  else
+    FULL_STR=',Adv-${\seqs^{1,5}}$,'
+  fi
+
+  for DATASET in c2s/java-small csn/java sri/py150 csn/python; do
+
+    THE_PATH_NORM="${1}/${DATASET}/${MODEL}/normal/attacked_metrics.txt"
+    THE_PATH_ONE="${1}/${DATASET}/${MODEL}/just-one-step-attacks/attacked_metrics.txt"
+    THE_PATH_ALL="${1}/${DATASET}/${MODEL}/all-attacks/attacked_metrics.txt"
+
+    F1_NORM=0.0
+    if [ -f "${THE_PATH_NORM}" ]; then
+      F1_NORM=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
+    fi
+
+    F1_ONE=0.0
+    if [ -f "${THE_PATH_ONE}" ]; then
+      F1_ONE=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
+    fi
+
+    F1_ALL=0.0
+    if [ -f "${THE_PATH_ALL}" ]; then
+      F1_ALL=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
+    fi
+
+    FULL_STR+=$(printf %.2f,%.2f,%.2f, ${F1_NORM} ${F1_ONE} ${F1_ALL})
+  done
+
+  echo ${FULL_STR::-1}
+done
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_0.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_0.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_1,5.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_1,5.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_1.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_1.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_2.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_2.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_3.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_3.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_4.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_4.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_5.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_5.txt
diff --git a/results/depth-k-results/c2s/java-small/attacked_metrics_Q_<=5.txt b/results/depth-k-results/c2s/java-small/attacked_metrics_Q_<=5.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_0.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_0.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_1,5.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_1,5.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_1.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_1.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_2.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_2.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_3.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_3.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_4.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_4.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_5.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_5.txt
diff --git a/results/depth-k-results/csn/java/attacked_metrics_Q_<=5.txt b/results/depth-k-results/csn/java/attacked_metrics_Q_<=5.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_0.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_0.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_1,5.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_1,5.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_1.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_1.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_2.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_2.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_3.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_3.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_4.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_4.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_5.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_5.txt
diff --git a/results/depth-k-results/csn/python/attacked_metrics_Q_<=5.txt b/results/depth-k-results/csn/python/attacked_metrics_Q_<=5.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_0.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_0.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_1,5.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_1,5.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_1.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_1.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_2.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_2.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_3.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_3.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_4.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_4.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_5.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_5.txt
diff --git a/results/depth-k-results/sri/py150/attacked_metrics_Q_<=5.txt b/results/depth-k-results/sri/py150/attacked_metrics_Q_<=5.txt
diff --git a/results/seq2seq-csv.sh b/results/seq2seq-csv.sh
@@ -20,17 +20,17 @@ for MODEL in normal adversarial-one-step adversarial-all; do
 
     F1_NORM=0.0
     if [ -f "${THE_PATH_NORM}" ]; then
-      F1_NORM=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
+      F1_NORM=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_NORM}" | awk '{ print $2 }')
     fi
 
     F1_ONE=0.0
     if [ -f "${THE_PATH_ONE}" ]; then
-      F1_ONE=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
+      F1_ONE=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ONE}" | awk '{ print $2 }')
     fi
 
     F1_ALL=0.0
     if [ -f "${THE_PATH_ALL}" ]; then
-      F1_ALL=$(grep -Po 'f1"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
+      F1_ALL=$(grep -Po 'exact_match"?: \d+.\d+' "${THE_PATH_ALL}" | awk '{ print $2 }')
     fi
 
     FULL_STR+=$(printf %.2f,%.2f,%.2f, ${F1_NORM} ${F1_ONE} ${F1_ALL})

diff --git a/scratch/before-after/ex-1-after.java b/scratch/before-after/ex-1-after.java
@@ -0,0 +1,31 @@
+public static void quoteHtmlChars(
+    OutputStream output, byte[] buffer, int off, int len
+) throws IOException {
+    System.out.println("usedMergeTransitionUgiSetup");
+    System.out.println("statusPipelineUrl");
+    System.out.println("shuffleTrashApplicationsCodeRestartSplitAllocatedMaximumCorrupt");
+    for (int i = off; i < (off + len); i++) {
+        switch (buffer[i]) {
+            case '&' :
+                output.write(ampBytes);
+                break;
+            case '<' :
+                output.write(ltBytes);
+                break;
+            case '>' :
+                output.write(gtBytes);
+                break;
+            case '\'' :
+                output.write(aposBytes);
+                break;
+            case '"' :
+                output.write(quotBytes);
+                break;
+            default :
+                output.write(buffer, i, 1);
+        }
+    }
+    System.out.println("validMinBalancerUserSkipSyncCodecRename");
+    System.out.println("namenodeFsSlashJarFirstHosts");
+    System.out.println("badHostsCounterEncryptionEntitiesRenderSortIdentifier");
+}
diff --git a/scratch/before-after/ex-1-before.java b/scratch/before-after/ex-1-before.java
@@ -0,0 +1,25 @@
+public static void quoteHtmlChars(
+    OutputStream output, byte[] buffer, int off, int len
+) throws IOException {
+    for (int i = off; i < (off + len); i++) {
+        switch (buffer[i]) {
+            case '&' :
+                output.write(ampBytes);
+                break;
+            case '<' :
+                output.write(ltBytes);
+                break;
+            case '>' :
+                output.write(gtBytes);
+                break;
+            case '\'' :
+                output.write(aposBytes);
+                break;
+            case '"' :
+                output.write(quotBytes);
+                break;
+            default :
+                output.write(buffer, i, 1);
+        }
+    }
+}
diff --git a/scratch/before-after/ex-2-after.java b/scratch/before-after/ex-2-after.java
@@ -0,0 +1,18 @@
+public void testCheckCommitAixCompatMode() throws IOException {
+    DFSClient dfsClient = Mockito.mock(DFSClient.class);
+    Nfs3FileAttributes attr = new Nfs3FileAttributes();
+    HdfsDataOutputStream fos = Mockito.mock(HdfsDataOutputStream.class);
+    // Last argument "true" here to enable AIX compatibility mode.
+    OpenFileCtx ctx = new OpenFileCtx(fos, attr, "/dumpFilePath", 
+        dfsClient, new IdUserGroup(new NfsConfiguration()), 1 == 1);
+    // Test fall-through to pendingWrites check in the event that commitOffset
+    // is greater than the number of bytes we've so far flushed.
+    Mockito.when(fos.getPos()).thenReturn(((long) (2)));
+    COMMIT_STATUS status = ctx.checkCommitInternal(5, null, 1, attr, 0 != 0);
+    Assert.assertTrue(status == COMMIT_STATUS.COMMIT_FINISHED);
+    // Test the case when we actually have received more bytes than we're trying
+    // to commit.
+    Mockito.when(fos.getPos()).thenReturn(((long) (10)));
+    status = ctx.checkCommitInternal(5, null, 1, attr, 1 != 1);
+    Assert.assertTrue(status == COMMIT_STATUS.COMMIT_DO_SYNC);
+}
diff --git a/scratch/before-after/ex-2-before.java b/scratch/before-after/ex-2-before.java
@@ -0,0 +1,18 @@
+public void testCheckCommitAixCompatMode() throws IOException {
+    DFSClient dfsClient = Mockito.mock(DFSClient.class);
+    Nfs3FileAttributes attr = new Nfs3FileAttributes();
+    HdfsDataOutputStream fos = Mockito.mock(HdfsDataOutputStream.class);
+    // Last argument "true" here to enable AIX compatibility mode.
+    OpenFileCtx ctx = new OpenFileCtx(fos, attr, "/dumpFilePath", 
+        dfsClient, new IdUserGroup(new NfsConfiguration()), true);
+    // Test fall-through to pendingWrites check in the event that commitOffset
+    // is greater than the number of bytes we've so far flushed.
+    Mockito.when(fos.getPos()).thenReturn(((long) (2)));
+    COMMIT_STATUS status = ctx.checkCommitInternal(5, null, 1, attr, false);
+    Assert.assertTrue(status == COMMIT_STATUS.COMMIT_FINISHED);
+    // Test the case when we actually have received more bytes than we're trying
+    // to commit.
+    Mockito.when(fos.getPos()).thenReturn(((long) (10)));
+    status = ctx.checkCommitInternal(5, null, 1, attr, false);
+    Assert.assertTrue(status == COMMIT_STATUS.COMMIT_DO_SYNC);
+}
diff --git a/scratch/before-after/ex-3-after.py b/scratch/before-after/ex-3-after.py
@@ -0,0 +1,8 @@
+def delete_record(scheme_job_scheduler, response_master_running_report_args):
+    """Delete a record and it's persistent identifiers."""
+    response_master_running_report_args.delete()
+    PersistentIdentifier.query.filter_by(object_type='rec', object_uuid=
+        response_master_running_report_args.id).update({
+        PersistentIdentifier.status: PIDStatus.DELETED})
+    scheme_job_scheduler.delete_buckets(response_master_running_report_args)
+    db.session.commit()
diff --git a/scratch/before-after/ex-3-before.py b/scratch/before-after/ex-3-before.py
@@ -0,0 +1,7 @@
+def delete_record(cls, record):
+    """Delete a record and it's persistent identifiers."""
+    record.delete()
+    PersistentIdentifier.query.filter_by(object_type='rec', object_uuid=
+        record.id).update({PersistentIdentifier.status: PIDStatus.DELETED})
+    cls.delete_buckets(record)
+    db.session.commit()
diff --git a/scratch/before-after/ex-4-after.py b/scratch/before-after/ex-4-after.py
@@ -0,0 +1,15 @@
+def replace(identifier_commit_invalid):
+    print('line_maps')
+    print('mode_clean_invalid_mask_valid_am_fetch')
+    """Replaces a snapshot"""
+    app = get_app()
+    tracker_or_and_reader_failover = app.get_snapshot(identifier_commit_invalid
+        )
+    if not tracker_or_and_reader_failover:
+        click.echo("Couldn't find snapshot %s" % identifier_commit_invalid)
+        sys.exit(1)
+    app.remove_snapshot(tracker_or_and_reader_failover)
+    app.create_snapshot(identifier_commit_invalid)
+    click.echo('Replaced snapshot %s' % identifier_commit_invalid)
+    print('usage_incr_segment_stat_feature_started_apps')
+    print('metadata_read_locations_summary_one')