Skip to content

CESEL/ace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Table of Contents

ACE - Automated Code Element Extractor

ACE extracts code elements (classes, methods, fields, etc) found in freeform ASCII text documents. It does not require any apriori knowledge of the code elements, but does require a large corpus of documents. The island parser is java based. ACE is based on heuristics and is NOT intended to be a re-implementation of the Java Spec. Since the tool is Java dependent, you must ensure that the docs you are passing it don't contain other languages that look like Java (e.g., C#).


There are two stages:

  1. Find code elements that are unambiguous and build an index of valid ce in the document corpus
  2. Find ambiguous code elements, and use the index to resolve them to their declaring type
More details in Rigby and Robillard ICSE 2013.

Required Libraries

  • sudo apt-get install postgresql libconfig-general-perl libregexp-common-perl libdbd-pg-perl libfile-slurp-perl libfile-find-rule-perl libio-all-perl
  • Postgres has ridculously low mem settings. Make sure to tune it for DW using http://pgtune.leopard.in.ua/

Run

  • See the example config file
    • src/example.conf
  • There are a number of main main_'type'.pl, for example, main_stackoverflow.pl. Main files are for different document types and are in main directory to prevent cluttering but must be moved and run from the /src directory to work
    • src/main/main_example.conf
  • resolve.sh will run all the appropriate perl and sql scripts
    •  ./resolve.sh db_name config_file main_file 2> project_name/project.error | tee project_name/project.out 
    •  ./resolve.sh so_2017_march android/android.conf main/main_stackoverflow.pl 2> android/android.error | tee andriod/android.out 
  • certain two camel words are are not valid classes and are project specific. (e.g., HttpClient is both a class and a project)
    • Too fix these terms add them to src/project_specific/<project_name></project_name>.sql
  • if you're going to run more than one project/tag through ACE in the same db,
    • you'll need to modify: src/rename_tables.sql

The Output

  • Just dump the clt table (Code Like Term) to a file
    • select tid, du, pqn, simple, kind from clt where trust = 0 and kind <> 'variable';
    • You can increase the reliability of tool by joining with the index table
  • tid = The thread id or question id
  • du = the document unit or post id
  • pqn = the partially qualified name (we don't resolve all the way back to packages)
  • kind = the type, method, field, package, annotation, etc
  • simple = the name of the code element (variables are not code elements and are only used as intermediaries)
  • trust = only want trust = 0, anything higher has been rejected

NOTES

  • ACE is designed for parsing code snippets and freeform text not entire source code files
    • Scope rules are expanded to include documentation units, etc
  • ACE includes elements from stacktraces

Special Cases

  • Ignore: System.* (e.g., System.out.println)
  • Ignore: Default annotation (e.g., @Override)
  • more details in src/limits.mediawiki

Manual Validation

Here are some scripts to do manual validation:

  • See src/benchmark/README.mediawiki
    • benchmark.sql -> dumps current bench_good.csv and creates a new bench.csv for analysis, also fills out the benchmark table
  • results_ace_tables.sql -> ace result tables including precision and recall
  • Note: If a CE is private (e.g., a method created by the poster) and it is not fully defined in that post, we ignore it because there's no way of knowing its declaring type

Further NLP parsing (experimental)

  • try: ./main/main_nlp_stackoverflow.pl android/android.conf
  • Note: Terms that remain ambiguous will not be highlighted, even if the term is identified later on in the document. This is to avoid terms that are very common. For example if url.get() is identified in a document, we could go back and highlight every instance of the word 'get', which would be silly.

What do the trust values mean?

If you just include code elements with trust 0, you'll be fine, if you want to know where things came from then you can look at the trust_original values:

0 &#45;&gt; naturally good&#58; qualified or defined or a new Constructor()
1&#45;&gt; variable and member, but variable is unresolved
2&#45;&gt; method undefined
3&#45;&gt; type undefined
4&#45;&gt; chain of methods, can have undefined type 
5&#45;&gt; field undefined
6&#45;&gt; declaration of constructor, method, class
7&#45;&gt; second pass from dictionary
8&#45;&gt; stacktrace
9&#45;&gt; ambigious package name
10&#45;&gt; annotation or annotation_package
11&#45;&gt; non&#45;compound second round (may not be processed)
12&#45;&gt; project specific badness

Android project

shams/Android_API_Change_Bug_History/src/combine_with_so