ACE extracts code elements (classes, methods, fields, etc) found in freeform ASCII text documents. It does not require any apriori knowledge of the code elements, but does require a large corpus of documents. The island parser is java based. ACE is based on heuristics and is NOT intended to be a re-implementation of the Java Spec. Since the tool is Java dependent, you must ensure that the docs you are passing it don't contain other languages that look like Java (e.g., C#).
There are two stages:
- Find code elements that are unambiguous and build an index of valid ce in the document corpus
- Find ambiguous code elements, and use the index to resolve them to their declaring type
- sudo apt-get install postgresql libconfig-general-perl libregexp-common-perl libdbd-pg-perl libfile-slurp-perl libfile-find-rule-perl libio-all-perl
- Postgres has ridculously low mem settings. Make sure to tune it for DW using http://pgtune.leopard.in.ua/
- See the example config file
- src/example.conf
- There are a number of main main_'type'.pl, for example, main_stackoverflow.pl. Main files are for different document types and are in main directory to prevent cluttering but must be moved and run from the /src directory to work
- src/main/main_example.conf
- resolve.sh will run all the appropriate perl and sql scripts
./resolve.sh db_name config_file main_file 2> project_name/project.error | tee project_name/project.out
./resolve.sh so_2017_march android/android.conf main/main_stackoverflow.pl 2> android/android.error | tee andriod/android.out
- certain two camel words are are not valid classes and are project specific. (e.g., HttpClient is both a class and a project)
- Too fix these terms add them to src/project_specific/<project_name></project_name>.sql
- if you're going to run more than one project/tag through ACE in the same db,
- you'll need to modify: src/rename_tables.sql
- Just dump the clt table (Code Like Term) to a file
- select tid, du, pqn, simple, kind from clt where trust = 0 and kind <> 'variable';
- You can increase the reliability of tool by joining with the index table
- tid = The thread id or question id
- du = the document unit or post id
- pqn = the partially qualified name (we don't resolve all the way back to packages)
- kind = the type, method, field, package, annotation, etc
- simple = the name of the code element (variables are not code elements and are only used as intermediaries)
- trust = only want trust = 0, anything higher has been rejected
- ACE is designed for parsing code snippets and freeform text not entire source code files
- Scope rules are expanded to include documentation units, etc
- ACE includes elements from stacktraces
- Ignore: System.* (e.g., System.out.println)
- Ignore: Default annotation (e.g., @Override)
- more details in src/limits.mediawiki
Here are some scripts to do manual validation:
- See src/benchmark/README.mediawiki
- benchmark.sql -> dumps current bench_good.csv and creates a new bench.csv for analysis, also fills out the benchmark table
- results_ace_tables.sql -> ace result tables including precision and recall
- Note: If a CE is private (e.g., a method created by the poster) and it is not fully defined in that post, we ignore it because there's no way of knowing its declaring type
- try: ./main/main_nlp_stackoverflow.pl android/android.conf
- Note: Terms that remain ambiguous will not be highlighted, even if the term is identified later on in the document. This is to avoid terms that are very common. For example if url.get() is identified in a document, we could go back and highlight every instance of the word 'get', which would be silly.
If you just include code elements with trust 0, you'll be fine, if you want to know where things came from then you can look at the trust_original values:
0 -> naturally good: qualified or defined or a new Constructor() 1-> variable and member, but variable is unresolved 2-> method undefined 3-> type undefined 4-> chain of methods, can have undefined type 5-> field undefined 6-> declaration of constructor, method, class 7-> second pass from dictionary 8-> stacktrace 9-> ambigious package name 10-> annotation or annotation_package 11-> non-compound second round (may not be processed) 12-> project specific badness
shams/Android_API_Change_Bug_History/src/combine_with_so