Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform for Syntactic Construct Extractor #645

Closed
wants to merge 3 commits into from
Closed

Transform for Syntactic Construct Extractor #645

wants to merge 3 commits into from

Conversation

pankajskku
Copy link
Member

This module extracts the base syntactic concepts from the multi-language source codes and represents these concepts in a unified language-agnostic representation that can be further used for multi-language data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Why are these changes needed?

Our framework abstracts language-specific concepts by transforming them into a unified, language-agnostic representation called universal base syntactic representation (UBSR), referred to as a concept, which is consistently encoded within the proposed schema structure. The current version supports the base syntactic concept for importing/including package/libraries, comments, and functions.

Data profiling, in the context of machine learning, is the process of examining and analyzing data to create useful statistics. These statistics are used both as an aid for better comprehension of the properties of data as well as for a variety of downstream data processing tasks such as data valuation (assessing the value of data relative to the business objectives at hand) and data curation (filtering and prioritizing training data based on derived thresholds). In the Large Language Model (LLM) setting, training data is typically unstructured in nature comprising natural language text, images, and code. In this work, we specifically focus on code-LLMs, where the quality of code training data substantially affects the model accuracy of LLM-based coding tasks such as code generation and summarization. Therefore, having the capabilities to characterize code data in terms of programming language concepts aids in both deriving insights related to code training/evaluation data and in the downstream curation of code training data. In this work, we address the problem of profiling multi-lingual code datasets by extracting an extensible user-defined set of syntactic concepts over arbitrary programming languages.

Related issue number (if any).

Pankaj Thorat and others added 3 commits September 30, 2024 17:49
This module extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.
This tranform extracts the base syntactic concepts from the multi-language source codes and represent these concepts in an unified langauge-agnostic representation that can be further used for multi-lnaguage data profiling. While programming languages expose similar syntactic building blocks to represent programming intent, such as importing packages/libraries, functions, classes, loops, conditionals, comments and others, these concepts are expressed through language-specific grammar, defined by distinct keywords and syntactic form.

Signed-off-by: Pankaj Thorat <[email protected]>
@pankajskku pankajskku closed this Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant