Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to construct the data set #115

Open
zhipeng-web opened this issue May 9, 2024 · 2 comments
Open

How to construct the data set #115

zhipeng-web opened this issue May 9, 2024 · 2 comments

Comments

@zhipeng-web
Copy link

Hello author, I have some data collected by myself, what should I do to construct the same json file as you, thank you for telling me.

@FakeEyes2Wo
Copy link

I think u can do this
from ember.features import PEFeatureExtractor

extractor = PEFeatureExtractor(feature_version=2)
with open('test.exe', 'rb') as f:
bytez = f.read()
print(extractor.features)
feature_vector = extractor.feature_vector(bytez)

@zer0daysec
Copy link

This is an example of a generated feature I wrote

main.py

import os
import json
import ember
import argparse

from pathlib import Path

def get_file_size(file_path):
    return os.path.getsize(file_path) if file_path.exists() else 0

if __name__ == "__main__":
    prog = "sample feature"
    description = "Get Sample feature"
    parser = argparse.ArgumentParser(prog=prog, description=description)
    parser.add_argument("-v", "--featureversion", type=int, default=2, help="EMBER feature version")
    parser.add_argument("folder", type=str, default="./samples", help="samples folder path")
    args = parser.parse_args()

    folder_path = Path(args.folder)
    output_dir = Path("./data")

    max_file_size = 2 * 1024 * 1024 * 1024
    file_index = 0
    current_file = output_dir / f"train_features_{file_index}.jsonl"

    with open(current_file, "a") as f:
        for binary_path in folder_path.rglob("*"):
            if binary_path.is_file():
                with open(binary_path, "rb") as file:
                    file_data = file.read()
                extractor = ember.PEFeatureExtractor(args.featureversion)
                json_features = json.dumps(extractor.raw_features(file_data))
                if get_file_size(current_file) + len(json_features.encode("utf-8")) > max_file_size:
                    f.close()
                    file_index += 1
                    current_file = output_dir / f"train_features_{file_index}.jsonl"
                    f = open(current_file, "a")
                f.write(json_features + "\n")

Image

but no md5, appeared, label, avclass attribute

I don't know how to generate the missing 4 attribute values. If you know, can you tell me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants