Skip to content

Latest commit

 

History

History
566 lines (497 loc) · 23.1 KB

methodology.md

File metadata and controls

566 lines (497 loc) · 23.1 KB

Methodology

Below we describe the methodology used to benchmark CLP and other tools.

Test datasets

There are two types of logs used in the benchmark:

  • Unstructured logs are log events that don't follow a predefined format or structure, making it difficult to automatically parse or analyze them. Unstructured logs are often just text interspersed with variable values like the log event's timestamp, log-level, and others such as usernames, etc. For example:

    2018-06-05 06:15:29,701 DEBUG org.apache.hadoop.util.Shell: setsid exited with exit code 0
    
  • Semi-structured logs do have follow a predefined format, making them parseable, but their schema is not rigid. Semi-structured log events usually have a few consistent fields (e.g., the event's timestamp and log level) but the rest may appear and disappear across each event. For example:

    {
        "t": {
            "$date": "2023-03-21T23:34:54.576-04:00"
        },
        "s": "I",
        "c": "NETWORK",
        "id": 4915701,
        "ctx": "-",
        "msg": "Initialized wire specification",
        "attr": {
            "spec": {
                "incomingExternalClient": {
                    "minWireVersion": 0,
                    "maxWireVersion": 17
                },
                "incomingInternalClient": {
                    "minWireVersion": 0,
                    "maxWireVersion": 17
                },
                "outgoing": {
                    "minWireVersion": 6,
                    "maxWireVersion": 17
                },
                "isInternalClient": true
            }
        }
    }

The unstructured log dataset, hadoop-25GB, contains log events generated by Hadoop when running workloads from the HiBench Benchmark Suite on three Hadoop clusters, each containing 48 data nodes. The dataset is a 258 GiB subset of a larger 14 TiB dataset (specifically, workers 1-3).

The semi-structured log dataset, mongodb, contains log events generated by MongoDB when running YCSB workloads A-E repeatedly. The dataset is 64.8 GiB and contains 186,287,600 log events.

Benchmark machines

Where we report benchmark results, we use a Linux server with Intel Xeon E5-2630v3 processor and 128GB of DDR4 memory. Both the uncompressed and compressed logs are stored on a 7200RPM SATA HDD.

Test runtime scenarios

We test two possible runtime scenarios for each tool:

  • Hot Run is where we ingest the test dataset and then immediately run the query benchmark.
    • TODO: In the future, we'll warm up the system by running the benchmark until the results remain consistent, and then collect results.
  • Cold Run is where we ingest the data, restart the system, and then run the query benchmark.
    • TODO: In the future, we plan to implement a more comprehensive approach by clearing the OS' caches to simulate a completely cold environment.

Metrics collected

Ingest time

This measures the time taken to ingest the data. Smaller values are better indicating faster ingestion performance.

Compressed size

This measures the on-disk size of the data, post-ingestion, in the tool being tested. We measure it using du <data> -bc. Smaller values are better indicating higher compression.

Average memory usage

This measures average memory usage, separately, for the ingestion and query stages. Smaller values are better indicating lower resource usage.

There are two collection methods based on how the tools being tested run:

  • For tools running with multiple microservices (e.g., Loki), we use docker stats to poll the total memory usage of all related containers, then average the results.
  • For other tools, we use ps to poll the RSS (resident-set size) field for all related processes, then average the results.

Query latency

This measures the time taken to completely execute a query. Smaller values are better indicating faster query performance.

Tested tools

The benchmark currently tests the following tools:

Tool specifics

Each tool requires different setup and configuration steps to run the benchmark.

CLP and CLP-S

To begin, download the latest released binary, or clone the most recent code and compile it locally by following these instructions. Note that the benchmark expects that the tools are run in Docker containers, so you should follow the previous instructions inside a Docker container.

We use the glt binary for CLP (unstructured logs) and clp-s for CLP-S (semi-structured logs).

CLP

For CLP, start by creating a yaml configuration file like the example below:

system_metric:
  enable: True
  memory:
    ingest_polling_interval: 5
    run_query_benchmark_polling_interval: 5
    
glt:
  container_id: xiaochong-clp-oss-jammy-xiaochong
  binary_path: /home/xiaochong/develop/clp-bench/tests/clp-unstructured/glt-local-compiled
  data_path: /home/xiaochong/develop/clp-bench/tests/clp-unstructured/offline-mode-data
  dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/hadoop-small
  queries:
    - '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
    - '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
    - '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
    - '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
    - '" to pid 21177 as user "'
    - '" 10000 reply: "'
    - '" 10 reply: "'
    - '" 178.2 MB "'
    - '" 1.9 GB "'
    - '"job_1528179349176_24837"'
    - '"blk_1075089282_1348458"'
    - '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
    - '" abcde "'

Note that in this configuration (which also applies for the following benchmark targets):

  • data_path specifies where the tool will store ingested data.
  • dataset_path refers to the location of data that is ready to be ingested.

With the yaml file configured and the container running, you can execute the following command to run the benchmarks:

clp-bench -t GLT -m {mode} -c {path-to-yaml}

Here, mode can be set to hot, cold, or query-only (which also applies for the following benchmark targets). For more details, run clp-bench --help.

No preprocessing is needed for the raw data. During data ingestion, clp-bench will execute:

docker exec {container_id} {binary_path} c {data_path} {dataset_path}

For query benchmarking, clp-bench runs the following command for each query:

docker exec {container_id} {binary_path} s {data_path} {query}

To verify the results, clp-bench appends | wc -l to count the matched log lines, ensuring the tool correctly identifies matches during the query process (which also applies for the following benchmark targets).

For memory monitoring, clp-bench periodically executes ps aux (based on the intervals specified under system.memory in the yaml file) within the container, checking the RSS field for processes associated with binary_path and data_path:

docker exec {container_id} ps aux

The averages of the memory usage collected during ingestion and query benchmark stages will be calculated respectively.

CLP-S

CLP-S' setup is nearly identical to CLP's. Below is an example of a yaml configuration file for CLP-S:

system_metric:
  enable: True
  memory:
    ingest_polling_interval: 5
    run_query_benchmark_polling_interval: 5
    
clp_s:
  container_id: xiaochong-clp-oss-jammy-xiaochong
  binary_path: /home/xiaochong/develop/clp-bench/tests/clp-json/clp-json-x86_64-v0.1.3/bin/clp-s-local-compiled
  data_path: /home/xiaochong/develop/clp-bench/tests/clp-json/offline-mode-data
  dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/mongodb-single
  queries:
    - 'attr.tickets:*'
    - 'id: 22419'
    - 'attr.message.msg: log_release* AND attr.message.session_name: connection'
    - 'ctx: initandlisten AND (NOT msg: "WiredTigermessage" OR attr.message.msg: log_remove*)'
    - 'c: WTWRTLOG AND attr.message.ts_sec > 1679490000'
    - 'ctx: FlowControlRefresher AND attr.numTrimmed: 0'

The difference is the query's format. The query for JSON logs uses KQL.

Similarly, with the yaml file configured and the container running, you can execute the following command to run the benchmarks:

clp-bench -t CLPS -m {mode} -c {path-to-yaml}

Like CLP, no preprocessing of raw data is required. The commands for data ingestion and query benchmarking remain the same. Memory monitoring is also performed using ps aux and checking the RSS field.

grep

grep is a command-line tool in Unix/Linux systems used to search for specific patterns within files or text. We use it as a baseline of unstructured log query benchmark.

Below is an example of yaml file of it:

system_metric:
  enable: False
    
grep:
  dataset_path: /home/xiaochong/develop/clp-bench/tests/datasets/hadoop-small
  queries:
    - '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
    - '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
    - '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
    - '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
    - '" to pid 21177 as user "'
    - '" 10000 reply: "'
    - '" 10 reply: "'
    - '" 178.2 MB "'
    - '" 1.9 GB "'
    - '"job_1528179349176_24837"'
    - '"blk_1075089282_1348458"'
    - '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
    - '" abcde "'

Since grep does not ingest data, so there is no difference between modes, you can randomly choose one:

clp-bench -t Grep -m {mode} -c {path-to-yaml}

For the query benchmark, clp-bench executes the following command for each query:

grep -r {query} {dataset_path}

Loki

Loki runs with two microservices. Loki itself is working as a backend which ingests the data sent by the log collector. We use Promtail as its log collector. These two are running in two containers communicated through REST APIs. For query benchmark, we use LogCLI to execute queries.

We haven't integrated launching and ingesting for Loki into clp-bench yet, so you may need to manually launch and ingest data first, then use clp-bench to run the query benchmark.

Launch

To run Loki, create a loki-config.yaml configuration file as the following (see details here):

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 1000

schema_config:
  configs:
    - from: 1972-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  reject_old_samples: false
  ingestion_rate_mb: 1000

ruler:
  alertmanager_url: http://localhost:9093

Then launch a container for Loki with the following command (assuming you are currently at the directory which contains the yaml configuration file above):

docker run \
  --name loki \
  -d \
  -v $(pwd):/mnt/config \
  -p 3100:3100 \
  grafana/loki:3.0.0 \
  -config.file=/mnt/config/loki-config.yaml

To run Promtail, create a promtail-config.yaml file as the following:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: benchlogs
      __path__: /mnt/datasets/hadoop/worker*/*

Note that __path__ should be the pattern of directories which contain the log files.

Then launch a container for Promtail with the following command (assuming you are currently at the directory which contains the above mentioned yaml configuration file):

docker run \
  --name promtail \
  -d \
  -v $(pwd):/mnt/config \
  -v /path/to/hadoop-log-datasets:/mnt/datasets/hadoop \
  --link loki \
  grafana/promtail:3.0.0 \
  -config.file=/mnt/config/promtail-config.yaml

When containers for Loki and Promtail have been launched, run the following command until it prints ready, then Loki will start ingesting data:

curl -G http://localhost:3100/ready

Ingest

Loki ingests data automatically when connects to Promtail. We do not do any preprocessing for the dataset.

We use docker stats to get the memory usage of ingesting data periodically (the frequency should be consistent with the yaml configuration file for `clp-bench as mentioned in the next section) until the ingestion finishes:

docker stats loki promtail --no-stream

Since Loki does not have prompt when ingestion finishes, we monitor the current ingested data by the following command:

curl -G http://localhost:3100/metrics | grep 'loki_distributor_bytes_received_total'

When the above command prints the size that equals to the size of the dataset, it means the ingestion finishes. Then we run the following command to get the time spent for ingesting data:

curl -G http://localhost:3100/metrics | \
  grep 'loki_request_duration_seconds_sum{method="POST",route="loki_api_v1_push"'

Query Benchmarking

An example of yaml configuration file of Loki for clp-bench to run the query benchmark is like:

system_metric:
  enable: True
  memory:
    ingest_polling_interval: 5
    run_query_benchmark_polling_interval: 5
    
loki:
  logcli_binary_path: /home/xiaochong/develop/loki/logcli
  job: benchlogs
  limit: 2000000
  batch: 1000
  from: '2024-10-08T10:00:00Z'
  to: '2024-10-09T10:00:00Z'
  interval: 30
  queries:
    - '" org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "'
    - '" org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "'
    - '" INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "'
    - '" DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="'
    - '" to pid 21177 as user "'
    - '" 10000 reply: "'
    - '" 10 reply: "'
    - '" 178.2 MB "'
    - '" 1.9 GB "'
    - '"job_1528179349176_24837"'
    - '"blk_1075089282_1348458"'
    - '"hdfs://master:8200/HiBench/Bayes/temp/worddict"'
    - '" abcde "'

Note that in this configuration:

  • loki.job should match the job of the labels in the promtail-config.yaml (see the example of promtail-config.yaml).
  • limit should be at least the maximum number of matched log lines of the query.
  • batch is maximum number of matched log lines that Loki will send to the client at once.
  • from and to define the rough wall-clock time range of ingesting data. In this example, it means that data ingestion happened between 2024-10-08T10:00:00Z and 2024-10-09T10:00:00Z.
  • interval defines the time range, in minutes, that Loki will use to query log lines ingested within that period. For example, setting an interval of 30 means the time range between from and to will be divided into 30-minute slices. Loki will run the query on log lines ingested during each of these slices. For each query, clp-bench instructs Loki to execute the query across all time slices, ensuring the entire dataset is covered.

With the yaml configuration file for clp-bench (which is different from the yaml file for Loki and Promtail containers) and running containers of Loki and Promtail (all data must have been ingested), we can run the following command to run the query benchmark:

clp-bench -t GrafanaLoki -m query-only -c {path-to-loki-clp-bench-yaml}

For each query, clp-bench run the following command. Note that the time_slice_start and time_slice_to are the start and end timestamps for each time slice, clp-bench will iterate all time slices between from and to in the configuration to cover the entire dataset.

<logcli_binary_path> \
  query '{ job="{job}" } |~ {query}' \
  --limit={limit} \
  --batch={batch} \
  --from="{time_slice_start}" \
  --to="{time_slice_to}"

For measuring memory usage during query execution, we employ the same method used for data ingestion.

Elasticsearch

We benchmarked Elasticsearch for both unstructured and semi-structured logs. They are basically the same, except the query's format and the preprocessing of the semi-structured log dataset. Generally, we use single-node deployment of Elasticsearch, disabling the security feature of xpack.

For unstructured logs, an example of the yaml configuration file is like:

system_metric:
  enable: True
  memory:
    ingest_polling_interval: 10
    run_query_benchmark_polling_interval: 1

elasticsearch:
  container_id: elasticsearch-xiaochong
  launch_script_path: /home/assets/start-ela.sh
  compress_script_path: /home/assets/compress.py
  search_script_path: /home/assets/query.py
  terminate_script_path: /home/assets/stop-ela.sh
  memory_polling_script_path: /home/assets/poll_mem.py
  data_path: /var/lib/elasticsearch
  log_path: /var/log/elasticsearch
  dataset_path: /home/datasets/worker*/worker*/*log*
  queries:
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " DEBUG org.apache.hadoop.mapred.ShuffleHandler: verifying request. enc_str="}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " to pid 21177 as user "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 10000 reply: "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 10 reply: "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 178.2 MB "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " 1.9 GB "}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": "job_1528179349176_24837"}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": "blk_1075089282_1348458"}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": "hdfs://master:8200/HiBench/Bayes/temp/worddict"}}}}, "size": 10000}'
    - '{"query": {"bool": {"must": {"match_phrase": {"log_line": " abcde "}}}}, "size": 10000}'

For semi-structured logs, an example of the yaml configuration file is like:

system_metric:
  enable: True
  memory:
    ingest_polling_interval: 10
    run_query_benchmark_polling_interval: 1

elasticsearch:
  container_id: elasticsearch-semi-xiaochong
  launch_script_path: /home/assets/start-ela.sh
  compress_script_path: /home/assets/compress.py
  search_script_path: /home/assets/query.py
  terminate_script_path: /home/assets/stop-ela.sh
  memory_polling_script_path: /home/assets/poll_mem.py
  data_path: /var/lib/elasticsearch
  log_path: /var/log/elasticsearch
  dataset_path: /home/datasets/mongod.log
  queries:
    - '{"query": {"exists": {"field": "attr.tickets"}}, "size": 10000}'
    - '{"query": {"term": {"id": 22419}}, "size": 10000}'
    - '{"query": {"bool": {"must": [{"wildcard": {"attr.message.msg": "log_release*"}}, {"match": {"attr.message.session_name": "connection"}}]}}, "size": 10000}'
    - '{"query": {"bool": {"must": [{"match": {"ctx": "initandlisten"}}], "should": [{"wildcard": {"attr.message.msg": "log_remove*"}}, {"bool": {"must_not": [{"match_phrase": {"msg": "WiredTiger message"}}]}}], "minimum_should_match": 1}}, "size": 10000}'
    - '{"query": {"bool": {"must": [{"match": {"c": "WTWRTLOG"}}, {"range": {"attr.message.ts_sec": {"gt": 1679490000}}}]}}, "size": 10000}'
    - '{"query": {"bool": {"must": [{"match": {"ctx": "FlowControlRefresher"}}, {"match": {"attr.numTrimmed": 0}}]}}, "size": 10000}'

Note that there are some scripts used by clp-bench can be found under clp-bench/assets/elasticsearch-unstructured (for unstructured logs) and clp-bench/assets/elasticsearch (for semi-structured logs). There are also Dockerfile can be found under these directories, which are used to build the containers with docker_build.sh. Once the containers are built, use docker_run.sh to launch them. The script requires two arguments: the first is the absolute path of the dataset on the host, and the second is the absolute path where the ingested data will be stored on the host.

With the yaml file and the running container corresponding to the log dataset type, we can run the following command to benchmark Elasticsearch:

# For unstructured log dataset
clp-bench -t ElasticsearchUnstructured -m {mode} -c {path-to-yaml}
# For semi-structured log dataset
clp-bench -t Elasticsearch -m {mode} -c {path-to-yaml}

For data ingestion, no preprocessing is needed for unstructured log datasets. However, JSON logs generated by MongoDB require some adjustments to be searchable by Elasticsearch. For details, refer to the traverse_data function in clp-bench/assets/elasticsearch/compress.py. The general approach involves reorganizing certain fields by moving them to outer or inner objects to ensure the queries function correctly. For both types of dataset, clp-bench uses streaming_bulk from elasticsearch.helpers to ingest data (refer to either clp-bench/assets/elasticsearch-unstructured/compress.py or clp-bench/assets/elasticsearch/compress.py).

For query benchmarking, clp-bench also uses functionality from elasticsearch python package to execute queries. For details, refer to execute_query_without_cache functions in clp-bench/assets/elasticsearch-unstructured/query.py and clp-bench/assets/elasticsearch/query.py.

For memory monitoring, similar to CLP and CLP-S, clp-bench uses ps aux and checks the RSS field.