Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add expand data corpus instructions #8807

Merged
merged 16 commits into from
Dec 16, 2024
Merged

Add expand data corpus instructions #8807

merged 16 commits into from
Dec 16, 2024

Conversation

Naarcha-AWS
Copy link
Collaborator

Fixes opensearch-project/opensearch-benchmark#672

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@Naarcha-AWS Naarcha-AWS added 3 - Tech review PR: Tech review in progress benchmark backport 2.18 PR: Backport label for 2.18 labels Nov 25, 2024
@Naarcha-AWS Naarcha-AWS self-assigned this Nov 25, 2024
Copy link

Thank you for submitting your PR. The PR states are In progress (or Draft) -> Tech review -> Doc review -> Editorial review -> Merged.

Before you submit your PR for doc review, make sure the content is technically accurate. If you need help finding a tech reviewer, tag a maintainer.

When you're ready for doc review, tag the assignee of this PR. The doc reviewer may push edits to the PR directly or leave comments and editorial suggestions for you to address (let us know in a comment if you have a preference). The doc reviewer will arrange for an editorial review.


# Expanding the data corpus of a workload

This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for a OpenSearch Becnhmark workload. This can help assist in running the `https_logs` Benchmark with a larger scale, for instance, with clusters containing multiple data nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We can simplify the last sentence:

This is helpful when running time-series workloads like http_logs against a large scale OpenSearch cluster.

Copy link
Contributor

@IanHoang IanHoang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend getting feedback from @gkamat as he has more experience with this and might have additional comments

To use this tutorial, make sure you fulfill the following prerequisites:

1. Python 3.x or greater installed.
2. The `http_logs` workload data corpus already in your load generation host where benchmark is running.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corpus is already available in your load generation host where OSB is running.

@Naarcha-AWS Naarcha-AWS added 4 - Doc review PR: Doc review in progress and removed 3 - Tech review PR: Tech review in progress labels Dec 11, 2024
Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naarcha-AWS Please see my comments and changes and tag me for approval once addressed. Thanks!

@@ -0,0 +1,83 @@
---
layout: default
title: Expand data corpus
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "Expanding a data corpus"?

grand_parent: User guide
---

# Expanding the data corpus of a workload
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Expanding a workload data corpus"?

To use this tutorial, make sure you fulfill the following prerequisites:

1. Python 3.x or greater installed.
2. The `http_logs` workload data corpus is already in your load generation host where OpenSearch Benchmark is running.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "is already stored on the load generation host running OpenSearch Benchmark"?

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Naarcha-AWS LGTM!

@Naarcha-AWS Naarcha-AWS merged commit fadfee3 into main Dec 16, 2024
7 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Dec 16, 2024
* Add expand data corpus instructions

Signed-off-by: Archer <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Nathan Bower <[email protected]>

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Nathan Bower <[email protected]>

---------

Signed-off-by: Archer <[email protected]>
Signed-off-by: Naarcha-AWS <[email protected]>
Signed-off-by: Nathan Bower <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
(cherry picked from commit fadfee3)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@Naarcha-AWS Naarcha-AWS deleted the expand-data-corpus branch December 19, 2024 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Doc review PR: Doc review in progress backport 2.18 PR: Backport label for 2.18 benchmark
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DOCUMENTATION] Expand Data Corpus section
4 participants