Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of buckets in AzurePublicDatasetV2 #15

Open
knodir opened this issue Mar 3, 2021 · 0 comments
Open

Definition of buckets in AzurePublicDatasetV2 #15

knodir opened this issue Mar 3, 2021 · 0 comments

Comments

@knodir
Copy link

knodir commented Mar 3, 2021

Hi,

Can you please include description of VM core and memory buckets to AzurePublicDatasetV2 dataset? It is just about including these two URLs in AzurePublicDatasetLinksV2.txt

I am aware that the exact number of VM cores are not given, as discussed in issue #5, and VMs are put in one of six buckets based on their cores or memory. However, it seems that description of these buckets are "missing", even though they were meant to be released.

I say "missing" (in quotes) because even though description file is not included in AzurePublicDatasetLinksV2.txt they are available for downloading on Azure Blob Storage. More precisely, schema.csv mentions that description of the CPU buckets are available at vm_virtual_core_bucket_definition.csv, which has two fields: bucket and definition. I blindly constructed a path for this file by appending the file name vm_virtual_core_bucket_definition.csv to the parent path and I was able to download through the constructed path vm_virtual_core_bucket_definition.csv.

The vm_virtual_core_bucket_definition.csv file has description of six buckets. These descriptions match the bucket labels in "VM Cores Distribution" plot in jupyter notebook, which is referenced in the main readme. This matching confirms that the file available through Azure Blob Storage is the correct one.

The same applies to description of memory bucket: schema.csv mentions vm_memory_bucket_definition.csv, it is not included in AzurePublicDatasetLinksV2.txt but is available for download in Azure Blob Storage, here vm_memory_bucket_definition.csv.

So, it would be great to update AzurePublicDatasetLinksV2.txt file to include URL for both files (to avoid future guesswork by others):

Let me know if you accept pull requests. I'd be happy to include these two URLs in AzurePublicDatasetLinksV2.txt by myself and perhaps add a short description of buckets to the main readme.

Also, is it accurate to say that

  • core range in bucket 6 is >24 and <=30, and
  • memory range in bucket 6 is >64 and <=70?

I noticed these lines in jupyter notebook, that suggest these ranges to be correct:

#Transform vmcorecount '>24' bucket to 30 and '>64' to 70
max_value_vmcorecountbucket = 30
max_value_vmmemorybucket = 70
trace_dataframe = trace_dataframe.replace({'vmcorecountbucket':'>24'},max_value_vmcorecountbucket)
trace_dataframe = trace_dataframe.replace({'vmmemorybucket':'>64'},max_value_vmmemorybucket)

Or is this transformation just a cosmetic improvement to have the jupyter table datatype as int? Having more precise bucket bounds would be helpful.

Finally, is there an external document that describes AzurePublicDatasetV2, like SOSP 2017 paper that describes AzurePublicDatasetV1? It would be useful to reference it in the readme, if any.

Thanks in advance for clarifications!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant