Definition of buckets in AzurePublicDatasetV2 #15

knodir · 2021-03-03T21:06:02Z

Hi,

Can you please include description of VM core and memory buckets to AzurePublicDatasetV2 dataset? It is just about including these two URLs in AzurePublicDatasetLinksV2.txt

I am aware that the exact number of VM cores are not given, as discussed in issue #5, and VMs are put in one of six buckets based on their cores or memory. However, it seems that description of these buckets are "missing", even though they were meant to be released.

I say "missing" (in quotes) because even though description file is not included in AzurePublicDatasetLinksV2.txt they are available for downloading on Azure Blob Storage. More precisely, schema.csv mentions that description of the CPU buckets are available at vm_virtual_core_bucket_definition.csv, which has two fields: bucket and definition. I blindly constructed a path for this file by appending the file name vm_virtual_core_bucket_definition.csv to the parent path and I was able to download through the constructed path vm_virtual_core_bucket_definition.csv.

The vm_virtual_core_bucket_definition.csv file has description of six buckets. These descriptions match the bucket labels in "VM Cores Distribution" plot in jupyter notebook, which is referenced in the main readme. This matching confirms that the file available through Azure Blob Storage is the correct one.

The same applies to description of memory bucket: schema.csv mentions vm_memory_bucket_definition.csv, it is not included in AzurePublicDatasetLinksV2.txt but is available for download in Azure Blob Storage, here vm_memory_bucket_definition.csv.

So, it would be great to update AzurePublicDatasetLinksV2.txt file to include URL for both files (to avoid future guesswork by others):

Let me know if you accept pull requests. I'd be happy to include these two URLs in AzurePublicDatasetLinksV2.txt by myself and perhaps add a short description of buckets to the main readme.

Also, is it accurate to say that

core range in bucket 6 is >24 and <=30, and
memory range in bucket 6 is >64 and <=70?

I noticed these lines in jupyter notebook, that suggest these ranges to be correct:

#Transform vmcorecount '>24' bucket to 30 and '>64' to 70
max_value_vmcorecountbucket = 30
max_value_vmmemorybucket = 70
trace_dataframe = trace_dataframe.replace({'vmcorecountbucket':'>24'},max_value_vmcorecountbucket)
trace_dataframe = trace_dataframe.replace({'vmmemorybucket':'>64'},max_value_vmmemorybucket)

Or is this transformation just a cosmetic improvement to have the jupyter table datatype as int? Having more precise bucket bounds would be helpful.

Finally, is there an external document that describes AzurePublicDatasetV2, like SOSP 2017 paper that describes AzurePublicDatasetV1? It would be useful to reference it in the readme, if any.

Thanks in advance for clarifications!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Definition of buckets in AzurePublicDatasetV2 #15

Definition of buckets in AzurePublicDatasetV2 #15

knodir commented Mar 3, 2021

Definition of buckets in AzurePublicDatasetV2 #15

Definition of buckets in AzurePublicDatasetV2 #15

Comments

knodir commented Mar 3, 2021