Skip to content

Commit

Permalink
Implement budget shrinkage when bin-packing difficult proposals. (#1759)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #1759

Some proposals are much more difficult for the bin-packer than others.
e.g. for APF mast_ctr_cvr_conso_cmf_pipeline, with 0.88 storage reservation:

```
buck2 run mode/opt aps_models/ads/common/utils:sharding_planner_util_run[inplace] -- \
    --mode=mast_ctr_cvr_conso_cmf_pipeline \
    --world_size=128 \
    --batch_size=1024 \
    --hbm_mem_gb=80 \
    --embedding_location=UVM_CACHING \
    --auto_tuning_stats_file=manifold://intaik/flat/stats_2024-01-30_254b64e645a747f4024f8faf95a815c3_1a1381a2b29f6931fb7cbcc1f3a3ebb3_5d06dfdf1f616f69c9622f1a1e88ef55_65536_131072_262144.json \
    --pipeline_type=prefetch-sparse-dist \
    --storage_reservation_policy=FixedPercentage \
    --storage_reservation_percentage=0.88
```

and using the scaleup proposer, we see the partitioner has 1TB of
available budget (10% of total HBM), but attempts to use it all fail
to partition, so we end up falling back to the min-working-set
non-scaled plan.

```
EmbeddingOffloadScaleupProposer - cache scale up budget=1076.78 GB, exploring [152.02, 1228.8] GB
EmbeddingOffloadScaleupProposer - proposed size=152.02 GB, score=59.371404015505505
EmbeddingOffloadScaleupProposer - proposed size=1102.03 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1132.72 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=989.15 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=776.23 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=962.52 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=535.53 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1024.85 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=444.16 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1097.76 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1132.72 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1098.68 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1132.72 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=908.05 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=1056.36 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=801.69 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=740.11 GB, score=None
```

This diff introduces an optimization where we shrink the search space
as we discover non-partitionable proposals, which allows us to focus
the search into more productive areas.

We know the partitioner costs are non-smooth in general (why we
switched away from simple binary-search). This diff uses a narrower
assumption, if a scaleup proposal using X GB fails to partition, any
proposal larger than this will also fail.

With shrinkage, we discover an optimal plan using 213GB.
```
EmbeddingOffloadScaleupProposer - cache scale up budget=1076.78 GB, exploring [152.02, 1228.8] GB
EmbeddingOffloadScaleupProposer - proposed size=152.02 GB, score=59.371404015505505
EmbeddingOffloadScaleupProposer - proposed size=1102.03 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=879.47 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=728.05 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=461.63 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=245.85 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=233.26 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=197.23 GB, score=58.653459490303476
EmbeddingOffloadScaleupProposer - proposed size=215.92 GB, score=None
EmbeddingOffloadScaleupProposer - proposed size=167.79 GB, score=59.158962621812364
EmbeddingOffloadScaleupProposer - proposed size=189.74 GB, score=58.77705761067111
EmbeddingOffloadScaleupProposer - proposed size=171.28 GB, score=59.10331765634908
EmbeddingOffloadScaleupProposer - proposed size=167.77 GB, score=59.158962621812364
EmbeddingOffloadScaleupProposer - proposed size=179.19 GB, score=58.95720474703581
EmbeddingOffloadScaleupProposer - proposed size=158.81 GB, score=59.28707837441257
EmbeddingOffloadScaleupProposer - proposed size=203.26 GB, score=58.561385218983325
EmbeddingOffloadScaleupProposer - proposed size=213.36 GB, score=58.42813341117671
```

Reviewed By: henrylhtsang

Differential Revision: D54431752

fbshipit-source-id: 89dfdd8dba945d5a974e5c6d088d74c554e8081e
  • Loading branch information
Damian Reeves authored and facebook-github-bot committed Mar 28, 2024
1 parent cd5212a commit 01b5d34
Show file tree
Hide file tree
Showing 3 changed files with 88 additions and 0 deletions.
10 changes: 10 additions & 0 deletions torchrec/distributed/planner/proposers.py
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,16 @@ def feedback(
f"EmbeddingOffloadScaleupProposer - proposed size={round(bytes_to_gb(hbm_used_previously), 2)} GB, score={perf_rating}"
)

if not partitionable:
# Focus our search on smaller plans by assuming plans larger than this
# proposal will also fail to partition.
starting_size = sum(
sharding_option.total_storage.hbm
for sharding_option in self.starting_proposal
)
new_budget = hbm_used_previously - starting_size
self.search.shrink_right(new_budget) # pyre-ignore

assert self.search is not None # keep pyre happy
budget = self.search.next(perf_rating or 1e99)
if budget is not None:
Expand Down
71 changes: 71 additions & 0 deletions torchrec/distributed/planner/tests/test_proposers.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,77 @@ def test_scaleup(self) -> None:
},
)

def test_budget_shrink(self) -> None:
tables = [
EmbeddingBagConfig(
num_embeddings=2_000_000,
embedding_dim=10000,
name="table_0",
feature_names=["feature_0"],
)
]
constraints = {
"table_0": ParameterConstraints(
compute_kernels=[EmbeddingComputeKernel.FUSED_UVM_CACHING.value],
cache_params=CacheParams(
load_factor=0.1,
stats=MockCacheStatistics(expected_lookups=2, cacheability=0.2),
),
),
}

GB = 1024 * 1024 * 1024
storage_constraint = Topology(
world_size=1, compute_device="cuda", hbm_cap=100 * GB, ddr_cap=1000 * GB
)
model = TestSparseNN(tables=tables, sparse_device=torch.device("meta"))
enumerator = EmbeddingEnumerator(
topology=storage_constraint, batch_size=BATCH_SIZE, constraints=constraints
)
search_space = enumerator.enumerate(
module=model,
sharders=[
cast(ModuleSharder[torch.nn.Module], EmbeddingBagCollectionSharder())
],
)
proposer = EmbeddingOffloadScaleupProposer()
proposer.load(search_space, enumerator=enumerator)

proposal = proposer.propose()
best_plan = None
best_perf = 1e99
proposals = -1
initial_mem = None
while proposal is not None:
proposals += 1
mem = sum(so.total_storage.hbm for so in proposal)
if initial_mem is None:
initial_mem = mem
# Budget given constraints:
# cache scale up budget=92.53 GB, exploring [7.47, 100.0] GB
#
# Simple perf model, assume partitioner gives a lowest score at 7.9GB, and
# anything larger than 8GB fails to partition. This is very hard to hit when
# exploring the larger [7.47, 100] range with limited iterations without
# shrinkage.
perf = abs(mem - (7.9 * GB))
partitionable = mem < 8 * GB
if perf < best_perf:
best_plan = mem
best_perf = perf
proposer.feedback(
partitionable=partitionable,
plan=proposal,
perf_rating=perf if partitionable else None,
storage_constraint=storage_constraint,
)
proposal = proposer.propose()

self.assertEqual(proposals, 16)
self.assertNotEqual(initial_mem, best_plan, "couldn't find a better plan")
# goal is 7.9, we get very close
self.assertEqual(best_plan, 7.960684550926089 * GB)

def test_proposers_to_proposals_list(self) -> None:
def make_mock_proposal(name: str) -> List[ShardingOption]:
return [
Expand Down
7 changes: 7 additions & 0 deletions torchrec/distributed/planner/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -199,6 +199,13 @@ def __init__(
self.fright: Optional[float] = None
self.d: float = self.right - self.left

def shrink_right(self, B: float) -> None:
"Shrink right boundary given [B,infinity) -> infinity"
self.right = B
self.fright = math.inf
self.d = self.right - self.left
self.x = self.clamp(self.x)

def clamp(self, x: float) -> float:
"Clamp x into range [left, right]"
if x < self.left:
Expand Down

0 comments on commit 01b5d34

Please sign in to comment.