[CFDFC] CFDFC Extraction & Caching Redesign #262

Jiahui17 · 2024-12-08T13:20:57Z

Jiahui17
Dec 8, 2024
Maintainer

Introduction

Performance optimization algorithms like the one in FPGA20 aim to optimize dataflow subcircuits called CFDFCs, which correspond to one or multiple loops in the control-flow graph. Since then, many papers have built on top of this idea in these few ways:

New performance optimization algorithms: FPL22 (detailed timing model), MapBuf (even more detailed timing model).
Needs the information from buffering: FCCM22 (resource sharing needs buffering decision), CRUSH (also resource sharing), LSQ sizing (needs CFDFC II and buffer positions to estimate the lifetime of memory access entries in LSQ).
New CF to handshake algorithm: Fast token delivery (Implementation of Fast Token Delivery flow within Dynamatic #177: requires a different way of mapping loops in the original program to the CFDFC subgraph), multi-threading (same as fast token).

The Requirements

Limitations of the current design. So far, the CFDFC data structure suits 1, but it is not designed with 2 and 3 in mind. I would like to initiate the discussion for a new implementation in this issue.

Better Caching Mechanism

Currently, the buffer placement pass would log the performance optimization result directly inside the MLIR file as a function attribute. This includes:

For each CFDFC, a list of BBs.
For each CFDFC, the achieved II result.

Note

Duplicated logic. As seen from the LSQ sizing pass: The CFDFC class in the buffering pass is not the same as the CFDFC class in the LSQ sizing pass. Thus, the LSQ sizing pass needs its logic for recreating the CFDFCs from the function attributes.

Note

Complex instrumentation. As seen from the sharing pass: The buffer placement passes are instrumented to retrieve the performance optimization decision. The buffer passes are internally called by the sharing pass (instead of reading the cached result from somewhere).

This approach is against the modularity principle (but is fundamentally due to the fact that caching the performance information is hard).

Note

Rely on BB organization. As I will discuss later, some handshake circuits would omit BB organization. Therefore, using a list of BBs cannot recover the CFDFC from this kind of circuit.

Thus, I suggest creating a simple Graphviz DOT format for caching the performance optimization decision across the optimization passes. Here is an example:

digraph cfdfc0 {
  // graph attribute
  graph [throughput=0.50];
  // node attribute
  merge1 [BB=1];
  merge1 -> buffer1;
  buffer1 -> branch1;
  ...
}

For each CFDFC, a .DOT file is created and it encodes the following information:

The nodes and edges in the CFDFC.
The throughput reported by the buffer placement pass.
The BB of each node.

The CFDFC would be extended to:

Export the dot graph.
Parse the dot graph into the CFDFC data structure.
Check if the dot graph is legal.

General CFDFC Extraction Method

More recent CF to handshake conversion passes (#177) omit BB organization for performance merit. The current CFDFC extraction logic would not work here because it relies on BB information. Yet, if we can figure out a new way to extract CFDFC, the performance optimization algorithm should work out of the box.

Current CFDFC extraction logic:

The result of software profiling returns a list of BBs transitions. These transitions are then used to construct a list of "BB cycles". For example (the matvec example):

// An important inner loop
2 -> 2

// A less important outer loop
1 -> 2 -> 3 -> 1

Based on these two BB loops, we identify:

Nodes that are in any of the BBs,
Edges that connect two nodes in the same BB or two nodes that correspond to any BB transition in the BB loop.

An extraction flow. Ideally, we could have:

Extract BB cycles in the Cf level.
Find the predicates (i.e., arith.cmpi, arith.cmpf, etc) in the Cf level that keep the BB cycles running.
Retain the names of the predicates when doing FtdCfToHandshake.
Later when doing the buffering, use the relation to extract CFDFCs.

Note

Effect of omitting BB organization. Yet, the fast token method would create "weird connections", like 1 -> 3 for the second loop. These connections would be completely ignored using the current logic, which might cause a performance penalty due to lack of token balancing.

Proposed solution:

(here I am improvising and definitely need your help @AyaElAkhras @paolo-ienne @lana555)

Instead of identifying a list of BB loops, we identify the set of conditions that will "keep the loop running".

For instance, for the same loop 2 -> 2, we might get a hypothetical set of conditions:

// The loop keeps going as long as cmpi0 outputs a true
FORMULA := cmpi0

From here, we can propagate this information to the rest of the circuit to remove inactive parts when the loop is running. For instance:

recursively remove the branch output that is not selected by the condition
recursively remove the mux input that is not selected by the condition
(maybe there are more?)

Remarks

What do you think? I'd like to hear your thoughts on this :D

pcineverdies · 2024-12-09T07:26:54Z

pcineverdies
Dec 9, 2024
Collaborator

Indeed, while we are currently satisfied with FTD functionality-wise, it remains an "abstract" method as long as we cannot really benefit from all the handshake optimizations (especially FPGA'20) . I hope the following comment is somehow relevant to your topic - otherwise feel free to ignore it :)

One problem we faced for FTD was the lack of a CFG structure at the end of the conversion pass.
We came up with a solution which consists on annotating the CFG information in the handshake::FuncOp operation during the conversion pass, so that it could be later retrieved and used.

See these messages for a full context: 1, 2, 3.

Right now, we want to use the edge information to re-introduce a fake cf-like structure and exploit all the related MLIR features (dominance information, block connections...) rather than doing everything from scratch (the current CFG.h library). You can see what is going on in my branch.

Maybe this is not relevant at all with your concern, but if you feel like such addition might be beneficial, we can find a way which accommodates both our scopes.

For instance, for the same loop 2 -> 2, we might get a hypothetical set of conditions:
[...]
FORMULA := cmpi0

Is this something you were thinking of discovering during buffering or during a previous pass? Right now, it is not fully guaranteed the persistence of names across all passes, thus the stored cmpi0 name might be irrelevant during its usage...

P.S.

More recent CF to handshake conversion passes omit BB organization for performance merit.

The link here is broken!

0 replies

Jiahui17 · 2024-12-09T22:47:57Z

Jiahui17
Dec 9, 2024
Maintainer Author

@pcineverdies thanks for the comment!

Is this something you were thinking of discovering during buffering or during a previous pass? Right now, it is not fully guaranteed the persistence of names across all passes, thus the stored cmpi0 name might be irrelevant during its usage...

I updated the flow that I had in mind. But I see your point, although the conversion passes do not change the names (something called oneToOneConversion, because all the attributes are retained), some optimization passes do. For instance, a pattern rewrite would create new operations (which get new names) and destroy the original ops.

0 replies

AyaElAkhras · 2024-12-09T23:18:32Z

AyaElAkhras
Dec 9, 2024
Maintainer

Thanks @Jiahui17 for initiating this discussion!

I think for buffer placement to work for an arbitrary dataflow circuit (for instance, those produced by the fast token delivery strategy), we need the following:

General methodology for choice-free dataflow circuit (CFDFC) extraction. Here is one proposal:
- (cycles, conditions_for_each_cycle) = identifyCycles(CFG): function that takes the CFG and returns two things: (1) all cycles present, e.g., BB2->BB3->BB4->BB2 is one cycle, (2) conditions_for_cycles, which contain the values of conditions making the cycle true, e.g., c3=false ^ c4=true is the condition for execution of the mentioned cycle.
- (conditions_of_execution_for_each_edge) = identifyEdgesConditions(cycles, conditions_for_cycles): function that takes the cycles and conditions_for_each_cycle returned by the previous function and returns conditions_of_execution_for_each_edge, which is a structure mapping every edge in the dataflow circuit to the conditions at which the edge will be active, e.g., EdgeX: c1=true and c3=false.
- (edges_belonging_to_every_cycle) = identifyEdgesActiveInEachCycle(cycles, conditions_for_each_cycle, conditions_of_execution_for_each_edge): function that uses all the returns of the previous functions to match the conditions of activation of each edge and the conditions of execution of each cycle to decide if a particular edge belongs to a particular cycle or not. The return can be in any form, but should contain information that says the edges belonging to every cycle.
- edges_belonging_to_every_cycle can then be used to identify which edges belong to which CFDFC.
General/standard algorithm for backward edge detection. The requirement is that for every extracted CFDFC, we should identify exactly one edge that is inside the cycle. Since every cycle has a unique backward edge, backward edges have always been used for this requirement, but it was detected in nonstandard messy ways. To support arbitrary circuits, we need to employ a proper back edge detection algorithm.

I think @pcineverdies's work on propagating information about the CFG in the IR and being able to reconstruct it at any point in the handshake dialect, directly benefits us here in identifying (1) cycles/loops, (2) conditions of activation of the different edges in the circuit.

As for the naming problem in the identification of the conditions of BBs, I think we should be relying on attributes rather than on the names of operations, as I suggest here.

0 replies

AyaElAkhras · 2024-12-11T11:59:41Z

AyaElAkhras
Dec 11, 2024
Maintainer

Further to the above discussion, @pcineverdies and I were discussing today that it would be useful for everyone to have a single, consistent, and general methodology to refer to control flow information at any point in the handshake dialect. Keep in mind that anyone requiring control flow information is very likely to need dominance analysis, loop analysis, and other similar analysis which are already done on the CFG by early passes at the cf dialect.

In @pcineverdies's work on implementing FTD, he added new annotations to the IR and used them to temporarily reconstruct the entirety of the CFG structure to benefit from all the standard analysis done at the cf level, and he destroys this reconstruction before exporting the IR. We think that this way of retrieving control flow information is general enough and is useful here for the CFDFC extraction, and likely for many different passes requiring control flow information.

We are thinking of opening this in a separate issue, but feel free to object or support the idea here early on :)

0 replies

lucas-rami · 2024-12-12T00:02:01Z

lucas-rami
Dec 12, 2024
Maintainer

We think that this way of retrieving control flow information is general enough and is useful here for the CFDFC extraction, and likely for many different passes requiring control flow information.

Since this is being pushed as some kind of long-term "good" solution for this problem I need to re-iterate that, while functional for this use case, this is very much a hack that will be hard or impossible to maintain in the long run and is therefore not desirable as anything longer than a short-term solution. The correct way to do this, as I have already alluded to on the issue where this was originally brought up, is to become a client of LLVM's GenericDomTree module, which is built exactly for the purpose of running CFG-style queries on generic graph types. DominanceInfo is a client of this module that operates specifically on MLIR basic blocks. Likewise, we should become a client that operates on a custom and simple graph representation derived from the CFG function attribute.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CFDFC] CFDFC Extraction & Caching Redesign #262

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[CFDFC] CFDFC Extraction & Caching Redesign #262

Jiahui17 Dec 8, 2024 Maintainer

Introduction

The Requirements

Better Caching Mechanism

General CFDFC Extraction Method

Current CFDFC extraction logic:

Proposed solution:

Remarks

Replies: 5 comments

pcineverdies Dec 9, 2024 Collaborator

Jiahui17 Dec 9, 2024 Maintainer Author

AyaElAkhras Dec 9, 2024 Maintainer

AyaElAkhras Dec 11, 2024 Maintainer

lucas-rami Dec 12, 2024 Maintainer

Jiahui17
Dec 8, 2024
Maintainer

pcineverdies
Dec 9, 2024
Collaborator

Jiahui17
Dec 9, 2024
Maintainer Author

AyaElAkhras
Dec 9, 2024
Maintainer

AyaElAkhras
Dec 11, 2024
Maintainer

lucas-rami
Dec 12, 2024
Maintainer