Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Microbatch: Customize event_time_start/event_time_end format for source table filtering #11331

Open
3 tasks done
logicoffee opened this issue Feb 23, 2025 · 0 comments
Labels
enhancement New feature or request triage

Comments

@logicoffee
Copy link

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

It would be great to have the option to customize the format for event_time_start and event_time_end when filtering a source table for microbatch models.
As I'm working with BigQuery, the senarios shown below specifically apply to BigQuery.

Case 1: Source Table Partitioned by a Non-TIMESTAMP Column

With this update, the source table’s event_time column is now cast to TIMESTAMP as shown below:

select
    *
from (
    select * from `project`.`schema`.`source_table`
    where
        cast(event_time as timestamp) >= '2025-02-23 00:00:00+00:00'
        and cast(event_time as timestamp) < '2025-02-24 00:00:00+00:00'
)

However, if the event_time column in the source table is not originally of TIMESTAMP type and the table is partitioned by event_time, this casting in the WHERE clause does not reduce the amount of data scanned.
This issue can be resolved by formatting event_time_start and event_time_end:

where
    event_time >= '2025-02-23'
    and event_time < '2025-02-24'

Case 2: Sharded Tables

For example, GA4's BigQuery Export tables are sharded by date (formatted as YYYYMMDD).
In these cases, filtering should be performed using the _table_suffix pseudo column:

select
    *
from
    `project`.`analytics_xxxxxxxx`.`events_*`
where 
    _table_suffix >= 'YYYYMMDD'
    and _table_suffix <= 'YYYYMMDD'

Describe alternatives you've considered

  • use event_time_start and event_time_end of batch object to manually filter source tables

Who will this benefit?

  • dbt users working with tables that are partitioned by a non-timestamp column as a source table of microbatch models
  • dbt users working with BigQuery sharded tables as a source table of microbatch models

Are you interested in contributing this feature?

Yes, but I'm not very familiar with the codebase.

Anything else?

No response

@logicoffee logicoffee added enhancement New feature or request triage labels Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage
Projects
None yet
Development

No branches or pull requests

1 participant