Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
Dandandan committed Oct 17, 2024
2 parents 69416ba + d8e4e92 commit 5f0803b
Show file tree
Hide file tree
Showing 306 changed files with 17,605 additions and 6,756 deletions.
12 changes: 7 additions & 5 deletions .github/actions/setup-builder/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,16 +28,18 @@ runs:
- name: Install Build Dependencies
shell: bash
run: |
apt-get update
apt-get install -y protobuf-compiler
RETRY="ci/scripts/retry"
"${RETRY}" apt-get update
"${RETRY}" apt-get install -y protobuf-compiler
- name: Setup Rust toolchain
shell: bash
# rustfmt is needed for the substrait build script
run: |
RETRY="ci/scripts/retry"
echo "Installing ${{ inputs.rust-version }}"
rustup toolchain install ${{ inputs.rust-version }}
rustup default ${{ inputs.rust-version }}
rustup component add rustfmt
"${RETRY}" rustup toolchain install ${{ inputs.rust-version }}
"${RETRY}" rustup default ${{ inputs.rust-version }}
"${RETRY}" rustup component add rustfmt
- name: Configure rust runtime env
uses: ./.github/actions/setup-rust-runtime
- name: Fixup git permissions
Expand Down
16 changes: 7 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,9 @@
DataFusion is an extensible query engine written in [Rust] that
uses [Apache Arrow] as its in-memory format.

The DataFusion libraries in this repository are used to build data-centric system software. DataFusion also provides the
following subprojects, which are packaged versions of DataFusion intended for end users.
This crate provides libraries and binaries for developers building fast and
feature rich database and analytic systems, customized to particular workloads.
See [use cases] for examples. The following related subprojects target end users:

- [DataFusion Python](https://github.com/apache/datafusion-python/) offers a Python interface for SQL and DataFrame
queries.
Expand All @@ -54,13 +55,10 @@ following subprojects, which are packaged versions of DataFusion intended for en
- [DataFusion Comet](https://github.com/apache/datafusion-comet/) is an accelerator for Apache Spark based on
DataFusion.

The target audience for the DataFusion crates in this repository are
developers building fast and feature rich database and analytic systems,
customized to particular workloads. See [use cases] for examples.

DataFusion offers [SQL] and [`Dataframe`] APIs,
excellent [performance], built-in support for CSV, Parquet, JSON, and Avro,
extensive customization, and a great community.
"Out of the box,"
DataFusion offers [SQL] and [`Dataframe`] APIs, excellent [performance],
built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and
a great community.

DataFusion features a full query planner, a columnar, streaming, multi-threaded,
vectorized execution engine, and partitioned data sources. You can
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/src/bin/h2o.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ use datafusion::datasource::listing::{
use datafusion::datasource::MemTable;
use datafusion::prelude::CsvReadOptions;
use datafusion::{arrow::util::pretty, error::Result, prelude::SessionContext};
use datafusion_benchmarks::BenchmarkRun;
use datafusion_benchmarks::util::BenchmarkRun;
use std::path::PathBuf;
use std::sync::Arc;
use structopt::StructOpt;
Expand Down
3 changes: 1 addition & 2 deletions benchmarks/src/clickbench.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
use std::path::Path;
use std::path::PathBuf;

use crate::util::{BenchmarkRun, CommonOpt};
use datafusion::{
error::{DataFusionError, Result},
prelude::SessionContext,
Expand All @@ -26,8 +27,6 @@ use datafusion_common::exec_datafusion_err;
use datafusion_common::instant::Instant;
use structopt::StructOpt;

use crate::{BenchmarkRun, CommonOpt};

/// Run the clickbench benchmark
///
/// The ClickBench[1] benchmarks are widely cited in the industry and
Expand Down
3 changes: 2 additions & 1 deletion benchmarks/src/imdb/run.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ use std::path::PathBuf;
use std::sync::Arc;

use super::{get_imdb_table_schema, get_query_sql, IMDB_TABLES};
use crate::{BenchmarkRun, CommonOpt};
use crate::util::{BenchmarkRun, CommonOpt};

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::{self, pretty_format_batches};
Expand Down Expand Up @@ -489,6 +489,7 @@ mod tests {

use super::*;

use crate::util::CommonOpt;
use datafusion::common::exec_err;
use datafusion::error::Result;
use datafusion_proto::bytes::{
Expand Down
3 changes: 1 addition & 2 deletions benchmarks/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,4 @@ pub mod imdb;
pub mod parquet_filter;
pub mod sort;
pub mod tpch;
mod util;
pub use util::*;
pub mod util;
2 changes: 1 addition & 1 deletion benchmarks/src/parquet_filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

use std::path::PathBuf;

use crate::{AccessLogOpt, BenchmarkRun, CommonOpt};
use crate::util::{AccessLogOpt, BenchmarkRun, CommonOpt};

use arrow::util::pretty;
use datafusion::common::Result;
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/src/sort.rs
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
use std::path::PathBuf;
use std::sync::Arc;

use crate::{AccessLogOpt, BenchmarkRun, CommonOpt};
use crate::util::{AccessLogOpt, BenchmarkRun, CommonOpt};

use arrow::util::pretty;
use datafusion::common::Result;
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/src/tpch/run.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ use std::sync::Arc;
use super::{
get_query_sql, get_tbl_tpch_table_schema, get_tpch_table_schema, TPCH_TABLES,
};
use crate::{BenchmarkRun, CommonOpt};
use crate::util::{BenchmarkRun, CommonOpt};

use arrow::record_batch::RecordBatch;
use arrow::util::pretty::{self, pretty_format_batches};
Expand Down
21 changes: 21 additions & 0 deletions ci/scripts/retry
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash

set -euo pipefail

x() {
echo "+ $*" >&2
"$@"
}

max_retry_time_seconds=$(( 3 * 60 ))
retry_delay_seconds=10

END=$(( $(date +%s) + ${max_retry_time_seconds} ))

while (( $(date +%s) < $END )); do
x "$@" && exit 0
sleep "${retry_delay_seconds}"
done

echo "$0: retrying [$*] timed out" >&2
exit 1
12 changes: 5 additions & 7 deletions datafusion-cli/Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion datafusion-cli/src/exec.rs
Original file line number Diff line number Diff line change
Expand Up @@ -383,7 +383,7 @@ pub(crate) async fn register_object_store_and_config_extensions(
ctx.register_table_options_extension_from_scheme(scheme);

// Clone and modify the default table options based on the provided options
let mut table_options = ctx.session_state().default_table_options().clone();
let mut table_options = ctx.session_state().default_table_options();
if let Some(format) = format {
table_options.set_config_format(format);
}
Expand Down
10 changes: 5 additions & 5 deletions datafusion-cli/src/object_storage.rs
Original file line number Diff line number Diff line change
Expand Up @@ -495,7 +495,7 @@ mod tests {

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &mut plan {
ctx.register_table_options_extension_from_scheme(scheme);
let mut table_options = ctx.state().default_table_options().clone();
let mut table_options = ctx.state().default_table_options();
table_options.alter_with_string_hash_map(&cmd.options)?;
let aws_options = table_options.extensions.get::<AwsOptions>().unwrap();
let builder =
Expand Down Expand Up @@ -540,7 +540,7 @@ mod tests {

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &mut plan {
ctx.register_table_options_extension_from_scheme(scheme);
let mut table_options = ctx.state().default_table_options().clone();
let mut table_options = ctx.state().default_table_options();
table_options.alter_with_string_hash_map(&cmd.options)?;
let aws_options = table_options.extensions.get::<AwsOptions>().unwrap();
let err = get_s3_object_store_builder(table_url.as_ref(), aws_options)
Expand All @@ -566,7 +566,7 @@ mod tests {

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &mut plan {
ctx.register_table_options_extension_from_scheme(scheme);
let mut table_options = ctx.state().default_table_options().clone();
let mut table_options = ctx.state().default_table_options();
table_options.alter_with_string_hash_map(&cmd.options)?;
let aws_options = table_options.extensions.get::<AwsOptions>().unwrap();
// ensure this isn't an error
Expand Down Expand Up @@ -594,7 +594,7 @@ mod tests {

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &mut plan {
ctx.register_table_options_extension_from_scheme(scheme);
let mut table_options = ctx.state().default_table_options().clone();
let mut table_options = ctx.state().default_table_options();
table_options.alter_with_string_hash_map(&cmd.options)?;
let aws_options = table_options.extensions.get::<AwsOptions>().unwrap();
let builder = get_oss_object_store_builder(table_url.as_ref(), aws_options)?;
Expand Down Expand Up @@ -631,7 +631,7 @@ mod tests {

if let LogicalPlan::Ddl(DdlStatement::CreateExternalTable(cmd)) = &mut plan {
ctx.register_table_options_extension_from_scheme(scheme);
let mut table_options = ctx.state().default_table_options().clone();
let mut table_options = ctx.state().default_table_options();
table_options.alter_with_string_hash_map(&cmd.options)?;
let gcp_options = table_options.extensions.get::<GcpOptions>().unwrap();
let builder = get_gcs_object_store_builder(table_url.as_ref(), gcp_options)?;
Expand Down
1 change: 1 addition & 0 deletions datafusion-examples/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ dashmap = { workspace = true }
datafusion = { workspace = true, default-features = true, features = ["avro"] }
datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
datafusion-functions-window-common = { workspace = true }
datafusion-optimizer = { workspace = true, default-features = true }
datafusion-physical-expr = { workspace = true, default-features = true }
datafusion-proto = { workspace = true }
Expand Down
6 changes: 5 additions & 1 deletion datafusion-examples/examples/advanced_udwf.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ use datafusion_expr::function::WindowUDFFieldArgs;
use datafusion_expr::{
PartitionEvaluator, Signature, WindowFrame, WindowUDF, WindowUDFImpl,
};
use datafusion_functions_window_common::partition::PartitionEvaluatorArgs;

/// This example shows how to use the full WindowUDFImpl API to implement a user
/// defined window function. As in the `simple_udwf.rs` example, this struct implements
Expand Down Expand Up @@ -74,7 +75,10 @@ impl WindowUDFImpl for SmoothItUdf {

/// Create a `PartitionEvaluator` to evaluate this function on a new
/// partition.
fn partition_evaluator(&self) -> Result<Box<dyn PartitionEvaluator>> {
fn partition_evaluator(
&self,
_partition_evaluator_args: PartitionEvaluatorArgs,
) -> Result<Box<dyn PartitionEvaluator>> {
Ok(Box::new(MyPartitionEvaluator::new()))
}

Expand Down
4 changes: 2 additions & 2 deletions datafusion-examples/examples/custom_file_format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -154,7 +154,7 @@ impl FileFormatFactory for TSVFileFactory {
&self,
state: &SessionState,
format_options: &std::collections::HashMap<String, String>,
) -> Result<std::sync::Arc<dyn FileFormat>> {
) -> Result<Arc<dyn FileFormat>> {
let mut new_options = format_options.clone();
new_options.insert("format.delimiter".to_string(), "\t".to_string());

Expand All @@ -164,7 +164,7 @@ impl FileFormatFactory for TSVFileFactory {
Ok(tsv_file_format)
}

fn default(&self) -> std::sync::Arc<dyn FileFormat> {
fn default(&self) -> Arc<dyn FileFormat> {
todo!()
}

Expand Down
6 changes: 5 additions & 1 deletion datafusion-examples/examples/simplify_udwf_expression.rs
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ use datafusion_expr::{
expr::WindowFunction, simplify::SimplifyInfo, Expr, PartitionEvaluator, Signature,
Volatility, WindowUDF, WindowUDFImpl,
};
use datafusion_functions_window_common::partition::PartitionEvaluatorArgs;

/// This UDWF will show how to use the WindowUDFImpl::simplify() API
#[derive(Debug, Clone)]
Expand Down Expand Up @@ -60,7 +61,10 @@ impl WindowUDFImpl for SimplifySmoothItUdf {
&self.signature
}

fn partition_evaluator(&self) -> Result<Box<dyn PartitionEvaluator>> {
fn partition_evaluator(
&self,
_partition_evaluator_args: PartitionEvaluatorArgs,
) -> Result<Box<dyn PartitionEvaluator>> {
todo!()
}

Expand Down
27 changes: 0 additions & 27 deletions datafusion/common/src/dfschema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -406,33 +406,6 @@ impl DFSchema {
}
}

/// Check whether the column reference is ambiguous
pub fn check_ambiguous_name(
&self,
qualifier: Option<&TableReference>,
name: &str,
) -> Result<()> {
let count = self
.iter()
.filter(|(field_q, f)| match (field_q, qualifier) {
(Some(q1), Some(q2)) => q1.resolved_eq(q2) && f.name() == name,
(None, None) => f.name() == name,
_ => false,
})
.take(2)
.count();
if count > 1 {
_schema_err!(SchemaError::AmbiguousReference {
field: Column {
relation: None,
name: name.to_string(),
},
})
} else {
Ok(())
}
}

/// Find the qualified field with the given name
pub fn qualified_field_with_name(
&self,
Expand Down
Loading

0 comments on commit 5f0803b

Please sign in to comment.