Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add client interface. #16

Merged
merged 88 commits into from
Nov 30, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
6ffea6e
feat: Add client code structure and interface for Data, Future and ex…
sitaowang1998 Oct 31, 2024
94af15b
feat: Split client into two libraries and add interface
sitaowang1998 Oct 31, 2024
f69523a
fix: Add boost library for spider_client_lib
sitaowang1998 Nov 1, 2024
ccf6cc8
style: Improve code style for data based on pr comments
sitaowang1998 Nov 1, 2024
5e26f58
fix: Add absl as public library for core
sitaowang1998 Nov 1, 2024
020093c
style: Improve code style for client interface based on pr reviea pri…
sitaowang1998 Nov 1, 2024
ee222f0
fix: Try fix clang-tidy find nout found
sitaowang1998 Nov 1, 2024
1b0ccac
docs: Add quick start doc
sitaowang1998 Nov 1, 2024
b3a2e1e
style: Change markdown headings to sentence style and hard wrap markd…
sitaowang1998 Nov 3, 2024
ec8f500
docs: Update doc according to pr comments
sitaowang1998 Nov 3, 2024
5dc12cb
docs: Remove the worker note section and put the content in run task …
sitaowang1998 Nov 3, 2024
0d59c62
docs: Return a Job instead of Future for run and support user to pass…
sitaowang1998 Nov 5, 2024
4cd6233
Merge branch 'main' into interface
sitaowang1998 Nov 6, 2024
fec5e73
Change future to job
sitaowang1998 Nov 14, 2024
a5e799b
Change task to context
sitaowang1998 Nov 14, 2024
1b15b5d
Remove TaskGraph::run to simplify interface
sitaowang1998 Nov 16, 2024
98104f1
Add separate key-value store interface
sitaowang1998 Nov 16, 2024
cb369fd
Edit some docstrings.
kirkrodrigues Nov 19, 2024
b4a6f36
Fix include guard
sitaowang1998 Nov 19, 2024
f3de2ca
Merge branch 'main' into interface
sitaowang1998 Nov 19, 2024
70547ae
Add serialzable concept
sitaowang1998 Nov 20, 2024
525311c
Merge remote-tracking branch 'origin/interface' into interface
sitaowang1998 Nov 20, 2024
c776376
Fix clang-tidy
sitaowang1998 Nov 20, 2024
49c571e
Fix typo
sitaowang1998 Nov 20, 2024
43d7e16
Fix clang-tidy
sitaowang1998 Nov 20, 2024
5e8e1dd
Remove macOS build
sitaowang1998 Nov 20, 2024
fe23c3c
Change driver constructor
sitaowang1998 Nov 20, 2024
064edd8
Add exception to interface
sitaowang1998 Nov 20, 2024
e7c5240
Change run to start
sitaowang1998 Nov 20, 2024
b0b414e
Add get jobs to driver
sitaowang1998 Nov 20, 2024
97761e1
Add get jobs in context
sitaowang1998 Nov 20, 2024
91d36f2
Update doc with new interface
sitaowang1998 Nov 20, 2024
84c2f41
Fix clang-tidy
sitaowang1998 Nov 20, 2024
f7ab013
Refactor Context.hpp.
kirkrodrigues Nov 21, 2024
302e68a
style: Fix header guard name
sitaowang1998 Nov 21, 2024
a2dc8bc
style: Rename Context to TaskContext
sitaowang1998 Nov 21, 2024
046e740
style: Add missing class docstring
sitaowang1998 Nov 21, 2024
92d6489
feat: Add concepts for task argument
sitaowang1998 Nov 21, 2024
d27f042
Refactor Context.hpp.
kirkrodrigues Nov 21, 2024
2b49746
feat: Change the arguments from Serializable to TaskArgument
sitaowang1998 Nov 21, 2024
c4ee015
style: Update docstring for Driver
sitaowang1998 Nov 21, 2024
0cc231b
style: Update docstring for Data and Job
sitaowang1998 Nov 21, 2024
069a7a7
style: Update clang-format for library headers
sitaowang1998 Nov 21, 2024
9069030
style: Clean up unused headers and Change TaskGraph template
sitaowang1998 Nov 21, 2024
0fe063f
doc: Update quick start guide
sitaowang1998 Nov 21, 2024
e089107
style: Fix clang-tidy
sitaowang1998 Nov 21, 2024
159aa08
Rename TaskArgument to TaskIo
sitaowang1998 Nov 21, 2024
8e035b7
feat: Add Runnable concept and TaskFunction type
sitaowang1998 Nov 21, 2024
9d90c37
refactor: Rename insert_kv and get_kv to kv_store_insert and kv_store…
sitaowang1998 Nov 21, 2024
8de26ec
fix: Fix the template instantiation of TaskFunction
sitaowang1998 Nov 22, 2024
b9bfcdd
style: Fix clang-tidy
sitaowang1998 Nov 22, 2024
351c8b5
docs: Move cluster setup after run task and change all mentions of da…
sitaowang1998 Nov 22, 2024
7dae8d4
docs: Add task graph to group task example
sitaowang1998 Nov 22, 2024
5d15b22
Refactor Data.hpp
kirkrodrigues Nov 25, 2024
1588b51
Refactor Driver.hpp
kirkrodrigues Nov 25, 2024
1e1e41d
Refactor Exception.hpp
kirkrodrigues Nov 25, 2024
c0a6e6f
Refactor Job.hpp.
kirkrodrigues Nov 25, 2024
99c5935
Refactor TaskContext.hpp.
kirkrodrigues Nov 25, 2024
80f314a
Refactor TaskGraph.hpp.
kirkrodrigues Nov 25, 2024
7796e15
Refactor Concepts.hpp.
kirkrodrigues Nov 25, 2024
036dd51
Add absl to libraray list and sort library list
sitaowang1998 Nov 26, 2024
8affa10
Rename template types to satisfy clang-tidy
sitaowang1998 Nov 26, 2024
77d2458
Change set_cleanup to set_cleanup_func
sitaowang1998 Nov 26, 2024
026b6f1
Change set_cleanup to set_cleanup_func
sitaowang1998 Nov 26, 2024
448f693
Change job state enum name and error docstring
sitaowang1998 Nov 26, 2024
769a708
Restruct all the concepts
sitaowang1998 Nov 26, 2024
3555d2e
Add todo for task registration with timeout
sitaowang1998 Nov 26, 2024
a0c5b3a
Fix circular dependency
sitaowang1998 Nov 26, 2024
e185bf3
Restruct quick start guide
sitaowang1998 Nov 26, 2024
f0d79e9
Fix clang-tidy
sitaowang1998 Nov 26, 2024
f444772
Remove all cpp files in client
sitaowang1998 Nov 26, 2024
530da78
Move driver id section after task restart
sitaowang1998 Nov 26, 2024
db21fe7
Add Job::cancel
sitaowang1998 Nov 28, 2024
165eb84
Fix typo
sitaowang1998 Nov 29, 2024
06de774
Fix clean up function signature
sitaowang1998 Nov 29, 2024
c7a07b1
Fix set_locality argument in docstring example
sitaowang1998 Nov 29, 2024
f8c623a
Add void return type for kv_store_insert
sitaowang1998 Nov 29, 2024
488eaa3
Add noreturn and void return type for TaskContext::abort
sitaowang1998 Nov 29, 2024
f0729d9
Fix some header guards.
kirkrodrigues Nov 29, 2024
4d2aa6c
Edit some docstrings and comments.
kirkrodrigues Nov 29, 2024
88ed638
Fix typo in Data docstring example.
kirkrodrigues Nov 29, 2024
bd55552
Add exception in docstring
sitaowang1998 Nov 29, 2024
9897995
Remove pImpl in interface
sitaowang1998 Nov 29, 2024
73eabef
Fix clang-tidy
sitaowang1998 Nov 29, 2024
85d2475
Fix exception what
sitaowang1998 Nov 29, 2024
61be939
Fix docstring job state name
sitaowang1998 Nov 29, 2024
5bffeee
Refactor exceptions.
kirkrodrigues Nov 29, 2024
22f370d
Remove quick start guide
sitaowang1998 Nov 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ project(
set(CMAKE_CXX_STANDARD 20)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# AppleClang complains about file has no symbol and abort the build.
if(APPLE)
set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> Scr <TARGET> <LINK_FLAGS> <OBJECTS>")
set(CMAKE_CXX_ARCHIVE_FINISH "<CMAKE_RANLIB> -no_warning_for_no_symbols -c <TARGET>")
endif()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this anymore considering we've dropped support for building on macOS?


# Enable exporting compile commands
set(CMAKE_EXPORT_COMPILE_COMMANDS
ON
Expand Down
322 changes: 322 additions & 0 deletions docs/quick_start.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The writing is better, but I still think it would be confusing to a reader unless they read it multiple times. I think there are two things you can do:

  • Read React's quick start to get an idea of what a quick start guide should look like. Notice that they start small and build up. Obviously, Spider is a more complicated system, but I still think we can explain things in a way that's easy to follow without loss of detail.
  • Restructure the guide as follows:
    • Intro Spider---what it is, what it does, and briefly why it exists. This should only be a few sentences.
    • Explain how to write a task.
    • Explain how to run a task---this will require explaining how to write a client.
    • Explain how to start a cluster and run the client.
    • Explain how to compose tasks together, i.e. joining them together and nesting them.
    • Explain how to manage data as well as the basics of fault tolerance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still need to remove this, right?

Original file line number Diff line number Diff line change
@@ -0,0 +1,322 @@
# Spider quick start guide

## Architecture of Spider

A Spider cluster is made up of three components:

* __Database__: Spider stores all the states and data in a fault-tolerant database.
* __Scheduler__: Scheduler is responsible for making scheduling decision when a worker ask for a new
task to run. It also handles garbage collection and failure recovery.
* __Worker__: Worker executes the task it is assigned to. Once it finishes, it updates the task
output in database and contacts scheduler for a new task.

Users creates a __client__ to run tasks on Spider cluster. It connects to the database to submit new
tasks and get the results. Clients _never_ directly talks to a scheduler or a worker.

## Set up Spider

To get started,

1. Start a database supported by Spider, e.g. MySql.
2. Start a scheduler and connect it to the database by running
`spider start --scheduler --db <db_url> --port <scheduler_port>`.
3. Start some workers and connect them to the database by running
`spider start --worker --db <db_url>`. Starting a worker that can run specific tasks needs to
link to libraries. We'll cover this later.

## Start a client

Client first creates a Spider client driver and connects it to the database. Spider automatically
cleans up the resource in driver's destructor. User can pass in an optional client id. Two drivers
with same client id cannot run at the same time.

```c++
#include <spider/Spider.hpp>

auto main(int argc, char **argv) -> int {
boost::uuids::string_generator gen;
spider::Driver driver{"db_url", gen(L"01234567-89ab-cdef-0123-456789abcdef")};
}
```

## Create a task

In Spider, a task is a non-member function that takes the first argument a `spider::Context` object.
It can then take any number of arguments. The argument of a task must have one of the following
type:

1. POD type
2. `spider::Data` covered in [later section](#data-on-external-storage)
3. `std::vector` of POD type and `spider::Data`

Task can return any value of the valid argument type listed above. If a task needs to return more
than one result, uses `std::tuple` and makes sure all elements of the tuple are of a valid argument
type.

Spider requires user to register the task function using static `spider::register_task`, which
sets up the function internally in Spider library for later user. Spider requires the function name
to be unique in the cluster.

```c++
// Task that sums to integers
auto sum(spider::Context &context, int x, int y) -> int {
return x + y;
}

// Task that sorts two integers in non-ascending order
auto sort(spider::Context &context, int x, int y) -> std::tuple<int, int> {
if (x >= y) {
return { x, y };
}
return { y, x };
}

spider::register_task(sum);
spider::register_task(sort);

```

## Run a task

Spider enables user to run a task on the cluster. Simply call `Driver::run` and provide the
arguments of the task. `Driver::run`returns a `spider::Job` object, which represents the running
task. `spider::Job` takes the output type of the task graph as template argument. You can call
`Job::state` to check the state of the running task, and `Job::get_result` to block and get the task
result. User can send a cancel signal to Spider by calling `Job::cancel`. Client can get all running
jobs submitted by itself by calling `Driver::get_jobs`.

```c++
auto main(int argc, char **argv) -> int {
// driver initialization skipped
spider::Job<int> sum_job = driver.run(sum, 2);
assert(4 == sum_job.get_result());

spider::Job<std::tuple<int, int>> sort_job = driver.run(4, 3);
assert(std::tuple{3, 4} == sort_job.get_result());
}
```

If you try to compile and run the example code directly, you'll find that it fails because Spider
worker does not know which function to run. User need to compile all the tasks into a shared
library, including the call to `spider::register_task`, and start the worker with the library by
running `spider start --worker --db <db_url> --libs [client_libraries]`.

## Group tasks together

In real world, running a single task is too simple to be useful. Spider lets you bind outputs of
tasks as inputs of another task, similar to `std::bind`. The first argument of `spider::bind` is the
child task. The later arguments are either a `spider::Task` or a `spider::TaskGraph`, whose entire
outputs are used as part of the inputs to the child task, or a POD or
`spider::Data` that is directly used as input. Spider requires that the types of `Task` or
`TaskGraph` outputs or POD type or `spider::Data` matches the input types of child task.

Binding the tasks together forms a dependencies among tasks, which is represented by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How do you plan to communicate (serialize) the task calls (name + args) from the client to the server?
  2. Let's say a task takes two inputs. Does this interface support using a constant for one input and a task for the other?
    1. If so, do you support a task graph that looks like this?
      flowchart TD
          leaf["foo(int, int) -> int"]
          parent["bar(int, int) -> int"]
          3 --> leaf
          4 --> parent
          5 --> parent
          parent --> leaf
      
      Loading
    2. If so, that means any arguments the user passes into run get passed to the inputs in the task graph with a kind of DFS ordering. Is that true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Here comes the use of spider::register_task, which records the mapping between function name and the function pointer. When the user calls run with the function pointer, the library actually sends the function name to the db. Worker also gets the name of the function as part of the task metadata. Since the worker links to the library with register_task call, it also has the mapping and can know which function to call.
    As for arguments, their types are stored in db (serialized as string right now), and values are serialized into string.
  2. Yes.
    i. Yes.
    ii. No. spider::bind needs to bind all inputs of the child task. Thus, the value 3 must be the an argument in bind. 4 and 5 can be passed in run. However, it is possible that there are multiple first layer tasks, i.e. tasks with no parents. The input of task graph is the product of all first layer tasks' output.
    It is possible to support passing 3 as an argument in run for the above example. We can have a special placeholder type for bind. In such case, DFS is an intuitive order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you explain how:
    1. register_task maps a function pointer into a function name?
    2. run turns a function pointer into a function name?
  2. Can you explain this point with an example?

    However, it is possible that there are multiple first layer tasks, i.e. tasks with no parents. The input of task graph is the product of all first layer tasks' output.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. For client and worker, when the library is loaded, it will create a mapping between function name and function pointer.
    i. I am thinking of turning register_task into a macro to get the function name at compile time. It then stores the mapping between the function name to the function pointer.
    ii. run gets the function name from the mapping stored before.
  2. In the example below, both bar and baz has no parent, and task graph input is their inputs put together, i.e. 1, 2, 3 and 4.
graph TD
    1[1]
    2[2]
    3[3]
    4[4]
    foo["foo(int, int) -> int"]
    bar["bar(int, int) -> int"]
    baz["baz(int, int) -> int"]
    1 --> bar
    2 --> bar
    3 --> baz
    4 --> baz
    bar --> foo
    baz --> foo
Loading

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Makes sense.
  2. Gotcha. So to clarify my understanding:
    • in run, users can only pass arguments to root tasks (tasks with no parents).
    • in run, when the user passes a list of arguments [1, 2, 3, 4], conceptually:
      • the framework will iterate over the root tasks in order;
      • for each task argument, the framework will pop one argument from the list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and yes.

`spider::TaskGraph`. `TaskGraph` can be further bound into more complicated `TaskGraph` by serving
as inputs for another task. You can run the task using `Driver::run` in the same way as running a
single task.

```c++
auto square(spider::Context& context, int x) -> int {
return x * x;
}

auto square_root(spider::Context& context, int x) -> int {
return sqrt(x);
}
// task registration skipped
auto main(int argc, char **argv) -> auto {
// driver initialization skipped
spider::TaskGraph<int(int, int)> sum_of_square = spider::bind(sum, square, square);
spider::TaskGraph<int(int, int)> rss = spider::bind(square_root, sum_of_square);
spider::Job<int> job = driver::run(rss, 3, 4);
assert(5 == job.get_result());
}
```

## Run task inside task

Static task graph is enough to solve a lot of real work problems, but dynamically add tasks
on-the-fly could become handy. As mentioned before, spider allows you to add another task as child
of the running task by calling `Context::add_child`. `Context::add_child` can also add a task graph
as child. Task graph can be constructed by `Context::bind`, which has the same signature and
semantic as`spider::bind`.

```c++
auto gcd(spider::Conect& context, int x, int y) -> std::tuple<int, int> {
if (x == y) {
std::cout << "gdc is: " << x << std::endl;
return { x, y };
}
if (x > y) {
context.add_child(gcd);
return { x % y, y };
}
context.add_child(gcd);
return { x, y % x };
}
```

However, it is impossible to get the return value of the dynamically created tasks from a client. We
have a solution by sharing data using key-value store, which will be discussed
[later](#data-as-key-value-store). Another solution is to run task or task graph inside a task and
wait for its value, just like a client. This solution is closer to the conventional function call
semantic.

```c++
auto gcd(spider:Context& context, int x, int y) -> int {
if (x < y) {
std::swap(x, y);
}
while (x != y) {
spider::Job<std:tuple<int, int>> job = context.run(gcd_impl, x, y);
x = job.get_result().get<0>();
y = job.get_result().get<1>();
}
return x;
}

auto gcd_impl(spider::Context& context, int x, int y) -> std::tuple<int, int> {
return { x, x % y};
}
```

## Data on external storage

Often simple POD data are not enough. However, passing large amount of data around is expensive.
Usually these data is stored on disk or a distributed storage system. For example, an ETL workload
usually reads in data from an external storage, writes temporary data on an external storage, and
writes final data into an external storage.

Spider lets user pass the metadata of these data around in `spider::Data` objects. `Data` stores the
value of the metadata information of external data, and provides crucial information to Spider for
correct and efficient scheduling and failure recovery. `Data` stores a list of nodes which has
locality of the external data, and user can specify if locality is a hard requirement, i.e. task can
only run on the nodes in locality list. `Data` can include a `cleanup`function, which will run when
the `Data` object is no longer reference by any task and client. `Data` has a persist flag to
represent that external data is persisted and do not need to be cleaned up.

```c++
struct HdfsFile {
std::string url;
};

/**
* In this example, we run a filter and map on the input stored in Hdfs.
* Filter writes its output into a temporary Hdfs file, which will be cleaned
* up by Spider when the task graph finishes.
* Map reads the temporary files and persists the output in Hdfs file.
*/
auto main(int argc, char** argv) -> int {
// Creates a HdfsFile Data to represent the input data stored in Hdfs.
spider::Data<HdfsFile> input = spider::Data<HdfsFile>::Builder()
.mark_persist(true)
.build(HdfsFile { "/path/to/input" });
spider::Job<spider::Data<HdfsFile>> job = spider::run(
spider::bind(map, filter),
input);
std::string const output_path = job.get_result().get().url;
std::cout << "Result is stored in " << output_path << std::endl;
}

/**
* Runs filer on the input data from Hdfs file and write the output into a
* temporary Hdfs file for later tasks.
*
* @param input input file stored in Hdfs
* @return temporary file store in Hdfs
*/
auto filter(spider::Data<Hdfsfile> input) -> spider::Data<HdfsFile> {
// We can use task id as a unique random number.
std::string const output_path = std::format("/path/%s", context.task_id());
std::string const input_path = input.get().url;
// Creates HdfsFile Data before creating the actual file in Hdfs so Spider
// can clean up the Hdfs file on failure.
spider::Data<HdfsFile> output = spider::Data<HdfsFile>::Builder()
.cleanup([](HdfsFile const& file) { delete_hdfs_file(file); })
.build(HdfsFile { output_path });
auto file = hdfs_create(output_path);
// Hdfs allows reading data from any node, but reading from the nodes where
// file is stored and replicated is faster.
std::vector<std::string> nodes = hdfs_get_nodes(file);
output.set_locality(nodes, false); // not hard locality

// Runs the filter
run_filter(input_path, file);

return output;
}

/**
* Runs map on the input data from Hdfs file and persists the output into an
* Hdfs file.
*
* @param input input file stored in Hdfs
* @return persisted output in Hdfs
*/
auto map(spider::Data<HdfsFile> input) -> spider::Data<HdfsFile> {
// We use hardcoded path for simplicity in this example. You can pass in
// the path as an input to the task or use task id as random name as in
// filter.
std::string const output_path = "/path/to/output";
std::string const input_path = input.get().url;

spider::Data<HdfsFile> output = spider::Data<HdfsFile>::Builder()
.cleanup([](HdfaFile const& file) { delete_hdfs_file(file); })
.build(HdfsFile { output_path });

run_map(input_path, output_path);

// Now that map finishes, the file is persisted on Hdfs as output of job.
// We need to inform Spider that the file is not persisted and should not
// be cleaned up.
output.mark_persist();
return output;
}

```

## Data as key-value store

`Data` can also be used as a key-value store. User can specify a key when creating the data, and the
data can be accessed later by its key. Notice that a task can only access the `Data` created by
itself or passed to it. Client can access any data with the key.

Using the key value store, we can solve the dynamic task result problem
mentioned [before](#run-task-inside-task).

```c++
auto gcd(spider::Context& context, int x, int y, const char* key)
-> std::tuple<int, int, std::string> {
if (x == y) {
spider::Data<int>.Builder()
.set_key(key)
.build(x);
return { x, y, key };
}
if (x > y) {
context.add_child(gcd);
return { x % y, y, key };
}
context.add_child(gcd);
return { x, y % x, key };
}

auto main(int argc, char** argv) -> int {
std::string const key = "random_key";
driver.run(gcd, 48, 18, key);
while (!driver.get_data_by_key(key)) {
int value = driver.get_data_by_key(key).get();
std::cout << "gcd of " << x << " and " << y << " is " << value << std::endl;
}
}
```

## Straggler mitigation

`Driver::register_task` can take a second argument for timeout milliseconds. If a task executes for
longer than the specified timeout, Spider spawns another task instance running the same function.
The task that finishes first wins. Other running task instances are cancelled, and associated data
is cleaned up.

The new task has a different task id, and it is the responsibility of the user to avoid any data
race and deduplicate the output if necessary.
Loading