Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14073 dfuse: Move writeback caching from kernel to dfuse. #12729

Merged
merged 40 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
1917a8d
DAOS-14073 dfuse: Move writeback caching from kernel to dfuse.
ashleypittman Jul 28, 2023
841ce5f
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Aug 8, 2023
377a6b2
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Aug 10, 2023
9cc1c45
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Sep 7, 2023
6c7fe85
Use a rwlock to flush.
ashleypittman Sep 7, 2023
03b8de6
Add more flush calls.
ashleypittman Sep 7, 2023
4255d3c
Test compile
ashleypittman Sep 7, 2023
bd347ba
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Oct 10, 2023
cfdc60c
Use release define to enable debugging.
ashleypittman Oct 10, 2023
fa2eaa2
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Oct 31, 2023
d59c603
Remove debugging.
ashleypittman Oct 31, 2023
25683d8
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Nov 13, 2023
020767f
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Nov 13, 2023
355d94b
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Nov 17, 2023
65e2455
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Nov 21, 2023
f4d6026
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Dec 8, 2023
d8dc076
Use other macro for new flush functions.
ashleypittman Dec 8, 2023
d6696f4
Fix formatting.
ashleypittman Dec 8, 2023
83431d3
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Dec 21, 2023
6a88f68
Only hold hte lock if wbb cache is being used.
ashleypittman Dec 21, 2023
cf2b80f
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Jan 11, 2024
f213f33
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Jan 12, 2024
1a3d84b
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Jan 15, 2024
5f551ed
Only take the lock once
ashleypittman Jan 15, 2024
29340c9
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Jan 19, 2024
0c77d4e
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Jan 25, 2024
3d2a2cb
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 8, 2024
96d8129
Only flush files.
ashleypittman Feb 13, 2024
071aae8
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 13, 2024
2d85959
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 15, 2024
faa8429
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 16, 2024
8923436
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 23, 2024
3d2e1df
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Feb 27, 2024
a606e40
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Mar 5, 2024
aec576a
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Mar 18, 2024
c113aa8
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Mar 21, 2024
ad68c6b
Fix indentation.
ashleypittman Mar 21, 2024
e9730d4
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Apr 5, 2024
0b6be3d
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Apr 9, 2024
1bdb650
Merge branch 'master' into amd/dfuse-write-cache
ashleypittman Apr 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions src/client/dfuse/dfuse.h
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,7 @@ struct dfuse_cont {
double dfc_data_timeout;
bool dfc_data_otoc;
bool dfc_direct_io_disable;
bool dfc_wb_cache;

/* Set to true if the inode was allocated to this structure, so should be kept on close*/
bool dfc_save_ino;
Expand Down Expand Up @@ -987,6 +988,10 @@ struct dfuse_inode_entry {
/** File has been unlinked from daos */
bool ie_unlinked;

/* Lock for writes, shared locks are held during write-back reads, exclusive lock is
* acquired and released to flush outstanding writes for getattr, close and forget.
*/
pthread_rwlock_t ie_wlock;
/** Last file closed in this directory was read linearly. Directories only.
*
* Set on close() of a file in the directory to the value of linear_read from the fh.
Expand All @@ -998,6 +1003,19 @@ struct dfuse_inode_entry {
d_list_t ie_evict_entry;
};

/* Flush write-back cache writes to a inode. It does this by waiting for and then releasing an
* exclusive lock on the inode. Writes take a shared lock so this will block until all pending
* writes are complete.
*/
Comment on lines +1006 to +1009
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a best effort flush? since it can be very racy of course depending on when other writes acquire or try to acquire the lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics will depend on what rwlocks do but it'll flush at least all writes which have already started. I don't know what it will do for writes that are issued after the flush starts but before active writes are complete. It's probably rare that a process is calling flush and write on the same fd at the same time however so I don't think this is a big issue.

It is per inode though rather than per file so could happen across processes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this macro though to only check for files, there's no lock for non-files.


#define DFUSE_IE_WFLUSH(_ie) \
do { \
if ((_ie)->ie_dfs->dfc_wb_cache && S_ISREG((_ie)->ie_stat.st_mode)) { \
D_RWLOCK_WRLOCK(&(_ie)->ie_wlock); \
D_RWLOCK_UNLOCK(&(_ie)->ie_wlock); \
} \
} while (0)

/* Lookup an inode and take a ref on it. */
static inline struct dfuse_inode_entry *
dfuse_inode_lookup(struct dfuse_info *dfuse_info, fuse_ino_t ino)
Expand Down
8 changes: 6 additions & 2 deletions src/client/dfuse/dfuse_core.c
Original file line number Diff line number Diff line change
Expand Up @@ -608,7 +608,7 @@ dfuse_char_disabled(char *addr, size_t len)
* set.
*/
static int
dfuse_cont_get_cache(struct dfuse_cont *dfc)
dfuse_cont_get_cache(struct dfuse_info *dfuse_info, struct dfuse_cont *dfc)
{
size_t sizes[ATTR_COUNT];
char *buff;
Expand Down Expand Up @@ -730,6 +730,9 @@ dfuse_cont_get_cache(struct dfuse_cont *dfc)

if (have_dentry && !have_dentry_dir)
dfc->dfc_dentry_dir_timeout = dfc->dfc_dentry_timeout;

if (dfc->dfc_data_timeout != 0 && dfuse_info->di_wb_cache)
dfc->dfc_wb_cache = true;
rc = 0;
out:
D_FREE(buff);
Expand Down Expand Up @@ -873,7 +876,7 @@ dfuse_cont_open(struct dfuse_info *dfuse_info, struct dfuse_pool *dfp, const cha
uuid_copy(dfc->dfc_uuid, c_info.ci_uuid);

if (dfuse_info->di_caching) {
rc = dfuse_cont_get_cache(dfc);
rc = dfuse_cont_get_cache(dfuse_info, dfc);
if (rc == ENODATA) {
DFUSE_TRA_INFO(dfc, "Using default caching values");
dfuse_set_default_cont_cache_values(dfc);
Expand Down Expand Up @@ -1250,6 +1253,7 @@ dfuse_ie_init(struct dfuse_info *dfuse_info, struct dfuse_inode_entry *ie)
atomic_init(&ie->ie_il_count, 0);
atomic_fetch_add_relaxed(&dfuse_info->di_inode_count, 1);
D_INIT_LIST_HEAD(&ie->ie_evict_entry);
D_RWLOCK_INIT(&ie->ie_wlock, 0);
}

void
Expand Down
51 changes: 40 additions & 11 deletions src/client/dfuse/dfuse_fuseops.c
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,8 @@ dfuse_show_flags(void *handle, unsigned int cap, unsigned int want)
DFUSE_TRA_WARNING(handle, "Unknown requested flags %#x", want);
}

/* Called on filesystem init. It has the ability to both observe configuration
* options, but also to modify them. As we do not use the FUSE command line
* parsing this is where we apply tunables.
/* Called on filesystem init. It has the ability to both observe configuration options, but also to
* modify them. As we do not use the FUSE command line parsing this is where we apply tunables.
*/
static void
dfuse_fuse_init(void *arg, struct fuse_conn_info *conn)
Expand All @@ -76,8 +75,8 @@ dfuse_fuse_init(void *arg, struct fuse_conn_info *conn)
DFUSE_TRA_INFO(dfuse_info, "Proto %d %d", conn->proto_major, conn->proto_minor);

/* These are requests dfuse makes to the kernel, but are then capped by the kernel itself,
* for max_read zero means "as large as possible" which is what we want, but then dfuse
* does not know how large to pre-allocate any buffers.
* for max_read zero means "as large as possible" which is what we want, but then dfuse does
* not know how large to pre-allocate any buffers.
*/
DFUSE_TRA_INFO(dfuse_info, "max read %#x", conn->max_read);
DFUSE_TRA_INFO(dfuse_info, "max write %#x", conn->max_write);
Expand All @@ -91,16 +90,12 @@ dfuse_fuse_init(void *arg, struct fuse_conn_info *conn)
conn->want |= FUSE_CAP_READDIRPLUS;
conn->want |= FUSE_CAP_READDIRPLUS_AUTO;

conn->time_gran = 1;

if (dfuse_info->di_wb_cache)
conn->want |= FUSE_CAP_WRITEBACK_CACHE;

#ifdef FUSE_CAP_CACHE_SYMLINKS
conn->want |= FUSE_CAP_CACHE_SYMLINKS;
#endif
dfuse_show_flags(dfuse_info, conn->capable, conn->want);

conn->time_gran = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this has something to do with ms or ns time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This isn't a logic change, just grouping the conn->want code together makes it appear new.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I missed that

conn->max_background = 16;
conn->congestion_threshold = 8;

Expand Down Expand Up @@ -170,6 +165,8 @@ df_ll_getattr(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
DFUSE_IE_STAT_ADD(inode, DS_GETATTR);
}

DFUSE_IE_WFLUSH(inode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more locking on the getattr (and open) path sounds going in the wrong direction for me, performance wise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, although this should uncontented as it's just flushing the inode. The alternative would be to use atomics to track the in-flight count. That would be extra code but allow this to be a single atomic which wouldn't need to flush the pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or...

we could remove this flush but then the returned file-size might be "incorrect" with respect to committed writes that have been submitted by the kernel which seems dangerous. We'd at least want to detect this and set a attr_timeout of 0 which again would mean we needed to keep track of the in-flight count using atomics.


if (inode->ie_dfs->dfc_attr_timeout &&
(atomic_load_relaxed(&inode->ie_open_write_count) == 0) &&
(atomic_load_relaxed(&inode->ie_il_count) == 0)) {
Expand Down Expand Up @@ -209,6 +206,8 @@ df_ll_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr, int to_set,
DFUSE_IE_STAT_ADD(inode, DS_SETATTR);
}

DFUSE_IE_WFLUSH(inode);

if (inode->ie_dfs->dfs_ops->setattr)
inode->ie_dfs->dfs_ops->setattr(req, inode, attr, to_set);
else
Expand Down Expand Up @@ -541,6 +540,34 @@ df_ll_statfs(fuse_req_t req, fuse_ino_t ino)
DFUSE_REPLY_ERR_RAW(dfuse_info, req, rc);
}

static void
dfuse_cb_flush(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)
{
struct dfuse_obj_hdl *oh;
struct dfuse_inode_entry *inode;

D_ASSERT(fi != NULL);
oh = (struct dfuse_obj_hdl *)fi->fh;
inode = oh->doh_ie;

DFUSE_IE_WFLUSH(inode);
DFUSE_REPLY_ZERO(inode, req);
}

static void
dfuse_cb_fdatasync(fuse_req_t req, fuse_ino_t ino, int datasync, struct fuse_file_info *fi)
{
struct dfuse_obj_hdl *oh;
struct dfuse_inode_entry *inode;

D_ASSERT(fi != NULL);
oh = (struct dfuse_obj_hdl *)fi->fh;
inode = oh->doh_ie;

DFUSE_IE_WFLUSH(inode);
DFUSE_REPLY_ZERO(inode, req);
}

/* dfuse ops that are used for accessing dfs mounts */
const struct dfuse_inode_ops dfuse_dfs_ops = {
.lookup = dfuse_cb_lookup,
Expand Down Expand Up @@ -598,7 +625,9 @@ const struct dfuse_inode_ops dfuse_pool_ops = {
ACTION(write_buf, dfuse_cb_write, true) \
ACTION(read, dfuse_cb_read, false) \
ACTION(readlink, dfuse_cb_readlink, false) \
ACTION(ioctl, dfuse_cb_ioctl, false)
ACTION(ioctl, dfuse_cb_ioctl, false) \
ACTION(flush, dfuse_cb_flush, true) \
ACTION(fsync, dfuse_cb_fdatasync, true)

#define SET_MEMBER(member, fn, ...) ops.member = fn;

Expand Down
2 changes: 2 additions & 0 deletions src/client/dfuse/ops/open.c
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@ dfuse_cb_release(fuse_req_t req, fuse_ino_t ino, struct fuse_file_info *fi)

DFUSE_TRA_DEBUG(oh, "Closing %d", oh->doh_caching);

DFUSE_IE_WFLUSH(oh->doh_ie);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in open.c but actually the function is release which maps to close(), this line is what causes flush-on-close which we do want to keep, and won't affect the performance of open at all.


if (oh->doh_readahead) {
struct dfuse_event *ev;

Expand Down
2 changes: 2 additions & 0 deletions src/client/dfuse/ops/read.c
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,8 @@ dfuse_cb_read(fuse_req_t req, fuse_ino_t ino, size_t len, off_t position, struct

ev->de_complete_cb = dfuse_cb_read_complete;

DFUSE_IE_WFLUSH(oh->doh_ie);

rc = dfs_read(oh->doh_dfs, oh->doh_obj, &ev->de_sgl, position, &ev->de_len, &ev->de_ev);
if (rc != 0) {
D_GOTO(err, rc);
Expand Down
51 changes: 34 additions & 17 deletions src/client/dfuse/ops/write.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/**
* (C) Copyright 2016-2023 Intel Corporation.
* (C) Copyright 2016-2024 Intel Corporation.
*
* SPDX-License-Identifier: BSD-2-Clause-Patent
*/
Expand All @@ -10,10 +10,14 @@
static void
dfuse_cb_write_complete(struct dfuse_event *ev)
{
if (ev->de_ev.ev_error == 0)
DFUSE_REPLY_WRITE(ev->de_oh, ev->de_req, ev->de_len);
else
DFUSE_REPLY_ERR_RAW(ev->de_oh, ev->de_req, ev->de_ev.ev_error);
if (ev->de_req) {
if (ev->de_ev.ev_error == 0)
DFUSE_REPLY_WRITE(ev->de_oh, ev->de_req, ev->de_len);
else
DFUSE_REPLY_ERR_RAW(ev->de_oh, ev->de_req, ev->de_ev.ev_error);
} else {
D_RWLOCK_UNLOCK(&ev->de_oh->doh_ie->ie_wlock);
}
daos_event_fini(&ev->de_ev);
d_slab_release(ev->de_eqt->de_write_slab, ev);
}
Expand All @@ -22,15 +26,15 @@ void
dfuse_cb_write(fuse_req_t req, fuse_ino_t ino, struct fuse_bufvec *bufv, off_t position,
struct fuse_file_info *fi)
{
struct dfuse_obj_hdl *oh = (struct dfuse_obj_hdl *)fi->fh;
struct dfuse_info *dfuse_info = fuse_req_userdata(req);
const struct fuse_ctx *fc = fuse_req_ctx(req);
size_t len = fuse_buf_size(bufv);
struct fuse_bufvec ibuf = FUSE_BUFVEC_INIT(len);
struct dfuse_eq *eqt;
int rc;
struct dfuse_event *ev;
uint64_t eqt_idx;
struct dfuse_obj_hdl *oh = (struct dfuse_obj_hdl *)fi->fh;
struct dfuse_info *dfuse_info = fuse_req_userdata(req);
size_t len = fuse_buf_size(bufv);
struct fuse_bufvec ibuf = FUSE_BUFVEC_INIT(len);
struct dfuse_eq *eqt;
int rc;
struct dfuse_event *ev;
uint64_t eqt_idx;
bool wb_cache = false;

DFUSE_IE_STAT_ADD(oh->doh_ie, DS_WRITE);

Expand All @@ -40,8 +44,13 @@ dfuse_cb_write(fuse_req_t req, fuse_ino_t ino, struct fuse_bufvec *bufv, off_t p

eqt = &dfuse_info->di_eqt[eqt_idx % dfuse_info->di_eq_count];

DFUSE_TRA_DEBUG(oh, "%#zx-%#zx requested flags %#x pid=%d", position, position + len - 1,
bufv->buf[0].flags, fc->pid);
if (oh->doh_ie->ie_dfs->dfc_wb_cache) {
D_RWLOCK_RDLOCK(&oh->doh_ie->ie_wlock);
wb_cache = true;
}

DFUSE_TRA_DEBUG(oh, "%#zx-%#zx requested flags %#x", position, position + len - 1,
bufv->buf[0].flags);

/* Evict the metadata cache here so the lookup doesn't return stale size/time info */
if (atomic_fetch_add_relaxed(&oh->doh_write_count, 1) == 0) {
Expand All @@ -66,7 +75,10 @@ dfuse_cb_write(fuse_req_t req, fuse_ino_t ino, struct fuse_bufvec *bufv, off_t p

ev->de_oh = oh;
ev->de_iov.iov_len = len;
ev->de_req = req;
if (wb_cache)
ev->de_req = 0;
else
ev->de_req = req;
ev->de_len = len;
ev->de_complete_cb = dfuse_cb_write_complete;

Expand All @@ -93,6 +105,9 @@ dfuse_cb_write(fuse_req_t req, fuse_ino_t ino, struct fuse_bufvec *bufv, off_t p
if (rc != 0)
D_GOTO(err, rc);

if (wb_cache)
DFUSE_REPLY_WRITE(oh, req, len);

/* Send a message to the async thread to wake it up and poll for events */
sem_post(&eqt->de_sem);

Expand All @@ -102,6 +117,8 @@ dfuse_cb_write(fuse_req_t req, fuse_ino_t ino, struct fuse_bufvec *bufv, off_t p
return;

err:
if (wb_cache)
D_RWLOCK_UNLOCK(&ev->de_oh->doh_ie->ie_wlock);
DFUSE_REPLY_ERR_RAW(oh, req, rc);
if (ev) {
daos_event_fini(&ev->de_ev);
Expand Down
Loading