Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor FileData to be more modular and more testable #1369

Open
xsdg opened this issue May 28, 2024 · 5 comments
Open

Refactor FileData to be more modular and more testable #1369

xsdg opened this issue May 28, 2024 · 5 comments

Comments

@xsdg
Copy link
Contributor

xsdg commented May 28, 2024

Currently, FileData is implemented as filedata.h (295 lines) and filedata.cc (3,581 lines). filedata.h declares 102 functions, and filedata.cc also includes 57 static functions.

The purpose of this bug is to lay out the desired end state, since it will take jumping through some hoops to actually get there from how things are implemented right now, while keeping everything functioning during the refactor.

The ideal end state has the following traits:

  1. Fundamentally distinct groups of functionality are split into separate implementation files. An old prototype of this refactor created changeinfo.cc, core.cc, filelist.cc, filter.cc, metadata.cc, sidecar.cc, sidecar_change_info.cc, and util.cc. That seems like a reasonable place to start.
  2. FileData becomes a class, in order to take advantage of C++ class access control features.
  3. static functions are generally replaced with protected functions. There may be some functions that still make sense to live in an anonymous namespace, but in general, functions should be unit-testable.
  4. A shim layer is created to translate between the historical C-style FileData API (things like file_data_new_dir(const gchar *path_utf8)) and a more modern C++-style API (something like FileData::new_for_dir(const gchar *path_utf8)
  5. Globals (like static GHashTable *file_data_pool) should at the very least become protected members (so that they're no longer accessible outside of the FileData class), and should ideally be encapsulated in some way, so that unit tests don't interfere with each other and to make it easier to make things thread-safe.
@xsdg
Copy link
Contributor Author

xsdg commented May 28, 2024

As far as strategy this time, I think I'm going to try to wrap all of the filedata functions into a single big class to begin with, and then work on splitting things out of that class into separate files and/or classes. The pieces of filedata.c are so tangled (and depend so much on static functionality) that it's impossible to split things apart without breaking them until the C++ access control is in place anyway.

So I see these as the main steps:

  1. Create the FileData class
    1. Move filedata.cc to filedata/filedata.cc. Put everything in that file inside of a big FileData class
    2. Update wherever the FileData struct is defined so that it also includes the public methods that are now encapsulated
    3. Create a new filedata.cc which trampolines the C-style calls to the C++-style class method calls.
    4. Turn any callback into a static class method to avoid needing to jump through method-versus-function calling hoops.
  2. Create an initial unit test
    1. Sample unit tests from prototype
    2. Examples of unit testing private functions from prototype
  3. Start separating out functionality into separate files
    1. This should just amount to moving functions at this point. Because we stopped using static functions during phase 1, we no longer have constraints around functions all being defined in the same translation unit (i.e. file)
    2. Part of this process should also involve trying to understand how the files relate to each other, and figuring out what would make sense as a private API between each file
  4. Start turning files into their own sub-classes
    1. Here's an example from the prototype
    2. The goal is to further encapsulate the dependencies for each group of functionality, so, for example, every bit of code in FileData no longer depends on UI code.
  5. Consider encapsulating globals into some kind of singleton object that can be dependency-injected for unit tests.
    1. This would go a long way towards allowing us to reliably validate things like corner-case refcounting behaviors.
    2. Beyond that, this would facilitate adding debugging features directly to that object, to make it easier to understand and fix bugs, and to validate those bugfixes.

@xsdg
Copy link
Contributor Author

xsdg commented Jun 28, 2024

I was working on milestone 5 and ran into a conundrum. Gonna write it out here in case other folks have ideas (especially @qarkai, since this fundamentally a C++ lifetime management question):

Background context

I decided to approach this by creating a FileDataContext struct (which holds global_file_data_count, file_data_pool, and file_data_planned_change_map). I also created an IFileDataContext interface (which currently just specifies virtual FileDataContext &context() = 0;, which derived classes need to implement), and a GlobalFileDataContext singleton that derives from IFileDataContext:

struct FileDataContext {
#ifdef DEBUG_FILEDATA
        gint global_file_data_count = 0;
#endif
        GHashTable *file_data_pool = nullptr;
        GHashTable *file_data_planned_change_hash = nullptr;
};

class IFileDataContext {
    public:
        virtual FileDataContext &context() = 0;
};

 class GlobalFileDataContext : public IFileDataContext {
    private:
            static GlobalFileDataContext get_instance() {
…
    public:
        FileDataContext &context() override { return context; }

    private:
        std::mutex instance_lock_;
        GlobalFileDataContext *instance_ = nullptr;  // GUARDED_BY(instance_lock_);
        FileDataContext context;
};

 class FileData {
     private:
        FileData() = delete;
        FileDataContext *context = nullptr;  // TODO(xsdg): Move to bottom of class.

The issue

The file_data_planned_change_remove function

  1. needs to access context->file_data_planned_change_hash, which means that it needs to be a FileData instance method
  2. calls file_data_unref(fd), which means that it needs to not be an instance method, so that it can destruct the FileData when appropriate

Or, more broadly, how should we deal with cases where code within FileData itself might remove the last reference to itself? (This is a possibility anywhere that file_data_unref is called from within src/filedata/filedata.cc).

Some ideas

  1. One obvious idea is to schedule file_data_unref asynchronously. That has all the challenges of doing things asynchronously (like ordering may no longer be guaranteed), but would obviously solve this particular issue

  2. This stuff could move to a friend class of FileData, which is no longer part of a FileData instance. That said, I think if a FileData method is in the call stack, that would still be an issue.

  3. Actually, I might be wrong. It might be the case that this Just Works (although it might also be very error-prone). See https://isocpp.org/wiki/faq/freestore-mgmt#delete-this.

    Also, note that the guarantees from the FAQ above explicitly specify that they only hold when using delete after using new for allocation, which we don't do. But also, as a side experiment, I already have a branch where FileData uses new/delete instead of g_new0 and g_free, so that requirement wouldn't be too difficult to manage if it's actually needed.

@qarkai
Copy link
Contributor

qarkai commented Jun 29, 2024

AFAIU FileDataContext is kind of storage of FileData. Probably file_data_planned_change_remove() should be part of FileDataContext.

@xsdg
Copy link
Contributor Author

xsdg commented Jul 5, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants