Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract process_record to Separate Content Extraction Step for Reusability and Testing #48

Open
silentninja opened this issue Dec 30, 2024 · 1 comment

Comments

@silentninja
Copy link

silentninja commented Dec 30, 2024

Problem

The process_record function currently tightly couples content extraction and aggregation logic. This makes it difficult to:

  1. Reuse the extraction logic across different parts of the codebase.
  2. Isolate and test the extraction logic effectively.

Proposed Improvement

Introduce a separate step for content extraction. This abstraction will:

  • Encourage Reusability: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.
  • Enhance Testability: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.

Implementation Suggestions

  1. Extract the content extraction logic into a dedicated function.
  2. Extract the content aggregation logic into a dedicated function.
  3. Modify process_record to delegate to the new abstractions

This can be implemented at two potential levels:

  1. At the CCSparkJob Level:

    • Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.
  2. At Specific Examples:

    • Implement the abstraction in specific examples like ExtractLinksJob to showcase the idea as a suggestion.
    • Provides flexibility for contributors to adopt or adapt the approach as needed.
@sebastian-nagel
Copy link
Contributor

Hi @silentninja,

The process_record function currently tightly couples content extraction and aggregation logic.

Isn't the aggregation logic outside of process_record?

  • the method process_record is a generator which takes a single WARC/WAT/WET record and yields any kind of tuple. <String, Long> is the default type of a tuple, but
  • implementations of a CCSparkJob may define a custom output tuple, but then need also to implement a custom aggregation logic by overriding the reduce_by_key method. See word_count.py.
  1. Extract the content extraction logic into a dedicated function.
  2. Extract the content aggregation logic into a dedicated function.
  3. Modify process_record to delegate to the new abstractions

Unluckily, process_record is the only method which must be defined in examples. There is no default implementation. This is something which shouldn't be changed because it would affect all customized tools built with cc-pyspark. Even for the provided examples it would be a substantial change. For very simple examples (e.g. html_tag_count.py), it wouldn't hardly simplify the implementation.

Establish a standardized approach to content extraction

How this should be done? But Extraction of the content depends not only on the use case but on the input. Every type of WARC record can be consumed: WARC, WAT, WET; response, request, metadata, conversion. The payload can be text, JSON, HTML, PDF, etc. There is also an open PR which extends using cc-pyspark to any kind of file, not only WARC files, see #45.

I strongly argue to keep the examples as simple as possible. And keep the complexity down the line and not in the top of the inheritance hierarchy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants