Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

silentninja · 2024-12-30T18:42:32Z

Problem

The process_record function currently tightly couples content extraction and aggregation logic. This makes it difficult to:

Reuse the extraction logic across different parts of the codebase.
Isolate and test the extraction logic effectively.

Proposed Improvement

Introduce a separate step for content extraction. This abstraction will:

Encourage Reusability: By decoupling the logic, the content extraction step can be easily shared across modules or extended by the community.
Enhance Testability: Since the extraction logic involves mostly pure and idempotent functions, isolating it would simplify testing and debugging.

Implementation Suggestions

Extract the content extraction logic into a dedicated function.
Extract the content aggregation logic into a dedicated function.
Modify process_record to delegate to the new abstractions

This can be implemented at two potential levels:

At the CCSparkJob Level:
- Establish a standardized approach to content extraction, signifying it as the principal way of handling such tasks in the codebase.
At Specific Examples:
- Implement the abstraction in specific examples like ExtractLinksJob to showcase the idea as a suggestion.
- Provides flexibility for contributors to adopt or adapt the approach as needed.

The text was updated successfully, but these errors were encountered:

sebastian-nagel · 2025-01-12T13:51:35Z

Hi @silentninja,

The process_record function currently tightly couples content extraction and aggregation logic.

Isn't the aggregation logic outside of process_record?

the method process_record is a generator which takes a single WARC/WAT/WET record and yields any kind of tuple. <String, Long> is the default type of a tuple, but
implementations of a CCSparkJob may define a custom output tuple, but then need also to implement a custom aggregation logic by overriding the reduce_by_key method. See word_count.py.

Extract the content extraction logic into a dedicated function.

Extract the content aggregation logic into a dedicated function.

Modify process_record to delegate to the new abstractions

Unluckily, process_record is the only method which must be defined in examples. There is no default implementation. This is something which shouldn't be changed because it would affect all customized tools built with cc-pyspark. Even for the provided examples it would be a substantial change. For very simple examples (e.g. html_tag_count.py), it wouldn't hardly simplify the implementation.

Establish a standardized approach to content extraction

How this should be done? But Extraction of the content depends not only on the use case but on the input. Every type of WARC record can be consumed: WARC, WAT, WET; response, request, metadata, conversion. The payload can be text, JSON, HTML, PDF, etc. There is also an open PR which extends using cc-pyspark to any kind of file, not only WARC files, see #45.

I strongly argue to keep the examples as simple as possible. And keep the complexity down the line and not in the top of the inheritance hierarchy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

silentninja commented Dec 30, 2024 •

edited

Loading

sebastian-nagel commented Jan 12, 2025

Abstract process_record to Separate Content Extraction Step for Reusability and Testing #48

Abstract process_record to Separate Content Extraction Step for Reusability and Testing #48

Comments

silentninja commented Dec 30, 2024 • edited Loading

Problem

Proposed Improvement

Implementation Suggestions

sebastian-nagel commented Jan 12, 2025

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

Abstract `process_record` to Separate Content Extraction Step for Reusability and Testing #48

silentninja commented Dec 30, 2024 •

edited

Loading