You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The process_record function currently tightly couples content extraction and aggregation logic.
Isn't the aggregation logic outside of process_record?
the method process_record is a generator which takes a single WARC/WAT/WET record and yields any kind of tuple. <String, Long> is the default type of a tuple, but
implementations of a CCSparkJob may define a custom output tuple, but then need also to implement a custom aggregation logic by overriding the reduce_by_key method. See word_count.py.
Extract the content extraction logic into a dedicated function.
Extract the content aggregation logic into a dedicated function.
Modify process_record to delegate to the new abstractions
Unluckily, process_record is the only method which must be defined in examples. There is no default implementation. This is something which shouldn't be changed because it would affect all customized tools built with cc-pyspark. Even for the provided examples it would be a substantial change. For very simple examples (e.g. html_tag_count.py), it wouldn't hardly simplify the implementation.
Establish a standardized approach to content extraction
How this should be done? But Extraction of the content depends not only on the use case but on the input. Every type of WARC record can be consumed: WARC, WAT, WET; response, request, metadata, conversion. The payload can be text, JSON, HTML, PDF, etc. There is also an open PR which extends using cc-pyspark to any kind of file, not only WARC files, see #45.
I strongly argue to keep the examples as simple as possible. And keep the complexity down the line and not in the top of the inheritance hierarchy.
Problem
The
process_record
function currently tightly couples content extraction and aggregation logic. This makes it difficult to:Proposed Improvement
Introduce a separate step for content extraction. This abstraction will:
Implementation Suggestions
process_record
to delegate to the new abstractionsThis can be implemented at two potential levels:
At the
CCSparkJob
Level:At Specific Examples:
ExtractLinksJob
to showcase the idea as a suggestion.The text was updated successfully, but these errors were encountered: