Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[backend/frontend] add configuration to have better chances to get RSS Feed contents (#8736) #9244

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

aHenryJard
Copy link
Member

@aHenryJard aHenryJard commented Dec 5, 2024

Proposed changes

  • Do not set proxyAgent header when there is no proxy configured

Adding several configuration to get better chance to avoid HTTP 403 from public RSS Feed:

  • ingestion_manager.rss_feed.min_interval_minutes : wait for at least N minutes before next HTTP call for RSS Feed
  • ingestion_manager.rss_feed.user_agent: allow to overide user agent used in header for RSS Feed requests
  • ingestion_manager.csv_feed.min_interval_minutes : wait for at least N minutes before next HTTP call for CSV Feed

Adding some info and warn level log:

  • Info to know if an HTTP Call has been skipped because of min_interval_minutes or queue states
  • warn in case of 403 coming from a cloudflare challenge to know that it's not a credential issue.

image

Related issues

Checklist

  • I consider the submitted work as finished
  • I tested the code for its functionality
  • I wrote test cases for the relevant uses case (coverage and e2e)
  • I added/update the relevant documentation (either on github or on notion)
  • Where necessary I refactored code to improve the overall quality

Further comments

@github-actions github-actions bot added the filigran team use to identify PR from the Filigran team label Dec 5, 2024
@aHenryJard aHenryJard changed the title Investigate on RSS public Feed 403 issue. Investigate on RSS public Feed 403 issue (#8736) Dec 5, 2024
@aHenryJard aHenryJard changed the title Investigate on RSS public Feed 403 issue (#8736) [backend] add configuration to have better chances to get RSS Feed contents (#8736) Dec 11, 2024
@aHenryJard aHenryJard force-pushed the issue/8736-investigation branch from 2c316f3 to 6b967af Compare December 11, 2024 09:26
@aHenryJard aHenryJard changed the title [backend] add configuration to have better chances to get RSS Feed contents (#8736) [backend/frontend] add configuration to have better chances to get RSS Feed contents (#8736) Dec 11, 2024
@aHenryJard aHenryJard marked this pull request as ready for review December 11, 2024 09:28
Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 31.11111% with 31 lines in your changes missing coverage. Please review.

Project coverage is 65.10%. Comparing base (b293e2e) to head (4a7b302).

Files with missing lines Patch % Lines
...rm/opencti-graphql/src/manager/ingestionManager.ts 17.14% 29 Missing ⚠️
...-platform/opencti-graphql/src/utils/http-client.ts 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9244      +/-   ##
==========================================
- Coverage   65.12%   65.10%   -0.03%     
==========================================
  Files         630      630              
  Lines       60083    60117      +34     
  Branches     6694     6703       +9     
==========================================
+ Hits        39127    39137      +10     
- Misses      20956    20980      +24     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

const { messages_number, messages_size } = await queueDetails(connectorIdFromIngestId(ingestion.id));
if (messages_number === 0) {
const { last_execution_date } = ingestion;
const shouldExecuteIngestion = !last_execution_date || sinceNowInMinutes(last_execution_date) > CSV_FEED_MIN_INTERVAL_MINUTES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why strict > ? so if CSV_FEED_MIN_INTERVAL_MINUTES is 5min, and sinceNowInMinutes(last_execution_date) is 5, it won't execute until the next minute?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and since this code is the same for RSS, I suggest to maybe refactor it to use a method like :

const shouldExexcuteIngestion = (ingestion, min_interval_minutes) => {
   const { last_execution_date } = ingestion;
   return !last_execution_date || sinceNowInMinutes(last_execution_date) > min_interval_minutes;
}

so we could call it like this shouldExecuteIngestion(ingestion, CSV_FEED_MIN_INTERVAL_MINUTES) or shouldExecuteIngestion(ingestion, RSS_FEED_MIN_INTERVAL_MINUTES)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inerval is 30s so it will wait only 30s (well for default values). Anyway I'm adding an <= and ok for shouldExecuteIngestion

className={classes.bodyItem}
style={{ width: dataColumns.last_execution_date.width }}
>
{nsdt(node.last_execution_date)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this || '-' check here?

Comment on lines +41 to +46
const defaultHttpAgent: http.Agent | undefined = undefined;
let defaultHttpsAgent: https.Agent | undefined;

if (cert || key || ca) {
defaultHttpsAgent = new https.Agent({ rejectUnauthorized: rejectUnauthorized === true, cert, key, ca });
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand the logic behind the changes. Moreover, defaultHttpAgent is always undefined

Copy link
Member Author

@aHenryJard aHenryJard Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to testagain because it's the more risky change, but the issue was that defaultHttpAgent was set to new Agent() instead of undefined. Then is there is some proxy configuration the agent is correctly initialize in

export const getPlatformHttpProxyAgent = (uri) => {
const platformProxies = getPlatformHttpProxies();
const targetUrl = new URL(uri);
const targetProxy = platformProxies[targetUrl.protocol]; // Select the proxy according to target protocol
if (targetProxy) {
// If proxy found, check if hostname is not excluded
if (targetProxy.isExcluded(targetUrl.hostname)) {
return undefined;
}
// If not generate the agent accordingly
return targetProxy.build();
}
return undefined;
};

As you see there, if no proxy the return is undefined, which means that in the http call there is no empty httpAgent => from my test having empty httpAgent in request is what is causing being refused by cloudflare.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should just remove defaultHttpAgent in http-client

@aHenryJard aHenryJard force-pushed the issue/8736-investigation branch from 432ffbf to 4a7b302 Compare January 8, 2025 09:42
@aHenryJard aHenryJard marked this pull request as draft January 8, 2025 09:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filigran team use to identify PR from the Filigran team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RSS Feed] Error 403 on accessible public feeds
4 participants