crawlee-one / Exports
- ApifyEntryMetadata
- ApifyErrorReport
- CrawleeOneActorDef
- CrawleeOneActorInst
- CrawleeOneArgs
- CrawleeOneConfig
- CrawleeOneConfigSchema
- CrawleeOneConfigSchemaCrawler
- CrawleeOneCtx
- CrawleeOneDataset
- CrawleeOneErrorHandlerInput
- CrawleeOneErrorHandlerOptions
- CrawleeOneIO
- CrawleeOneKeyValueStore
- CrawleeOneRequestQueue
- CrawleeOneRoute
- CrawleeOneTelemetry
- DatasetSizeMonitorOptions
- InputActorInput
- ListingFiltersSetupOptions
- ListingLogger
- ListingPageFilter
- ListingPageScraperContext
- ListingPageScraperOptions
- LoggingActorInput
- MetamorphActorInput
- Migration
- OutputActorInput
- PerfActorInput
- PrivacyActorInput
- ProxyActorInput
- PushDataOptions
- PushRequestsOptions
- RequestActorInput
- RequestQueueSizeMonitorOptions
- RunCrawleeOneOptions
- StartUrlsActorInput
- AllActorInputs
- ApifyCrawleeOneIO
- ArrVal
- CaptureError
- CaptureErrorInput
- CrawleeOneActorDefWithInput
- CrawleeOneActorRouterCtx
- CrawleeOneHookCtx
- CrawleeOneHookFn
- CrawleeOneRouteCtx
- CrawleeOneRouteHandler
- CrawleeOneRouteMatcher
- CrawleeOneRouteMatcherFn
- CrawleeOneRouteWrapper
- CrawlerConfigActorInput
- CrawlerType
- CrawlerUrl
- ExtractErrorHandlerOptionsReport
- ExtractIOReport
- GenRedactedValue
- LogLevel
- MaybeArray
- MaybeAsyncFn
- MaybePromise
- Metamorph
- OnBatchAddRequests
- OnBatchAddRequestsArgs
- PickPartial
- PickRequired
- PrivacyFilter
- PrivacyMask
- RunCrawler
- LOG_LEVEL
- allActorInputValidationFields
- allActorInputs
- apifyIO
- crawlerInput
- crawlerInputValidationFields
- inputInput
- inputInputValidationFields
- logLevelToCrawlee
- loggingInput
- loggingInputValidationFields
- metamorphInput
- metamorphInputValidationFields
- outputInput
- outputInputValidationFields
- perfInput
- perfInputValidationFields
- privacyInput
- privacyInputValidationFields
- proxyInput
- proxyInputValidationFields
- requestInput
- requestInputValidationFields
- startUrlsInput
- startUrlsInputValidationFields
- basicCaptureErrorRouteHandler
- captureError
- captureErrorRouteHandler
- captureErrorWrapper
- cheerioCaptureErrorRouteHandler
- crawleeOne
- createErrorHandler
- createHttpCrawlerOptions
- createLocalMigrationState
- createLocalMigrator
- createMockClientDataset
- createMockClientRequestQueue
- createMockDatasetCollectionClient
- createMockKeyValueStoreClient
- createMockRequestQueueClient
- createMockStorageClient
- createMockStorageDataset
- createSentryTelemetry
- datasetSizeMonitor
- generateTypes
- getColumnFromDataset
- getDatasetCount
- httpCaptureErrorRouteHandler
- itemCacheKey
- jsdomCaptureErrorRouteHandler
- loadConfig
- logLevelHandlerWrapper
- playwrightCaptureErrorRouteHandler
- puppeteerCaptureErrorRouteHandler
- pushData
- pushRequests
- registerHandlers
- requestQueueSizeMonitor
- runCrawleeOne
- runCrawlerTest
- scrapeListingEntries
- setupDefaultHandlers
- setupMockApifyActor
- validateConfig
Ƭ AllActorInputs: InputActorInput
& CrawlerConfigActorInput
& PerfActorInput
& StartUrlsActorInput
& LoggingActorInput
& ProxyActorInput
& PrivacyActorInput
& RequestActorInput
& OutputActorInput
& MetamorphActorInput
Ƭ ApifyCrawleeOneIO: CrawleeOneIO
<ApifyEnv
, ApifyErrorReport
, ApifyEntryMetadata
>
Integration between CrawleeOne and Apify.
This is the default integration.
src/lib/integrations/apify.ts:39
Ƭ ArrVal<T
>: T
[number
]
Unwrap Array to its item(s)
Name | Type |
---|---|
T |
extends any [] | readonly any [] |
Ƭ CaptureError: (input
: CaptureErrorInput
) => MaybePromise
<void
>
▸ (input
): MaybePromise
<void
>
Name | Type |
---|---|
input |
CaptureErrorInput |
MaybePromise
<void
>
src/lib/error/errorHandler.ts:24
Ƭ CaptureErrorInput: PickRequired
<Partial
<CrawleeOneErrorHandlerInput
>, "error"
>
src/lib/error/errorHandler.ts:23
Ƭ CrawleeOneActorDefWithInput<T
>: Omit
<CrawleeOneActorDef
<T
>, "input"
> & { input
: T
["input"
] | null
; state
: Record
<string
, unknown
> }
CrawleeOneActorDef object where the input is already resolved
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
Ƭ CrawleeOneActorRouterCtx<T
>: Object
Context passed from actor to route handlers
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
Name | Type | Description |
---|---|---|
actor |
CrawleeOneActorInst <T > |
- |
metamorph |
Metamorph |
Trigger actor metamorph, using actor's inputs as defaults. |
pushData |
<T>(oneOrManyItems : T | T [], options : PushDataOptions <T >) => Promise <any []> |
Actor.pushData with extra optional features: - Limit the number of entries pushed to the Dataset based on the Actor input - Transform and filter entries via Actor input. - Add metadata to entries before they are pushed to Dataset. - Set which (nested) properties are personal data optionally redact them for privacy compliance. |
pushRequests |
<T>(oneOrManyItems : T | T [], options? : PushRequestsOptions <T >) => Promise <any []> |
Similar to Actor.openRequestQueue().addRequests , but with extra features: - Limit the max size of the RequestQueue. No requests are added when RequestQueue is at or above the limit. - Transform and filter requests. Requests that did not pass the filter are not added to the RequestQueue. |
Ƭ CrawleeOneHookCtx<T
>: Pick
<CrawleeOneActorInst
<T
>, "input"
| "state"
> & { io
: T
["io"
] ; itemCacheKey
: typeof itemCacheKey
; sendRequest
: typeof gotScraping
}
Context passed to user-defined functions passed from input
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
Ƭ CrawleeOneHookFn<TArgs
, TReturn
, T
>: (...args
: [...TArgs, CrawleeOneHookCtx
<T
>]) => MaybePromise
<TReturn
>
Name | Type |
---|---|
TArgs |
extends any [] = [] |
TReturn |
void |
T |
extends CrawleeOneCtx = CrawleeOneCtx |
▸ (...args
): MaybePromise
<TReturn
>
Name | Type |
---|---|
...args |
[...TArgs, CrawleeOneHookCtx <T >] |
MaybePromise
<TReturn
>
Ƭ CrawleeOneRouteCtx<T
, RouterCtx
>: Parameters
<Parameters
<CrawlerRouter
<T
["context"
] & RouterCtx
>["addHandler"
]>[1
]>[0
]
Context object provided in CrawlerRouter
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
RouterCtx |
extends Record <string , any > = {} |
Ƭ CrawleeOneRouteHandler<T
, RouterCtx
>: Parameters
<CrawlerRouter
<T
["context"
] & RouterCtx
>["addHandler"
]>[1
]
Function that's passed to router.addHandler(label, handler)
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
Ƭ CrawleeOneRouteMatcher<T
, RouterCtx
>: MaybeArray
<RegExp
| CrawleeOneRouteMatcherFn
<T
, RouterCtx
>>
Function or RegExp that checks if the CrawleeOneRoute this Matcher belongs to should handle the given request.
If the Matcher returns truthy value, the request is passed to the action
function of the same CrawleeOneRoute.
The Matcher can be:
- Regular expression
- Function
- Array of <RegExp | Function>
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
Ƭ CrawleeOneRouteMatcherFn<T
, RouterCtx
>: (url
: string
, ctx
: CrawleeOneRouteCtx
<T
, RouterCtx
>, route
: CrawleeOneRoute
<T
, RouterCtx
>, routes
: Record
<T
["labels"
], CrawleeOneRoute
<T
, RouterCtx
>>) => unknown
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
▸ (url
, ctx
, route
, routes
): unknown
Function variant of Matcher. Matcher that checks if the CrawleeOneRoute this Matcher belongs to should handle the given request.
If the Matcher returns truthy value, the request is passed to the action
function of the same CrawleeOneRoute.
Name | Type |
---|---|
url |
string |
ctx |
CrawleeOneRouteCtx <T , RouterCtx > |
route |
CrawleeOneRoute <T , RouterCtx > |
routes |
Record <T ["labels" ], CrawleeOneRoute <T , RouterCtx >> |
unknown
Ƭ CrawleeOneRouteWrapper<T
, RouterCtx
>: (handler
: (ctx
: CrawleeOneRouteCtx
<T
, RouterCtx
>) => Promise
<void
> | Awaitable
<void
>) => MaybePromise
<(ctx
: CrawleeOneRouteCtx
<T
, RouterCtx
>) => Promise
<void
> | Awaitable
<void
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
▸ (handler
): MaybePromise
<(ctx
: CrawleeOneRouteCtx
<T
, RouterCtx
>) => Promise
<void
> | Awaitable
<void
>>
Wrapper that modifies behavior of CrawleeOneRouteHandler
Name | Type |
---|---|
handler |
(ctx : CrawleeOneRouteCtx <T , RouterCtx >) => Promise <void > | Awaitable <void > |
MaybePromise
<(ctx
: CrawleeOneRouteCtx
<T
, RouterCtx
>) => Promise
<void
> | Awaitable
<void
>>
Ƭ CrawlerConfigActorInput: Pick
<CheerioCrawlerOptions
, "navigationTimeoutSecs"
| "ignoreSslErrors"
| "additionalMimeTypes"
| "suggestResponseEncoding"
| "forceResponseEncoding"
| "requestHandlerTimeoutSecs"
| "maxRequestRetries"
| "maxRequestsPerCrawl"
| "maxRequestsPerMinute"
| "minConcurrency"
| "maxConcurrency"
| "keepAlive"
>
Crawler config fields that can be overriden from the actor input
Ƭ CrawlerType: ArrVal
<typeof CRAWLER_TYPE
>
Ƭ CrawlerUrl: NonNullable
<Parameters
<OrigRunCrawler
<any
>>[0
]>[0
]
URL string or object passed to Crawler.run
Ƭ ExtractErrorHandlerOptionsReport<T
>: T
extends CrawleeOneErrorHandlerOptions
<infer U> ? ExtractIOReport
<U
> : never
Name | Type |
---|---|
T |
extends CrawleeOneErrorHandlerOptions <any > |
src/lib/integrations/types.ts:322
Ƭ ExtractIOReport<T
>: T
extends CrawleeOneIO
<object
, infer U> ? U
: never
Name | Type |
---|---|
T |
extends CrawleeOneIO <object , object > |
src/lib/integrations/types.ts:325
Ƭ GenRedactedValue<V
, K
, O
>: (val
: V
, key
: K
, obj
: O
) => MaybePromise
<any
>
Name |
---|
V |
K |
O |
▸ (val
, key
, obj
): MaybePromise
<any
>
Functions that generates a "redacted" version of a value.
If you pass it a Promise, it will be resolved.
Name | Type |
---|---|
val |
V |
key |
K |
obj |
O |
MaybePromise
<any
>
Ƭ LogLevel: ArrVal
<typeof LOG_LEVEL
>
Ƭ MaybeArray<T
>: T
| T
[]
Value or an array thereof
Name |
---|
T |
Ƭ MaybeAsyncFn<R
, Args
>: R
| (...args
: Args
) => MaybePromise
<R
>
Value or (a)sync func that returns thereof
Name | Type |
---|---|
R |
R |
Args |
extends any [] |
Ƭ MaybePromise<T
>: T
| Promise
<T
>
Value or a promise thereof
Name |
---|
T |
Ƭ Metamorph: (overrides?
: MetamorphActorInput
) => Promise
<void
>
▸ (overrides?
): Promise
<void
>
Trigger actor metamorph, using actor's inputs as defaults.
Name | Type |
---|---|
overrides? |
MetamorphActorInput |
Promise
<void
>
Ƭ OnBatchAddRequests: (...args
: OnBatchAddRequestsArgs
) => MaybePromise
<void
>
▸ (...args
): MaybePromise
<void
>
Name | Type |
---|---|
...args |
OnBatchAddRequestsArgs |
MaybePromise
<void
>
src/lib/test/mockApifyClient.ts:31
Ƭ OnBatchAddRequestsArgs: [requests: Omit<RequestQueueClientRequestSchema, "id">[], options?: RequestQueueClientBatchAddRequestWithRetriesOptions]
src/lib/test/mockApifyClient.ts:27
Ƭ PickPartial<T
, Keys
>: Omit
<T
, Keys
> & Partial
<Pick
<T
, Keys
>>
Pick properties that should be optional
Name | Type |
---|---|
T |
extends object |
Keys |
extends keyof T |
Ƭ PickRequired<T
, Keys
>: Omit
<T
, Keys
> & Required
<Pick
<T
, Keys
>>
Pick properties that should be required
Name | Type |
---|---|
T |
extends object |
Keys |
extends keyof T |
Ƭ PrivacyFilter<V
, K
, O
>: boolean
| (val
: V
, key
: K
, obj
: O
, options?
: { setCustomRedactedValue
: (val
: V
) => any
}) => any
Determine if the property is considered private (and hence may be hidden for privacy reasons).
PrivacyFilter
may be either boolean, or a function that returns truthy/falsy value.
Property is private if true
or if the function returns truthy value.
The function receives the property value, its position, and parent object.
By default, when a property is redacted, its value is replaced with a string
that informs about the redaction. If you want different text or value to be used instead,
supply it to setCustomRedactedValue
.
If the function returns a Promise, it will be awaited.
Name |
---|
V |
K |
O |
Ƭ PrivacyMask<T
>: { [Key in keyof T]?: T[Key] extends Date | any[] ? PrivacyFilter<T[Key], Key, T> : T[Key] extends object ? PrivacyMask<T[Key]> : PrivacyFilter<T[Key], Key, T> }
PrivacyMask determines which (potentally nested) properties of an object are considered private.
PrivacyMask copies the structure of another object, but each non-object property on PrivacyMask is a PrivacyFilter - function that determines if the property is considered private.
Property is private if the function returns truthy value.
If the function returns a Promise, it will be awaited.
Name | Type |
---|---|
T |
extends object |
Ƭ RunCrawler<Ctx
>: (requests?
: CrawlerUrl
[], options?
: Parameters
<OrigRunCrawler
<Ctx
>>[1
]) => ReturnType
<OrigRunCrawler
<Ctx
>>
Name | Type |
---|---|
Ctx |
extends CrawlingContext = CrawlingContext <BasicCrawler > |
▸ (requests?
, options?
): ReturnType
<OrigRunCrawler
<Ctx
>>
Extended type of crawler.run()
function
Name | Type |
---|---|
requests? |
CrawlerUrl [] |
options? |
Parameters <OrigRunCrawler <Ctx >>[1 ] |
ReturnType
<OrigRunCrawler
<Ctx
>>
• Const
LOG_LEVEL: readonly ["debug"
, "info"
, "warn"
, "error"
, "off"
]
• Const
allActorInputValidationFields: Object
Name | Type |
---|---|
additionalMimeTypes |
ArraySchema <any []> |
errorReportingDatasetId |
StringSchema <string > |
errorTelemetry |
BooleanSchema <boolean > |
forceResponseEncoding |
StringSchema <string > |
ignoreSslErrors |
BooleanSchema <boolean > |
includePersonalData |
BooleanSchema <boolean > |
inputExtendFromFunction |
StringSchema <string > |
inputExtendUrl |
StringSchema <string > |
keepAlive |
BooleanSchema <boolean > |
logLevel |
StringSchema <string > |
maxConcurrency |
NumberSchema <number > |
maxRequestRetries |
NumberSchema <number > |
maxRequestsPerCrawl |
NumberSchema <number > |
maxRequestsPerMinute |
NumberSchema <number > |
metamorphActorBuild |
StringSchema <string > |
metamorphActorId |
StringSchema <string > |
metamorphActorInput |
ObjectSchema <any > |
minConcurrency |
NumberSchema <number > |
navigationTimeoutSecs |
NumberSchema <number > |
outputCacheActionOnResult |
StringSchema <string > |
outputCachePrimaryKeys |
ArraySchema <any []> |
outputCacheStoreId |
StringSchema <string > |
outputDatasetId |
StringSchema <string > |
outputFilter |
StringSchema <string > |
outputFilterAfter |
StringSchema <string > |
outputFilterBefore |
StringSchema <string > |
outputMaxEntries |
NumberSchema <number > |
outputPickFields |
ArraySchema <any []> |
outputRenameFields |
ObjectSchema <any > |
outputTransform |
StringSchema <string > |
outputTransformAfter |
StringSchema <string > |
outputTransformBefore |
StringSchema <string > |
perfBatchSize |
NumberSchema <number > |
perfBatchWaitSecs |
NumberSchema <number > |
proxy |
ObjectSchema <any > |
requestFilter |
StringSchema <string > |
requestFilterAfter |
StringSchema <string > |
requestFilterBefore |
StringSchema <string > |
requestHandlerTimeoutSecs |
NumberSchema <number > |
requestMaxEntries |
NumberSchema <number > |
requestQueueId |
StringSchema <string > |
requestTransform |
StringSchema <string > |
requestTransformAfter |
StringSchema <string > |
requestTransformBefore |
StringSchema <string > |
startUrls |
ArraySchema <any []> |
startUrlsFromDataset |
StringSchema <string > |
startUrlsFromFunction |
StringSchema <string > |
suggestResponseEncoding |
StringSchema <string > |
• Const
allActorInputs: Object
Name | Type |
---|---|
additionalMimeTypes |
ArrayField <any []> |
errorReportingDatasetId |
StringField <string , string > |
errorTelemetry |
BooleanField <boolean > |
forceResponseEncoding |
StringField <string , string > |
ignoreSslErrors |
BooleanField <boolean > |
includePersonalData |
BooleanField <boolean > |
inputExtendFromFunction |
StringField <string , string > |
inputExtendUrl |
StringField <string , string > |
keepAlive |
BooleanField <boolean > |
logLevel |
StringField <"error" | "off" | "info" | "debug" | "warn" , string > |
maxConcurrency |
IntegerField <number , string > |
maxRequestRetries |
IntegerField <number , string > |
maxRequestsPerCrawl |
IntegerField <number , string > |
maxRequestsPerMinute |
IntegerField <number , string > |
metamorphActorBuild |
StringField <string , string > |
metamorphActorId |
StringField <string , string > |
metamorphActorInput |
ObjectField <{ uploadDatasetToGDrive : boolean = true }> |
minConcurrency |
IntegerField <number , string > |
navigationTimeoutSecs |
IntegerField <number , string > |
outputCacheActionOnResult |
StringField <NonNullable <undefined | null | "add" | "remove" | "overwrite" >, string > |
outputCachePrimaryKeys |
ArrayField <string []> |
outputCacheStoreId |
StringField <string , string > |
outputDatasetId |
StringField <string , string > |
outputFilter |
StringField <string , string > |
outputFilterAfter |
StringField <string , string > |
outputFilterBefore |
StringField <string , string > |
outputMaxEntries |
IntegerField <number , string > |
outputPickFields |
ArrayField <string []> |
outputRenameFields |
ObjectField <{ oldFieldName : string = 'newFieldName' }> |
outputTransform |
StringField <string , string > |
outputTransformAfter |
StringField <string , string > |
outputTransformBefore |
StringField <string , string > |
perfBatchSize |
IntegerField <number , string > |
perfBatchWaitSecs |
IntegerField <number , string > |
proxy |
ObjectField <object > |
requestFilter |
StringField <string , string > |
requestFilterAfter |
StringField <string , string > |
requestFilterBefore |
StringField <string , string > |
requestHandlerTimeoutSecs |
IntegerField <number , string > |
requestMaxEntries |
IntegerField <number , string > |
requestQueueId |
StringField <string , string > |
requestTransform |
StringField <string , string > |
requestTransformAfter |
StringField <string , string > |
requestTransformBefore |
StringField <string , string > |
startUrls |
ArrayField <any []> |
startUrlsFromDataset |
StringField <string , string > |
startUrlsFromFunction |
StringField <string , string > |
suggestResponseEncoding |
StringField <string , string > |
• Const
apifyIO: ApifyCrawleeOneIO
Integration between CrawleeOne and Apify.
This is the default integration.
src/lib/integrations/apify.ts:117
• Const
crawlerInput: Object
Common input fields related to crawler setup
Name | Type |
---|---|
additionalMimeTypes |
ArrayField <any []> |
forceResponseEncoding |
StringField <string , string > |
ignoreSslErrors |
BooleanField <boolean > |
keepAlive |
BooleanField <boolean > |
maxConcurrency |
IntegerField <number , string > |
maxRequestRetries |
IntegerField <number , string > |
maxRequestsPerCrawl |
IntegerField <number , string > |
maxRequestsPerMinute |
IntegerField <number , string > |
minConcurrency |
IntegerField <number , string > |
navigationTimeoutSecs |
IntegerField <number , string > |
requestHandlerTimeoutSecs |
IntegerField <number , string > |
suggestResponseEncoding |
StringField <string , string > |
• Const
crawlerInputValidationFields: Object
Name | Type |
---|---|
additionalMimeTypes |
ArraySchema <any []> |
forceResponseEncoding |
StringSchema <string > |
ignoreSslErrors |
BooleanSchema <boolean > |
keepAlive |
BooleanSchema <boolean > |
maxConcurrency |
NumberSchema <number > |
maxRequestRetries |
NumberSchema <number > |
maxRequestsPerCrawl |
NumberSchema <number > |
maxRequestsPerMinute |
NumberSchema <number > |
minConcurrency |
NumberSchema <number > |
navigationTimeoutSecs |
NumberSchema <number > |
requestHandlerTimeoutSecs |
NumberSchema <number > |
suggestResponseEncoding |
StringSchema <string > |
• Const
inputInput: Object
Common input fields related to actor input
Name | Type |
---|---|
inputExtendFromFunction |
StringField <string , string > |
inputExtendUrl |
StringField <string , string > |
• Const
inputInputValidationFields: Object
Name | Type |
---|---|
inputExtendFromFunction |
StringSchema <string > |
inputExtendUrl |
StringSchema <string > |
• Const
logLevelToCrawlee: Record
<LogLevel
, CrawleeLogLevel
>
Map log levels of crawlee-one
to log levels of crawlee
• Const
loggingInput: Object
Common input fields related to logging setup
Name | Type |
---|---|
errorReportingDatasetId |
StringField <string , string > |
errorTelemetry |
BooleanField <boolean > |
logLevel |
StringField <"error" | "off" | "info" | "debug" | "warn" , string > |
• Const
loggingInputValidationFields: Object
Name | Type |
---|---|
errorReportingDatasetId |
StringSchema <string > |
errorTelemetry |
BooleanSchema <boolean > |
logLevel |
StringSchema <string > |
• Const
metamorphInput: Object
Common input fields related to actor metamorphing
Name | Type |
---|---|
metamorphActorBuild |
StringField <string , string > |
metamorphActorId |
StringField <string , string > |
metamorphActorInput |
ObjectField <{ uploadDatasetToGDrive : boolean = true }> |
• Const
metamorphInputValidationFields: Object
Name | Type |
---|---|
metamorphActorBuild |
StringSchema <string > |
metamorphActorId |
StringSchema <string > |
metamorphActorInput |
ObjectSchema <any > |
• Const
outputInput: Object
Common input fields related to actor output
Name | Type |
---|---|
outputCacheActionOnResult |
StringField <NonNullable <undefined | null | "add" | "remove" | "overwrite" >, string > |
outputCachePrimaryKeys |
ArrayField <string []> |
outputCacheStoreId |
StringField <string , string > |
outputDatasetId |
StringField <string , string > |
outputFilter |
StringField <string , string > |
outputFilterAfter |
StringField <string , string > |
outputFilterBefore |
StringField <string , string > |
outputMaxEntries |
IntegerField <number , string > |
outputPickFields |
ArrayField <string []> |
outputRenameFields |
ObjectField <{ oldFieldName : string = 'newFieldName' }> |
outputTransform |
StringField <string , string > |
outputTransformAfter |
StringField <string , string > |
outputTransformBefore |
StringField <string , string > |
• Const
outputInputValidationFields: Object
Name | Type |
---|---|
outputCacheActionOnResult |
StringSchema <string > |
outputCachePrimaryKeys |
ArraySchema <any []> |
outputCacheStoreId |
StringSchema <string > |
outputDatasetId |
StringSchema <string > |
outputFilter |
StringSchema <string > |
outputFilterAfter |
StringSchema <string > |
outputFilterBefore |
StringSchema <string > |
outputMaxEntries |
NumberSchema <number > |
outputPickFields |
ArraySchema <any []> |
outputRenameFields |
ObjectSchema <any > |
outputTransform |
StringSchema <string > |
outputTransformAfter |
StringSchema <string > |
outputTransformBefore |
StringSchema <string > |
• Const
perfInput: Object
Common input fields related to performance which are not part of the CrawlerConfig
Name | Type |
---|---|
perfBatchSize |
IntegerField <number , string > |
perfBatchWaitSecs |
IntegerField <number , string > |
• Const
perfInputValidationFields: Object
Name | Type |
---|---|
perfBatchSize |
NumberSchema <number > |
perfBatchWaitSecs |
NumberSchema <number > |
• Const
privacyInput: Object
Common input fields related to proxy setup
Name | Type |
---|---|
includePersonalData |
BooleanField <boolean > |
• Const
privacyInputValidationFields: Object
Name | Type |
---|---|
includePersonalData |
BooleanSchema <boolean > |
• Const
proxyInput: Object
Common input fields related to proxy setup
Name | Type |
---|---|
proxy |
ObjectField <object > |
• Const
proxyInputValidationFields: Object
Name | Type |
---|---|
proxy |
ObjectSchema <any > |
• Const
requestInput: Object
Common input fields related to actor request
Name | Type |
---|---|
requestFilter |
StringField <string , string > |
requestFilterAfter |
StringField <string , string > |
requestFilterBefore |
StringField <string , string > |
requestMaxEntries |
IntegerField <number , string > |
requestQueueId |
StringField <string , string > |
requestTransform |
StringField <string , string > |
requestTransformAfter |
StringField <string , string > |
requestTransformBefore |
StringField <string , string > |
• Const
requestInputValidationFields: Object
Name | Type |
---|---|
requestFilter |
StringSchema <string > |
requestFilterAfter |
StringSchema <string > |
requestFilterBefore |
StringSchema <string > |
requestMaxEntries |
NumberSchema <number > |
requestQueueId |
StringSchema <string > |
requestTransform |
StringSchema <string > |
requestTransformAfter |
StringSchema <string > |
requestTransformBefore |
StringSchema <string > |
• Const
startUrlsInput: Object
Common input fields for defining URLs to scrape
Name | Type |
---|---|
startUrls |
ArrayField <any []> |
startUrlsFromDataset |
StringField <string , string > |
startUrlsFromFunction |
StringField <string , string > |
• Const
startUrlsInputValidationFields: Object
Name | Type |
---|---|
startUrls |
ArraySchema <any []> |
startUrlsFromDataset |
StringSchema <string > |
startUrlsFromFunction |
StringSchema <string > |
▸ basicCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <BasicCrawlingContext <Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:133
▸ captureError<TIO
>(input
, options
): Promise
<never
>
Error handling for CrawleeOne crawlers.
By default, error reports are saved to Apify Dataset.
See https://docs.apify.com/academy/node-js/analyzing-pages-and-fixing-errors#error-reporting
Name | Type |
---|---|
TIO |
extends CrawleeOneIO <object , object , object , TIO > = CrawleeOneIO <object , object , object > |
Name | Type |
---|---|
input |
CaptureErrorInput |
options |
CrawleeOneErrorHandlerOptions <TIO > |
Promise
<never
>
src/lib/error/errorHandler.ts:33
▸ captureErrorRouteHandler<T
>(handler
, options
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Drop-in replacement for regular request handler callback for Crawlee route that automatically tracks errors.
By default, error reports are saved to Apify Dataset.
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
handler |
(ctx : Omit <T ["context" ] & {}, "request" > & { request : Request <Dictionary > } & { captureError : CaptureError }) => MaybePromise <void > |
options |
CrawleeOneErrorHandlerOptions <T ["io" ]> |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Example
router.addDefaultHandler(
captureErrorRouteHandler(async (ctx) => {
const { page, crawler } = ctx;
const url = page.url();
...
})
);
src/lib/error/errorHandler.ts:110
▸ captureErrorWrapper<TIO
>(fn
, options
): Promise
<void
>
Error handling for Crawlers as a function wrapper
By default, error reports are saved to Apify Dataset.
Name | Type |
---|---|
TIO |
extends CrawleeOneIO <object , object , object , TIO > = CrawleeOneIO <object , object , object > |
Name | Type |
---|---|
fn |
(input : { captureError : CaptureError }) => MaybePromise <void > |
options |
CrawleeOneErrorHandlerOptions <TIO > |
Promise
<void
>
src/lib/error/errorHandler.ts:77
▸ cheerioCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CheerioCrawlingContext <any , any >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:136
▸ crawleeOne<TType
, T
>(args
): Promise
<void
>
Name | Type |
---|---|
TType |
extends "basic" | "http" | "cheerio" | "jsdom" | "playwright" | "puppeteer" |
T |
extends CrawleeOneCtx <CrawlerMeta <TType >["context" ], string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > = CrawleeOneCtx <CrawlerMeta <TType >["context" ], string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >> |
Name | Type |
---|---|
args |
CrawleeOneArgs <TType , T > |
Promise
<void
>
▸ createErrorHandler<T
>(options
): ErrorHandler
<T
["context"
]>
Create an ErrorHandler
function that can be assigned to
failedRequestHandler
option of BasicCrawlerOptions
.
The function saves error to a Dataset, and optionally forwards it to Sentry.
By default, error reports are saved to Apify Dataset.
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
options |
CrawleeOneErrorHandlerOptions <T ["io" ]> & { onSendErrorToTelemetry? : T ["telemetry" ]["onSendErrorToTelemetry" ] ; sendToTelemetry? : boolean } |
ErrorHandler
<T
["context"
]>
src/lib/error/errorHandler.ts:148
▸ createHttpCrawlerOptions<T
, TOpts
>(«destructured»
): Partial
<TOpts
> & Dictionary
<TOpts
["requestHandler"
] | TOpts
["handleRequestFunction"
] | TOpts
["requestList"
] | TOpts
["requestQueue"
] | TOpts
["requestHandlerTimeoutSecs"
] | TOpts
["handleRequestTimeoutSecs"
] | TOpts
["errorHandler"
] | TOpts
["failedRequestHandler"
] | TOpts
["handleFailedRequestFunction"
] | TOpts
["maxRequestRetries"
] | TOpts
["maxRequestsPerCrawl"
] | TOpts
["autoscaledPoolOptions"
] | TOpts
["minConcurrency"
] | TOpts
["maxConcurrency"
] | TOpts
["maxRequestsPerMinute"
] | TOpts
["keepAlive"
] | TOpts
["useSessionPool"
] | TOpts
["sessionPoolOptions"
] | TOpts
["loggingInterval"
] | TOpts
["log"
]>
Given the actor input, create common crawler options.
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
TOpts |
extends BasicCrawlerOptions <T ["context" ], TOpts > |
Name | Type | Description |
---|---|---|
«destructured» |
Object |
- |
› defaults? |
TOpts |
Default config options set by us. These may be overriden by values from actor input (set by user). |
› input |
null | T ["input" ] |
Actor input |
› overrides? |
TOpts |
These config options will overwrite both the default and user options. This is useful for hard-setting values e.g. in tests. |
Partial
<TOpts
> & Dictionary
<TOpts
["requestHandler"
] | TOpts
["handleRequestFunction"
] | TOpts
["requestList"
] | TOpts
["requestQueue"
] | TOpts
["requestHandlerTimeoutSecs"
] | TOpts
["handleRequestTimeoutSecs"
] | TOpts
["errorHandler"
] | TOpts
["failedRequestHandler"
] | TOpts
["handleFailedRequestFunction"
] | TOpts
["maxRequestRetries"
] | TOpts
["maxRequestsPerCrawl"
] | TOpts
["autoscaledPoolOptions"
] | TOpts
["minConcurrency"
] | TOpts
["maxConcurrency"
] | TOpts
["maxRequestsPerMinute"
] | TOpts
["keepAlive"
] | TOpts
["useSessionPool"
] | TOpts
["sessionPoolOptions"
] | TOpts
["loggingInterval"
] | TOpts
["log"
]>
▸ createLocalMigrationState(«destructured»
): Object
Name | Type |
---|---|
«destructured» |
Object |
› stateDir |
string |
Object
Name | Type |
---|---|
loadState |
(migrationFilename : string ) => Promise <Actor > |
saveState |
(migrationFilename : string , actor : ActorClient ) => Promise <void > |
src/lib/migrate/localState.ts:5
▸ createLocalMigrator(«destructured»
): Object
Name | Type | Description |
---|---|---|
«destructured» |
Object |
- |
› delimeter |
string |
Delimeter between version and rest of file name |
› extension |
string |
Extension glob |
› migrationsDir |
string |
- |
Object
Name | Type |
---|---|
migrate |
(version : string ) => Promise <void > |
unmigrate |
(version : string ) => Promise <void > |
src/lib/migrate/localMigrator.ts:8
▸ createMockClientDataset(overrides?
): Dataset
Name | Type |
---|---|
overrides? |
Dataset |
Dataset
src/lib/test/mockApifyClient.ts:33
▸ createMockClientRequestQueue(overrides?
): RequestQueue
Name | Type |
---|---|
overrides? |
RequestQueue |
RequestQueue
src/lib/test/mockApifyClient.ts:50
▸ createMockDatasetCollectionClient(«destructured»?
): DatasetCollectionClient
Name | Type |
---|---|
«destructured» |
Object |
› log? |
(args : any ) => void |
DatasetCollectionClient
src/lib/test/mockApifyClient.ts:195
▸ createMockKeyValueStoreClient(«destructured»?
): KeyValueStoreClient
Name | Type |
---|---|
«destructured» |
Object |
› log? |
(args : any ) => void |
KeyValueStoreClient
src/lib/test/mockApifyClient.ts:71
▸ createMockRequestQueueClient(«destructured»?
): RequestQueueClient
Name | Type |
---|---|
«destructured» |
Object |
› log? |
(args : any ) => void |
› onBatchAddRequests? |
OnBatchAddRequests |
RequestQueueClient
src/lib/test/mockApifyClient.ts:98
▸ createMockStorageClient(«destructured»?
): StorageClient
Name | Type |
---|---|
«destructured» |
Object |
› log? |
(args : any ) => void |
› onBatchAddRequests? |
OnBatchAddRequests |
StorageClient
src/lib/test/mockApifyClient.ts:227
▸ createMockStorageDataset(...args
): Promise
<Dataset
<any
>>
Name | Type |
---|---|
...args |
[datasetId?: null | string, options?: OpenStorageOptions, custom?: Object] |
Promise
<Dataset
<any
>>
src/lib/test/mockApifyClient.ts:252
▸ createSentryTelemetry<T
>(sentryOptions?
): T
Name | Type |
---|---|
T |
extends CrawleeOneTelemetry <CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >>, CrawleeOneErrorHandlerOptions <CrawleeOneIO <object , object , object >>, T > |
Name | Type |
---|---|
sentryOptions? |
NodeOptions |
T
src/lib/telemetry/sentry.ts:24
▸ datasetSizeMonitor(maxSize
, options?
): Object
Semi-automatic monitoring of Dataset size. This is used in limiting the total of entries scraped per run / Dataset:
- When Dataset reaches
maxSize
, then all remaining Requests in the RequestQueue are removed. - Pass an array of items to
shortenToSize
to shorten the array to the size that still fits the Dataset.
By default uses Apify Dataset.
Name | Type |
---|---|
maxSize |
number |
options? |
DatasetSizeMonitorOptions |
Object
Name | Type |
---|---|
isFull |
() => Promise <boolean > |
isStale |
() => boolean |
onValue |
(callback : ValueCallback <number >) => () => void |
refresh |
() => Promise <number > |
shortenToSize |
<T>(arr : T []) => Promise <T []> |
value |
() => null | number | Promise <number > |
▸ generateTypes(outfile
, configOrPath?
): Promise
<void
>
Generate types for CrawleeOne given a config.
Config can be passed directly, or as the path to the config file. If the config is omitted, it is automatically searched for using CosmicConfig.
Name | Type |
---|---|
outfile |
string |
configOrPath? |
string | CrawleeOneConfig |
Promise
<void
>
src/cli/commands/codegen.ts:251
▸ getColumnFromDataset<T
>(datasetId
, field
, options?
): Promise
<T
[]>
Given a Dataset ID and a name of a field, get the columnar data.
By default uses Apify Dataset.
Example:
// Given dataset
// [
// { id: 1, field: 'abc' },
// { id: 2, field: 'def' }
// ]
const results = await getColumnFromDataset('datasetId123', 'field');
console.log(results)
// ['abc', 'def']
Name |
---|
T |
Name | Type |
---|---|
datasetId |
string |
field |
string |
options? |
Object |
options.dataOptions? |
Pick <DatasetDataOptions , "offset" | "desc" | "limit" > |
options.io? |
CrawleeOneIO <object , object , object > |
Promise
<T
[]>
▸ getDatasetCount(datasetNameOrId?
, options?
): Promise
<null
| number
>
Given a Dataset ID, get the number of entries already in the Dataset.
By default uses Apify Dataset.
Name | Type |
---|---|
datasetNameOrId? |
string |
options? |
Object |
options.io? |
CrawleeOneIO <object , object , object > |
options.log? |
Log |
Promise
<null
| number
>
▸ httpCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <HttpCrawlingContext <any , any >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:134
▸ itemCacheKey(item
, primaryKeys?
): string
Serialize dataset item to fixed-length hash.
NOTE: Apify (around which this lib is designed) allows the key-value store key to be max 256 char long. https://docs.apify.com/sdk/js/reference/class/KeyValueStore#setValue
Name | Type |
---|---|
item |
any |
primaryKeys? |
string [] |
string
▸ jsdomCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <JSDOMCrawlingContext <any , any >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:135
▸ loadConfig(configFilePath?
): Promise
<null
| CrawleeOneConfig
>
Load CrawleeOne config file. Config will be searched for using CosmicConfig.
Optionally, you can supply path to the config file.
Learn more: https://github.com/cosmiconfig/cosmiconfig
Name | Type |
---|---|
configFilePath? |
string |
Promise
<null
| CrawleeOneConfig
>
▸ logLevelHandlerWrapper<T
, RouterCtx
>(logLevel
): CrawleeOneRouteWrapper
<T
, RouterCtx
>
Wrapper for Crawlee route handler that configures log level.
Usage with Crawlee's RouterHandler.addDefaultHandler
const wrappedHandler = logLevelHandlerWrapper('debug')(handler)
await router.addDefaultHandler<Ctx>(wrappedHandler);
Usage with Crawlee's RouterHandler.addHandler
const wrappedHandler = logLevelHandlerWrapper('error')(handler)
await router.addHandler<Ctx>(wrappedHandler);
Usage with createCrawleeOne
const actor = await createCrawleeOne<CheerioCrawlingContext>({
validateInput,
router: createCheerioRouter(),
routes,
routeHandlers: ({ input }) => createHandlers(input!),
routeHandlerWrappers: ({ input }) => [
logLevelHandlerWrapper<CheerioCrawlingContext<any, any>>(input?.logLevel ?? 'info'),
],
createCrawler: ({ router, input }) => createCrawler({ router, input, crawlerConfig }),
});
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
Name | Type |
---|---|
logLevel |
"error" | "off" | "info" | "debug" | "warn" |
CrawleeOneRouteWrapper
<T
, RouterCtx
>
▸ playwrightCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <PlaywrightCrawlingContext <Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:137
▸ puppeteerCaptureErrorRouteHandler<T
>(...args
): CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
Name | Type |
---|---|
T |
extends CrawleeOneCtx <PuppeteerCrawlingContext <Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
...args |
[handler: Function, options: CrawleeOneErrorHandlerOptions<T["io"]>] |
CrawleeOneRouteHandler
<T
, CrawleeOneRouteCtx
<T
>>
src/lib/error/errorHandler.ts:138
▸ pushData<Ctx
, T
>(ctx
, oneOrManyItems
, options
): Promise
<unknown
[]>
Apify's Actor.pushData
with extra features:
- Data can be sent elsewhere, not just to Apify. This is set by the
io
options. By default data is sent using Apify (cloud/local). - Limit the max size of the Dataset. No entries are added when Dataset is at or above the limit.
- Redact "private" fields
- Add metadata to entries before they are pushed to dataset.
- Select and rename (nested) properties
- Transform and filter entries. Entries that did not pass the filter are not added to the dataset.
- Add/remove entries to/from KeyValueStore. Entries are saved to the store by hash generated from entry fields set by
cachePrimaryKeys
.
Name | Type |
---|---|
Ctx |
extends CrawlingContext <unknown , Dictionary , Ctx > |
T |
extends Record <any , any > = Record <any , any > |
Name | Type |
---|---|
ctx |
Ctx |
oneOrManyItems |
T | T [] |
options |
PushDataOptions <T > |
Promise
<unknown
[]>
▸ pushRequests<T
>(oneOrManyItems
, options?
): Promise
<unknown
[]>
Similar to Actor.openRequestQueue().addRequests
, but with extra features:
- Data can be sent elsewhere, not just to Apify. This is set by the
io
options. By default data is sent using Apify (cloud/local). - Limit the max size of the RequestQueue. No requests are added when RequestQueue is at or above the limit.
- Transform and filter requests. Requests that did not pass the filter are not added to the RequestQueue.
Name | Type |
---|---|
T |
extends RequestOptions <Dictionary > | Request <Dictionary > |
Name | Type |
---|---|
oneOrManyItems |
T | T [] |
options? |
PushRequestsOptions <T > |
Promise
<unknown
[]>
▸ registerHandlers<T
, RouterCtx
>(router
, routes
, options?
): Promise
<void
>
Register many handlers at once onto the Crawlee's RouterHandler.
The labels under which the handlers are registered are the respective object keys.
Example:
registerHandlers(router, { labelA: fn1, labelB: fn2 });
Is similar to:
router.addHandler(labelA, fn1)
router.addHandler(labelB, fn2)
You can also specify a list of wrappers to override the behaviour of all handlers all at once.
A list of wrappers [a, b, c]
will be applied to the handlers right-to-left as so
a( b( c( handler ) ) )
.
The entries on the routerContext
object will be made available to all handlers.
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
Name | Type |
---|---|
router |
RouterHandler <T ["context" ]> |
routes |
Record <T ["labels" ], CrawleeOneRoute <T , RouterCtx >> |
options? |
Object |
options.handlerWrappers? |
CrawleeOneRouteWrapper <T , RouterCtx >[] |
options.onSetCtx? |
(ctx : null | Omit <T ["context" ] & RouterCtx , "request" > & { request : Request <Dictionary > }) => void |
options.routerContext? |
RouterCtx |
Promise
<void
>
▸ requestQueueSizeMonitor(maxSize
, options?
): Object
Semi-automatic monitoring of RequestQueue size. This is used for limiting the total of entries scraped per run / RequestQueue:
- When RequestQueue reaches
maxSize
, then all remaining Requests are removed. - Pass an array of items to
shortenToSize
to shorten the array to the size that still fits the RequestQueue.
By default uses Apify RequestQueue.
Name | Type |
---|---|
maxSize |
number |
options? |
RequestQueueSizeMonitorOptions |
Object
Name | Type |
---|---|
isFull |
() => Promise <boolean > |
isStale |
() => boolean |
onValue |
(callback : ValueCallback <number >) => () => void |
refresh |
() => Promise <number > |
shortenToSize |
<T>(arr : T []) => Promise <T []> |
value |
() => null | number | Promise <number > |
▸ runCrawleeOne<TType
, T
>(args
): Promise
<void
>
Create opinionated Crawlee crawler that uses, and run it within Apify's Actor.main()
context.
Apify context can be replaced with custom implementation using the actorConfig.io
option.
This function does the following for you:
-
Full TypeScript coverage - Ensure all components use the same Crawler / CrawlerContext.
-
Get Actor input from
io.getInput()
, which by default corresponds to Apify'sActor.getInput()
. -
(Optional) Validate Actor input
-
Set up router such that requests that reach default route are redirected to labelled routes based on which item from "routes" they match.
-
Register all route handlers for you.
-
(Optional) Wrap all route handlers in a wrapper. Use this e.g. if you want to add a field to the context object, or handle errors from a single place.
-
(Optional) Support transformation and filtering of (scraped) entries, configured via Actor input.
-
(Optional) Support Actor metamorphing, configured via Actor input.
-
Apify context (e.g. calling
Actor.getInput
) can be replaced with custom implementation using theio
option.
Name | Type |
---|---|
TType |
extends "basic" | "http" | "cheerio" | "jsdom" | "playwright" | "puppeteer" |
T |
extends CrawleeOneCtx <CrawlerMeta <TType >["context" ], string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
Name | Type |
---|---|
args |
RunCrawleeOneOptions <TType , T > |
Promise
<void
>
▸ runCrawlerTest<TData
, TInput
>(«destructured»
): Promise
<void
>
Name | Type |
---|---|
TData |
extends MaybeArray <Dictionary > |
TInput |
TInput |
Name | Type |
---|---|
«destructured» |
Object |
› input |
TInput |
› log? |
(...args : any []) => void |
› onBatchAddRequests? |
OnBatchAddRequests |
› onDone? |
(done : () => void ) => MaybePromise <void > |
› onPushData? |
(data : any , done : () => void ) => MaybePromise <void > |
› runCrawler |
() => MaybePromise <void > |
› vi |
VitestUtils |
Promise
<void
>
▸ scrapeListingEntries<Ctx
, UrlType
>(options
): Promise
<UrlType
[]>
Get entries from a listing page (eg URLs to profiles that should be scraped later)
Name | Type |
---|---|
Ctx |
extends object |
UrlType |
UrlType |
Name | Type |
---|---|
options |
ListingPageScraperOptions <Ctx , UrlType > |
Promise
<UrlType
[]>
src/lib/actions/scrapeListing.ts:229
▸ setupDefaultHandlers<T
, RouterCtx
>(«destructured»
): Promise
<void
>
Configures the default router handler to redirect URLs to labelled route handlers based on which route the URL matches first.
NOTE: This does mean that the URLs passed to this default handler will be fetched twice (as the URL will be requeued to the correct handler). We recommend to use this function only in the scenarios where there is a small number of startUrls, yet these may need various ways of processing based on different paths or etc.
Name | Type |
---|---|
T |
extends CrawleeOneCtx <CrawlingContext <BasicCrawler <BasicCrawlingContext <Dictionary >> | PuppeteerCrawler | PlaywrightCrawler | JSDOMCrawler | CheerioCrawler | HttpCrawler <InternalHttpCrawlingContext <any , any , HttpCrawler <any >>>, Dictionary >, string , Record <string , any >, CrawleeOneIO <object , object , object >, CrawleeOneTelemetry <any , any >, T > |
RouterCtx |
extends Record <string , any > = CrawleeOneRouteCtx <T > |
Name | Type |
---|---|
«destructured» |
Object |
› input? |
null | T ["input" ] |
› io |
T ["io" ] |
› onSetCtx? |
(ctx : null | Omit <T ["context" ] & RouterCtx , "request" > & { request : Request <Dictionary > }) => void |
› routeHandlerWrappers? |
CrawleeOneRouteWrapper <T , RouterCtx >[] |
› router |
RouterHandler <T ["context" ]> |
› routerContext? |
RouterCtx |
› routes |
Record <T ["labels" ], CrawleeOneRoute <T , RouterCtx >> |
Promise
<void
>
Example
const routeLabels = {
MAIN_PAGE: 'MAIN_PAGE',
JOB_LISTING: 'JOB_LISTING',
JOB_DETAIL: 'JOB_DETAIL',
JOB_RELATED_LIST: 'JOB_RELATED_LIST',
PARTNERS: 'PARTNERS',
} as const;
const router = createPlaywrightRouter();
const routes = createPlaywrightCrawleeOneRouteMatchers<typeof routeLabels>([
// URLs that match this route are redirected to router.addHandler(routeLabels.MAIN_PAGE)
{
route: routeLabels.MAIN_PAGE,
// Check for main page like https://www.profesia.sk/?#
match: (url) => url.match(/[\W]profesia\.sk/?(?:[?#~]|$)/i),
},
// Optionally override the logic that assigns the URL to the route by specifying the `action` prop
{
route: routeLabels.MAIN_PAGE,
// Check for main page like https://www.profesia.sk/?#
match: (url) => url.match(/[\W]profesia\.sk/?(?:[?#~]|$)/i),
action: async (ctx) => {
await ctx.crawler.addRequests([{
url: 'https://profesia.sk/praca',
label: routeLabels.JOB_LISTING,
}]);
},
},
]);
// Set up default route to redirect to labelled routes
setupDefaultHandlers({ router, routes });
// Now set up the labelled routes
await router.addHandler(routeLabels.JOB_LISTING, async (ctx) => { ... }
▸ setupMockApifyActor<TInput
, TData
>(«destructured»
): Promise
<void
>
Name | Type |
---|---|
TInput |
TInput |
TData |
extends MaybeArray <Dictionary > = MaybeArray <Dictionary > |
Name | Type |
---|---|
«destructured» |
Object |
› actorInput? |
TInput |
› log? |
(...args : any []) => void |
› onBatchAddRequests? |
OnBatchAddRequests |
› onGetInfo? |
(...args : any []) => MaybePromise <void > |
› onPushData? |
(data : TData ) => MaybePromise <void > |
› vi |
VitestUtils |
Promise
<void
>
▸ validateConfig(config
): void
Validate given CrawleeOne config.
Config can be passed directly, or you can specify the path to the config file. For the latter, the config will be loaded using loadConfig.
Name | Type |
---|---|
config |
unknown |
void