Options
All
  • Public
  • Public/Protected
  • All
Menu

Class Scraper

Superclass for all "scrapers"

remarks

This abstract class describes a standardized method of scraping web pages and saving the results. Its structure is specifically engineered to support complex, relational data stored in a RDBMS such as Postgres. An subclass of AbstractScraper generally describes the process of scraping one type of webpage into one database table. Each instance of a class extending AbstractScraper corresponds to the scrape of one specific URL. The general use pattern for an instance of such as class is to first call the constructor, then Scraper.scrape.

Hierarchy

Index

Constructors

constructor

  • new Scraper(description: string, verbose?: boolean): Scraper

Properties

dataReadFromLocal

dataReadFromLocal: boolean

Scrapers always check for a local copy of the target resource (using Scraper.checkForLocalRecord) before executing a scrape from an external resource. If the resource was found (and therefore no external calls made), this is set to true.

description

description: string

A simple, human-readble description of what is being scraped. Used for logging.

results

results: ResultBatch

Contains all results generated by Scraper.scrape, including recursive calls.

scrapeSucceeded

scrapeSucceeded: boolean

Flag indicating a sucessful scrape, set to true after non-error-throwing call to Scraper.scrape.

verbose

verbose: boolean

Used to override .env settings and force-log the output of a given scraper.

Methods

checkForLocalRecord

  • checkForLocalRecord(): Promise<boolean>
  • Gets the local stored record corresponding to a given scraper. Should return null if no local record is found. By default, returns false (resource is always scraped).

    Returns Promise<boolean>

Protected extractInfo

  • extractInfo(): void
  • Extracts information from a scraped resource synchronously

    remarks

    Must be called after Scraper.requestScrape.

    Extracted info should be stored into class properties, to be saved later by Scraper.saveToLocal. Stores constructed (but not .scrape()'ed) instances of any more recursive scrapes extracted from this one, to later be scraped by Scraper.scrapeDependencies. By default, this method does nothing.

    Returns void

printInfo

  • printInfo(): void
  • Prints a detailed report of local properties for a scraper, used for debugging

    Returns void

printResult

  • printResult(): void

requestScrape

  • requestScrape(): Promise<void>

Protected saveToLocal

  • saveToLocal(): Promise<void>

scrape

  • scrape(forceScrape?: boolean): Promise<void>
  • Entry point for initiating an asset scrape. General scrape outline/method order:

    1. Scraper.checkForLocalRecord
    2. If local entity was found, update class props and return.
    3. Scraper.requestScrape
    4. Scraper.extractInfo
    5. Scraper.scrapeDependencies
    6. Scraper.saveToLocal
    7. Update class props and return
    remarks

    This method should be considered unsafe - there are several points where this can throw errors. This is intentional, and allows easier support for relational data scraping/storage. Scraped assets may have a mixture of required and non-required dependencies, the status of which should be kept in mind when implementing Scraper.scrapeDependencies. A subclass should catch and log errors from non-required scrapes. However, errors from a required scrape should remain uncaught, so the original call to a Scraper.scrape will error out before [[Scraper.save]] is called for incomplete data.

    Parameters

    • Default value forceScrape: boolean = false

      If set to true, scrapes the external resource regardless of any existing local records

    Returns Promise<void>

Protected scrapeDependencies

  • scrapeDependencies(): Promise<void>

Protected scrapeErrorHandler

  • scrapeErrorHandler(error: Error): Promise<void>

Static scrapeDependencyArr

  • scrapeDependencyArr<T>(scrapers: T[], forceScrape?: boolean): Promise<ScrapersWithResults<T>>

Generated using TypeDoc