CommonCrawl Extractor 1.0 documentation

Contents:

  • Installation
  • Quick Start Guide
    • Quick Overview
    • Quickstart
    • Artemis Queue
  • API
    • Aggregator
      • Aggregator.App
        • Aggregator.App.index_query
        • Aggregator.App.ndjson_decoder
        • Aggregator.App.utils
      • Aggregator.aggregator
    • Processor
      • Processor.App
        • Processor.App.Downloader
        • Processor.App.Extractor
        • Processor.App.OutStreamer
        • Processor.App.Pipeline
        • Processor.App.Router
        • Processor.App.processor_utils
        • Processor.App.ArticleUtils
      • Processor.process_article
      • Processor.processor
        • Processor.processor.Listener
        • Processor.processor.ListnerStats
        • Processor.processor.Message
Theme by the Executable Book Project
  • .rst

Quick Start Guide

Quick Start Guide#

Contents:

  • Quick Overview
    • 1. Querying CommonCrawl
    • 2. Downloading a file
    • 3. Choose parser
    • 4. Filtering out the web page
    • 5. Extract fields from the page
    • 6. File saving
  • Quickstart
    • Extractor
    • download_article.py
    • Extracting (Transformations)
    • Extracting( BS4 version)
    • Filtering
    • config.json
    • Testing our extractor
    • Running the extractor
  • Artemis Queue

previous

Installation

next

Quick Overview

By Hynek Kydlíček
© Copyright 2022, Hynek Kydlíček.