Data Processing Patterns

Data processing is the backbone of effective browser automation workflows. This guide covers comprehensive patterns for handling various data types and transformation scenarios.

Text Processing Pattern

Overview

Extract, clean, and transform textual content from web pages with advanced processing capabilities.

Use Cases

Article content extraction and cleaning
Comment and review processing
Social media text analysis
Document content extraction

Implementation

Workflow Structure

flowchart LR
    A[GetAllText Node] --> B[EditFields Node]
    B --> C[Filter Node]
    C --> D[Transform Data]
    D --> E[Output Results]

    A --> A1[Extract with Context]
    A --> A2[Preserve Formatting]
    A --> A3[Include Metadata]

    B --> B1[Clean Text]
    B --> B2[Normalize Content]
    B --> B3[Calculate Metrics]

    C --> C1[Quality Filter]
    C --> C2[Relevance Filter]
    C --> C3[Length Filter]

    D --> D1[Keyword Extraction]
    D --> D2[Sentiment Analysis]
    D --> D3[Content Classification]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style E fill:#e1f5fe

Step-by-Step Implementation

Text Extraction with Context

// GetAllText node configuration
{
  "selector": "article, .content, .post-body",
  "preserveFormatting": true,
  "includeMetadata": true,
  "excludeSelectors": [".ads", ".sidebar", ".comments"]
}

Text Cleaning and Normalization

// EditFields node - text processing
{
  "operations": [
    {
      "field": "content",
      "operation": "clean",
      "rules": [
        "removeExtraWhitespace",
        "removeSpecialChars",
        "normalizeLineBreaks"
      ]
    },
    {
      "field": "wordCount",
      "operation": "calculate",
      "formula": "content.split(' ').length"
    }
  ]
}

Content Analysis

// Advanced text processing
{
  "operations": [
    {
      "field": "keywords",
      "operation": "extract",
      "pattern": "\\b[A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*\\b",
      "limit": 10
    },
    {
      "field": "sentiment",
      "operation": "analyze",
      "type": "sentiment"
    }
  ]
}

Expected Output

{
  "content": "Clean, processed article content...",
  "wordCount": 1250,
  "keywords": ["Technology", "Innovation", "Future"],
  "sentiment": "positive",
  "metadata": {
    "extractedAt": "2024-01-15T10:30:00Z",
    "source": "https://example.com/article"
  }
}

Structured Data Extraction Pattern

Overview

Parse and extract structured data from tables, lists, forms, and other organized content.

Use Cases

Financial data tables
Product specification lists
Directory information
Form data extraction

Implementation

Workflow Structure

graph TB
    A[GetAllHTML Node] --> B[ProcessHTML Node]
    B --> C[Parse Structure]
    C --> D[Validate Data]
    D --> E[Output Structured Data]

    subgraph "Structure Types"
        F[Tables] --> F1[Header Detection]
        F --> F2[Column Mapping]
        F --> F3[Data Type Conversion]

        G[Lists] --> G1[Item Extraction]
        G --> G2[Hierarchy Detection]
        G --> G3[Metadata Parsing]

        H[Forms] --> H1[Field Identification]
        H --> H2[Value Extraction]
        H --> H3[Validation Rules]
    end

    B --> F
    B --> G
    B --> H

    F1 --> C
    F2 --> C
    F3 --> C
    G1 --> C
    G2 --> C
    G3 --> C
    H1 --> C
    H2 --> C
    H3 --> C

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style E fill:#e8f5e8

Step-by-Step Implementation

HTML Structure Extraction

// GetAllHTML with structure preservation
{
  "selector": "table, .data-grid, .spec-list",
  "preserveStructure": true,
  "includeAttributes": ["class", "id", "data-*"]
}

Table Processing

// ProcessHTML for table data
{
  "tableProcessing": {
    "headerRow": 0,
    "skipRows": [],
    "columnMapping": {
      "0": "product_name",
      "1": "price",
      "2": "availability",
      "3": "rating"
    },
    "dataTypes": {
      "price": "currency",
      "rating": "number",
      "availability": "boolean"
    }
  }
}

List Structure Processing

// ProcessHTML for list data
{
  "listProcessing": {
    "itemSelector": "li, .list-item",
    "extractionRules": [
      {
        "field": "title",
        "selector": ".item-title, h3"
      },
      {
        "field": "description",
        "selector": ".item-desc, p"
      },
      {
        "field": "metadata",
        "selector": ".item-meta",
        "parseAs": "keyValue"
      }
    ]
  }
}

Data Validation

// Validation rules
{
  "validationRules": [
    {
      "field": "price",
      "type": "number",
      "min": 0,
      "required": true
    },
    {
      "field": "email",
      "type": "email",
      "pattern": "^[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}$"
    }
  ]
}

Media Processing Pattern

Overview

Extract, process, and analyze images, videos, and other media content from web pages.

Use Cases

Image gallery extraction
Video metadata collection
Media file processing
Visual content analysis

Implementation

Workflow Structure

sequenceDiagram
    participant W as Workflow
    participant GI as GetAllImages
    participant IP as ImageProcessor
    participant ME as MediaExtractor
    participant A as Analysis Engine
    participant O as Output

    W->>GI: Extract images from page
    GI->>IP: Send image collection
    IP->>IP: Resize & optimize images
    IP->>ME: Process optimized images
    ME->>ME: Extract metadata & properties
    ME->>A: Send for content analysis
    A->>A: Detect objects & extract text
    A->>O: Return processed results

    Note over IP: Image processing:
resize, format conversion,
quality optimization
    Note over A: AI-powered analysis:
object detection,
text extraction,
content classification

Step-by-Step Implementation

Image Collection

// GetAllImages with filtering
{
  "selector": "img, picture source",
  "minWidth": 200,
  "minHeight": 200,
  "excludeTypes": ["svg", "gif"],
  "includeMetadata": true
}

Image Processing

// ImageProcessor configuration
{
  "operations": [
    {
      "type": "resize",
      "width": 800,
      "height": 600,
      "maintainAspect": true
    },
    {
      "type": "format",
      "outputFormat": "webp",
      "quality": 85
    },
    {
      "type": "analyze",
      "extractColors": true,
      "detectObjects": true
    }
  ]
}

Media Metadata Extraction

// MediaExtractor for comprehensive metadata
{
  "extractionTypes": [
    "dimensions",
    "fileSize",
    "format",
    "colorProfile",
    "exifData"
  ],
  "analysisOptions": {
    "detectFaces": true,
    "extractText": true,
    "classifyContent": true
  }
}

Expected Output

{
  "images": [
    {
      "url": "https://example.com/image.jpg",
      "processedUrl": "data:image/webp;base64,UklGRiIAAABXRUJQVlA4...",
      "metadata": {
        "width": 800,
        "height": 600,
        "format": "webp",
        "size": 45678,
        "colors": ["#FF5733", "#33FF57", "#3357FF"],
        "objects": ["person", "car", "building"]
      }
    }
  ]
}

Real-time Data Streaming Pattern

Overview

Process continuous data feeds and real-time updates from dynamic web sources.

Use Cases

Live chat monitoring
Stock price tracking
Social media feeds
Real-time notifications

Implementation

Workflow Structure

timeline
    title Real-time Data Streaming Process

    section Monitoring Phase
        Initial Setup    : Monitor configuration
                        : Target selection
                        : Polling intervals

        Change Detection : Content monitoring
                        : Delta identification
                        : Change classification

    section Processing Phase
        Data Extraction  : Extract new content
                        : Parse changes
                        : Validate data

        Stream Processing: Buffer management
                         : Batch processing
                         : Real-time output

    section Output Phase
        Data Delivery   : Stream to endpoints
                       : Update subscribers
                       : Maintain state

Step-by-Step Implementation

Change Detection Setup

// Monitor configuration
{
  "watchSelector": ".live-data, .feed-item",
  "pollInterval": 5000,
  "changeDetection": "content",
  "maxItems": 100
}

Delta Processing

// Process only new/changed content
{
  "deltaProcessing": {
    "trackBy": "id",
    "compareFields": ["content", "timestamp"],
    "onNew": "process",
    "onChange": "update",
    "onDelete": "archive"
  }
}

Stream Processing

// Real-time data processing
{
  "streamConfig": {
    "bufferSize": 50,
    "flushInterval": 10000,
    "processingMode": "batch",
    "errorHandling": "continue"
  }
}

Advanced Data Transformation Patterns

Data Aggregation

// Aggregate extracted data
{
  "aggregations": [
    {
      "field": "price",
      "operations": ["min", "max", "avg", "sum"]
    },
    {
      "field": "category",
      "operation": "groupBy",
      "subAggregations": ["count", "avg:price"]
    }
  ]
}

Data Enrichment

// Enrich data with external sources
{
  "enrichmentRules": [
    {
      "field": "location",
      "source": "geocoding_api",
      "mapping": {
        "input": "address",
        "output": ["latitude", "longitude", "timezone"]
      }
    }
  ]
}

Data Normalization

// Normalize data formats
{
  "normalizationRules": [
    {
      "field": "date",
      "inputFormat": "MM/DD/YYYY",
      "outputFormat": "ISO8601"
    },
    {
      "field": "currency",
      "baseCurrency": "USD",
      "conversionAPI": "exchange_rates_api"
    }
  ]
}

Performance Optimization

Batch Processing

Process data in chunks to manage memory usage
Implement parallel processing for independent operations
Use streaming for large datasets

Caching Strategies

Cache processed results to avoid recomputation
Implement intelligent cache invalidation
Use persistent storage for long-term caching

Memory Management

Clean up temporary data regularly
Use efficient data structures
Implement garbage collection triggers

Error Handling and Recovery

Data Validation

Implement comprehensive validation rules
Handle malformed data gracefully
Provide detailed error reporting

Recovery Strategies

Implement automatic retry mechanisms
Provide fallback processing methods
Maintain processing state for recovery

Quality Assurance

Monitor data quality metrics
Implement anomaly detection
Provide data quality reports