Skip to content

Data Processing Patterns

Data processing is the backbone of effective browser automation workflows. This guide covers comprehensive patterns for handling various data types and transformation scenarios.

Extract, clean, and transform textual content from web pages with advanced processing capabilities.

  • Article content extraction and cleaning
  • Comment and review processing
  • Social media text analysis
  • Document content extraction
flowchart LR
    A[GetAllText Node] --> B[EditFields Node]
    B --> C[Filter Node]
    C --> D[Transform Data]
    D --> E[Output Results]

    A --> A1[Extract with Context]
    A --> A2[Preserve Formatting]
    A --> A3[Include Metadata]

    B --> B1[Clean Text]
    B --> B2[Normalize Content]
    B --> B3[Calculate Metrics]

    C --> C1[Quality Filter]
    C --> C2[Relevance Filter]
    C --> C3[Length Filter]

    D --> D1[Keyword Extraction]
    D --> D2[Sentiment Analysis]
    D --> D3[Content Classification]

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e8
    style E fill:#e1f5fe
  1. Text Extraction with Context

    // GetAllText node configuration
    {
    "selector": "article, .content, .post-body",
    "preserveFormatting": true,
    "includeMetadata": true,
    "excludeSelectors": [".ads", ".sidebar", ".comments"]
    }
  2. Text Cleaning and Normalization

    // EditFields node - text processing
    {
    "operations": [
    {
    "field": "content",
    "operation": "clean",
    "rules": [
    "removeExtraWhitespace",
    "removeSpecialChars",
    "normalizeLineBreaks"
    ]
    },
    {
    "field": "wordCount",
    "operation": "calculate",
    "formula": "content.split(' ').length"
    }
    ]
    }
  3. Content Analysis

    // Advanced text processing
    {
    "operations": [
    {
    "field": "keywords",
    "operation": "extract",
    "pattern": "\\b[A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*\\b",
    "limit": 10
    },
    {
    "field": "sentiment",
    "operation": "analyze",
    "type": "sentiment"
    }
    ]
    }
{
"content": "Clean, processed article content...",
"wordCount": 1250,
"keywords": ["Technology", "Innovation", "Future"],
"sentiment": "positive",
"metadata": {
"extractedAt": "2024-01-15T10:30:00Z",
"source": "https://example.com/article"
}
}

Parse and extract structured data from tables, lists, forms, and other organized content.

  • Financial data tables
  • Product specification lists
  • Directory information
  • Form data extraction
graph TB
    A[GetAllHTML Node] --> B[ProcessHTML Node]
    B --> C[Parse Structure]
    C --> D[Validate Data]
    D --> E[Output Structured Data]

    subgraph "Structure Types"
        F[Tables] --> F1[Header Detection]
        F --> F2[Column Mapping]
        F --> F3[Data Type Conversion]

        G[Lists] --> G1[Item Extraction]
        G --> G2[Hierarchy Detection]
        G --> G3[Metadata Parsing]

        H[Forms] --> H1[Field Identification]
        H --> H2[Value Extraction]
        H --> H3[Validation Rules]
    end

    B --> F
    B --> G
    B --> H

    F1 --> C
    F2 --> C
    F3 --> C
    G1 --> C
    G2 --> C
    G3 --> C
    H1 --> C
    H2 --> C
    H3 --> C

    style A fill:#e3f2fd
    style B fill:#fff3e0
    style E fill:#e8f5e8
  1. HTML Structure Extraction

    // GetAllHTML with structure preservation
    {
    "selector": "table, .data-grid, .spec-list",
    "preserveStructure": true,
    "includeAttributes": ["class", "id", "data-*"]
    }
  2. Table Processing

    // ProcessHTML for table data
    {
    "tableProcessing": {
    "headerRow": 0,
    "skipRows": [],
    "columnMapping": {
    "0": "product_name",
    "1": "price",
    "2": "availability",
    "3": "rating"
    },
    "dataTypes": {
    "price": "currency",
    "rating": "number",
    "availability": "boolean"
    }
    }
    }
  3. List Structure Processing

    // ProcessHTML for list data
    {
    "listProcessing": {
    "itemSelector": "li, .list-item",
    "extractionRules": [
    {
    "field": "title",
    "selector": ".item-title, h3"
    },
    {
    "field": "description",
    "selector": ".item-desc, p"
    },
    {
    "field": "metadata",
    "selector": ".item-meta",
    "parseAs": "keyValue"
    }
    ]
    }
    }
// Validation rules
{
"validationRules": [
{
"field": "price",
"type": "number",
"min": 0,
"required": true
},
{
"field": "email",
"type": "email",
"pattern": "^[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}$"
}
]
}

Extract, process, and analyze images, videos, and other media content from web pages.

  • Image gallery extraction
  • Video metadata collection
  • Media file processing
  • Visual content analysis
sequenceDiagram
    participant W as Workflow
    participant GI as GetAllImages
    participant IP as ImageProcessor
    participant ME as MediaExtractor
    participant A as Analysis Engine
    participant O as Output

    W->>GI: Extract images from page
    GI->>IP: Send image collection
    IP->>IP: Resize & optimize images
    IP->>ME: Process optimized images
    ME->>ME: Extract metadata & properties
    ME->>A: Send for content analysis
    A->>A: Detect objects & extract text
    A->>O: Return processed results

    Note over IP: Image processing:
resize, format conversion,
quality optimization Note over A: AI-powered analysis:
object detection,
text extraction,
content classification
  1. Image Collection

    // GetAllImages with filtering
    {
    "selector": "img, picture source",
    "minWidth": 200,
    "minHeight": 200,
    "excludeTypes": ["svg", "gif"],
    "includeMetadata": true
    }
  2. Image Processing

    // ImageProcessor configuration
    {
    "operations": [
    {
    "type": "resize",
    "width": 800,
    "height": 600,
    "maintainAspect": true
    },
    {
    "type": "format",
    "outputFormat": "webp",
    "quality": 85
    },
    {
    "type": "analyze",
    "extractColors": true,
    "detectObjects": true
    }
    ]
    }
  3. Media Metadata Extraction

    // MediaExtractor for comprehensive metadata
    {
    "extractionTypes": [
    "dimensions",
    "fileSize",
    "format",
    "colorProfile",
    "exifData"
    ],
    "analysisOptions": {
    "detectFaces": true,
    "extractText": true,
    "classifyContent": true
    }
    }
{
"images": [
{
"url": "https://example.com/image.jpg",
"processedUrl": "data:image/webp;base64,UklGRiIAAABXRUJQVlA4...",
"metadata": {
"width": 800,
"height": 600,
"format": "webp",
"size": 45678,
"colors": ["#FF5733", "#33FF57", "#3357FF"],
"objects": ["person", "car", "building"]
}
}
]
}

Process continuous data feeds and real-time updates from dynamic web sources.

  • Live chat monitoring
  • Stock price tracking
  • Social media feeds
  • Real-time notifications
timeline
    title Real-time Data Streaming Process

    section Monitoring Phase
        Initial Setup    : Monitor configuration
                        : Target selection
                        : Polling intervals

        Change Detection : Content monitoring
                        : Delta identification
                        : Change classification

    section Processing Phase
        Data Extraction  : Extract new content
                        : Parse changes
                        : Validate data

        Stream Processing: Buffer management
                         : Batch processing
                         : Real-time output

    section Output Phase
        Data Delivery   : Stream to endpoints
                       : Update subscribers
                       : Maintain state
  1. Change Detection Setup

    // Monitor configuration
    {
    "watchSelector": ".live-data, .feed-item",
    "pollInterval": 5000,
    "changeDetection": "content",
    "maxItems": 100
    }
  2. Delta Processing

    // Process only new/changed content
    {
    "deltaProcessing": {
    "trackBy": "id",
    "compareFields": ["content", "timestamp"],
    "onNew": "process",
    "onChange": "update",
    "onDelete": "archive"
    }
    }
  3. Stream Processing

    // Real-time data processing
    {
    "streamConfig": {
    "bufferSize": 50,
    "flushInterval": 10000,
    "processingMode": "batch",
    "errorHandling": "continue"
    }
    }
// Aggregate extracted data
{
"aggregations": [
{
"field": "price",
"operations": ["min", "max", "avg", "sum"]
},
{
"field": "category",
"operation": "groupBy",
"subAggregations": ["count", "avg:price"]
}
]
}
// Enrich data with external sources
{
"enrichmentRules": [
{
"field": "location",
"source": "geocoding_api",
"mapping": {
"input": "address",
"output": ["latitude", "longitude", "timezone"]
}
}
]
}
// Normalize data formats
{
"normalizationRules": [
{
"field": "date",
"inputFormat": "MM/DD/YYYY",
"outputFormat": "ISO8601"
},
{
"field": "currency",
"baseCurrency": "USD",
"conversionAPI": "exchange_rates_api"
}
]
}
  • Process data in chunks to manage memory usage
  • Implement parallel processing for independent operations
  • Use streaming for large datasets
  • Cache processed results to avoid recomputation
  • Implement intelligent cache invalidation
  • Use persistent storage for long-term caching
  • Clean up temporary data regularly
  • Use efficient data structures
  • Implement garbage collection triggers
  • Implement comprehensive validation rules
  • Handle malformed data gracefully
  • Provide detailed error reporting
  • Implement automatic retry mechanisms
  • Provide fallback processing methods
  • Maintain processing state for recovery
  • Monitor data quality metrics
  • Implement anomaly detection
  • Provide data quality reports