Data Processing Patterns
Data Processing Patterns
Section titled “Data Processing Patterns”Data processing is the backbone of effective browser automation workflows. This guide covers comprehensive patterns for handling various data types and transformation scenarios.
Text Processing Pattern
Section titled “Text Processing Pattern”Overview
Section titled “Overview”Extract, clean, and transform textual content from web pages with advanced processing capabilities.
Use Cases
Section titled “Use Cases”- Article content extraction and cleaning
- Comment and review processing
- Social media text analysis
- Document content extraction
Implementation
Section titled “Implementation”Workflow Structure
Section titled “Workflow Structure”flowchart LR
A[GetAllText Node] --> B[EditFields Node]
B --> C[Filter Node]
C --> D[Transform Data]
D --> E[Output Results]
A --> A1[Extract with Context]
A --> A2[Preserve Formatting]
A --> A3[Include Metadata]
B --> B1[Clean Text]
B --> B2[Normalize Content]
B --> B3[Calculate Metrics]
C --> C1[Quality Filter]
C --> C2[Relevance Filter]
C --> C3[Length Filter]
D --> D1[Keyword Extraction]
D --> D2[Sentiment Analysis]
D --> D3[Content Classification]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8
style E fill:#e1f5fe
Step-by-Step Implementation
Section titled “Step-by-Step Implementation”-
Text Extraction with Context
// GetAllText node configuration{"selector": "article, .content, .post-body","preserveFormatting": true,"includeMetadata": true,"excludeSelectors": [".ads", ".sidebar", ".comments"]} -
Text Cleaning and Normalization
// EditFields node - text processing{"operations": [{"field": "content","operation": "clean","rules": ["removeExtraWhitespace","removeSpecialChars","normalizeLineBreaks"]},{"field": "wordCount","operation": "calculate","formula": "content.split(' ').length"}]} -
Content Analysis
// Advanced text processing{"operations": [{"field": "keywords","operation": "extract","pattern": "\\b[A-Z][a-z]+(?:\\s+[A-Z][a-z]+)*\\b","limit": 10},{"field": "sentiment","operation": "analyze","type": "sentiment"}]}
Expected Output
Section titled “Expected Output”{ "content": "Clean, processed article content...", "wordCount": 1250, "keywords": ["Technology", "Innovation", "Future"], "sentiment": "positive", "metadata": { "extractedAt": "2024-01-15T10:30:00Z", "source": "https://example.com/article" }}Structured Data Extraction Pattern
Section titled “Structured Data Extraction Pattern”Overview
Section titled “Overview”Parse and extract structured data from tables, lists, forms, and other organized content.
Use Cases
Section titled “Use Cases”- Financial data tables
- Product specification lists
- Directory information
- Form data extraction
Implementation
Section titled “Implementation”Workflow Structure
Section titled “Workflow Structure”graph TB
A[GetAllHTML Node] --> B[ProcessHTML Node]
B --> C[Parse Structure]
C --> D[Validate Data]
D --> E[Output Structured Data]
subgraph "Structure Types"
F[Tables] --> F1[Header Detection]
F --> F2[Column Mapping]
F --> F3[Data Type Conversion]
G[Lists] --> G1[Item Extraction]
G --> G2[Hierarchy Detection]
G --> G3[Metadata Parsing]
H[Forms] --> H1[Field Identification]
H --> H2[Value Extraction]
H --> H3[Validation Rules]
end
B --> F
B --> G
B --> H
F1 --> C
F2 --> C
F3 --> C
G1 --> C
G2 --> C
G3 --> C
H1 --> C
H2 --> C
H3 --> C
style A fill:#e3f2fd
style B fill:#fff3e0
style E fill:#e8f5e8
Step-by-Step Implementation
Section titled “Step-by-Step Implementation”-
HTML Structure Extraction
// GetAllHTML with structure preservation{"selector": "table, .data-grid, .spec-list","preserveStructure": true,"includeAttributes": ["class", "id", "data-*"]} -
Table Processing
// ProcessHTML for table data{"tableProcessing": {"headerRow": 0,"skipRows": [],"columnMapping": {"0": "product_name","1": "price","2": "availability","3": "rating"},"dataTypes": {"price": "currency","rating": "number","availability": "boolean"}}} -
List Structure Processing
// ProcessHTML for list data{"listProcessing": {"itemSelector": "li, .list-item","extractionRules": [{"field": "title","selector": ".item-title, h3"},{"field": "description","selector": ".item-desc, p"},{"field": "metadata","selector": ".item-meta","parseAs": "keyValue"}]}}
Data Validation
Section titled “Data Validation”// Validation rules{ "validationRules": [ { "field": "price", "type": "number", "min": 0, "required": true }, { "field": "email", "type": "email", "pattern": "^[\\w-\\.]+@([\\w-]+\\.)+[\\w-]{2,4}$" } ]}Media Processing Pattern
Section titled “Media Processing Pattern”Overview
Section titled “Overview”Extract, process, and analyze images, videos, and other media content from web pages.
Use Cases
Section titled “Use Cases”- Image gallery extraction
- Video metadata collection
- Media file processing
- Visual content analysis
Implementation
Section titled “Implementation”Workflow Structure
Section titled “Workflow Structure”sequenceDiagram
participant W as Workflow
participant GI as GetAllImages
participant IP as ImageProcessor
participant ME as MediaExtractor
participant A as Analysis Engine
participant O as Output
W->>GI: Extract images from page
GI->>IP: Send image collection
IP->>IP: Resize & optimize images
IP->>ME: Process optimized images
ME->>ME: Extract metadata & properties
ME->>A: Send for content analysis
A->>A: Detect objects & extract text
A->>O: Return processed results
Note over IP: Image processing:
resize, format conversion,
quality optimization
Note over A: AI-powered analysis:
object detection,
text extraction,
content classification
Step-by-Step Implementation
Section titled “Step-by-Step Implementation”-
Image Collection
// GetAllImages with filtering{"selector": "img, picture source","minWidth": 200,"minHeight": 200,"excludeTypes": ["svg", "gif"],"includeMetadata": true} -
Image Processing
// ImageProcessor configuration{"operations": [{"type": "resize","width": 800,"height": 600,"maintainAspect": true},{"type": "format","outputFormat": "webp","quality": 85},{"type": "analyze","extractColors": true,"detectObjects": true}]} -
Media Metadata Extraction
// MediaExtractor for comprehensive metadata{"extractionTypes": ["dimensions","fileSize","format","colorProfile","exifData"],"analysisOptions": {"detectFaces": true,"extractText": true,"classifyContent": true}}
Expected Output
Section titled “Expected Output”{ "images": [ { "url": "https://example.com/image.jpg", "processedUrl": "...", "metadata": { "width": 800, "height": 600, "format": "webp", "size": 45678, "colors": ["#FF5733", "#33FF57", "#3357FF"], "objects": ["person", "car", "building"] } } ]}Real-time Data Streaming Pattern
Section titled “Real-time Data Streaming Pattern”Overview
Section titled “Overview”Process continuous data feeds and real-time updates from dynamic web sources.
Use Cases
Section titled “Use Cases”- Live chat monitoring
- Stock price tracking
- Social media feeds
- Real-time notifications
Implementation
Section titled “Implementation”Workflow Structure
Section titled “Workflow Structure”timeline
title Real-time Data Streaming Process
section Monitoring Phase
Initial Setup : Monitor configuration
: Target selection
: Polling intervals
Change Detection : Content monitoring
: Delta identification
: Change classification
section Processing Phase
Data Extraction : Extract new content
: Parse changes
: Validate data
Stream Processing: Buffer management
: Batch processing
: Real-time output
section Output Phase
Data Delivery : Stream to endpoints
: Update subscribers
: Maintain state
Step-by-Step Implementation
Section titled “Step-by-Step Implementation”-
Change Detection Setup
// Monitor configuration{"watchSelector": ".live-data, .feed-item","pollInterval": 5000,"changeDetection": "content","maxItems": 100} -
Delta Processing
// Process only new/changed content{"deltaProcessing": {"trackBy": "id","compareFields": ["content", "timestamp"],"onNew": "process","onChange": "update","onDelete": "archive"}} -
Stream Processing
// Real-time data processing{"streamConfig": {"bufferSize": 50,"flushInterval": 10000,"processingMode": "batch","errorHandling": "continue"}}
Advanced Data Transformation Patterns
Section titled “Advanced Data Transformation Patterns”Data Aggregation
Section titled “Data Aggregation”// Aggregate extracted data{ "aggregations": [ { "field": "price", "operations": ["min", "max", "avg", "sum"] }, { "field": "category", "operation": "groupBy", "subAggregations": ["count", "avg:price"] } ]}Data Enrichment
Section titled “Data Enrichment”// Enrich data with external sources{ "enrichmentRules": [ { "field": "location", "source": "geocoding_api", "mapping": { "input": "address", "output": ["latitude", "longitude", "timezone"] } } ]}Data Normalization
Section titled “Data Normalization”// Normalize data formats{ "normalizationRules": [ { "field": "date", "inputFormat": "MM/DD/YYYY", "outputFormat": "ISO8601" }, { "field": "currency", "baseCurrency": "USD", "conversionAPI": "exchange_rates_api" } ]}Performance Optimization
Section titled “Performance Optimization”Batch Processing
Section titled “Batch Processing”- Process data in chunks to manage memory usage
- Implement parallel processing for independent operations
- Use streaming for large datasets
Caching Strategies
Section titled “Caching Strategies”- Cache processed results to avoid recomputation
- Implement intelligent cache invalidation
- Use persistent storage for long-term caching
Memory Management
Section titled “Memory Management”- Clean up temporary data regularly
- Use efficient data structures
- Implement garbage collection triggers
Error Handling and Recovery
Section titled “Error Handling and Recovery”Data Validation
Section titled “Data Validation”- Implement comprehensive validation rules
- Handle malformed data gracefully
- Provide detailed error reporting
Recovery Strategies
Section titled “Recovery Strategies”- Implement automatic retry mechanisms
- Provide fallback processing methods
- Maintain processing state for recovery
Quality Assurance
Section titled “Quality Assurance”- Monitor data quality metrics
- Implement anomaly detection
- Provide data quality reports