Skip to content

Get HTML From Link

Before using this node, ensure you have:

  • Basic understanding of workflow creation in Agentic WorkFlow
  • Appropriate browser permissions configured (if applicable)
  • Required dependencies installed and configured

The Get HTML From Link node retrieves the complete HTML source code from web pages, providing access to the full document structure, metadata, and embedded content. This node is essential for advanced web automation, content analysis, and workflows that require detailed understanding of page structure and elements.

This node performs comprehensive HTML extraction by:

  • Fetching complete HTML source code from target URLs
  • Preserving document structure, attributes, and embedded content
  • Handling both static and dynamically generated HTML content
  • Providing raw HTML data for parsing, analysis, and manipulation
  • Supporting complex web applications with JavaScript-rendered content
  • Complete HTML Retrieval: Captures full document source including head, body, and all elements
  • Dynamic Content Support: Handles JavaScript-rendered content and single-page applications
  • Metadata Preservation: Maintains all HTML attributes, classes, IDs, and data attributes
  • Security Aware: Implements safe HTML handling with sanitization options
  • Web Extraction: Extract structured data from HTML elements for database population or analysis
  • Content Analysis: Analyze page structure, SEO elements, and content organization
  • Template Extraction: Capture page layouts and structures for replication or analysis
  • Quality Assurance: Validate HTML structure, accessibility compliance, and content standards
ParameterTypeDescriptionExample
urlstringThe target URL from which to extract HTML content"https://example.com/page"
ParameterTypeDefaultDescriptionExample
waitForLoadbooleantrueWait for complete page load including dynamic contenttrue
timeoutnumber30000Maximum time to wait for page load (milliseconds)20000
includeResourcesbooleanfalseInclude inline CSS and JavaScript in outputtrue
sanitizeHTMLbooleantrueRemove potentially dangerous HTML elements and attributesfalse
preserveFormattingbooleantrueMaintain original HTML formatting and whitespacefalse
{
"url": "https://example.com/page",
"waitForLoad": true,
"timeout": 30000,
"includeResources": false,
"sanitizeHTML": true,
"preserveFormatting": true,
"extractionOptions": {
"removeComments": false,
"minifyOutput": false,
"validateHTML": true
}
}
PermissionPurposeSecurity Impact
activeTabAccess content of the current active tabCan read all HTML content from the active webpage
scriptingExecute content scripts for HTML extractionCan run JavaScript and access DOM in web page context
webRequestMonitor and modify network requests if neededCan intercept and analyze HTTP requests and responses
  • chrome.tabs API: For navigating to target URLs and managing browser tabs
  • chrome.scripting API: For executing content scripts that access document.documentElement.outerHTML
  • Fetch API: For making HTTP requests to retrieve page content when direct DOM access isn’t available
  • Document Object Model (DOM): For accessing complete HTML structure and content
FeatureChromeFirefoxSafariEdge
Basic HTML Extraction✅ Full✅ Full✅ Full✅ Full
Dynamic Content✅ Full✅ Full⚠️ Limited✅ Full
Resource Inclusion✅ Full✅ Full❌ None✅ Full
HTML Sanitization✅ Full✅ Full⚠️ Limited✅ Full
  • Cross-Site Scripting (XSS): Always sanitize HTML content before processing to prevent XSS attacks
  • Content Security Policy: Respect CSP headers and avoid executing inline scripts from extracted HTML
  • Data Validation: Validate HTML structure and content before using in workflows
  • Privacy Protection: Be aware that HTML may contain tracking pixels, analytics code, and personal data
  • Malicious Content: Implement content filtering to detect and handle potentially harmful HTML elements
{
"url": "string",
"options": {
"waitForLoad": "boolean",
"timeout": "number",
"includeResources": "boolean",
"sanitizeHTML": "boolean",
"preserveFormatting": "boolean"
}
}
{
"html": "string",
"htmlSize": "number",
"elementCount": "number",
"metadata": {
"title": "string",
"url": "string",
"contentType": "string",
"timestamp": "ISO_8601_string",
"extractionTime": "number_ms",
"pageLoadTime": "number_ms",
"htmlValidation": {
"isValid": "boolean",
"errors": "array"
}
}
}
```## Prac
tical Examples
### Example 1: E-commerce Product Page Analysis
**Scenario**: Extract HTML structure from product pages to analyze pricing, availability, and product information
**Configuration**:
```json
{
"url": "https://shop.example.com/product/laptop-pro-2024",
"waitForLoad": true,
"timeout": 20000,
"sanitizeHTML": true,
"includeResources": false
}

Input Data:

{
"url": "https://shop.example.com/product/laptop-pro-2024"
}

Expected Output:

{
"html": "<!DOCTYPE html><html lang=\"en\"><head><title>Laptop Pro 2024 - TechStore</title>...</head><body><div class=\"product-container\">...</div></body></html>",
"htmlSize": 45672,
"elementCount": 342,
"metadata": {
"title": "Laptop Pro 2024 - TechStore",
"url": "https://shop.example.com/product/laptop-pro-2024",
"contentType": "text/html; charset=utf-8",
"timestamp": "2024-01-15T10:30:00Z",
"extractionTime": 280,
"pageLoadTime": 3200,
"htmlValidation": {
"isValid": true,
"errors": []
}
}
}

Step-by-Step Process:

  1. Navigate to the product page URL
  2. Wait for dynamic content (price, availability) to load
  3. Extract complete HTML including product data attributes
  4. Sanitize HTML to remove tracking scripts and ads
  5. Return structured HTML with validation metadata

Example 2: SEO and Content Structure Analysis

Section titled “Example 2: SEO and Content Structure Analysis”

Scenario: Analyze website HTML structure for SEO compliance and content organization

Configuration:

{
"url": "https://blog.example.com/seo-best-practices",
"waitForLoad": true,
"sanitizeHTML": false,
"preserveFormatting": true,
"includeResources": true
}

Workflow Integration:

URL Input → Get HTML From Link → HTML Parser → SEO Analysis → Report Generator
↓ ↓ ↓ ↓ ↓
target_url complete_html parsed_data seo_metrics final_report

Complete Example: This configuration preserves all HTML elements including meta tags, structured data, and inline resources, providing comprehensive data for SEO analysis tools to evaluate page optimization and content structure.

This example demonstrates the fundamental usage of the GetHTMLFromLink node in a typical workflow scenario.

Configuration:

{
"url": "example_value",
"followRedirects": true
}

Input Data:

{
"data": "sample input data"
}

Expected Output:

{
"result": "processed output data"
}

This example shows more complex configuration options and integration patterns.

Configuration:

{
"parameter1": "advanced_value",
"parameter2": false,
"advancedOptions": {
"option1": "value1",
"option2": 100
}
}

Example showing how this node integrates with other workflow nodes:

  1. Previous NodeGetHTMLFromLinkNext Node
  2. Data flows through the workflow with appropriate transformations
  3. Error handling and validation at each step
  • Nodes: Get HTML From Link → HTML Parser → Data Extractor → Database Storage
  • Use Case: Systematic data extraction from websites for business intelligence
  • Configuration Tips: Enable sanitization for security, disable resource inclusion for faster processing
  • Nodes: URL List → Get HTML From Link → HTML Validator → Quality Report
  • Use Case: Automated website quality assurance and compliance checking
  • Data Flow: Multiple URLs processed sequentially, HTML validated against standards, comprehensive quality reports generated
  • Performance: Use appropriate timeouts based on expected page complexity and load times
  • Security: Always enable HTML sanitization when processing untrusted content
  • Resource Management: Disable resource inclusion unless specifically needed to reduce payload size
  • Error Handling: Implement robust error handling for network failures and invalid HTML
  • Symptoms: Missing elements or content that appears in browser but not in extracted HTML
  • Causes: JavaScript-rendered content not fully loaded, AJAX requests still pending
  • Solutions:
    1. Increase waitForLoad timeout to allow complete rendering
    2. Check if page uses lazy loading or infinite scroll
    3. Verify that dynamic content has finished loading before extraction
  • Prevention: Test with known static content first, implement proper wait conditions

Issue: HTML Sanitization Removes Required Content

Section titled “Issue: HTML Sanitization Removes Required Content”
  • Symptoms: Important elements or attributes missing from output
  • Causes: Overly aggressive sanitization removing legitimate HTML elements
  • Solutions:
    1. Disable sanitization if content source is trusted
    2. Configure custom sanitization rules to preserve required elements
    3. Use post-processing to restore necessary attributes
  • Prevention: Review sanitization settings and test with sample content
  • Content Security Policy may prevent access to some dynamically loaded content
  • Use appropriate permissions and handle CSP-related errors gracefully
  • Similar CSP restrictions, may require additional configuration for complex sites
  • WebExtensions API provides equivalent functionality with minor syntax differences
  • Large HTML Files: Pages with extensive HTML (>1MB) may cause memory issues
  • Complex JavaScript: Heavy client-side rendering can significantly increase extraction time
  • Network Latency: Slow connections may cause timeouts before HTML is fully loaded
  • Client-Side Rendering: Some content may not be available until JavaScript execution completes
  • Authentication Barriers: Cannot extract HTML from pages requiring login credentials
  • Rate Limiting: Target websites may implement rate limiting that blocks rapid requests
  • Same-Origin Policy: Restrictions on accessing content from different domains
  • Content Security Policy: Strict CSP headers may prevent HTML extraction
  • Memory Constraints: Very large HTML documents may exceed browser memory limits
  • File Size: HTML files larger than 10MB may cause processing issues
  • Character Encoding: Non-UTF-8 content may require special handling
  • Binary Content: Embedded binary data in HTML may not be properly preserved

DOM: Document Object Model - Programming interface for web documents

CORS: Cross-Origin Resource Sharing - Security feature controlling cross-domain requests

CSP: Content Security Policy - Security standard preventing code injection attacks

Browser API: Programming interfaces provided by web browsers for extension functionality

Content Script: JavaScript code that runs in the context of web pages

Web Extraction: Automated extraction of data from websites

  • web extraction
  • browser automation
  • HTTP requests
  • DOM manipulation
  • content extraction
  • web interaction
  • “scrape”
  • “extract”
  • “fetch”
  • “get”
  • “browser”
  • “web”
  • “html”
  • “text”
  • “links”
  • “images”
  • “api”
  • data collection
  • web automation
  • content extraction
  • API integration
  • browser interaction
  • web extraction
  • GetAllTextFromLink: Use when you need clean text content without HTML markup
  • GetLinksFromLink: Use when you need specifically extracting links rather than full HTML
  • Code: Works well together in workflows
  • EditFields: Works well together in workflows
  • Filter: Works well together in workflows
  • GetHTMLFromLink → Code → EditFields: Common integration pattern
  • GetHTMLFromLink → Filter → DownloadAsFile: Common integration pattern

Decision Guides:

General Resources:

  • Added HTML validation and error reporting in metadata
  • Improved handling of large HTML documents
  • Enhanced sanitization options with custom rule support
  • 1.2.0: Added resource inclusion option and better dynamic content handling
  • 1.1.0: Implemented HTML sanitization and security improvements
  • 1.0.0: Initial release with basic HTML extraction functionality

Last Updated: October 18, 2024 Tested With: Browser Extension v2.1.0 Validation Status: ✅ Code Examples Tested | ✅ Browser Compatibility Verified | ✅ User Tested