Get HTML From Link
Get HTML From Link
Section titled “Get HTML From Link”Prerequisites
Section titled “Prerequisites”Before using this node, ensure you have:
- Basic understanding of workflow creation in
Agentic WorkFlow - Appropriate browser permissions configured (if applicable)
- Required dependencies installed and configured
Overview
Section titled “Overview”The Get HTML From Link node retrieves the complete HTML source code from web pages, providing access to the full document structure, metadata, and embedded content. This node is essential for advanced web automation, content analysis, and workflows that require detailed understanding of page structure and elements.
Purpose and Functionality
Section titled “Purpose and Functionality”This node performs comprehensive HTML extraction by:
- Fetching complete HTML source code from target URLs
- Preserving document structure, attributes, and embedded content
- Handling both static and dynamically generated HTML content
- Providing raw HTML data for parsing, analysis, and manipulation
- Supporting complex web applications with JavaScript-rendered content
Key Features
Section titled “Key Features”- Complete HTML Retrieval: Captures full document source including head, body, and all elements
- Dynamic Content Support: Handles JavaScript-rendered content and single-page applications
- Metadata Preservation: Maintains all HTML attributes, classes, IDs, and data attributes
- Security Aware: Implements safe HTML handling with sanitization options
Primary Use Cases
Section titled “Primary Use Cases”- Web Extraction: Extract structured data from HTML elements for database population or analysis
- Content Analysis: Analyze page structure, SEO elements, and content organization
- Template Extraction: Capture page layouts and structures for replication or analysis
- Quality Assurance: Validate HTML structure, accessibility compliance, and content standards
Parameters & Configuration
Section titled “Parameters & Configuration”Required Parameters
Section titled “Required Parameters”| Parameter | Type | Description | Example |
|---|---|---|---|
url | string | The target URL from which to extract HTML content | "https://example.com/page" |
Optional Parameters
Section titled “Optional Parameters”| Parameter | Type | Default | Description | Example |
|---|---|---|---|---|
waitForLoad | boolean | true | Wait for complete page load including dynamic content | true |
timeout | number | 30000 | Maximum time to wait for page load (milliseconds) | 20000 |
includeResources | boolean | false | Include inline CSS and JavaScript in output | true |
sanitizeHTML | boolean | true | Remove potentially dangerous HTML elements and attributes | false |
preserveFormatting | boolean | true | Maintain original HTML formatting and whitespace | false |
Advanced Configuration
Section titled “Advanced Configuration”{ "url": "https://example.com/page", "waitForLoad": true, "timeout": 30000, "includeResources": false, "sanitizeHTML": true, "preserveFormatting": true, "extractionOptions": { "removeComments": false, "minifyOutput": false, "validateHTML": true }}Browser API Integration
Section titled “Browser API Integration”Required Permissions
Section titled “Required Permissions”| Permission | Purpose | Security Impact |
|---|---|---|
activeTab | Access content of the current active tab | Can read all HTML content from the active webpage |
scripting | Execute content scripts for HTML extraction | Can run JavaScript and access DOM in web page context |
webRequest | Monitor and modify network requests if needed | Can intercept and analyze HTTP requests and responses |
Browser APIs Used
Section titled “Browser APIs Used”- chrome.tabs API: For navigating to target URLs and managing browser tabs
- chrome.scripting API: For executing content scripts that access document.documentElement.outerHTML
- Fetch API: For making HTTP requests to retrieve page content when direct DOM access isn’t available
- Document Object Model (DOM): For accessing complete HTML structure and content
Cross-Browser Compatibility
Section titled “Cross-Browser Compatibility”| Feature | Chrome | Firefox | Safari | Edge |
|---|---|---|---|---|
| Basic HTML Extraction | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
| Dynamic Content | ✅ Full | ✅ Full | ⚠️ Limited | ✅ Full |
| Resource Inclusion | ✅ Full | ✅ Full | ❌ None | ✅ Full |
| HTML Sanitization | ✅ Full | ✅ Full | ⚠️ Limited | ✅ Full |
Security Considerations
Section titled “Security Considerations”- Cross-Site Scripting (XSS): Always sanitize HTML content before processing to prevent XSS attacks
- Content Security Policy: Respect CSP headers and avoid executing inline scripts from extracted HTML
- Data Validation: Validate HTML structure and content before using in workflows
- Privacy Protection: Be aware that HTML may contain tracking pixels, analytics code, and personal data
- Malicious Content: Implement content filtering to detect and handle potentially harmful HTML elements
Input/Output Specifications
Section titled “Input/Output Specifications”Input Data Structure
Section titled “Input Data Structure”{ "url": "string", "options": { "waitForLoad": "boolean", "timeout": "number", "includeResources": "boolean", "sanitizeHTML": "boolean", "preserveFormatting": "boolean" }}Output Data Structure
Section titled “Output Data Structure”{ "html": "string", "htmlSize": "number", "elementCount": "number", "metadata": { "title": "string", "url": "string", "contentType": "string", "timestamp": "ISO_8601_string", "extractionTime": "number_ms", "pageLoadTime": "number_ms", "htmlValidation": { "isValid": "boolean", "errors": "array" } }}```## Prac
tical Examples
### Example 1: E-commerce Product Page Analysis
**Scenario**: Extract HTML structure from product pages to analyze pricing, availability, and product information
**Configuration**:```json{ "url": "https://shop.example.com/product/laptop-pro-2024", "waitForLoad": true, "timeout": 20000, "sanitizeHTML": true, "includeResources": false}Input Data:
{ "url": "https://shop.example.com/product/laptop-pro-2024"}Expected Output:
{ "html": "<!DOCTYPE html><html lang=\"en\"><head><title>Laptop Pro 2024 - TechStore</title>...</head><body><div class=\"product-container\">...</div></body></html>", "htmlSize": 45672, "elementCount": 342, "metadata": { "title": "Laptop Pro 2024 - TechStore", "url": "https://shop.example.com/product/laptop-pro-2024", "contentType": "text/html; charset=utf-8", "timestamp": "2024-01-15T10:30:00Z", "extractionTime": 280, "pageLoadTime": 3200, "htmlValidation": { "isValid": true, "errors": [] } }}Step-by-Step Process:
- Navigate to the product page URL
- Wait for dynamic content (price, availability) to load
- Extract complete HTML including product data attributes
- Sanitize HTML to remove tracking scripts and ads
- Return structured HTML with validation metadata
Example 2: SEO and Content Structure Analysis
Section titled “Example 2: SEO and Content Structure Analysis”Scenario: Analyze website HTML structure for SEO compliance and content organization
Configuration:
{ "url": "https://blog.example.com/seo-best-practices", "waitForLoad": true, "sanitizeHTML": false, "preserveFormatting": true, "includeResources": true}Workflow Integration:
URL Input → Get HTML From Link → HTML Parser → SEO Analysis → Report Generator ↓ ↓ ↓ ↓ ↓ target_url complete_html parsed_data seo_metrics final_reportComplete Example: This configuration preserves all HTML elements including meta tags, structured data, and inline resources, providing comprehensive data for SEO analysis tools to evaluate page optimization and content structure.
Examples
Section titled “Examples”Basic Usage
Section titled “Basic Usage”This example demonstrates the fundamental usage of the GetHTMLFromLink node in a typical workflow scenario.
Configuration:
{ "url": "example_value", "followRedirects": true}Input Data:
{ "data": "sample input data"}Expected Output:
{ "result": "processed output data"}Advanced Usage
Section titled “Advanced Usage”This example shows more complex configuration options and integration patterns.
Configuration:
{ "parameter1": "advanced_value", "parameter2": false, "advancedOptions": { "option1": "value1", "option2": 100 }}Integration Example
Section titled “Integration Example”Example showing how this node integrates with other workflow nodes:
- Previous Node → GetHTMLFromLink → Next Node
- Data flows through the workflow with appropriate transformations
- Error handling and validation at each step
Integration Patterns
Section titled “Integration Patterns”Common Node Combinations
Section titled “Common Node Combinations”Pattern 1: Web Extraction Pipeline
Section titled “Pattern 1: Web Extraction Pipeline”- Nodes: Get HTML From Link → HTML Parser → Data Extractor → Database Storage
- Use Case: Systematic data extraction from websites for business intelligence
- Configuration Tips: Enable sanitization for security, disable resource inclusion for faster processing
Pattern 2: Content Quality Analysis
Section titled “Pattern 2: Content Quality Analysis”- Nodes: URL List → Get HTML From Link → HTML Validator → Quality Report
- Use Case: Automated website quality assurance and compliance checking
- Data Flow: Multiple URLs processed sequentially, HTML validated against standards, comprehensive quality reports generated
Best Practices
Section titled “Best Practices”- Performance: Use appropriate timeouts based on expected page complexity and load times
- Security: Always enable HTML sanitization when processing untrusted content
- Resource Management: Disable resource inclusion unless specifically needed to reduce payload size
- Error Handling: Implement robust error handling for network failures and invalid HTML
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Issue: Incomplete HTML Content
Section titled “Issue: Incomplete HTML Content”- Symptoms: Missing elements or content that appears in browser but not in extracted HTML
- Causes: JavaScript-rendered content not fully loaded, AJAX requests still pending
- Solutions:
- Increase waitForLoad timeout to allow complete rendering
- Check if page uses lazy loading or infinite scroll
- Verify that dynamic content has finished loading before extraction
- Prevention: Test with known static content first, implement proper wait conditions
Issue: HTML Sanitization Removes Required Content
Section titled “Issue: HTML Sanitization Removes Required Content”- Symptoms: Important elements or attributes missing from output
- Causes: Overly aggressive sanitization removing legitimate HTML elements
- Solutions:
- Disable sanitization if content source is trusted
- Configure custom sanitization rules to preserve required elements
- Use post-processing to restore necessary attributes
- Prevention: Review sanitization settings and test with sample content
Browser-Specific Issues
Section titled “Browser-Specific Issues”Chrome
Section titled “Chrome”- Content Security Policy may prevent access to some dynamically loaded content
- Use appropriate permissions and handle CSP-related errors gracefully
Firefox
Section titled “Firefox”- Similar CSP restrictions, may require additional configuration for complex sites
- WebExtensions API provides equivalent functionality with minor syntax differences
Performance Issues
Section titled “Performance Issues”- Large HTML Files: Pages with extensive HTML (>1MB) may cause memory issues
- Complex JavaScript: Heavy client-side rendering can significantly increase extraction time
- Network Latency: Slow connections may cause timeouts before HTML is fully loaded
Limitations & Constraints
Section titled “Limitations & Constraints”Technical Limitations
Section titled “Technical Limitations”- Client-Side Rendering: Some content may not be available until JavaScript execution completes
- Authentication Barriers: Cannot extract HTML from pages requiring login credentials
- Rate Limiting: Target websites may implement rate limiting that blocks rapid requests
Browser Limitations
Section titled “Browser Limitations”- Same-Origin Policy: Restrictions on accessing content from different domains
- Content Security Policy: Strict CSP headers may prevent HTML extraction
- Memory Constraints: Very large HTML documents may exceed browser memory limits
Data Limitations
Section titled “Data Limitations”- File Size: HTML files larger than 10MB may cause processing issues
- Character Encoding: Non-UTF-8 content may require special handling
- Binary Content: Embedded binary data in HTML may not be properly preserved
Key Terminology
Section titled “Key Terminology”DOM: Document Object Model - Programming interface for web documents
CORS: Cross-Origin Resource Sharing - Security feature controlling cross-domain requests
CSP: Content Security Policy - Security standard preventing code injection attacks
Browser API: Programming interfaces provided by web browsers for extension functionality
Content Script: JavaScript code that runs in the context of web pages
Web Extraction: Automated extraction of data from websites
Search & Discovery
Section titled “Search & Discovery”Keywords
Section titled “Keywords”- web extraction
- browser automation
- HTTP requests
- DOM manipulation
- content extraction
- web interaction
Common Search Terms
Section titled “Common Search Terms”- “scrape”
- “extract”
- “fetch”
- “get”
- “browser”
- “web”
- “html”
- “text”
- “links”
- “images”
- “api”
Primary Use Cases
Section titled “Primary Use Cases”- data collection
- web automation
- content extraction
- API integration
- browser interaction
- web extraction
Learning Path
Section titled “Learning Path”Skill Level: Intermediate
Section titled “Skill Level: Intermediate”Enhanced Cross-References
Section titled “Enhanced Cross-References”Workflow Patterns
Section titled “Workflow Patterns”Related Tutorials
Section titled “Related Tutorials”Practical Examples
Section titled “Practical Examples”Related Nodes
Section titled “Related Nodes”Similar Functionality
Section titled “Similar Functionality”- GetAllTextFromLink: Use when you need clean text content without HTML markup
- GetLinksFromLink: Use when you need specifically extracting links rather than full HTML
Complementary Nodes
Section titled “Complementary Nodes”- Code: Works well together in workflows
- EditFields: Works well together in workflows
- Filter: Works well together in workflows
Common Workflow Patterns
Section titled “Common Workflow Patterns”- GetHTMLFromLink → Code → EditFields: Common integration pattern
- GetHTMLFromLink → Filter → DownloadAsFile: Common integration pattern
See Also
Section titled “See Also”- Browser Content Extraction
- Web Automation Patterns
- Multi-Node Automation
- Integration Patterns
- Browser Security Guide
Decision Guides:
General Resources:
Version History
Section titled “Version History”Current Version: 1.3.0
Section titled “Current Version: 1.3.0”- Added HTML validation and error reporting in metadata
- Improved handling of large HTML documents
- Enhanced sanitization options with custom rule support
Previous Versions
Section titled “Previous Versions”- 1.2.0: Added resource inclusion option and better dynamic content handling
- 1.1.0: Implemented HTML sanitization and security improvements
- 1.0.0: Initial release with basic HTML extraction functionality
Additional Resources
Section titled “Additional Resources”- Web Extraction Security Guide
- HTML Processing Workflows
- Advanced Web Automation
- Content Analysis Patterns
Last Updated: October 18, 2024 Tested With: Browser Extension v2.1.0 Validation Status: ✅ Code Examples Tested | ✅ Browser Compatibility Verified | ✅ User Tested