Get All HTML
The Get All HTML node extracts the complete HTML source code from a webpage, giving you all the underlying structure and content for analysis, archiving, or processing. Think of it as having a web developer’s view of how the page is built.
This is perfect for SEO analysis, competitor research, web archiving, or understanding how websites are structured. Instead of just seeing the visual page, you get the complete code that creates it.
How it works
Section titled “How it works”The node captures the complete HTML source code of the webpage, including all tags, attributes, and structure. It can optionally format the code for easier reading and extract metadata like SEO tags and page information.
graph LR
Page[Web Page] --> Extractor{HTML Extractor}
Extractor --> Code[HTML Code]
Extractor --> Meta[Metadata]
style Extractor fill:#6d28d9,stroke:#fff,color:#fff
Setup guide
Section titled “Setup guide”- Navigate to Target Page: Make sure you’re on the webpage whose HTML you want to extract.
- Configure Options: Choose whether to include metadata, exclude scripts, or format the output.
- Run Extraction: The node captures all the HTML source code from the page.
- Process Results: Use the HTML for analysis, archiving, or further processing with other tools.
Practical example: SEO analysis
Section titled “Practical example: SEO analysis”Let’s extract HTML from a webpage to analyze its SEO structure and meta tags.
What you configure:
- Include Metadata: To capture SEO tags like description and keywords.
- Exclude Scripts: To remove JavaScript code for cleaner output.
- Format Output: To organize the code so it’s easier to read visually.
What you get:
- HTML Code: The full source code of the page.
- Page Stats: Title, size of the page, and number of elements.
- Metadata: Details like description, keywords, and social media tags.
Common settings
Section titled “Common settings”| Setting | Purpose | When to Use |
|---|---|---|
| Include Metadata | Extract SEO and page information | For SEO analysis and content research |
| Exclude Scripts | Remove JavaScript code | For cleaner analysis or security |
| Pretty Print | Format HTML for easier reading | When you need to review the code manually |
Troubleshooting
Section titled “Troubleshooting”- HTML looks messy: Enable “Pretty Print” to format it with proper indentation for easier reading
- Too much code: Enable “Exclude Scripts” to focus on content structure rather than functionality
- Missing dynamic content: Some content loads after the page - try waiting before extraction or the content might be generated by JavaScript