Skip to content

AI-Powered Web Extraction

This example demonstrates how to create intelligent web extraction workflows that combine browser extension nodes with AI models to extract, process, and structure data from websites automatically.

Traditional web extraction relies on fixed selectors and patterns. AI-powered extraction uses machine learning to intelligently identify and extract relevant information, making it more robust and adaptable to different website structures.

  • GetAllLinks Node: Discovers all links on a page for navigation
  • GetAllHTML Node: Captures page structure for AI analysis
  • GetAllImages Node: Collects images for visual content analysis
  • Agent Node: Uses AI to identify relevant content patterns
  • Structured Output Parser: Converts unstructured content to structured data
  • Text Classifier: Categorizes extracted content automatically
  • Vector Store: Stores extracted data with semantic search capabilities
  • Memory: Maintains context across multiple pages
  • Workflow Tool: Chains multiple extraction steps together
// Workflow: Intelligent Product Data Extraction
// 1. Analyze page structure with AI
const pageAnalysis = await Agent.execute({
input: await GetAllHTML.execute(),
tools: [ThinkTool],
prompt: `Analyze this e-commerce page and identify:
- Product name and description
- Price information
- Product images
- Specifications or features
- Customer reviews section`
});
// 2. Extract structured product data
const productData = await StructuredOutputParser.execute({
input: pageAnalysis,
schema: {
name: "string",
price: "number",
description: "string",
features: "array",
images: "array",
rating: "number"
}
});
// 3. Store in vector database for similarity search
await VectorStore.execute({
documents: [{
content: JSON.stringify(productData),
metadata: {
url: currentPage.url,
category: "product",
timestamp: Date.now()
}
}]
});

Intelligently navigate through paginated content:

// AI-guided pagination
const navigationStrategy = await Agent.execute({
input: await GetAllLinks.execute(),
tools: [ThinkTool, CalculatorTool],
prompt: "Identify pagination links and determine the best navigation strategy"
});
// Execute navigation based on AI recommendations
for (const page of navigationStrategy.pages) {
await navigateToPage(page.url);
const content = await extractPageContent();
await storeContent(content);
}

Handle JavaScript-rendered content and dynamic elements:

// Wait for dynamic content to load
await WaitNode.execute({ seconds: 3 });
// Re-analyze after content loads
const dynamicContent = await Agent.execute({
input: await GetAllHTML.execute(),
prompt: "Extract data from dynamically loaded content sections"
});

Automatically categorize scraped content:

// Classify content type
const contentType = await TextClassifier.execute({
input: extractedContent,
categories: ["product", "article", "review", "specification", "pricing"]
});
// Route to appropriate processing workflow
switch (contentType.category) {
case "product":
await processProductData(extractedContent);
break;
case "article":
await processArticleContent(extractedContent);
break;
// ... other cases
}
  • Competitor Analysis: Extract pricing, features, and product information
  • Trend Monitoring: Track content changes and new product launches
  • Review Analysis: Collect and analyze customer feedback across platforms
  • News Monitoring: Collect articles from multiple sources on specific topics
  • Research Data: Gather academic papers, reports, and documentation
  • Social Media: Extract posts, comments, and engagement metrics
  • Contact Information: Extract business contact details from directories
  • Company Profiles: Gather company information from various sources
  • Event Data: Collect information about conferences, meetups, and events

Handle varying website structures gracefully:

// Fallback extraction strategy
try {
const primaryData = await extractWithPrimaryMethod();
} catch (error) {
console.log("Primary extraction failed, trying AI-based fallback");
const fallbackData = await Agent.execute({
input: await GetAllHTML.execute(),
prompt: "Extract relevant data using alternative patterns"
});
}

Implement respectful extraction practices:

// Intelligent delay between requests
const delay = await Agent.execute({
input: `Previous request took ${lastRequestTime}ms`,
prompt: "Calculate appropriate delay for next request to be respectful"
});
await WaitNode.execute({ seconds: delay.seconds });

Use AI to validate extracted data quality:

// Validate extracted data
const validation = await Agent.execute({
input: extractedData,
prompt: `Validate this extracted data for:
- Completeness (are all expected fields present?)
- Accuracy (do the values make sense?)
- Consistency (are related fields consistent?)
Return validation score and issues found.`
});
if (validation.score < 0.8) {
// Re-extract with different strategy
await retryExtraction();
}

Identify and handle duplicate content:

// Check for duplicates using vector similarity
const similarContent = await VectorStore.execute({
query: extractedData.content,
topK: 5,
threshold: 0.9
});
if (similarContent.length > 0) {
console.log("Similar content found, skipping or merging");
}

This AI-powered approach to web extraction provides more robust, adaptable, and intelligent data extraction capabilities compared to traditional methods.