AI-Powered Web Extraction

This example demonstrates how to create intelligent web extraction workflows that combine browser extension nodes with AI models to extract, process, and structure data from websites automatically.

Overview

Traditional web extraction relies on fixed selectors and patterns. AI-powered extraction uses machine learning to intelligently identify and extract relevant information, making it more robust and adaptable to different website structures.

Workflow Components

1. Intelligent Content Discovery

GetAllLinks Node: Discovers all links on a page for navigation
GetAllHTML Node: Captures page structure for AI analysis
GetAllImages Node: Collects images for visual content analysis

2. AI-Driven Extraction

Agent Node: Uses AI to identify relevant content patterns
Structured Output Parser: Converts unstructured content to structured data
Text Classifier: Categorizes extracted content automatically

3. Data Processing

Vector Store: Stores extracted data with semantic search capabilities
Memory: Maintains context across multiple pages
Workflow Tool: Chains multiple extraction steps together

Example: E-commerce Product Extraction

// Workflow: Intelligent Product Data Extraction
// 1. Analyze page structure with AI
const pageAnalysis = await Agent.execute({
  input: await GetAllHTML.execute(),
  tools: [ThinkTool],
  prompt: `Analyze this e-commerce page and identify:
    - Product name and description
    - Price information
    - Product images
    - Specifications or features
    - Customer reviews section`
});

// 2. Extract structured product data
const productData = await StructuredOutputParser.execute({
  input: pageAnalysis,
  schema: {
    name: "string",
    price: "number",
    description: "string",
    features: "array",
    images: "array",
    rating: "number"
  }
});

// 3. Store in vector database for similarity search
await VectorStore.execute({
  documents: [{
    content: JSON.stringify(productData),
    metadata: {
      url: currentPage.url,
      category: "product",
      timestamp: Date.now()
    }
  }]
});

Advanced Extraction Patterns

Intelligently navigate through paginated content:

// AI-guided pagination
const navigationStrategy = await Agent.execute({
  input: await GetAllLinks.execute(),
  tools: [ThinkTool, CalculatorTool],
  prompt: "Identify pagination links and determine the best navigation strategy"
});

// Execute navigation based on AI recommendations
for (const page of navigationStrategy.pages) {
  await navigateToPage(page.url);
  const content = await extractPageContent();
  await storeContent(content);
}

Dynamic Content Handling

Handle JavaScript-rendered content and dynamic elements:

// Wait for dynamic content to load
await WaitNode.execute({ seconds: 3 });

// Re-analyze after content loads
const dynamicContent = await Agent.execute({
  input: await GetAllHTML.execute(),
  prompt: "Extract data from dynamically loaded content sections"
});

Content Classification

Automatically categorize scraped content:

// Classify content type
const contentType = await TextClassifier.execute({
  input: extractedContent,
  categories: ["product", "article", "review", "specification", "pricing"]
});

// Route to appropriate processing workflow
switch (contentType.category) {
  case "product":
    await processProductData(extractedContent);
    break;
  case "article":
    await processArticleContent(extractedContent);
    break;
  // ... other cases
}

Use Cases

Market Research

Competitor Analysis: Extract pricing, features, and product information
Trend Monitoring: Track content changes and new product launches
Review Analysis: Collect and analyze customer feedback across platforms

Content Aggregation

News Monitoring: Collect articles from multiple sources on specific topics
Research Data: Gather academic papers, reports, and documentation
Social Media: Extract posts, comments, and engagement metrics

Lead Generation

Contact Information: Extract business contact details from directories
Company Profiles: Gather company information from various sources
Event Data: Collect information about conferences, meetups, and events

Error Handling and Resilience

Adaptive Extraction

Handle varying website structures gracefully:

// Fallback extraction strategy
try {
  const primaryData = await extractWithPrimaryMethod();
} catch (error) {
  console.log("Primary extraction failed, trying AI-based fallback");
  const fallbackData = await Agent.execute({
    input: await GetAllHTML.execute(),
    prompt: "Extract relevant data using alternative patterns"
  });
}

Rate Limiting and Politeness

Implement respectful extraction practices:

// Intelligent delay between requests
const delay = await Agent.execute({
  input: `Previous request took ${lastRequestTime}ms`,
  prompt: "Calculate appropriate delay for next request to be respectful"
});

await WaitNode.execute({ seconds: delay.seconds });

Data Quality and Validation

AI-Powered Validation

Use AI to validate extracted data quality:

// Validate extracted data
const validation = await Agent.execute({
  input: extractedData,
  prompt: `Validate this extracted data for:
    - Completeness (are all expected fields present?)
    - Accuracy (do the values make sense?)
    - Consistency (are related fields consistent?)
    Return validation score and issues found.`
});

if (validation.score < 0.8) {
  // Re-extract with different strategy
  await retryExtraction();
}

Duplicate Detection

Identify and handle duplicate content:

// Check for duplicates using vector similarity
const similarContent = await VectorStore.execute({
  query: extractedData.content,
  topK: 5,
  threshold: 0.9
});

if (similarContent.length > 0) {
  console.log("Similar content found, skipping or merging");
}

This AI-powered approach to web extraction provides more robust, adaptable, and intelligent data extraction capabilities compared to traditional methods.

AI-Powered Web Extraction

AI-Powered Web Extraction

Overview

Workflow Components

1. Intelligent Content Discovery

2. AI-Driven Extraction

3. Data Processing

Example: E-commerce Product Extraction

Advanced Extraction Patterns

Multi-Page Navigation

Dynamic Content Handling

Content Classification

Use Cases

Market Research

Content Aggregation

Lead Generation

Error Handling and Resilience

Adaptive Extraction

Rate Limiting and Politeness

Data Quality and Validation

AI-Powered Validation

Duplicate Detection