Intelligent Web Extraction with AI

This guide demonstrates how to create sophisticated web extraction workflows that use AI to intelligently identify, extract, and process web content, making extraction more adaptive and robust than traditional selector-based approaches.

Core Concepts

Traditional vs AI-Powered Extraction

Traditional Extraction:

Relies on fixed CSS selectors and XPath expressions
Breaks when website structure changes
Requires manual updates for each site variation
Limited to predefined data patterns

AI-Powered Extraction:

Uses machine learning to identify content patterns
Adapts to structural changes automatically
Learns from examples to improve extraction
Handles unstructured and semi-structured data

Workflow Architecture

1. Intelligent Content Discovery

Use AI to identify relevant content areas on web pages:

// AI-powered content area detection
const contentAnalysis = await Agent.execute({
  input: await GetAllHTML.execute(),
  tools: [ThinkTool, CalculatorTool],
  prompt: `Analyze this HTML structure and identify:
    1. Main content areas (articles, product info, etc.)
    2. Navigation elements
    3. Sidebar content
    4. Footer information
    5. Advertisement sections

    Return a JSON structure mapping content types to CSS selectors.`
});

// Extract content based on AI analysis
const mainContent = await ProcessHTML.execute({
  html: await GetAllHTML.execute(),
  selector: contentAnalysis.selectors.main_content,
  operation: "extract_text"
});

2. Adaptive Data Extraction

Create extraction workflows that adapt to different website structures:

// Multi-strategy extraction with AI fallback
class AdaptiveExtractor {
  constructor() {
    this.strategies = [
      new SelectorBasedExtraction(),
      new AIPatternExtraction(),
      new SemanticExtraction()
    ];
  }

  async extract(url, dataType) {
    await NavigateToLink.execute({ url });

    for (const strategy of this.strategies) {
      try {
        const result = await strategy.extract(dataType);
        if (this.validateExtraction(result)) {
          return result;
        }
      } catch (error) {
        console.log(`Strategy ${strategy.name} failed:`, error);
      }
    }

    // Final AI-powered extraction attempt
    return await this.aiEmergencyExtraction(dataType);
  }

  async aiEmergencyExtraction(dataType) {
    const pageContent = await GetAllText.execute();

    return await Agent.execute({
      input: pageContent,
      prompt: `Extract ${dataType} information from this text.
        Return structured JSON with all relevant fields.
        If information is missing, mark fields as null.`
    });
  }
}

3. Semantic Content Understanding

Use AI to understand content semantics rather than just structure:

// Semantic content classifier
const contentClassifier = await Agent.execute({
  input: await GetAllText.execute(),
  tools: [TextClassifierTool],
  prompt: `Classify the content on this page into categories:
    - Product information (name, price, description, specs)
    - User reviews and ratings
    - Related products or recommendations
    - Company/seller information
    - Shipping and return policies

    For each category found, provide:
    1. Confidence score (0-1)
    2. Key information extracted
    3. Location hints for future extraction`
});

// Process each content type appropriately
for (const category of contentClassifier.categories) {
  if (category.confidence > 0.8) {
    await processContentCategory(category);
  }
}

Advanced Extraction Patterns

Multi-Page Intelligence

Create AI workflows that intelligently navigate and extract from multiple pages:

// Intelligent pagination handler
class SmartPaginator {
  constructor() {
    this.visitedUrls = new Set();
    this.extractedData = [];
  }

  async scrapeWithPagination(startUrl, maxPages = 50) {
    let currentUrl = startUrl;
    let pageCount = 0;

    while (currentUrl && pageCount < maxPages) {
      if (this.visitedUrls.has(currentUrl)) break;

      await NavigateToLink.execute({ url: currentUrl });
      this.visitedUrls.add(currentUrl);

      // Extract data from current page
      const pageData = await this.extractPageData();
      this.extractedData.push(...pageData);

      // AI-powered next page detection
      const nextPageAnalysis = await Agent.execute({
        input: await GetAllLinks.execute(),
        prompt: `Analyze these links to find the next page in pagination.
          Look for patterns like "Next", "Page 2", "→", or numbered links.
          Return the most likely next page URL or null if no pagination found.`
      });

      currentUrl = nextPageAnalysis.nextPageUrl;
      pageCount++;

      // Respectful delay
      await WaitNode.execute({ seconds: 2 });
    }

    return this.extractedData;
  }

  async extractPageData() {
    // Use AI to identify and extract relevant data
    const dataAnalysis = await Agent.execute({
      input: {
        html: await GetAllHTML.execute(),
        text: await GetAllText.execute(),
        links: await GetAllLinks.execute(),
        images: await GetAllImages.execute()
      },
      prompt: `Extract structured data from this page.
        Identify the main data entities and their relationships.
        Return as JSON array with consistent schema.`
    });

    return dataAnalysis.entities || [];
  }
}

Dynamic Content Handling

Handle JavaScript-rendered and dynamically loaded content:

// Dynamic content scraper with AI timing
class DynamicContentScraper {
  async scrapeWithDynamicContent(url) {
    await NavigateToLink.execute({ url });

    // Initial content snapshot
    const initialContent = await GetAllText.execute();

    // Wait for dynamic content with AI-guided timing
    const loadingAnalysis = await Agent.execute({
      input: {
        html: await GetAllHTML.execute(),
        initialTextLength: initialContent.length
      },
      prompt: `Analyze this page for dynamic content loading indicators:
        - Loading spinners or placeholders
        - JavaScript frameworks (React, Vue, Angular)
        - AJAX request patterns
        - Estimated load time needed (1-10 seconds)`
    });

    // Wait based on AI recommendation
    await WaitNode.execute({
      seconds: loadingAnalysis.recommendedWaitTime || 3
    });

    // Re-analyze after waiting
    const finalContent = await GetAllText.execute();

    if (finalContent.length > initialContent.length * 1.2) {
      console.log("Dynamic content detected and loaded");
    }

    return await this.extractFinalData();
  }

  async extractFinalData() {
    return await Agent.execute({
      input: {
        html: await GetAllHTML.execute(),
        text: await GetAllText.execute()
      },
      prompt: `Extract all relevant data from this fully loaded page.
        Focus on the main content that appears to be dynamically loaded.`
    });
  }
}

Content Validation and Quality Assurance

Use AI to validate extracted data quality:

// AI-powered data validation
class DataValidator {
  async validateExtractedData(data, expectedSchema) {
    const validation = await Agent.execute({
      input: {
        data: JSON.stringify(data),
        schema: JSON.stringify(expectedSchema)
      },
      prompt: `Validate this extracted data against the expected schema:

        1. Check completeness (all required fields present)
        2. Verify data types and formats
        3. Identify inconsistencies or anomalies
        4. Suggest corrections for invalid data
        5. Rate overall quality (0-1 score)

        Return validation report with issues and suggestions.`
    });

    if (validation.qualityScore < 0.7) {
      console.log("Data quality issues detected:", validation.issues);
      return await this.attemptDataCorrection(data, validation);
    }

    return data;
  }

  async attemptDataCorrection(data, validationReport) {
    return await Agent.execute({
      input: {
        originalData: JSON.stringify(data),
        issues: JSON.stringify(validationReport.issues)
      },
      prompt: `Attempt to correct the identified data quality issues:
        ${validationReport.issues.join('\n')}

        Return corrected data maintaining the original structure.`
    });
  }
}

Real-World Use Cases

E-commerce Product Intelligence

Comprehensive product data extraction with competitive analysis:

// Intelligent product scraper
class ProductIntelligenceScraper {
  async scrapeProduct(productUrl) {
    await NavigateToLink.execute({ url: productUrl });

    // Multi-modal product analysis
    const productAnalysis = await Agent.execute({
      input: {
        text: await GetAllText.execute(),
        images: await GetAllImages.execute(),
        html: await GetAllHTML.execute()
      },
      tools: [ImageAnalysisTool, PriceExtractorTool, ReviewAnalyzerTool],
      prompt: `Perform comprehensive product analysis:

        1. Extract basic product information (name, brand, model)
        2. Identify pricing information (current, original, discounts)
        3. Analyze product images for features and quality
        4. Extract technical specifications
        5. Analyze customer reviews and ratings
        6. Identify related or competing products
        7. Extract availability and shipping information

        Return structured product intelligence report.`
    });

    // Enhance with competitive analysis
    const competitorAnalysis = await this.analyzeCompetitors(
      productAnalysis.productName,
      productAnalysis.category
    );

    return {
      ...productAnalysis,
      competitorAnalysis,
      scrapedAt: new Date().toISOString(),
      sourceUrl: productUrl
    };
  }

  async analyzeCompetitors(productName, category) {
    // Search for similar products
    const searchResults = await WebSearchTool.execute({
      query: `${productName} ${category} alternatives comparison`
    });

    return await Agent.execute({
      input: searchResults,
      prompt: `Analyze these search results to identify:
        1. Direct competitors and alternatives
        2. Price comparison opportunities
        3. Feature differentiation points
        4. Market positioning insights`
    });
  }
}

News and Content Monitoring

Intelligent content monitoring with sentiment analysis:

// AI-powered news monitoring
class NewsMonitoringScraper {
  async monitorTopic(topic, sources = []) {
    const articles = [];

    for (const source of sources) {
      await NavigateToLink.execute({ url: source });

      const articleAnalysis = await Agent.execute({
        input: {
          html: await GetAllHTML.execute(),
          links: await GetAllLinks.execute()
        },
        prompt: `Find articles related to "${topic}" on this news site:

          1. Identify article links and headlines
          2. Extract publication dates
          3. Determine relevance score (0-1) to the topic
          4. Classify article type (news, opinion, analysis, etc.)

          Return array of relevant articles with metadata.`
      });

      // Process each relevant article
      for (const article of articleAnalysis.articles) {
        if (article.relevanceScore > 0.6) {
          const fullArticle = await this.scrapeFullArticle(article.url);
          articles.push(fullArticle);
        }
      }
    }

    return await this.analyzeArticleCollection(articles, topic);
  }

  async scrapeFullArticle(articleUrl) {
    await NavigateToLink.execute({ url: articleUrl });

    return await Agent.execute({
      input: {
        text: await GetAllText.execute(),
        html: await GetAllHTML.execute()
      },
      tools: [SentimentAnalysisTool, EntityExtractionTool],
      prompt: `Extract comprehensive article information:

        1. Article title and subtitle
        2. Author and publication info
        3. Main content (clean text)
        4. Key entities mentioned (people, organizations, locations)
        5. Sentiment analysis (positive/negative/neutral)
        6. Key topics and themes
        7. Article summary (2-3 sentences)

        Return structured article data.`
    });
  }

  async analyzeArticleCollection(articles, topic) {
    return await Agent.execute({
      input: {
        articles: JSON.stringify(articles),
        topic: topic
      },
      prompt: `Analyze this collection of articles about "${topic}":

        1. Identify trending themes and subtopics
        2. Track sentiment evolution over time
        3. Find key influencers and sources
        4. Detect emerging narratives or controversies
        5. Summarize overall coverage patterns

        Return comprehensive topic analysis report.`
    });
  }
}

Research Data Collection

Academic and research-focused intelligent extraction:

// Research-oriented scraper
class ResearchDataScraper {
  async collectResearchData(researchQuery, sources = []) {
    const researchData = {
      papers: [],
      datasets: [],
      experts: [],
      institutions: []
    };

    for (const source of sources) {
      await NavigateToLink.execute({ url: source });

      const sourceAnalysis = await Agent.execute({
        input: {
          text: await GetAllText.execute(),
          links: await GetAllLinks.execute()
        },
        prompt: `Analyze this academic/research source for "${researchQuery}":

          1. Identify research papers and publications
          2. Find datasets and data sources
          3. Extract researcher and expert information
          4. Identify institutional affiliations
          5. Find related research topics and keywords

          Classify each finding by type and relevance.`
      });

      // Process each type of research data
      await this.processResearchFindings(sourceAnalysis, researchData);
    }

    return await this.synthesizeResearchFindings(researchData, researchQuery);
  }

  async processResearchFindings(analysis, researchData) {
    // Process papers
    for (const paper of analysis.papers || []) {
      const paperDetails = await this.extractPaperDetails(paper.url);
      researchData.papers.push(paperDetails);
    }

    // Process datasets
    for (const dataset of analysis.datasets || []) {
      const datasetInfo = await this.extractDatasetInfo(dataset.url);
      researchData.datasets.push(datasetInfo);
    }

    // Process experts and institutions
    researchData.experts.push(...(analysis.experts || []));
    researchData.institutions.push(...(analysis.institutions || []));
  }

  async synthesizeResearchFindings(researchData, query) {
    return await Agent.execute({
      input: {
        data: JSON.stringify(researchData),
        query: query
      },
      prompt: `Synthesize this research data collection:

        1. Identify key research themes and trends
        2. Map relationships between papers, authors, and institutions
        3. Find research gaps and opportunities
        4. Suggest follow-up research directions
        5. Create citation network analysis

        Return comprehensive research landscape report.`
    });
  }
}

This intelligent web extraction approach provides more robust, adaptable, and insightful data extraction capabilities that can handle the complexity and variability of modern web content.