Intelligent Web Extraction with AI
Intelligent Web Extraction with AI
Section titled “Intelligent Web Extraction with AI”This guide demonstrates how to create sophisticated web extraction workflows that use AI to intelligently identify, extract, and process web content, making extraction more adaptive and robust than traditional selector-based approaches.
Core Concepts
Section titled “Core Concepts”Traditional vs AI-Powered Extraction
Section titled “Traditional vs AI-Powered Extraction”Traditional Extraction:
- Relies on fixed CSS selectors and XPath expressions
- Breaks when website structure changes
- Requires manual updates for each site variation
- Limited to predefined data patterns
AI-Powered Extraction:
- Uses machine learning to identify content patterns
- Adapts to structural changes automatically
- Learns from examples to improve extraction
- Handles unstructured and semi-structured data
Workflow Architecture
Section titled “Workflow Architecture”1. Intelligent Content Discovery
Section titled “1. Intelligent Content Discovery”Use AI to identify relevant content areas on web pages:
// AI-powered content area detectionconst contentAnalysis = await Agent.execute({ input: await GetAllHTML.execute(), tools: [ThinkTool, CalculatorTool], prompt: `Analyze this HTML structure and identify: 1. Main content areas (articles, product info, etc.) 2. Navigation elements 3. Sidebar content 4. Footer information 5. Advertisement sections
Return a JSON structure mapping content types to CSS selectors.`});
// Extract content based on AI analysisconst mainContent = await ProcessHTML.execute({ html: await GetAllHTML.execute(), selector: contentAnalysis.selectors.main_content, operation: "extract_text"});2. Adaptive Data Extraction
Section titled “2. Adaptive Data Extraction”Create extraction workflows that adapt to different website structures:
// Multi-strategy extraction with AI fallbackclass AdaptiveExtractor { constructor() { this.strategies = [ new SelectorBasedExtraction(), new AIPatternExtraction(), new SemanticExtraction() ]; }
async extract(url, dataType) { await NavigateToLink.execute({ url });
for (const strategy of this.strategies) { try { const result = await strategy.extract(dataType); if (this.validateExtraction(result)) { return result; } } catch (error) { console.log(`Strategy ${strategy.name} failed:`, error); } }
// Final AI-powered extraction attempt return await this.aiEmergencyExtraction(dataType); }
async aiEmergencyExtraction(dataType) { const pageContent = await GetAllText.execute();
return await Agent.execute({ input: pageContent, prompt: `Extract ${dataType} information from this text. Return structured JSON with all relevant fields. If information is missing, mark fields as null.` }); }}3. Semantic Content Understanding
Section titled “3. Semantic Content Understanding”Use AI to understand content semantics rather than just structure:
// Semantic content classifierconst contentClassifier = await Agent.execute({ input: await GetAllText.execute(), tools: [TextClassifierTool], prompt: `Classify the content on this page into categories: - Product information (name, price, description, specs) - User reviews and ratings - Related products or recommendations - Company/seller information - Shipping and return policies
For each category found, provide: 1. Confidence score (0-1) 2. Key information extracted 3. Location hints for future extraction`});
// Process each content type appropriatelyfor (const category of contentClassifier.categories) { if (category.confidence > 0.8) { await processContentCategory(category); }}Advanced Extraction Patterns
Section titled “Advanced Extraction Patterns”Multi-Page Intelligence
Section titled “Multi-Page Intelligence”Create AI workflows that intelligently navigate and extract from multiple pages:
// Intelligent pagination handlerclass SmartPaginator { constructor() { this.visitedUrls = new Set(); this.extractedData = []; }
async scrapeWithPagination(startUrl, maxPages = 50) { let currentUrl = startUrl; let pageCount = 0;
while (currentUrl && pageCount < maxPages) { if (this.visitedUrls.has(currentUrl)) break;
await NavigateToLink.execute({ url: currentUrl }); this.visitedUrls.add(currentUrl);
// Extract data from current page const pageData = await this.extractPageData(); this.extractedData.push(...pageData);
// AI-powered next page detection const nextPageAnalysis = await Agent.execute({ input: await GetAllLinks.execute(), prompt: `Analyze these links to find the next page in pagination. Look for patterns like "Next", "Page 2", "→", or numbered links. Return the most likely next page URL or null if no pagination found.` });
currentUrl = nextPageAnalysis.nextPageUrl; pageCount++;
// Respectful delay await WaitNode.execute({ seconds: 2 }); }
return this.extractedData; }
async extractPageData() { // Use AI to identify and extract relevant data const dataAnalysis = await Agent.execute({ input: { html: await GetAllHTML.execute(), text: await GetAllText.execute(), links: await GetAllLinks.execute(), images: await GetAllImages.execute() }, prompt: `Extract structured data from this page. Identify the main data entities and their relationships. Return as JSON array with consistent schema.` });
return dataAnalysis.entities || []; }}Dynamic Content Handling
Section titled “Dynamic Content Handling”Handle JavaScript-rendered and dynamically loaded content:
// Dynamic content scraper with AI timingclass DynamicContentScraper { async scrapeWithDynamicContent(url) { await NavigateToLink.execute({ url });
// Initial content snapshot const initialContent = await GetAllText.execute();
// Wait for dynamic content with AI-guided timing const loadingAnalysis = await Agent.execute({ input: { html: await GetAllHTML.execute(), initialTextLength: initialContent.length }, prompt: `Analyze this page for dynamic content loading indicators: - Loading spinners or placeholders - JavaScript frameworks (React, Vue, Angular) - AJAX request patterns - Estimated load time needed (1-10 seconds)` });
// Wait based on AI recommendation await WaitNode.execute({ seconds: loadingAnalysis.recommendedWaitTime || 3 });
// Re-analyze after waiting const finalContent = await GetAllText.execute();
if (finalContent.length > initialContent.length * 1.2) { console.log("Dynamic content detected and loaded"); }
return await this.extractFinalData(); }
async extractFinalData() { return await Agent.execute({ input: { html: await GetAllHTML.execute(), text: await GetAllText.execute() }, prompt: `Extract all relevant data from this fully loaded page. Focus on the main content that appears to be dynamically loaded.` }); }}Content Validation and Quality Assurance
Section titled “Content Validation and Quality Assurance”Use AI to validate extracted data quality:
// AI-powered data validationclass DataValidator { async validateExtractedData(data, expectedSchema) { const validation = await Agent.execute({ input: { data: JSON.stringify(data), schema: JSON.stringify(expectedSchema) }, prompt: `Validate this extracted data against the expected schema:
1. Check completeness (all required fields present) 2. Verify data types and formats 3. Identify inconsistencies or anomalies 4. Suggest corrections for invalid data 5. Rate overall quality (0-1 score)
Return validation report with issues and suggestions.` });
if (validation.qualityScore < 0.7) { console.log("Data quality issues detected:", validation.issues); return await this.attemptDataCorrection(data, validation); }
return data; }
async attemptDataCorrection(data, validationReport) { return await Agent.execute({ input: { originalData: JSON.stringify(data), issues: JSON.stringify(validationReport.issues) }, prompt: `Attempt to correct the identified data quality issues: ${validationReport.issues.join('\n')}
Return corrected data maintaining the original structure.` }); }}Real-World Use Cases
Section titled “Real-World Use Cases”E-commerce Product Intelligence
Section titled “E-commerce Product Intelligence”Comprehensive product data extraction with competitive analysis:
// Intelligent product scraperclass ProductIntelligenceScraper { async scrapeProduct(productUrl) { await NavigateToLink.execute({ url: productUrl });
// Multi-modal product analysis const productAnalysis = await Agent.execute({ input: { text: await GetAllText.execute(), images: await GetAllImages.execute(), html: await GetAllHTML.execute() }, tools: [ImageAnalysisTool, PriceExtractorTool, ReviewAnalyzerTool], prompt: `Perform comprehensive product analysis:
1. Extract basic product information (name, brand, model) 2. Identify pricing information (current, original, discounts) 3. Analyze product images for features and quality 4. Extract technical specifications 5. Analyze customer reviews and ratings 6. Identify related or competing products 7. Extract availability and shipping information
Return structured product intelligence report.` });
// Enhance with competitive analysis const competitorAnalysis = await this.analyzeCompetitors( productAnalysis.productName, productAnalysis.category );
return { ...productAnalysis, competitorAnalysis, scrapedAt: new Date().toISOString(), sourceUrl: productUrl }; }
async analyzeCompetitors(productName, category) { // Search for similar products const searchResults = await WebSearchTool.execute({ query: `${productName} ${category} alternatives comparison` });
return await Agent.execute({ input: searchResults, prompt: `Analyze these search results to identify: 1. Direct competitors and alternatives 2. Price comparison opportunities 3. Feature differentiation points 4. Market positioning insights` }); }}News and Content Monitoring
Section titled “News and Content Monitoring”Intelligent content monitoring with sentiment analysis:
// AI-powered news monitoringclass NewsMonitoringScraper { async monitorTopic(topic, sources = []) { const articles = [];
for (const source of sources) { await NavigateToLink.execute({ url: source });
const articleAnalysis = await Agent.execute({ input: { html: await GetAllHTML.execute(), links: await GetAllLinks.execute() }, prompt: `Find articles related to "${topic}" on this news site:
1. Identify article links and headlines 2. Extract publication dates 3. Determine relevance score (0-1) to the topic 4. Classify article type (news, opinion, analysis, etc.)
Return array of relevant articles with metadata.` });
// Process each relevant article for (const article of articleAnalysis.articles) { if (article.relevanceScore > 0.6) { const fullArticle = await this.scrapeFullArticle(article.url); articles.push(fullArticle); } } }
return await this.analyzeArticleCollection(articles, topic); }
async scrapeFullArticle(articleUrl) { await NavigateToLink.execute({ url: articleUrl });
return await Agent.execute({ input: { text: await GetAllText.execute(), html: await GetAllHTML.execute() }, tools: [SentimentAnalysisTool, EntityExtractionTool], prompt: `Extract comprehensive article information:
1. Article title and subtitle 2. Author and publication info 3. Main content (clean text) 4. Key entities mentioned (people, organizations, locations) 5. Sentiment analysis (positive/negative/neutral) 6. Key topics and themes 7. Article summary (2-3 sentences)
Return structured article data.` }); }
async analyzeArticleCollection(articles, topic) { return await Agent.execute({ input: { articles: JSON.stringify(articles), topic: topic }, prompt: `Analyze this collection of articles about "${topic}":
1. Identify trending themes and subtopics 2. Track sentiment evolution over time 3. Find key influencers and sources 4. Detect emerging narratives or controversies 5. Summarize overall coverage patterns
Return comprehensive topic analysis report.` }); }}Research Data Collection
Section titled “Research Data Collection”Academic and research-focused intelligent extraction:
// Research-oriented scraperclass ResearchDataScraper { async collectResearchData(researchQuery, sources = []) { const researchData = { papers: [], datasets: [], experts: [], institutions: [] };
for (const source of sources) { await NavigateToLink.execute({ url: source });
const sourceAnalysis = await Agent.execute({ input: { text: await GetAllText.execute(), links: await GetAllLinks.execute() }, prompt: `Analyze this academic/research source for "${researchQuery}":
1. Identify research papers and publications 2. Find datasets and data sources 3. Extract researcher and expert information 4. Identify institutional affiliations 5. Find related research topics and keywords
Classify each finding by type and relevance.` });
// Process each type of research data await this.processResearchFindings(sourceAnalysis, researchData); }
return await this.synthesizeResearchFindings(researchData, researchQuery); }
async processResearchFindings(analysis, researchData) { // Process papers for (const paper of analysis.papers || []) { const paperDetails = await this.extractPaperDetails(paper.url); researchData.papers.push(paperDetails); }
// Process datasets for (const dataset of analysis.datasets || []) { const datasetInfo = await this.extractDatasetInfo(dataset.url); researchData.datasets.push(datasetInfo); }
// Process experts and institutions researchData.experts.push(...(analysis.experts || [])); researchData.institutions.push(...(analysis.institutions || [])); }
async synthesizeResearchFindings(researchData, query) { return await Agent.execute({ input: { data: JSON.stringify(researchData), query: query }, prompt: `Synthesize this research data collection:
1. Identify key research themes and trends 2. Map relationships between papers, authors, and institutions 3. Find research gaps and opportunities 4. Suggest follow-up research directions 5. Create citation network analysis
Return comprehensive research landscape report.` }); }}This intelligent web extraction approach provides more robust, adaptable, and insightful data extraction capabilities that can handle the complexity and variability of modern web content.