Skip to content

Intelligent Web Extraction with AI

This guide demonstrates how to create sophisticated web extraction workflows that use AI to intelligently identify, extract, and process web content, making extraction more adaptive and robust than traditional selector-based approaches.

Traditional Extraction:

  • Relies on fixed CSS selectors and XPath expressions
  • Breaks when website structure changes
  • Requires manual updates for each site variation
  • Limited to predefined data patterns

AI-Powered Extraction:

  • Uses machine learning to identify content patterns
  • Adapts to structural changes automatically
  • Learns from examples to improve extraction
  • Handles unstructured and semi-structured data

Use AI to identify relevant content areas on web pages:

// AI-powered content area detection
const contentAnalysis = await Agent.execute({
input: await GetAllHTML.execute(),
tools: [ThinkTool, CalculatorTool],
prompt: `Analyze this HTML structure and identify:
1. Main content areas (articles, product info, etc.)
2. Navigation elements
3. Sidebar content
4. Footer information
5. Advertisement sections
Return a JSON structure mapping content types to CSS selectors.`
});
// Extract content based on AI analysis
const mainContent = await ProcessHTML.execute({
html: await GetAllHTML.execute(),
selector: contentAnalysis.selectors.main_content,
operation: "extract_text"
});

Create extraction workflows that adapt to different website structures:

// Multi-strategy extraction with AI fallback
class AdaptiveExtractor {
constructor() {
this.strategies = [
new SelectorBasedExtraction(),
new AIPatternExtraction(),
new SemanticExtraction()
];
}
async extract(url, dataType) {
await NavigateToLink.execute({ url });
for (const strategy of this.strategies) {
try {
const result = await strategy.extract(dataType);
if (this.validateExtraction(result)) {
return result;
}
} catch (error) {
console.log(`Strategy ${strategy.name} failed:`, error);
}
}
// Final AI-powered extraction attempt
return await this.aiEmergencyExtraction(dataType);
}
async aiEmergencyExtraction(dataType) {
const pageContent = await GetAllText.execute();
return await Agent.execute({
input: pageContent,
prompt: `Extract ${dataType} information from this text.
Return structured JSON with all relevant fields.
If information is missing, mark fields as null.`
});
}
}

Use AI to understand content semantics rather than just structure:

// Semantic content classifier
const contentClassifier = await Agent.execute({
input: await GetAllText.execute(),
tools: [TextClassifierTool],
prompt: `Classify the content on this page into categories:
- Product information (name, price, description, specs)
- User reviews and ratings
- Related products or recommendations
- Company/seller information
- Shipping and return policies
For each category found, provide:
1. Confidence score (0-1)
2. Key information extracted
3. Location hints for future extraction`
});
// Process each content type appropriately
for (const category of contentClassifier.categories) {
if (category.confidence > 0.8) {
await processContentCategory(category);
}
}

Create AI workflows that intelligently navigate and extract from multiple pages:

// Intelligent pagination handler
class SmartPaginator {
constructor() {
this.visitedUrls = new Set();
this.extractedData = [];
}
async scrapeWithPagination(startUrl, maxPages = 50) {
let currentUrl = startUrl;
let pageCount = 0;
while (currentUrl && pageCount < maxPages) {
if (this.visitedUrls.has(currentUrl)) break;
await NavigateToLink.execute({ url: currentUrl });
this.visitedUrls.add(currentUrl);
// Extract data from current page
const pageData = await this.extractPageData();
this.extractedData.push(...pageData);
// AI-powered next page detection
const nextPageAnalysis = await Agent.execute({
input: await GetAllLinks.execute(),
prompt: `Analyze these links to find the next page in pagination.
Look for patterns like "Next", "Page 2", "→", or numbered links.
Return the most likely next page URL or null if no pagination found.`
});
currentUrl = nextPageAnalysis.nextPageUrl;
pageCount++;
// Respectful delay
await WaitNode.execute({ seconds: 2 });
}
return this.extractedData;
}
async extractPageData() {
// Use AI to identify and extract relevant data
const dataAnalysis = await Agent.execute({
input: {
html: await GetAllHTML.execute(),
text: await GetAllText.execute(),
links: await GetAllLinks.execute(),
images: await GetAllImages.execute()
},
prompt: `Extract structured data from this page.
Identify the main data entities and their relationships.
Return as JSON array with consistent schema.`
});
return dataAnalysis.entities || [];
}
}

Handle JavaScript-rendered and dynamically loaded content:

// Dynamic content scraper with AI timing
class DynamicContentScraper {
async scrapeWithDynamicContent(url) {
await NavigateToLink.execute({ url });
// Initial content snapshot
const initialContent = await GetAllText.execute();
// Wait for dynamic content with AI-guided timing
const loadingAnalysis = await Agent.execute({
input: {
html: await GetAllHTML.execute(),
initialTextLength: initialContent.length
},
prompt: `Analyze this page for dynamic content loading indicators:
- Loading spinners or placeholders
- JavaScript frameworks (React, Vue, Angular)
- AJAX request patterns
- Estimated load time needed (1-10 seconds)`
});
// Wait based on AI recommendation
await WaitNode.execute({
seconds: loadingAnalysis.recommendedWaitTime || 3
});
// Re-analyze after waiting
const finalContent = await GetAllText.execute();
if (finalContent.length > initialContent.length * 1.2) {
console.log("Dynamic content detected and loaded");
}
return await this.extractFinalData();
}
async extractFinalData() {
return await Agent.execute({
input: {
html: await GetAllHTML.execute(),
text: await GetAllText.execute()
},
prompt: `Extract all relevant data from this fully loaded page.
Focus on the main content that appears to be dynamically loaded.`
});
}
}

Use AI to validate extracted data quality:

// AI-powered data validation
class DataValidator {
async validateExtractedData(data, expectedSchema) {
const validation = await Agent.execute({
input: {
data: JSON.stringify(data),
schema: JSON.stringify(expectedSchema)
},
prompt: `Validate this extracted data against the expected schema:
1. Check completeness (all required fields present)
2. Verify data types and formats
3. Identify inconsistencies or anomalies
4. Suggest corrections for invalid data
5. Rate overall quality (0-1 score)
Return validation report with issues and suggestions.`
});
if (validation.qualityScore < 0.7) {
console.log("Data quality issues detected:", validation.issues);
return await this.attemptDataCorrection(data, validation);
}
return data;
}
async attemptDataCorrection(data, validationReport) {
return await Agent.execute({
input: {
originalData: JSON.stringify(data),
issues: JSON.stringify(validationReport.issues)
},
prompt: `Attempt to correct the identified data quality issues:
${validationReport.issues.join('\n')}
Return corrected data maintaining the original structure.`
});
}
}

Comprehensive product data extraction with competitive analysis:

// Intelligent product scraper
class ProductIntelligenceScraper {
async scrapeProduct(productUrl) {
await NavigateToLink.execute({ url: productUrl });
// Multi-modal product analysis
const productAnalysis = await Agent.execute({
input: {
text: await GetAllText.execute(),
images: await GetAllImages.execute(),
html: await GetAllHTML.execute()
},
tools: [ImageAnalysisTool, PriceExtractorTool, ReviewAnalyzerTool],
prompt: `Perform comprehensive product analysis:
1. Extract basic product information (name, brand, model)
2. Identify pricing information (current, original, discounts)
3. Analyze product images for features and quality
4. Extract technical specifications
5. Analyze customer reviews and ratings
6. Identify related or competing products
7. Extract availability and shipping information
Return structured product intelligence report.`
});
// Enhance with competitive analysis
const competitorAnalysis = await this.analyzeCompetitors(
productAnalysis.productName,
productAnalysis.category
);
return {
...productAnalysis,
competitorAnalysis,
scrapedAt: new Date().toISOString(),
sourceUrl: productUrl
};
}
async analyzeCompetitors(productName, category) {
// Search for similar products
const searchResults = await WebSearchTool.execute({
query: `${productName} ${category} alternatives comparison`
});
return await Agent.execute({
input: searchResults,
prompt: `Analyze these search results to identify:
1. Direct competitors and alternatives
2. Price comparison opportunities
3. Feature differentiation points
4. Market positioning insights`
});
}
}

Intelligent content monitoring with sentiment analysis:

// AI-powered news monitoring
class NewsMonitoringScraper {
async monitorTopic(topic, sources = []) {
const articles = [];
for (const source of sources) {
await NavigateToLink.execute({ url: source });
const articleAnalysis = await Agent.execute({
input: {
html: await GetAllHTML.execute(),
links: await GetAllLinks.execute()
},
prompt: `Find articles related to "${topic}" on this news site:
1. Identify article links and headlines
2. Extract publication dates
3. Determine relevance score (0-1) to the topic
4. Classify article type (news, opinion, analysis, etc.)
Return array of relevant articles with metadata.`
});
// Process each relevant article
for (const article of articleAnalysis.articles) {
if (article.relevanceScore > 0.6) {
const fullArticle = await this.scrapeFullArticle(article.url);
articles.push(fullArticle);
}
}
}
return await this.analyzeArticleCollection(articles, topic);
}
async scrapeFullArticle(articleUrl) {
await NavigateToLink.execute({ url: articleUrl });
return await Agent.execute({
input: {
text: await GetAllText.execute(),
html: await GetAllHTML.execute()
},
tools: [SentimentAnalysisTool, EntityExtractionTool],
prompt: `Extract comprehensive article information:
1. Article title and subtitle
2. Author and publication info
3. Main content (clean text)
4. Key entities mentioned (people, organizations, locations)
5. Sentiment analysis (positive/negative/neutral)
6. Key topics and themes
7. Article summary (2-3 sentences)
Return structured article data.`
});
}
async analyzeArticleCollection(articles, topic) {
return await Agent.execute({
input: {
articles: JSON.stringify(articles),
topic: topic
},
prompt: `Analyze this collection of articles about "${topic}":
1. Identify trending themes and subtopics
2. Track sentiment evolution over time
3. Find key influencers and sources
4. Detect emerging narratives or controversies
5. Summarize overall coverage patterns
Return comprehensive topic analysis report.`
});
}
}

Academic and research-focused intelligent extraction:

// Research-oriented scraper
class ResearchDataScraper {
async collectResearchData(researchQuery, sources = []) {
const researchData = {
papers: [],
datasets: [],
experts: [],
institutions: []
};
for (const source of sources) {
await NavigateToLink.execute({ url: source });
const sourceAnalysis = await Agent.execute({
input: {
text: await GetAllText.execute(),
links: await GetAllLinks.execute()
},
prompt: `Analyze this academic/research source for "${researchQuery}":
1. Identify research papers and publications
2. Find datasets and data sources
3. Extract researcher and expert information
4. Identify institutional affiliations
5. Find related research topics and keywords
Classify each finding by type and relevance.`
});
// Process each type of research data
await this.processResearchFindings(sourceAnalysis, researchData);
}
return await this.synthesizeResearchFindings(researchData, researchQuery);
}
async processResearchFindings(analysis, researchData) {
// Process papers
for (const paper of analysis.papers || []) {
const paperDetails = await this.extractPaperDetails(paper.url);
researchData.papers.push(paperDetails);
}
// Process datasets
for (const dataset of analysis.datasets || []) {
const datasetInfo = await this.extractDatasetInfo(dataset.url);
researchData.datasets.push(datasetInfo);
}
// Process experts and institutions
researchData.experts.push(...(analysis.experts || []));
researchData.institutions.push(...(analysis.institutions || []));
}
async synthesizeResearchFindings(researchData, query) {
return await Agent.execute({
input: {
data: JSON.stringify(researchData),
query: query
},
prompt: `Synthesize this research data collection:
1. Identify key research themes and trends
2. Map relationships between papers, authors, and institutions
3. Find research gaps and opportunities
4. Suggest follow-up research directions
5. Create citation network analysis
Return comprehensive research landscape report.`
});
}
}

This intelligent web extraction approach provides more robust, adaptable, and insightful data extraction capabilities that can handle the complexity and variability of modern web content.