Recursive Character Text Splitter
Recursive Character Text Splitter (Smart Chunking)
Section titled “Recursive Character Text Splitter (Smart Chunking)”What It Does
Section titled “What It Does”The Recursive Character Text Splitter intelligently breaks long documents into smaller, manageable chunks while keeping related content together. It’s like having a smart librarian who knows exactly where to split a book so each section makes sense on its own.
What Goes In, What Comes Out
Section titled “What Goes In, What Comes Out”| Name | Type | Description | Required | Default |
|---|---|---|---|---|
text | Text | Document content to split | Yes | - |
chunk_size | Number | Maximum characters per chunk | Yes | - |
chunk_overlap | Number | Characters to overlap between chunks | No | 200 |
separators | Array | How to split text (paragraphs, sentences) | No | [“\n\n”, “\n”, ”. “] |
Output
Section titled “Output”| Name | Type | Description |
|---|---|---|
chunks | Array | Smart text chunks ready for AI processing |
metadata | Object | Information about the splitting process |
Why Use Smart Chunking?
Section titled “Why Use Smart Chunking?”🧠 Preserves Meaning: Keeps related sentences and paragraphs together 📏 Perfect Sizing: Creates chunks that are just the right size for AI models 🔗 Maintains Context: Overlaps chunks so important connections aren’t lost 📚 Format Aware: Understands different document types (HTML, markdown, plain text) ⚡ AI Ready: Outputs chunks perfectly formatted for knowledge bases and AI processing
How It Works
Section titled “How It Works”flowchart LR
A[📄 Long Document] --> B[✂️ Smart Splitting]
B --> C[📝 Perfect Chunks]
C --> D[🤖 AI Ready]
style A fill:#e3f2fd
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e8
Smart Process:
- Analyze Document: Looks for natural break points (paragraphs, sentences)
- Split Intelligently: Breaks text while keeping related content together
- Add Overlap: Makes sure important connections aren’t lost between chunks
- Perfect Size: Creates chunks that are just right for AI processing
Perfect For
Section titled “Perfect For”📚 Preparing Documents for AI: Get documents ready for knowledge bases 🔍 Building Search Systems: Create searchable chunks from long content 🤖 AI Processing: Split content into AI-friendly sizes 📊 Content Organization: Break large documents into manageable pieces
Simple Settings
Section titled “Simple Settings”What You Need to Set ⚙️
- Text: The document you want to split
- Chunk Size: How big each piece should be (1000 characters is usually good)
Optional Settings 🎛️
- Chunk Overlap: How much pieces should overlap (200 characters prevents losing context)
- Separators: Where to split (paragraphs work best for most documents)
|
length_function|string|"character"| Method to measure chunk length: character, token, word |"token"|
Advanced Configuration
Section titled “Advanced Configuration”{ "text": "{extracted_content}", "chunk_size": 1500, "chunk_overlap": 300, "separators": ["\\n\\n", "\\n", ". ", " ", ""], "keep_separator": true, "is_separator_regex": false, "length_function": "character", "metadata_preservation": { "source_tracking": true, "chunk_numbering": true, "overlap_marking": true }, "content_type_handling": { "markdown": true, "code_blocks": true, "html_tags": false }}Browser API Integration
Section titled “Browser API Integration”Required Permissions
Section titled “Required Permissions”| Permission | Purpose | Security Impact |
|---|---|---|
storage | Cache splitting configurations and processed chunks | Stores text processing data locally |
activeTab | Access current tab content for text extraction | Can read content from active browser tabs |
Browser APIs Used
Section titled “Browser APIs Used”- Web Workers: Processes large text splitting operations without blocking UI
- IndexedDB: Caches processed chunks and splitting configurations
- TextEncoder/TextDecoder: Handles text encoding and character counting accurately
Cross-Browser Compatibility
Section titled “Cross-Browser Compatibility”| Feature | Chrome | Firefox | Safari | Edge |
|---|---|---|---|---|
| Text Processing | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
| Background Processing | ✅ Full | ✅ Full | ✅ Full | ✅ Full |
| Large Document Handling | ✅ Full | ✅ Full | ⚠️ Limited | ✅ Full |
Security Considerations
Section titled “Security Considerations”- Local Processing: All text splitting occurs locally without external transmission
- Memory Management: Efficient processing of large documents without memory leaks
- Data Privacy: Text content remains within browser environment during processing
- Resource Monitoring: Tracks processing resources to prevent browser overload
- Content Validation: Validates input text format and encoding before processing
Input/Output Specifications
Section titled “Input/Output Specifications”Input Data Structure
Section titled “Input Data Structure”{ "text": "string - The text content to split into chunks", "splitting_config": { "chunk_size": "number - Maximum chunk size", "chunk_overlap": "number - Overlap between chunks", "separators": "array - List of separators to use", "content_type": "string - Type of content (markdown, html, code, text)" }, "metadata": { "source_url": "string - URL of source document", "document_title": "string - Title of the document", "timestamp": "string - When content was extracted" }}Output Data Structure
Section titled “Output Data Structure”{ "chunks": [ { "text": "string - The chunk text content", "metadata": { "chunk_id": "string - Unique identifier for this chunk", "chunk_index": "number - Position in the original document", "start_position": "number - Character position where chunk starts", "end_position": "number - Character position where chunk ends", "overlap_with_previous": "number - Characters overlapping with previous chunk", "overlap_with_next": "number - Characters overlapping with next chunk", "source_document": "string - Reference to original document" }, "statistics": { "character_count": "number - Number of characters in chunk", "word_count": "number - Number of words in chunk", "line_count": "number - Number of lines in chunk" } } ], "summary": { "total_chunks": "number - Total number of chunks created", "original_length": "number - Length of original text", "total_chunk_length": "number - Combined length of all chunks", "average_chunk_size": "number - Average size of chunks", "overlap_efficiency": "number - Percentage of content that is overlapped" }, "metadata": { "timestamp": "2024-01-15T10:30:00Z", "processing_time": 450, "splitting_strategy": "recursive_character", "source": "recursive_text_splitter" }}Practical Examples
Section titled “Practical Examples”Example 1: Document Preparation for Knowledge Base
Section titled “Example 1: Document Preparation for Knowledge Base”Scenario: Split a large technical document into chunks for vector storage and RAG system
Configuration:
{ "text": "{technical_document}", "chunk_size": 1200, "chunk_overlap": 200, "separators": ["\\n\\n", "\\n", ". ", " ", ""], "keep_separator": true, "length_function": "character"}Input Data:
{ "text": "# API Documentation\\n\\nThe REST API provides programmatic access to system resources. Authentication is required for all endpoints.\\n\\n## Authentication\\n\\nUse Bearer tokens in the Authorization header. Tokens expire after 24 hours and must be refreshed.\\n\\n### Getting a Token\\n\\nSend a POST request to /auth/token with valid credentials. The response includes an access token and refresh token.\\n\\n## Endpoints\\n\\n### Users\\n\\nThe /users endpoint allows management of user accounts. Supports GET, POST, PUT, and DELETE operations.",
"splitting_config": { "chunk_size": 1200, "chunk_overlap": 200, "separators": ["\\n\\n", "\\n", ". ", " ", ""], "content_type": "markdown" }, "metadata": { "source_url": "https://docs.example.com/api", "document_title": "API Documentation", "timestamp": "2024-01-15T10:00:00Z" }}Expected Output:
{ "chunks": [ { "text": "# API Documentation\\n\\nThe REST API provides programmatic access to system resources. Authentication is required for all endpoints.\\n\\n## Authentication\\n\\nUse Bearer tokens in the Authorization header. Tokens expire after 24 hours and must be refreshed.\\n\\n### Getting a Token\\n\\nSend a POST request to /auth/token with valid credentials. The response includes an access token and refresh token.",
"metadata": { "chunk_id": "chunk_001", "chunk_index": 0, "start_position": 0, "end_position": 387, "overlap_with_previous": 0, "overlap_with_next": 200, "source_document": "API Documentation" }, "statistics": { "character_count": 387, "word_count": 58, "line_count": 9 } }, { "text": "The response includes an access token and refresh token.\\n\\n## Endpoints\\n\\n### Users\\n\\nThe /users endpoint allows management of user accounts. Supports GET, POST, PUT, and DELETE operations.",
"metadata": { "chunk_id": "chunk_002", "chunk_index": 1, "start_position": 187, "end_position": 387, "overlap_with_previous": 200, "overlap_with_next": 0, "source_document": "API Documentation" }, "statistics": { "character_count": 200, "word_count": 30, "line_count": 6 } } ], "summary": { "total_chunks": 2, "original_length": 587, "total_chunk_length": 587, "average_chunk_size": 293, "overlap_efficiency": 34.1 }, "metadata": { "timestamp": "2024-01-15T10:30:00Z", "processing_time": 450, "splitting_strategy": "recursive_character", "source": "recursive_text_splitter" }}Step-by-Step Process:
- Input text is analyzed for structure and content type
- Recursive splitting algorithm applies separators in hierarchical order
- Chunks are created respecting size limits while preserving semantic boundaries
- Overlap regions are calculated and marked for context continuity
- Metadata is generated for each chunk including position and statistics
Example 2: Code Documentation Processing
Section titled “Example 2: Code Documentation Processing”Scenario: Split code documentation with special handling for code blocks and comments
Configuration:
{ "text": "{code_documentation}", "chunk_size": 800, "chunk_overlap": 150, "separators": ["```\\n", "\\n\\n", "\\n", ". ", " "], "keep_separator": true, "content_type": "code"}Workflow Integration:
GetAllTextFromLink → Recursive Character Text Splitter → Ollama Embeddings → LocalKnowledge ↓ ↓ ↓ ↓ raw_documentation structured_chunks embeddings vector_storageComplete Example: This pattern creates a complete pipeline for processing technical documentation, ensuring code blocks remain intact while creating searchable knowledge bases.
Examples
Section titled “Examples”Basic Usage
Section titled “Basic Usage”This example demonstrates the fundamental usage of the RecursiveCharacterTextSplitter node in a typical workflow scenario.
Configuration:
{ "model": "example_value", "enabled": true}Input Data:
{ "data": "sample input data"}Expected Output:
{ "result": "processed output data"}Advanced Usage
Section titled “Advanced Usage”This example shows more complex configuration options and integration patterns.
Configuration:
{ "parameter1": "advanced_value", "parameter2": false, "advancedOptions": { "option1": "value1", "option2": 100 }}Integration Example
Section titled “Integration Example”Example showing how this node integrates with other workflow nodes:
- Previous Node → RecursiveCharacterTextSplitter → Next Node
- Data flows through the workflow with appropriate transformations
- Error handling and validation at each step
Integration Patterns
Section titled “Integration Patterns”Common Node Combinations
Section titled “Common Node Combinations”Pattern 1: Knowledge Base Creation Pipeline
Section titled “Pattern 1: Knowledge Base Creation Pipeline”- Nodes: GetAllTextFromLink → Recursive Character Text Splitter → Ollama Embeddings → LocalKnowledge
- Use Case: Process web documents into searchable knowledge bases with optimal chunking
- Configuration Tips: Match chunk sizes to embedding model capabilities and vector store requirements
Pattern 2: Batch Document Processing
Section titled “Pattern 2: Batch Document Processing”- Nodes: Recursive Character Text Splitter → Basic LLM Chain → EditFields → Merge
- Use Case: Process large documents in chunks and combine AI analysis results
- Data Flow: Text splitting → Individual chunk processing → Result formatting → Result combination
Best Practices
Section titled “Best Practices”- Performance: Use appropriate chunk sizes to balance context preservation and processing efficiency
- Error Handling: Validate text encoding and handle malformed content gracefully
- Data Validation: Ensure chunk sizes are appropriate for downstream AI model requirements
- Resource Management: Monitor memory usage when processing very large documents
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”Issue: Chunks Too Large or Too Small
Section titled “Issue: Chunks Too Large or Too Small”- Symptoms: Downstream AI processing fails due to inappropriate chunk sizes
- Causes: Incorrect chunk_size parameter, inappropriate separators, or content type mismatch
- Solutions:
- Adjust chunk_size parameter based on AI model requirements
- Modify separators to better match content structure
- Use appropriate length_function (character vs token counting)
- Test with sample content to optimize parameters
- Prevention: Profile downstream AI model requirements and test with representative content
Issue: Poor Semantic Preservation
Section titled “Issue: Poor Semantic Preservation”- Symptoms: Related content is split across chunks, breaking context
- Causes: Inappropriate separators, insufficient overlap, or aggressive chunk sizing
- Solutions:
- Increase chunk_overlap to preserve more context
- Adjust separators to respect content structure better
- Use content-type-specific separator configurations
- Implement custom separators for specific document formats
- Prevention: Analyze document structure and choose separators that respect semantic boundaries
Browser-Specific Issues
Section titled “Browser-Specific Issues”Chrome
Section titled “Chrome”- Large document processing may trigger memory warnings; implement chunked processing
- Use Web Workers for background processing of very large documents
Firefox
Section titled “Firefox”- Memory management may differ; monitor resource usage during large document processing
- Implement progress indicators for long-running splitting operations
Performance Issues
Section titled “Performance Issues”- Memory Usage: Very large documents may consume significant browser memory during processing
- Processing Time: Complex documents with many separators may require substantial processing time
- Storage Impact: Generated chunks may consume significant browser storage space
Limitations & Constraints
Section titled “Limitations & Constraints”Technical Limitations
Section titled “Technical Limitations”- Document Size: Very large documents may exceed browser memory limits
- Separator Complexity: Complex regex separators may impact processing performance
- Context Preservation: Perfect semantic preservation may not always be possible with size constraints
Browser Limitations
Section titled “Browser Limitations”- Memory Constraints: Browser memory limits may restrict maximum document size
- Processing Time: Long-running operations may be interrupted by browser timeouts
- Storage Limits: Generated chunks may exceed browser storage quotas
Data Limitations
Section titled “Data Limitations”- Content Type Support: Some specialized document formats may not split optimally
- Language Support: Separator effectiveness may vary for different languages
- Structure Recognition: Complex document structures may not be perfectly preserved
Key Terminology
Section titled “Key Terminology”LLM: Large Language Model - AI models trained on vast amounts of text data
RAG: Retrieval-Augmented Generation - AI technique combining information retrieval with text generation
Vector Store: Database optimized for storing and searching high-dimensional vectors
Embeddings: Numerical representations of text that capture semantic meaning
Prompt: Input text that guides AI model behavior and response generation
Temperature: Parameter controlling randomness in AI responses (0.0-1.0)
Tokens: Units of text processing used by AI models for input and output measurement
Search & Discovery
Section titled “Search & Discovery”Keywords
Section titled “Keywords”- artificial intelligence
- machine learning
- natural language processing
- LLM
- AI agent
- chatbot
- text generation
- language model
Common Search Terms
Section titled “Common Search Terms”- “ai”
- “llm”
- “gpt”
- “chat”
- “generate”
- “analyze”
- “understand”
- “process text”
- “smart”
- “intelligent”
Primary Use Cases
Section titled “Primary Use Cases”- content analysis
- text generation
- question answering
- document processing
- intelligent automation
- knowledge extraction