Skip to content

Character Text Splitter

Character Text Splitter (Simple Document Splitter)

Section titled “Character Text Splitter (Simple Document Splitter)”

The Character Text Splitter breaks long documents into smaller, equal-sized pieces. It’s like cutting a long article into pages - each piece is roughly the same size, making it easier for AI to process.

NameTypeDescriptionRequiredDefault
textTextDocument to splitYes-
chunk_sizeNumberSize of each pieceYes-
separatorTextWhere to split (like paragraphs)No”\n\n”
chunk_overlapNumberHow much pieces should overlapNo0
NameTypeDescription
chunksArrayText pieces ready for AI processing
chunk_countNumberHow many pieces were created

📄 Simple Document Processing: When you just need to break text into equal pieces ⚡ Quick Setup: Fastest way to prepare documents for AI 📊 Consistent Sizing: All pieces are roughly the same size 🔧 Basic Workflows: Good starting point for document processing

ParameterTypeDescriptionExample
inputTextstringThe text content to be split into chunks"{{document.text}}"
chunkSizenumberMaximum number of characters per chunk1000
ParameterTypeDefaultDescriptionExample
separatorstring"\n\n"Character sequence used to split the text"\n"
chunkOverlapnumber200Number of characters to overlap between consecutive chunks100
keepSeparatorbooleanfalseWhether to keep the separator in the resulting chunkstrue
stripWhitespacebooleantrueWhether to strip leading and trailing whitespace from chunksfalse
{
"inputText": "{{document.content}}",
"chunkSize": 1000,
"separator": "\n\n",
"chunkOverlap": 200,
"keepSeparator": false,
"stripWhitespace": true,
"processingOptions": {
"minChunkSize": 50,
"maxChunks": 1000,
"preserveFormatting": false
},
"metadata": {
"sourceDocument": "{{document.id}}",
"processingTimestamp": "auto"
}
}

The Character Text Splitter operates entirely within the browser environment and does not require additional browser permissions.

  • String Processing APIs: Native JavaScript string manipulation for efficient text splitting
  • Regular Expression Engine: For advanced separator pattern matching when needed
  • Memory Management: Efficient memory usage for processing large text documents
FeatureChromeFirefoxSafariEdge
Text Splitting✅ Full✅ Full✅ Full✅ Full
Character Counting✅ Full✅ Full✅ Full✅ Full
Separator Processing✅ Full✅ Full✅ Full✅ Full
Large Document Handling✅ Full✅ Full✅ Full✅ Full
  • Data Processing: All text processing occurs locally within the browser environment
  • Memory Safety: Efficient memory management prevents memory leaks with large documents
  • Input Validation: Text input is validated and sanitized to prevent processing errors
  • No External Dependencies: No external API calls or data transmission required
  • Content Security: Processed text remains within the secure browser context
{
"inputText": "string",
"processingOptions": {
"chunkSize": "number",
"separator": "string",
"chunkOverlap": "number"
},
"metadata": {
"sourceId": "string",
"documentType": "string"
}
}
{
"chunks": [
{
"text": "string",
"index": "number",
"startPosition": "number",
"endPosition": "number",
"characterCount": "number"
}
],
"summary": {
"totalChunks": "number",
"totalCharacters": "number",
"averageChunkSize": "number",
"separator": "string",
"chunkOverlap": "number"
},
"metadata": {
"processingTime": "number_ms",
"timestamp": "ISO_8601_string",
"splitterType": "character",
"langchainVersion": "string"
}
}

Scenario: Split a technical documentation file into chunks for embedding generation, using paragraph breaks as natural splitting points.

Configuration:

{
"inputText": "{{document.content}}",
"chunkSize": 800,
"separator": "\n\n",
"chunkOverlap": 100,
"stripWhitespace": true
}

Input Data:

{
"inputText": "Introduction\n\nThis document provides comprehensive guidelines for using the platform.\n\nGetting Started\n\nTo begin, create an account and log into the dashboard. The interface consists of several key components that work together to provide a seamless experience.\n\nFeatures Overview\n\nThe platform offers multiple features including workflow, data processing, and integration capabilities.",
"metadata": {
"sourceId": "user-guide-v2",
"documentType": "technical_documentation"
}
}

Expected Output:

{
"chunks": [
{
"text": "Introduction\n\nThis document provides comprehensive guidelines for using the platform.\n\nGetting Started\n\nTo begin, create an account and log into the dashboard. The interface consists of several key components that work together to provide a seamless experience.",
"index": 0,
"startPosition": 0,
"endPosition": 247,
"characterCount": 247
},
{
"text": "The interface consists of several key components that work together to provide a seamless experience.\n\nFeatures Overview\n\nThe platform offers multiple features including workflow, data processing, and integration capabilities.",
"index": 1,
"startPosition": 147,
"endPosition": 367,
"characterCount": 220
}
],
"summary": {
"totalChunks": 2,
"totalCharacters": 367,
"averageChunkSize": 233,
"separator": "\n\n",
"chunkOverlap": 100
},
"metadata": {
"processingTime": 15,
"timestamp": "2024-01-15T10:30:00Z",
"splitterType": "character",
"langchainVersion": "0.1.0"
}
}

Step-by-Step Process:

  1. Input text is analyzed for separator occurrences (“\n\n”)
  2. Text is split at separator boundaries while respecting chunk size limits
  3. Overlap is applied between consecutive chunks to maintain context
  4. Each chunk is measured by character count and positioned within the original text
  5. Metadata is generated including processing statistics and chunk information

Example 2: Custom Separator for Structured Content

Section titled “Example 2: Custom Separator for Structured Content”

Scenario: Process a CSV-like structured document where each record should be kept intact, using custom separators and specific formatting requirements.

Configuration:

{
"inputText": "{{structuredData.content}}",
"chunkSize": 500,
"separator": "---",
"chunkOverlap": 0,
"keepSeparator": true,
"stripWhitespace": false
}

Workflow Integration:

[Document Loader] → [Character Text Splitter] → [Data Validator] → [Embedding Generator]
↓ ↓ ↓ ↓
raw_document structured_chunks validated_data vector_embeddings

Complete Example: This configuration is ideal for processing structured data files where maintaining the exact formatting and separator structure is crucial for downstream processing, such as preparing data for specialized embedding models or maintaining data integrity in analytical workflows.

Pattern 1: Document Preprocessing Pipeline

Section titled “Pattern 1: Document Preprocessing Pipeline”
  • Nodes: [Document Loader] → [Character Text Splitter] → [Text Cleaner] → [Embedding Generator]
  • Use Case: Prepare documents for AI processing with consistent chunk sizes
  • Configuration Tips: Use paragraph separators (“\n\n”) for natural text boundaries

Pattern 2: Multi-Format Content Processing

Section titled “Pattern 2: Multi-Format Content Processing”
  • Nodes: [Format Detector] → [Character Text Splitter] → [Format Normalizer] → [Content Analyzer]
  • Use Case: Process various document formats with consistent chunking strategy
  • Data Flow: Detect format, split appropriately, normalize output, analyze content
  • Nodes: [Content Extractor] → [Character Text Splitter] → [Metadata Enricher] → [Vector Store]
  • Use Case: Prepare knowledge base content for retrieval-augmented generation
  • Configuration Tips: Balance chunk size with embedding model requirements and retrieval accuracy
  • Performance: Use appropriate chunk sizes (500-1500 characters) for optimal processing speed and memory usage
  • Error Handling: Validate input text length and handle edge cases like empty documents or very short texts
  • Data Validation: Ensure separator characters exist in the input text to enable meaningful splitting
  • Resource Management: Monitor memory usage when processing very large documents (>10MB)
  • Separator Selection: Choose separators that respect natural document structure (paragraphs, sections, sentences)
  • Symptoms: Output contains only one chunk with the entire input text
  • Causes:
    • Separator character sequence not found in the input text
    • Chunk size larger than the entire input text
    • Incorrect separator configuration
  • Solutions:
    1. Verify the separator exists in your input text
    2. Try alternative separators like “\n”, ”. ”, or ” ”
    3. Reduce chunk size to force splitting
    4. Use a fallback separator strategy
  • Prevention: Analyze input text structure before configuring separators
  • Symptoms: Resulting chunks don’t meet expected size requirements for downstream processing
  • Causes:
    • Inappropriate chunk size configuration
    • Separator placement creating uneven splits
    • Overlap settings affecting effective chunk size
  • Solutions:
    1. Adjust chunk size based on your specific requirements
    2. Experiment with different separators for more even distribution
    3. Modify overlap settings to balance context preservation and chunk independence
    4. Use multiple splitting strategies for different document sections
  • Prevention: Test with sample documents to optimize chunk size settings
  • Symptoms: Important information spans chunk boundaries, reducing effectiveness of downstream processing
  • Causes:
    • Insufficient chunk overlap
    • Poor separator selection that breaks semantic units
    • Chunk size too small for content complexity
  • Solutions:
    1. Increase chunk overlap to preserve more context
    2. Use semantic-aware separators (paragraphs, sections)
    3. Increase chunk size to capture more complete thoughts
    4. Consider using recursive splitting for complex documents
  • Prevention: Design chunking strategy based on content structure and downstream requirements
  • Excellent performance with large documents up to 50MB
  • Efficient memory management for text processing
  • No known compatibility issues
  • Slightly slower processing for very large documents (>20MB)
  • Good overall compatibility with all features
  • May require longer processing time for complex separator patterns
  • Consistent performance across all Safari versions
  • Efficient handling of Unicode and special characters
  • No known limitations for typical use cases
  • Slow Processing: Optimize chunk size and separator complexity, consider processing documents in smaller segments
  • Memory Usage: Monitor browser memory usage with large documents, implement streaming for very large files
  • Character Encoding: Ensure proper UTF-8 encoding for international text and special characters
  • Maximum Input Size: Browser memory limits may restrict processing of extremely large documents (>100MB)
  • Separator Complexity: Simple character-based separators only; no regex or complex pattern matching
  • Character Counting: Counts Unicode characters, which may differ from byte count or token count
  • Processing Speed: Large documents may require several seconds to process depending on complexity
  • Memory Constraints: Available browser memory limits the maximum document size that can be processed
  • String Length Limits: JavaScript string length limitations may affect very large documents
  • Performance Variation: Processing speed varies across different browsers and devices
  • Input Format: Accepts plain text only; binary or encoded content must be converted first
  • Output Size: Large numbers of chunks may impact browser performance and memory usage
  • Character Encoding: Proper handling requires valid UTF-8 encoded text input
  • Separator Requirements: Effective splitting requires appropriate separator characters in the source text
  • Recursive Text Splitter: More advanced splitting with hierarchical separator fallback
  • Token Text Splitter: Splits text based on token count rather than character count
  • Semantic Text Splitter: AI-powered splitting that respects semantic boundaries
  • Text Cleaner: Preprocesses text by removing unwanted characters and formatting
  • Document Loader: Loads and extracts text from various document formats
  • Embedding Generator: Converts text chunks into vector embeddings for AI processing
  • Metadata Enricher: Adds contextual information to processed text chunks
  • For document processing, consider combining with: Document Loader, Text Cleaner, Metadata Enricher
  • For RAG systems, this node works well before: Embedding Generator, Vector Store Writer, Similarity Search
  • For content analysis, follow this node with: Text Analyzer, Sentiment Processor, Topic Classifier