Recursively split by character
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list of separators is ["\n\n", "\n", " ", ""]
. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.
- How the text is split: by list of characters
- How the chunk size is measured: by number of characters
Important parameters to know here are chunkSize
and chunkOverlap
. chunkSize
controls the max size (in terms of number of characters) of the final documents. chunkOverlap
specifies how much overlap there should be between chunks. This is often helpful to make sure that the text isn't split weirdly. In the example below we set these values to be small (for illustration purposes), but in practice they default to 1000
and 200
respectively.
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const text = `Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});
const output = await splitter.createDocuments([text]);
You'll note that in the above example we are splitting a raw text string and getting back a list of documents. We can also split documents directly.
import { Document } from "langchain/document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const text = `Hi.\n\nI'm Harrison.\n\nHow? Are? You?\nOkay then f f f f.
This is a weird text to write, but gotta test the splittingggg some how.\n\n
Bye!\n\n-H.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 10,
chunkOverlap: 1,
});
const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);
You can customize the RecursiveCharacterTextSplitter
with arbitrary separators by passing a separators
parameter like this:
import { Document } from "langchain/document";
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
const text = `Some other considerations include:
- Do you deploy your backend and frontend together, or separately?
- Do you deploy your backend co-located with your database, or separately?
**Production Support:** As you move your LangChains into production, we'd love to offer more hands-on support.
Fill out [this form](https://airtable.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to share more about what you're building, and our team will get in touch.
## Deployment Options
See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.`;
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 50,
chunkOverlap: 1,
separators: ["|", "##", ">", "-"],
});
const docOutput = await splitter.splitDocuments([
new Document({ pageContent: text }),
]);
console.log(docOutput);
/*
[
Document {
pageContent: 'Some other considerations include:',
metadata: { loc: [Object] }
},
Document {
pageContent: '- Do you deploy your backend and frontend together',
metadata: { loc: [Object] }
},
Document {
pageContent: 'r, or separately?',
metadata: { loc: [Object] }
},
Document {
pageContent: '- Do you deploy your backend co',
metadata: { loc: [Object] }
},
Document {
pageContent: '-located with your database, or separately?\n\n**Pro',
metadata: { loc: [Object] }
},
Document {
pageContent: 'oduction Support:** As you move your LangChains in',
metadata: { loc: [Object] }
},
Document {
pageContent: "nto production, we'd love to offer more hands",
metadata: { loc: [Object] }
},
Document {
pageContent: '-on support.\nFill out [this form](https://airtable',
metadata: { loc: [Object] }
},
Document {
pageContent: 'e.com/appwQzlErAS2qiP0L/shrGtGaVBVAz7NcV2) to shar',
metadata: { loc: [Object] }
},
Document {
pageContent: "re more about what you're building, and our team w",
metadata: { loc: [Object] }
},
Document {
pageContent: 'will get in touch.',
metadata: { loc: [Object] }
},
Document { pageContent: '#', metadata: { loc: [Object] } },
Document {
pageContent: '# Deployment Options\n' +
'\n' +
"See below for a list of deployment options for your LangChain app. If you don't see your preferred option, please get in touch and we can add it to this list.",
metadata: { loc: [Object] }
}
]
*/
API Reference:
- Document from
langchain/document
- RecursiveCharacterTextSplitter from
langchain/text_splitter