Scale API | Chunks

Chunks

chunks

Methods

Rank Chunks ->

RankedChunksResponse

post/v4/chunks/rank

Description

Sorts a list of text chunks by similarity against a given query string.

Details

Use this API endpoint to rank which text chunks provide the most relevant responses to a given a query string.

This is useful for stuffing chunks into a prompt where order may matter or for filtering out less relevant chunks according to the ranking strategy. For example, this API may be useful when doing retrieval augment generation (RAG). Sometimes vector store similarity search does not always return the best ranking of text chunks, since this is heavily dependent on embedding generation. This API endpoint can act as a post-processing step to re-sort the given chunks using more complex strategies that may outperform vector search, and then filter only the top-k most relevant chunks to stuff into the prompt for RAG.

Restrictions and Limits

Ranking can be a very intensive and slow process depending on methodology where duration scales with number of chunks. For best performance, we recommend ranking less than 640 chunks at a time, and you may see a decrease in performance as the number of chunks ranked increases.

Synthesize Chunks ->

SynthesizeChunksResponse

post/v4/chunks/synthesis

Description

Synthesizes a response to the given query using the given chunks as context.

Details

This API operates on "chunks," which are the result of querying a vector store. A chunk is simply a fragment of a larger document. It can optionally have its own metadata or ID. You can also construct your own chunks from scratch so long as you provide the text pertaining to each chunk.

Synthesizing chunks can be thought of as the final step in a retrieval augmented generation ( RAG) system, after querying for chunks and potentially ranking them. Once you have the relevant chunks to use to respond to the user, you'll want to synthesize a readable natural language answer. During this synthesis step, we prompt an LLM with instructions and a set of (possibly transformed) chunks to guide it toward synthesizing a natural language response to the user query.

There are several synthesis strategies, whose difference become more apparent as the number of chunks increases. Currently, EGP supports the compact strategy by default: stuff as many chunks into the synthesis LLM as possible, produce a best-effort answer, then continue to stuff and answer the next set of chunks. The answer is passed through and gradually refined with each iteration of chunk scanning. This continues until all chunks have been scanned and a final answer is synthesized. More synthesis strategies are currently in development!

Restrictions and Limits

Generally, chunk synthesis increases its token consumption linearly the number of chunk-tokens. The compact strategy attempts to be as token-efficient as possible by stuffing tokens into each round of chunk scanning. Other strategies (in development!) may produce better summaries or more precise answers at the expense of more tokens consumed.

Broadly, we recommend keeping the number of chunks to under 100, or the total number of tokens across all chunks under around 10,000.

Domain types

RankedChunksResponse = { relevant_chunks }

SynthesizeChunksResponse = { response, metadata, source_nodes }