Graft supports a number of embedding strategies for each of the modalities supported, this can be helpful experiment with to optimize the performance of your data.
Embedding strategies are set for each embedding configuration you create, this allows you to embed the same table columns with different strategies and see how they perform using the search or chat interface by enabling/disabling embeddings.
Graft currently supports six strategies
- Chunk Average + Special Token Average
- Chunk Average + Token Average
- Chunk Average + Classification Token
- Truncate + Special Token Average
- Truncate + Token Average
- Truncate + Classification Token
By default the Chunk Average + Special Token Average strategy is used, but there may be times where it may be helpful to use an alternative strategy based on your use case and available data to optimize the embedding results. You can create multiple Entities each with a different strategy to evaluate which is best.
Details of each available strategy and its use can be found in its own article.
In general if you want to use the generated embeddings for search then strategies using "Special token average" is recommended. If you are being embeddings to be used as features for enrichments (for example predictions) then "Classification Token" strategies are preferred.
Selecting between Chunk or Truncate is dependant on the length of the text, if most of it is under 500 tokens then truncate will generate embeddings quickly, if the text is over 500 tokens using chunk average is recommended so context is not lost.
Strategy Legend
This is a special token introduced in order to capture the semantics of the entire input