PREPROCESS

Document
preprocessing
for LLMs

Document preprocessing for LLMs

The API to convert and split any kind of document into optimal chunks of text
without the hassle of building an in-house solution.

Free to use, no credit card required




























The problem

Basic chunking:
Garbage in, Garbage out

Splitting text based on a fixed word count leads to

  • Inconsistent Embeddings: random text splitting results in not accurate semantic similarity scores
  • Context Loss: poor chunking removes vital context, leading to confusion in LLMs.
  • Inaccurate Outputs: without proper context, Generative AI produces unreliable results.

The solution

Unleash the
True Potential of Data

Each element has it's carachteristics:

  • Sections: follow the original hierarchical structure.
  • Paragraphs: keep coherent and flowing content together.
  • Lists: group list elements alongside their introductions. Split them if necessary due to lengthy content.
  • Tables: create a single chunk for data tables; split textual tables.
  • Slides: split presentations according to the original flow.

Super Simple Ingestion Pipelines

Your documents will be automatically processed according to the file type.
Get the output quality of a custom preprocessing pipeline in a simple API call.

cURL
curl --location
--request POST 'https://api.preprocess.co/chunk' \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: your_api_key' \
--form 'file=@"/your_file.ext"'
Word-like
.DOCX, .PDF, .ODT, .DOC

During the conversion preprocess takes into account all elements and semantics of the content.
It divides the text following the hierarchical structure of the sections and then further divides the text into optimal chunks.
Keeps lists together if they are short, splits them if they contain long points.
Divide the text into paragraphs, taking care to keep together what is semantically linked, just like you would.

Excel-like
.XLSX, .CSV, .ODS, .XLS

This type of file is converted taking into account the writing orientation, headings and the lecotion of the elements.
Preprocess is able to differentiate data tables from textual ones by treating them differently.
By setting the table_output_format parameter you can decide whether to receive the output of the tables in text, markdown or html form.
By setting repeat_table_header = true you will find the header included in each chunk.

PowerPoint-like
.PPTX, .PDF, .ODP, .PPT

Presentations are a graphic-visual format that contains concepts in slides.
Preprocess recognizes which PDFs were originally presentations.
The content is divided by slide and if necessary further divided in the case of long texts.
The order of the text is important: each element on the slide is converted into consecutive text.

HTML & Text
.HTML, .EML, .TXT

They are the most used formats but also the least consistent ones.
Cleaning HTML files of unwanted elements automatically is essential to obtain processable data.
Recognizing titles and graphic elements is not always easy, especially when complex UX elements come into play.
Similarly, for plain texts, identifying the titles semantically is essential to divide the text coherently.

Already Ready

Integrate preprocess in your data pipeline with a few lines of code. Check the repositories.

Try it now

Request an API key and start using our chunking API. It's free and we serve on a best effort.
If you need specific SLA and support please reach out at support@preprocess.co