The old adage “garbage in, garbage out” applies to all search systems. Whether you are building for ecommerce, document retrieval, or Retrieval Augmented Generation (RAG), the quality of your search results depends on the quality of your search documents. Downstream, RAG systems improve the quality of generated answers by adding relevant data from other systems to the generative prompt. Most RAG solutions use a search engine to search for this relevant data. To get great responses, you need great search results, and to get great search results, you need great data. If you don’t properly partition, extract, enrich, and clean your data before loading it, your search results will reflect the poor quality of your search documents.
Aryn DocParse segments and labels PDF documents, runs OCR, extracts tables and images, and more. It turns your messy documents into beautiful, structured JSON, which is the first step of document extract, transform, and load (ETL). DocParse runs the open source Aryn Partitioner and its state-of-the-art, open source deep learning DETR AI model trained on over 80,000 enterprise documents. This leads to up to 6 times more accurate data chunking and 2 times improved recall on vector search or RAG when compared to off-the-shelf systems. The following screenshot is an example of how DocParse would segment a page in an ETL pipeline. You can visualize labeled bounding boxes for each document segment using the Aryn Playground.
In this post, we demonstrate how to use Amazon OpenSearch Service with purpose-built document ETL tools, Aryn DocParse and Sycamore, to quickly build a RAG application that relies on complex documents. We use over 75 PDF reports from the National Transportation Safety Board (NTSB) about aircraft incidents. You can refer to the following example document from the collection. As you can see, these documents are complex, containing tables, images, section headings, and complicated layouts.
Let’s get started!
Prerequisites
Complete the following prerequisite steps:
- Create an OpenSearch Service domain. For more details, see Creating and managing Amazon OpenSearch Service domains. You can create a domain using the AWS Management Console, AWS Command Line Interface (AWS CLI), or SDK. Be sure to choose public access for your domain, and set up a user name and password for your domain’s primary user so that you can run the notebook from your laptop, Amazon SageMaker Studio, or an Amazon Elastic Compute Cloud (EC2) instance. To keep costs low, you can create an OpenSearch Service domain with a single t3.small search node in a dev/test configuration for this example. Take note of the domain’s endpoint to use in later steps.
- Get an Aryn API key.
- You will be using Anthropic’s Claude large language model (LLM) on Amazon Bedrock in the ETL pipeline, so make sure your notebook has access to AWS credentials with the required permissions.
- Have access to a Jupyter environment to open and run the notebook.
Use DocParse and Sycamore to chunk data and load OpenSearch Service
Although you can generate an ETL pipeline to load your OpenSearch Service domain using the Aryn DocPrep UI, we will instead focus on the underlying Sycamore document ETL library and write a pipeline from scratch.
Sycamore was designed to make it straightforward for developers and data engineers to define complex data transformations over large collections of documents. Borrowing some ideas from popular dataflow frameworks like Apache Spark, Sycamore has a core abstraction called the DocSet. Each DocSet represents a collection of unstructured documents, and is scalable from a single document to many thousands. Each document in a DocSet has an arbitrary set of key-value properties as metadata, as well as an ordered list of elements. An Element corresponds to a chunk of the document that can be processed and embedded separately, such as a table, headline, text passage, or image. Like documents, Elements can also contain arbitrary key-value properties to encode domain- or application-specific metadata.
Notebook walkthrough
We’ve created a Jupyter notebook that uses Sycamore to orchestrate data preparation and loading. This notebook uses Sycamore to create a data processing pipeline that sends documents to DocParse for initial document segmentation and data extraction, then runs entity extraction and data transforms, and finally loads data into OpenSearch Service using a connector.
Copy the notebook into your Amazon SageMaker JupyterLab space, launch it using a Python kernel, then walk through the cells along with the following procedures.
To install Sycamore with the OpenSearch Service connector and local inference features necessary to create vector embeddings, run the first cell of the notebook:
In the second cell of the notebook, fill in your ARYN_API_KEY
. You should be able to complete the example in the notebook for less than $1.
Cell 3 does the initial work of reading the source data and preparing a DocSet for that data. After initializing the Sycamore context and setting paths, this code calls out to DocParse to create a partitioned_docset
:
The previous code uses materialize
to create and save a checkpoint. In future runs, the code will use the materialized view to save a few minutes of time. partitioned_docset.execute()
forces the pipeline to execute. Sycamore uses lazy execution to create efficient query plans, and would otherwise execute the pipeline at a much later step.
After this step, each document in the DocSet now includes the partitioned output from DocParse, including bounding boxes, text content, and images from that document, stored as elements.
Entity extraction
Part of the key to building good retrieval for RAG is adding structured information that enables accurate filtering for the search query. Sycamore provides LLM-powered transforms that can extract this information and store it as structured properties, enriching the document. Sycamore can do unsupervised or supervised schema extraction, where it pulls out fields based on a JSON schema you provide. When executing these types of transforms, Sycamore will take a specified number of elements from each document, use an LLM to extract the specified fields, and include them as properties in the document.
Cell 4 uses supervised schema extraction, setting the schema as the fields you want to extract. You can add additional information that is passed to the LLM performing the entity extraction. The location
property is an example of this:
The LLMPropertyExtractor
uses the schema you provided to add additional properties to the document. Next, summarize the images to add additional information to improve retrieval.
Image summarization
There’s more information in your documents than just text—as the saying goes, a picture is worth 1,000 words! When your documents contain images, you can capture the information in those images using Sycamore’s SummarizeImages
transform. SummarizeImages
uses an LLM to compute a text summary for the image, then adds the summary to that element. Sycamore will also send related information about the image, like a caption, to the LLM to aid with summarization. The following code (in cell 4) takes advantage of DocParse type labeling to automatically apply SummarizeImages
to image elements:
This cell can take up to 20 minutes to complete.
Now that your image elements contain additional retrieval information, it’s time to clean and normalize the text in the elements and extracted entities.
Data cleaning and formatting
Unless you are in direct control of the creation of the documents you are processing, you will likely need to normalize that data and make it ready for search. Sycamore makes it straightforward for you to clean messy data and bring it to a regular form, fixing data quality issues.
For example, in the NTSB data, dates in the incident report are not all formatted the same way, and some US state names are shown as abbreviations. Sycamore makes it straightforward to write custom transformations in Python, and also provides several useful cleaning and formatting transforms. Cell 4 uses two functions in Sycamore to format the state names and dates:
The elements are now in normal form, with extracted entities and image descriptions. The next step is to merge together semantically related elements to create chunks.
Create final chunks and vector embeddings
When you prepare for RAG, you create chunks—parts of the full document that are related information. You design your chunks so that as a search result they can be added to the prompt to provide a unit of meaning and information. There are many ways to approach chunking. If you have small documents, sometimes the whole document is a chunk. If you have larger documents, sentences, paragraphs, or even sections can be a chunk. As you iterate on your end application, it’s common to adjust the chunking strategy to fine-tune the accuracy of retrieval. Sycamore automates the process of building chunks by merging together the elements of the DocSet.
At this stage of the processing in cell 4, each document in our DocSet has a set of elements. The following code merges elements together using a chunking strategy to create larger elements that will improve query results. For instance, the DocSet might have an element that is a table and an element that is a caption for that table. Merging those elements together creates a chunk that’s a better search result.
We will use Sycamore’s Merge transform with the GreedySectionMerger
merging strategy to add elements in the same document section together into larger chunks:
With chunks created, it’s time to add vector embeddings for the chunks.
Create vector embeddings
Use vector embeddings to enable semantic search in OpenSearch Service. With semantic search, retrieve documents that are close to a query in a multidimensional space, rather than by matching words exactly. In RAG systems, it’s common to use semantic search along with lexical search for a hybrid search. Using hybrid search, you get best-of-all-worlds retrieval.
The code in cell 4 creates vector embeddings for each chunk. You can use a variety of different AI models with Sycamore’s embed transform to create vector embeddings. You can run these locally or use a service like Amazon Bedrock or OpenAI. The embedding model you choose has a huge impact on your search quality, and it’s common to experiment with this variable as well. In this example, you create embeddings locally using a model called GTE:
You use materialize
again here, so you can checkpoint the processed DocSet before loading. If there is an error when loading the indexes, you can retry without running the last few steps of the pipeline again.
Load OpenSearch Service
The final ETL step is loading the prepared data into OpenSearch Service vector and keyword indexes to power hybrid search for the RAG application. Sycamore makes loading indexes straightforward with its set of connectors. Cell 5 adds configuration, specifying the OpenSearch Service domain endpoint and what indexes to create. If you’re following along, be sure to replace YOUR-DOMAIN-ENDPOINT
, YOUR-OPENSEARCH-USERNAME
, and YOUR-OPENSEARCH-PASSWORD
in cell 5 with the actual values.
If you copied your domain endpoint from the console, it will start with the https://
URL scheme. When you replace YOUR-DOMAIN-ENDPOINT
, be sure to remove https://
.
In cell 6, Sycamore’s OpenSearch connector loads the data into an OpenSearch index:
Congratulations! You’ve completed some of the core processing steps to take raw PDFs and prepare them as a source for retrieval in a RAG application. In the next cells, you will run a couple of RAG queries.
Run a RAG query on OpenSearch using Sycamore
In cell 7, Sycamore’s query and summarize functions create a RAG pipeline on the data. The query step uses OpenSearch’s vector search to retrieve the relevant passages for RAG. Then, cell 8 runs a second RAG query that filters on metadata that Sycamore extracted in the ETL pipeline, yielding even better results. You could also use an OpenSearch hybrid search pipeline to perform hybrid vector and lexical retrieval.
Cell 7 asks “What was common with incidents in Texas, and how does that differ from incidents in California?” Sycamore’s summarize_data
transform runs the RAG query, and uses the LLM specified for generation (in this case, it’s Anthropic’s Claude):
Using metadata filters in a RAG query
Cell 8 makes a small adjustment to the code to add a filter to the vector search, filtering for documents from incidents with the location of California
. Filters increase the accuracy of chatbot responses by removing irrelevant data from the result the RAG pipeline passes to the LLM in the prompt.
To add a filter, cell 8 adds a filter
clause to the k-nearest neighbors (k-NN) query:
The output from the RAG query is as follows:
Clean up
Be sure to clean up the resources you deployed for this walkthrough:
- Delete your OpenSearch Service domain.
- Remove any Jupyter environments you created.
Conclusion
In this post, you used Aryn DocParse and Sycamore to parse, extract, enrich, clean, embed, and load data into vector and keyword indexes in OpenSearch Service. You then used Sycamore to run RAG queries on this data. Your second RAG query used an OpenSearch filter on metadata to get a more accurate result.
The way in which your documents are parsed, enriched, and processed has a significant impact on the quality of your RAG queries. You can use the examples in this post to build your own RAG systems with Aryn and OpenSearch Service, and iterate on the processing and retrieval strategies as you build your generative AI application.
About the Authors
Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.
Jon is the founding Chief Product Officer at Aryn. Prior to that, he was the SVP of Product Management at Dremio, a data lake company. Earlier, Jon was a Director at AWS, and led product management for in-memory database services (Amazon ElastiCache and Amazon MemoryDB for Redis), Amazon EMR (Apache Spark and Hadoop), and founded and was GM of the blockchain division. Jon has an MBA from Stanford Graduate School of Business and a BA in Chemistry from Washington University in St. Louis. Solana Token Creator