# OpenAI Retrieval-Augemented Generation

First we need to include the right modules and set the API key. For the assistant technology we will make use of the most recent LLM (GPT-4o). 

In [None]:
import openai
from typing_extensions import override
from openai import AssistantEventHandler, OpenAI
import os
import dotenv

dotenv.load_dotenv(".env", override=True) 
openai.api_key = os.getenv("OPENAI_API_KEY")


# Assistant La Chef

The following section creates the assistant **La Chef**, the expert cook of Bejo Zaden. The assistant requires access to the pdf with the recipes.

In [None]:
client = OpenAI()
 
assistant = client.beta.assistants.create(
  name="La Chef",
  instructions="You are an expert cook. You have access to recipes of Bejo Zaden.",
  model="gpt-4o",
  tools=[{"type": "file_search"}],
)

## Adding the pdf to the assistant

The recipe file needs to be added to a vector store. This database will hold the embeddings created based on the data in the file. 
Through the API you can select specific types of embeddings, but for now (and in most cases) we will use the default store.
The vector store will be used for RAG. Additionally, OpenAI using keyword search in the original document to find snippets.

In [None]:
# Create a vector store caled "Bejo recipes"
vector_store = client.beta.vector_stores.create(name="Bejo recipes")
 
# Ready the files for upload to OpenAI
file_paths = ["Recipes_Bejo.pdf"]
file_streams = [open(path, "rb") for path in file_paths]
 
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)
 
# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)

## Add the vector store to the assistant

In [None]:
assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

## Threads

Assistants have the ability to keep track of the discussion with the user through **threads**. You can add (additional) files to the assistant here as well. These files will be removed after 7 days by default. So you can update the assistant with either information supplied by the user or through another automated process. Be aware that you pay for storing data and retrieving data. So check file size and file type before adding a file to the assistant. 

You can store the ID of the assistant and the thread and use session cookies to keep track of which user is using which thread. Make sure that it is impossible to access a random thread: this might expose user data to a hacker.

In [None]:
# Create a thread
thread = client.beta.threads.create()

## Start discussion

Now we have the recipes ready we can start asking questions to **La Chef**.

In [None]:
class EventHandler(AssistantEventHandler):
    @override
    def on_text_created(self, text) -> None:
        print(f"\nassistant > ", end="", flush=True)

    @override
    def on_tool_call_created(self, tool_call):
        print(f"\nassistant > {tool_call.type}\n", flush=True)

    @override
    def on_message_done(self, message) -> None:
        # print a citation to the file searched
        message_content = message.content[0].text
        annotations = message_content.annotations
        citations = []
        for index, annotation in enumerate(annotations):
            message_content.value = message_content.value.replace(
                annotation.text, f"[{index}]"
            )
            if file_citation := getattr(annotation, "file_citation", None):
                cited_file = client.files.retrieve(file_citation.file_id)
                citations.append(f"[{index}] {cited_file.filename}")

        print(message_content.value, flush=True)
        print("\n".join(citations), flush=True)



In [None]:
message = client.beta.threads.messages.create(
        role="user",  
        content="Please provide a recipe which includes some garlic and perhaps peppers.",
        thread_id=thread.id,
)
with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=EventHandler(),
) as stream:
    stream.until_done()

In [None]:
message = client.beta.threads.messages.create(
        role="user",  
        content="I don't have a can of peeled tomatoes. Can you suggest a proper substitute?",
        thread_id=thread.id,
)
with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=EventHandler(),
) as stream:
    stream.until_done()

# Research assistant

This assistant will have access to several scientific papers. You ask the assistant anything about these papers.
Now we will focus on applying RAG to multiple documents and how to show the context of the results.

We will be using:
[Effects of crop rotation on sugar beet growth through improving soil physicochemical properties and microbiome](https://www.sciencedirect.com/science/article/pii/S092666902400308X)
[Potato yield and quality are linked to cover crop and soil microbiome, respectively](https://link.springer.com/article/10.1007/s00374-024-01813-0)
[Evolution of microbial community and the volatilome of fresh-cut chili pepper during storage under different temperature conditions: Correlation of microbiota and volatile organic compounds](https://www.sciencedirect.com/science/article/pii/S0308814624010501?casa_token=8kb_Wk5ek8cAAAAA:0e2bf3DZuz6Ez_31G3kv5cBcmR3HPl9u0ehw0vD-DCglcp_SS7RKX3kBMISgc5AViN8FXPqKCw)
The pdfs are also in the current directory.

## Create the Research assistant

In [None]:
assistant = client.beta.assistants.create(
  name="Research assistant",
  instructions="You are a research assistant on microbiome research, specialized in field crops.",
  model="gpt-4o",
  tools=[{"type": "file_search"}],
)

## Add the papers to vector store

In [None]:
# Create a vector store caled "Research papers"
vector_store = client.beta.vector_stores.create(name="Research papers")
 
# Ready the files for upload to OpenAI
file_paths = ["crop_rotation_sugar_beet.pdf", "microbial_pepper.pdf", "soil_microbiome_potato.pdf"]
file_streams = [open(path, "rb") for path in file_paths]
 
# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)
 
# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)

## Add the papers (vector store) to the assistant

In [None]:
assistant = client.beta.assistants.update(
  assistant_id=assistant.id,
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

## Create a new thread

In [None]:
# Create a thread
thread = client.beta.threads.create()

## Ask some questions about the papers

In [None]:
message = client.beta.threads.messages.create(
        role="user",  
        content="Could you provide a list of the bacterial species identified in the papers?",
        thread_id=thread.id,
)
with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=EventHandler(),
) as stream:
    stream.until_done()

## Getting a quote

The file ids are very handy to see on which file(s) the output is based. With a single file this is obvious, but with more files available to the assistant this is not apparent. Also, to verify the output makes any sense, it is usually necessary to have access to the quote in file (context). This way you can check what was used to create the prompt and sent to the assistant.

For this, we need to change the event handler a bit:

In [None]:

class EventHandlerContext(AssistantEventHandler):
    @override
    def on_text_created(self, text) -> None:
        print(f"\nassistant > ", end="", flush=True)

    @override
    def on_tool_call_created(self, tool_call):
        print(f"\nassistant > {tool_call.type}\n", flush=True)

    @override
    def on_message_done(self, message) -> None:
        # print a citation to the file searched
        message_content = message.content[0].text
        annotations = message_content.annotations
        citations = []
        for index, annotation in enumerate(annotations):
            message_content.value = message_content.value.replace(
                annotation.text, f"[{index}]"
            )
            if file_citation := getattr(annotation, "file_citation", None):
                cited_file = client.files.retrieve(file_citation.file_id)
                quote = file_citation.quote
                citations.append(f"[{index}] {cited_file.filename} '{quote}'")

        print(message_content.value, flush=True)
        print("\n".join(citations), flush=True)



## New we can ask questions and get also the context

In [None]:
message = client.beta.threads.messages.create(
        role="user",  
        content="Are there any bacterial species mentioned in the papers which indicate a positive effect on, for example, plant health or plant growth?",
        thread_id=thread.id,
)

with client.beta.threads.runs.stream(
    thread_id=thread.id,
    assistant_id=assistant.id,
    event_handler=EventHandlerContext(),
) as stream:
    stream.until_done()