## Langchain and large documents

In this notebook we will apply the map-reduce chaining of langchain to summarize and discuss large documents.
We will be analyzing scientific papers, as they follow a specific format, making it relatively easy to process.

First, import all necessary modules

In [None]:
%pip install langchain langchain-community html2text tiktoken langchain-openai pypdf

In [None]:
import openai
from openai import OpenAI
import os
import dotenv
import time

dotenv.load_dotenv(".env", override=True) 
openai.api_key = os.getenv("OPENAI_API_KEY")

MODEL ="gpt-4o"
chunk = 10000 # amount of data send to LLM per mapping 

### Langchain modules

In [None]:
from langchain.document_transformers import Html2TextTransformer
from langchain.document_loaders import PyPDFLoader
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import ReduceDocumentsChain, MapReduceDocumentsChain
from langchain_openai import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferMemory
from langchain.chains import ConversationChain


### Read the PDF

PDFs are built from printing, not so much for data processing. Extracting text from a PDF can be challenging. The same holds true from HTML pages.
Some PDFs will therefore parse nicely, while others might create a mess. Given your set of documents, you might need to try out several PDF-readers and parsers to find the one that gives the best results. In this case we will be using PyPDFLoader.

After reading the PDF we need to split it up in chunks. Picking the optimal chuck_size is tricky and depends on context length and LLM used. Setting it too low, however, will limit the reasoning capabilities of the LLM, because not enough context will then be provided.

In [None]:

loader = PyPDFLoader("crop_rotation_sugar_beet.pdf")
docs = loader.load()

html2text = Html2TextTransformer()
docs = html2text.transform_documents(docs)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)


### The LLM used for processing the data

In our case we will be using OpenAI GTP-4o.
And we need a template for the mapping phase and a template for the reduce phase.

In [None]:
llm = ChatOpenAI(temperature=0.3, model_name=MODEL, streaming=True)
map_template = """The following is a set of documents which combined form a full scientific paper and should therefore be considered as one long, single paper.
{docs}
{question}
Helpful Answer:"""

reduce_template = """{description}
{doc_summaries}
{question}
Helpful Answer:"""


### The map-reduce chaining

This is the most complex part: we need run the map-reduce approach on our document.
Note: we will add a sleep() to the method to prevent too many calls to the API. It is possible to have a subscription to the OpenAI API with much higher limits, though.

Documentation: [MapReduceDocumentsChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain.html#langchain.chains.combine_documents.map_reduce.MapReduceDocumentsChain)

The map-reduce methods are reported as 'deprecated'. However, there is no a high level chain yet available. 


In [None]:
def runMapReduce(map_template, reduce_template, docs, llm, model = "gpt-4o"):
    map_prompt = PromptTemplate.from_template(map_template)
    map_chain = LLMChain(llm=llm, prompt=map_prompt )

    # Reduce
    reduce_prompt = PromptTemplate.from_template(reduce_template)
    reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

    # Takes a list of documents, combines them into a single string, and passes this to an LLMChain
    combine_documents_chain = StuffDocumentsChain(
        llm_chain=reduce_chain, document_variable_name="doc_summaries"
    )

    #print("Reduce phase")
    # Combines and iteravely reduces the mapped documents
    reduce_documents_chain = ReduceDocumentsChain(
        # This is final chain that is called.
        combine_documents_chain=combine_documents_chain,
        # If documents exceed context for `StuffDocumentsChain`
        collapse_documents_chain=combine_documents_chain,
        # The maximum number of tokens to group documents into.
        #token_max=tokens,
    )

    #print("Mapping phase")
    # Combining documents by mapping a chain over them, then combining results
    map_reduce_chain = MapReduceDocumentsChain(
        # Map chain
        llm_chain=map_chain,
        # Reduce chain
        reduce_documents_chain=reduce_documents_chain,
        # The variable name in the llm_chain to put the documents in
        document_variable_name="docs",
        # Return the results of the map steps in the output
        return_intermediate_steps=False,
    )

    if model == "gpt-4o": # let's wait for a while
        time.sleep(10)

    return(map_reduce_chain.run(docs))



### Create a variable to keep track of all results

In [None]:
summaryDoc = ""

## Basic information on the paper

During the mapping phase, we will go through the entire paper, looking for author names, name of the journal, etc. And we would like to know a little bit more on the journal itself.
After collecting all this information, we will reduce this into a nice list, formatted in markdown.

In [None]:
map_q = "Please identify the publisher of this scientific paper and the authors of this paper. Provide some background on the journal. Is it for example considered high impact? What are generally the topics and results shared in this journal? If you can not extract this information from this part of the text, just provide an empty string as answer."
reduce_qDescription = "The following contains an author list and information on the journal from a scientific paper:"
reduce_q = "Take these and provide only the author list and information on the first mentioned journal and publisher. The output needs to be in Markdown file format."
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
print(result)
summaryDoc += result


## Research themes

What is the paper really about?

In [None]:
map_q = "Please identify the main themes of this scientific paper."
reduce_qDescription = "The following is set of summaries from a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the main themes of this scientific paper in Markdown file format."
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
print(result)
summaryDoc += result



## Summaries

Create a summary of each of the sections of a scientific paper (might take some time to complete) and discuss the quality of each of these sections

In [None]:
map_q = "Could you provide a summary of the introduction? If the document contains different sections of the paper you can answer with an empty string."
reduce_qDescription = "The following is set of summaries of the introduction of a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the introduction section of this scientific paper in Markdown file format. Is the introduction complete and concise? Are there for example topics introduced which require more elaborations? "
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
#print(result)
summaryDoc += result


map_q = "Could you provide a summary of the results section? You might need to deduce the section contains the results if the header is absent of labeled differently. If the document contains different sections of the paper you can answer with an empty string."
reduce_qDescription = "The following is set of summaries of the results section of a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the results section of this scientific paper in Markdown file format. Is the results section concise, are there any conflicting results or are there any comments that should in either the conclusion or materials and methods section, for example?"
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
#print(result)
summaryDoc += result

map_q = "Could you provide a summary of the Materials and Method section? Or any other section that might reflect the same content, such as an 'Implementation',  'Methods', 'Experimental Procedures' or 'Methodology' section. If the document contains different sections of the paper you can answer with an empty string."
reduce_qDescription = "The following is set of summaries of the Materials and Method section of a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the Materials and Method section of this scientific paper in Markdown file format. Does the section provide sufficient details on how to replicate the study? Does it, for example, include descriptions on how the data was collected, which software and/or databases were used, etc? "
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
#print(result)
summaryDoc += result

map_q = "Could you provide a summary of the Conclusion section? You might need to deduce the section contains the conclusion if the header is absent of labeled differently. If the document contains different sections of the paper you can answer with an empty string."
reduce_qDescription = "The following is set of summaries of the conclusion section of a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the conclusion of this scientific paper in Markdown file format. Is the conclusion section well written? Do the conclusions provided match the results? Are you missing any conclusions? Provide a bullet list of the main conclusions in markdown format."
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
#print(result)
summaryDoc += result

map_q = "Could you provide a summary of the Discussion section (or that part in the paper which contains a discussion)? You might need to deduce the section contains the discussion if the header is absent of labeled differently. If the document contains different sections of the paper you can answer with an empty string."
reduce_qDescription = "The following is set of summaries of the conclusion section of a scientific paper:"
reduce_q = "Take these and distill it into a final, consolidated summary of the discussion section of this scientific paper in Markdown file format. How well written is the discussion? Does it indeed contain sufficient topics? Are there any strange, out of context or maybe even wild statements? Does the discussion section align with the rest of the paper?"
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
#print(result)
summaryDoc += result


### View the results

In [None]:
print(summaryDoc)

## Provide feedback on the paper

Providing feedback is a very important part of being a scientific researcher and/or supervisor. We can ask the LLM to generate a list of tips and tops which you can use as input for the discussion with your peer.  

In [None]:
map_q = "You are a highly skilled researcher and writer. Could you reflect on this part of paper and provide feedback in the form of 'tips and tops'"
reduce_qDescription = "The following is set of tips (things that can be improved)  and tops (things that are good about the text) as feedback provided by a highly skilled researcher and writer:"
reduce_q = "Take these and distill it into a final, consolidated summary of tips (things that can be improved) and tops (things that are good about the text) in Markdown file format."
result = runMapReduce(map_template.format(question=map_q, docs="{docs}"), reduce_template.format(description=reduce_qDescription, question = reduce_q, doc_summaries="{doc_summaries}"), split_docs, llm=llm, model=MODEL)
print(result)


## Further feedback and research topics

Now let's ask the LLM to dive a little deeper into the science of the paper. First, we ask it to verify the general outline of the paper. After that, we will ask the LLM to generate a list of similar research topics and finally we would like to have some ideas on follow-up research.

### Conversation chain

Very similar to the direct call to the OpenAI Assistant API. If you like to use a single module, in stead of two, you can use langchain as your primary toolkit. 

In [None]:
conversation_buf = ConversationChain(
    llm=llm,
    memory=ConversationBufferMemory()
)

### General outline

In [None]:
answer = conversation_buf("You are provided with summaries of the sections of a scientific paper. Does this paper follow the general outline of a scientific paper? The summaries in Markdown format: " + summaryDoc)
print("\n\n# Does the paper follow the general outline of a scientific paper?  \n\n")
print(answer['response'])

### Similar research topics

In [None]:
answer = conversation_buf.run("Could you provide a list of similar research topics and/or questions which are directly related to the topics described in this paper?")
print("\n\n# Similar research topics and/or questions related to this paper \n\n")
print(answer)

### Follow-up research

In [None]:
answer = conversation_buf.run("Could you suggest follow-up research question based on the findings and discussion presented in this paper?")
print("\n\n# Suggestions for follow-up research \n\n")
print(answer)

## Clean-up

Normally, you would create the vector store and assistant only once. After creation, you can use the assistant.id for future reference. For now, let's clean everything up.

In [9]:
from openai import OpenAI
client = OpenAI()

# Function to list and delete files from OpenAI
def delete_openai_files(client):
    files = client.files.list()
    print(files.data)
    for file in files.data:
        client.files.delete(file.id)
        print(f"Deleted file: {file.id}")

def delete_openai_vector_stores(client):
    vectors = client.beta.vector_stores.list()
    print(vectors.data)
    for vector in vectors.data:
        client.beta.vector_stores.delete(vector.id)
        print(f"Deleted vector: {vector.id}")


# Execute the deletion functions
delete_openai_files(client)
delete_openai_vector_stores(client)

[]
[VectorStore(id='vs_kftdiCgZIKmvbUn5RPMOj7jk', created_at=1716846470, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), last_active_at=1716846554, metadata={}, name='Research papers', object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None), VectorStore(id='vs_s0iNOxcAT4VhquKAzhwDaxgU', created_at=1716846237, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), last_active_at=1716846320, metadata={}, name='Bejo recipes', object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None), VectorStore(id='vs_xeTI6iokxD3qyKkLD3aWQFFb', created_at=1716815561, file_counts=FileCounts(cancelled=0, completed=0, failed=0, in_progress=0, total=0), last_active_at=1716815620, metadata={}, name='Bejo recipes', object='vector_store', status='completed', usage_bytes=0, expires_after=None, expires_at=None), VectorStore(id='vs_kNYYucxUxtIQQUwhoTjgQ1wk', created_at=171681