Skip to content

Process a PDF with the Unstructured Serverless API and Batch APIs in JavaScript for a RAG Application

Introduction

Let's say you have a recipe book in a PDF format on your machine and you want to create a RAG application to extract recipes from the PDF. In this article, I will show you how to use the Unstructured Serverless API and OpenAI's Batch API to pre-process a PDF file for indexing in Chromadb using JavaScript. Let's get started!

Prerequisites

  1. Node.js installed on your machine

  2. You need to have an Unstructured Serverless API key and an API URL. To get them, follow the instructions at Unstructured Serverless API.

  3. OpenAI project key. To get it, follow the instructions at OpenAI API.

  4. Create a Chromadb database. To create a Chromadb database, follow the instructions at Chromadb.

Processing a PDF with the Unstructured API

Setting up the Project

Assuming you already have Node.js installed on your machine, follow these steps to process a PDF with the Unstructured API:

  1. Create a folder for your project and navigate to it in your terminal.
  2. After navigating into your project's folder, run npm init -y to create a package.json file that will hold your project's dependencies.
  3. Install the unstructured-client library by running npm install unstructured-client.
  4. Create a new JavaScript file and name it process-pdf.js, or give it any name you prefer.

Add the code to process the PDF with the Unstructured API

Open the process-pdf.js file in your code editor and take the following steps to process the PDF with the Unstructured API:

  1. Require the unstructured-client and the Node.js file system(fs) modules. From the unstructured-client module, import the UnstructuredClient class which you will use to interact with the Unstructured API.
js
const { UnstructuredClient } = require("unstructured-client");
const fs = require("fs");
const { UnstructuredClient } = require("unstructured-client");
const fs = require("fs");

We will use the Node.js fs module to read the PDF and write the PDF's content parsed by the Unstructured API to a file on your machine. That way you don't have to make a request to the API every time you want to access the processed content.

  1. Create a function called processPDF that takes the path to the PDF file as an argument.
js
async function processPDF(pdfPath) {
  // code here
}
async function processPDF(pdfPath) {
  // code here
}
  1. Inside the processPDF function, create a new instance of the UnstructuredClient class and pass it your Unstructured API key and API URL.
js
const apiKey = "your-api";
const serverURL = "your-api-url";

async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });
}
const apiKey = "your-api";
const serverURL = "your-api-url";

async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });
}

The constructor of the UnstructuredClient class accepts an SDKOptions object. The API URL is passed under the serverURL property. The API key is passed as the apiKey property of the security object.

  1. Read the content of the PDF file by calling the readFileSync method on the fs module and passing it the path to the PDF file.
js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
}
  1. To process the PDF's content with the Unstructured Serverless API, we use the partition method of the UnstructuturedClient instance. The partitionmethod accepts the content of the PDF file as a Unit8Array or string. The readFileSync method returns a Buffer object, so you need to convert it to a Unit8Array as follows:
js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);

  const content = new Uint8Array(data); //or content = data.toString('base64');
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);

  const content = new Uint8Array(data); //or content = data.toString('base64');
}
  1. To process the PDF file, specify the file name and set the content property of the files object, for the PartitionParameters object, to the content of the PDF file.
js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };
}

Then, call the partition method on the general property of the UnstructuredClient instance and pass it the PartitionParameters object and the API key as follows:

js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });
}

The partition method returns a PartitionResponse object that contains the content of the PDF file parsed by the Unstructured API. You can access the content via the elements property of the PartitionResponse object, if the request was successful - indicated by the status code being 200.

js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });

  if (res.statusCode == 200) {
    console.log(res.elements);
  }
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data);

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });

  if (res.statusCode == 200) {
    console.log(res.elements);
  }
}

Write the Response of the Unstructured API to a File

If the request was a success, we write the content of the PDF file parsed by the Unstructured API to a .json file on your machine. The writeFileSync method of the fs module accepts the path to the file and the content to write to the file.

js
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data); //or content = data.toString('base64');

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });

  if (res.statusCode == 200) {
   👉 fs.writeFileSync("recipe-book.json", JSON.stringify(res.elements));
  }
}
async function processPDF(pdfPath) {
  const unstructuredClient = new UnstructuredClient({
    serverURL,
    security: {
      apiKey,
    },
  });

  const data = fs.readFileSync(pdfPath);
  const content = new Uint8Array(data); //or content = data.toString('base64');

  const partitionParameters = {
    files: {
      content,
      fileName: "recipe-book.pdf",
    },
  };

  const res = await unstructuredClient.partition({
    partitionParameters,
    unstructuredApiKey: apiKey,
  });

  if (res.statusCode == 200) {
   👉 fs.writeFileSync("recipe-book.json", JSON.stringify(res.elements));
  }
}

Once you have finished defining the processPDF function, process the PDF by calling it and pass the path to the PDF file as an argument.

js
processPDF("path-to-pdf-file");
processPDF("path-to-pdf-file");

To adapt the partitioning process to your needs, you can tweak the PartitionParameters object by specifying the properties such as the chunkingStrategy, startingPageNumber, and uniqueElementIds. In this example, I've used the by_page strategy to partition the PDF file by page. So each item in the PartitionResponse object represents a page in the PDF.

js
const partitionParameters = {
  files: {
    content,
    fileName: "recipe-book.pdf",
  },
  chunkingStrategy: "by_page",
  startingPageNumber: 1,
  uniqueElementIds: true,
};
const partitionParameters = {
  files: {
    content,
    fileName: "recipe-book.pdf",
  },
  chunkingStrategy: "by_page",
  startingPageNumber: 1,
  uniqueElementIds: true,
};

For more parameters, visit the Unstructured API documentation.

Cleaning the Response from Unstructured API with the Batch API

The PDF book I processed is structured in such a way that some of the recipes start with a page that contains only the recipe image. Such pages are parsed as noise by the Unstructured API and you may not want to index that data as it is not useful.

To clean the response, you can use the OpenAI Batch API to filter out recipe text from pure noise by asking an LLM if the text is a recipe or not.

Using the Batch API is a great solution as it'll allow us to process all the book's pages data with the OpenAI's API at once.

To clean the response from the Unstructured API with the Batch API, follow these steps:

1. Initialise the OpenAI Node.js Library to use the Batch API

  1. Install the OpenAI Node.js Library

    To use the OpenAI Batch API, you need to install the OpenAI Node.js library. To install the library, run the following command in your terminal from your project's root folder:

    bash
    npm install openai
    npm install openai
  2. In JavaScript file, require the OpenAI class from the openai module.

    js
    const { OpenAI } = require("openai");
    const { OpenAI } = require("openai");
  3. Create a new instance of the OpenAI class and pass it your OpenAI project key obtained at your OpenAI dashboard.

    js
    const openai = new OpenAI("your-openai-key");
    const openai = new OpenAI("your-openai-key");

2. Create a JSONL File with the Requests for the Batch API

The Batch API requires you to upload a .jsonl file that contains the batch requests you want to make. For this, we will create 2 functions. One will append the page elements from the Unstructured API response to the .jsonl file. Let's call it writeElementToJSONL. The other function will recieve the elements array and the path to the .jsonl file and call the writeElementToJSONL function for each element in the array. Let's call it writeElementsToJSONL.

The following is the definition of the writeElementToJSONL function:

js
writeElementToJSONL(element, filename) {
 return new Promise((resolve, reject) => { //4.
   const stream = createWriteStream(filename, {  flags: "a" }); //1.

   const item = { //2.
     custom_id: element.element_id,
     method: "POST",
     url: "/v1/chat/completions",
     body: {
       model: "gpt-4o",
       messages: [
         {
           role: "user",
           content: `Is the following text a recipe? Answer 'Yes' or 'No':\n\n${element.text}`,
         },
       ],
       max_tokens: 1000,
     },
   };

   stream.write(JSON.stringify(item) + "\n", (err) => {
     if (err) {
       reject(err);
     } else {
       stream.end();
       resolve();
     }
   }); //3.

 });
}
writeElementToJSONL(element, filename) {
 return new Promise((resolve, reject) => { //4.
   const stream = createWriteStream(filename, {  flags: "a" }); //1.

   const item = { //2.
     custom_id: element.element_id,
     method: "POST",
     url: "/v1/chat/completions",
     body: {
       model: "gpt-4o",
       messages: [
         {
           role: "user",
           content: `Is the following text a recipe? Answer 'Yes' or 'No':\n\n${element.text}`,
         },
       ],
       max_tokens: 1000,
     },
   };

   stream.write(JSON.stringify(item) + "\n", (err) => {
     if (err) {
       reject(err);
     } else {
       stream.end();
       resolve();
     }
   }); //3.

 });
}

About the writeElementToJSONL function:

  1. To write the request object to the .jsonl file, we start by creating a write stream using the createWriteStream method of the fs module. The createWriteStream method accepts the path to the file and an options object that specifies the write mode. In our case we set the mode to a to append to the file instead of overwriting it.
  2. Each line in the .jsonl file is a JSON object representing a request. The item object contains the request data for each element(page data) in the response from the Unstructured API that we want write to each line in the .jsonl file.
  3. To append the request object to the .jsonl file, we call the write method on the stream object and pass it the JSON-stringified item object. The write method also accepts a callback function that gets called with an error if there is an issue writing to the file.
  4. To wait until all Unstructured API elements are written to the .jsonl file, we return a promise that resolves when the write operation is complete.

The writeElementsToJSONL function is defined as follows:

js
async function writeElementsToJSONL(elements, filename) {
  const writePromises = elements.map((element) =>
    writeElementToJSONL(element, filename)
  );

  await Promise.all(writePromises);
}
async function writeElementsToJSONL(elements, filename) {
  const writePromises = elements.map((element) =>
    writeElementToJSONL(element, filename)
  );

  await Promise.all(writePromises);
}

The await Promise.all(writePromises); lines means that writeElementsToJSONL only returns when all the elements have been written to the .jsonl file.

  1. Call the writeElementsToJSONL function and pass it the elements from the response of the Unstructured API and the path of the .jsonl file.
js
const elements = JSON.parse(readFileSync("recipe-book.json", "utf8"));
writeElementsToJSONL(elements, "recipes.jsonl");
const elements = JSON.parse(readFileSync("recipe-book.json", "utf8"));
writeElementsToJSONL(elements, "recipes.jsonl");

3. Upload the .jsonl File to the Batch API

Before you make the actual request for the batch completion, you need to upload the .jsonl file to the OpenAI Batch API. You can upload the file by calling the create method on the files property of the openai instance and passing it the file stream and the purpose of the file as follows:

js
async function isRecipeBatch() {
  // upload the file to OpenAI
  let file;
  try {
    file = await openai.files.create({
      file: fs.createReadStream("recipes.jsonl"),
      purpose: "batch",
    });
  } catch (error) {
    console.log("Error uploading", error);
  }
  console.log("File uploaded", file.id);
}
async function isRecipeBatch() {
  // upload the file to OpenAI
  let file;
  try {
    file = await openai.files.create({
      file: fs.createReadStream("recipes.jsonl"),
      purpose: "batch",
    });
  } catch (error) {
    console.log("Error uploading", error);
  }
  console.log("File uploaded", file.id);
}

4. Make the batch request for the uploaded file

Once the batch file is uploaded successfully, you can create a batch request for the uploaded file by calling the create method on the batches property(Batches) of the openai instance and passing it the uploaded file ID, endpoint and the completion window as follows:

js
try {
  // create a batch completion
  completion = await openai.batches.create({
    input_file_id: file.id,
    endpoint: "/v1/chat/completions",
    completion_window: "24h",
  });
  console.log(completion.id);
} catch (error) {
  console.log("Error creating batch", error);
}
try {
  // create a batch completion
  completion = await openai.batches.create({
    input_file_id: file.id,
    endpoint: "/v1/chat/completions",
    completion_window: "24h",
  });
  console.log(completion.id);
} catch (error) {
  console.log("Error creating batch", error);
}

The create method of the Batches object returns a Batch object(the completion variable above) that contains the ID of the batch request. You can access the ID via the id property of the Batch object. You can then use the ID to check the status of the batch request and retrieve the results.

5. Retrieve the batch request results

To retrieve the results of the batch request, check if the status of the batch request is completed by calling the retrieve method on the batches property of the openai instance and passing it the ID of the batch request shown in the preceding step as follows:

js
try {
  completion = await openai.batches.retrieve(completion.id);
  console.log(completion.status);
} catch (error) {
  console.log("Error retrieving batch", error);
}
try {
  completion = await openai.batches.retrieve(completion.id);
  console.log(completion.status);
} catch (error) {
  console.log("Error retrieving batch", error);
}

The retrieve method, like the create method above, returns a Batch object that contains the status of the batch request. You can access the status via the status property of the Batch object.

If the status is completed, you can access the results of the batch request by calling the content method on the Files property of the OpenAI instance and passing it the ID of request's output file as follows:

js
try {
  if (completion.status === "completed") {
    content = await openai.files.content(completion.output_file_id);
    console.log(content);
  }
} catch (error) {
  console.log("Error retrieving content", error);
}
try {
  if (completion.status === "completed") {
    content = await openai.files.content(completion.output_file_id);
    console.log(content);
  }
} catch (error) {
  console.log("Error retrieving content", error);
}

6.Save the results of the batch request to a file on your machine

You can write the results of the batch request to a file on your machine to access them later as follows:

js
try {
  if (completion.status === "completed") {
    content = await openai.files.content(completion.output_file_id);

    content.body.on("data", (chunk) => {
      const stream = fs.createWriteStream("recipes_output.jsonl", {
        flags: "a",
      });

      stream.write(chunk);
      stream.end();
    });

    content.body.on("end", () => {
      console.log("File written");
    });

    content.body.on("error", (err) => {
      console.log("Error writing file", err);
    });
  }
} catch (error) {
  console.log("Error writing content", error);
}
try {
  if (completion.status === "completed") {
    content = await openai.files.content(completion.output_file_id);

    content.body.on("data", (chunk) => {
      const stream = fs.createWriteStream("recipes_output.jsonl", {
        flags: "a",
      });

      stream.write(chunk);
      stream.end();
    });

    content.body.on("end", () => {
      console.log("File written");
    });

    content.body.on("error", (err) => {
      console.log("Error writing file", err);
    });
  }
} catch (error) {
  console.log("Error writing content", error);
}

7. Filter out the non-recipes

Once you have the results of the batch request, you can filter out the non-recipe text from the Unstructured API response by checking every element from the Unstructured API response against each response from the Batch API. If a response from the Batch API is a Yes, you can keep the text as it is a recipe. If it is a No, you can remove the text from the Unstructured API parsed data.

We filter out the non-recipe elements through the following steps:

  1. Create a function that reads the content of the file containing the results of the batch request. The function takes the path to the file as the first argument and a callback function as the second argument. The callback function is called with the content of the file once it has been read.

Why use a callback function? Because reading a file is an asynchronous operation in Node.js, and we need to wait for the file to be read before processing its content.

js
function readAndParseJSONL(filePath, callback) {
  const readStream = fs.createReadStream(filePath, "utf8");
  let data = "";
  readStream.on("data", (chunk) => {
    data += chunk; // Concatenate the chunk to the data string to collect all the file data
  });

  readStream.on("end", () => {
    const lines = data.split("\n").filter((line) => line.trim()); // Split the data into lines and filter out any empty lines

    const objects = lines
      .map((line) => {
        return JSON.parse(line); // Parse each line
      })
      .filter((value) => value !== undefined);
    // Invoke the callback with Batches API responses
    callback(objects);
  });

  readStream.on("error", (err) => {
    console.error("Error reading the file:", err);
    callback([]); // Invoke the callback with an empty array in case of error
  });
}
function readAndParseJSONL(filePath, callback) {
  const readStream = fs.createReadStream(filePath, "utf8");
  let data = "";
  readStream.on("data", (chunk) => {
    data += chunk; // Concatenate the chunk to the data string to collect all the file data
  });

  readStream.on("end", () => {
    const lines = data.split("\n").filter((line) => line.trim()); // Split the data into lines and filter out any empty lines

    const objects = lines
      .map((line) => {
        return JSON.parse(line); // Parse each line
      })
      .filter((value) => value !== undefined);
    // Invoke the callback with Batches API responses
    callback(objects);
  });

  readStream.on("error", (err) => {
    console.error("Error reading the file:", err);
    callback([]); // Invoke the callback with an empty array in case of error
  });
}

The callback function passed to the readAndParseJSONL function looks like this:

js
const collections = { ids: [], documents: [], metadatas: [] };
readAndParseJSONL((dataFromOpenAI) => {
  const dataPromises = dataFromOpenAI.map(async (_element) => {
    const isRecipe =
      _element.response.body.choices[0].message.content === "Yes"; //1.
    if (_element.custom_id === element.element_id && isRecipe) {
      // add to the ` collections` object for indexing in Chromadb
      collections.ids.push(element.element_id);
      collections.documents.push(this.cleanElementText(element.text));
      collections.metadatas.push({
        page_number: element.metadata.page_number,
      });
    }
  });
  Promise.all(dataPromises).then(() => resolve());
});
const collections = { ids: [], documents: [], metadatas: [] };
readAndParseJSONL((dataFromOpenAI) => {
  const dataPromises = dataFromOpenAI.map(async (_element) => {
    const isRecipe =
      _element.response.body.choices[0].message.content === "Yes"; //1.
    if (_element.custom_id === element.element_id && isRecipe) {
      // add to the ` collections` object for indexing in Chromadb
      collections.ids.push(element.element_id);
      collections.documents.push(this.cleanElementText(element.text));
      collections.metadatas.push({
        page_number: element.metadata.page_number,
      });
    }
  });
  Promise.all(dataPromises).then(() => resolve());
});
  • For each element in the response from the Unstructured API, we iterate over the data from the OpenAI Batch API.
  • For each item in the OpenAI Batch API response, we create the isRecipe variable that checks if the response from the OpenAI Batch API is Yes.
  • We then match the custom_id of the element from the Unstructured API response with the element_id from the OpenAI Batch API response.
  • If the custom_id matches the element_id and the response from the OpenAI Batch API is Yes, we add the element to our collections object for indexing in Chromadb.

8. Index the recipes in a Chromadb vector database

Once you have filtered out all the non-recipe elements that you possibly can, you can index the collections object in the Chromadb vector database. To do this, you need to have created a Chromadb database and have it running on your machine. For instructions on how to create a Chromadb database and add a collection to it, visit the Chromadb documentation.

9. Query the Chromadb database

Once you have indexed the recipes in the Chromadb database, you can query the database to retrieve the relevant recipe, which you can then pass to an LLM to generate a response to the user(you or I in this case) query.

Conclusion

That's it from me! Even though this article is about a recipe book, I hope you can see how you can use the Unstructured API and Batch API to process any PDF file for your RAG application.