Appearance
Process a PDF with the Unstructured Serverless API and Batch APIs in JavaScript for a RAG Application
Introduction
Let's say you have a recipe book in a PDF format on your machine and you want to create a RAG application to extract recipes from the PDF. In this article, I will show you how to use the Unstructured Serverless API and OpenAI's Batch API to pre-process a PDF file for indexing in Chromadb using JavaScript. Let's get started!
Prerequisites
Node.js installed on your machine
You need to have an Unstructured Serverless API key and an API URL. To get them, follow the instructions at Unstructured Serverless API.
OpenAI project key. To get it, follow the instructions at OpenAI API.
Create a Chromadb database. To create a Chromadb database, follow the instructions at Chromadb.
Processing a PDF with the Unstructured API
Setting up the Project
Assuming you already have Node.js installed on your machine, follow these steps to process a PDF with the Unstructured API:
- Create a folder for your project and navigate to it in your terminal.
- After navigating into your project's folder, run
npm init -y
to create apackage.json
file that will hold your project's dependencies. - Install the
unstructured-client
library by runningnpm install unstructured-client
. - Create a new JavaScript file and name it
process-pdf.js
, or give it any name you prefer.
Add the code to process the PDF with the Unstructured API
Open the process-pdf.js
file in your code editor and take the following steps to process the PDF with the Unstructured API:
- Require the
unstructured-client
and the Node.js file system(fs) modules. From theunstructured-client
module, import theUnstructuredClient
class which you will use to interact with the Unstructured API.
js
const { UnstructuredClient } = require("unstructured-client");
const fs = require("fs");
const { UnstructuredClient } = require("unstructured-client");
const fs = require("fs");
We will use the Node.js fs
module to read the PDF and write the PDF's content parsed by the Unstructured API to a file on your machine. That way you don't have to make a request to the API every time you want to access the processed content.
- Create a function called
processPDF
that takes the path to the PDF file as an argument.
js
async function processPDF(pdfPath) {
// code here
}
async function processPDF(pdfPath) {
// code here
}
- Inside the
processPDF
function, create a new instance of theUnstructuredClient
class and pass it your Unstructured API key and API URL.
js
const apiKey = "your-api";
const serverURL = "your-api-url";
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
}
const apiKey = "your-api";
const serverURL = "your-api-url";
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
}
The constructor of the UnstructuredClient
class accepts an SDKOptions object. The API URL is passed under the serverURL
property. The API key is passed as the apiKey
property of the security
object.
- Read the content of the PDF file by calling the
readFileSync
method on thefs
module and passing it the path to the PDF file.
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
}
- To process the PDF's content with the Unstructured Serverless API, we use the
partition
method of theUnstructuturedClient
instance. Thepartition
method accepts the content of the PDF file as aUnit8Array
orstring
. ThereadFileSync
method returns aBuffer
object, so you need to convert it to aUnit8Array
as follows:
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data); //or content = data.toString('base64');
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data); //or content = data.toString('base64');
}
- To process the PDF file, specify the file name and set the
content
property of thefiles
object, for the PartitionParameters object, to the content of the PDF file.
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
}
Then, call the partition
method on the general
property of the UnstructuredClient
instance and pass it the PartitionParameters
object and the API key as follows:
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
}
The partition
method returns a PartitionResponse object that contains the content of the PDF file parsed by the Unstructured API. You can access the content via the elements
property of the PartitionResponse
object, if the request was successful - indicated by the status code being 200
.
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
if (res.statusCode == 200) {
console.log(res.elements);
}
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data);
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
if (res.statusCode == 200) {
console.log(res.elements);
}
}
Write the Response of the Unstructured API to a File
If the request was a success, we write the content of the PDF file parsed by the Unstructured API to a .json
file on your machine. The writeFileSync
method of the fs
module accepts the path to the file and the content to write to the file.
js
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data); //or content = data.toString('base64');
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
if (res.statusCode == 200) {
👉 fs.writeFileSync("recipe-book.json", JSON.stringify(res.elements));
}
}
async function processPDF(pdfPath) {
const unstructuredClient = new UnstructuredClient({
serverURL,
security: {
apiKey,
},
});
const data = fs.readFileSync(pdfPath);
const content = new Uint8Array(data); //or content = data.toString('base64');
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
};
const res = await unstructuredClient.partition({
partitionParameters,
unstructuredApiKey: apiKey,
});
if (res.statusCode == 200) {
👉 fs.writeFileSync("recipe-book.json", JSON.stringify(res.elements));
}
}
Once you have finished defining the processPDF
function, process the PDF by calling it and pass the path to the PDF file as an argument.
js
processPDF("path-to-pdf-file");
processPDF("path-to-pdf-file");
To adapt the partitioning process to your needs, you can tweak the PartitionParameters
object by specifying the properties such as the chunkingStrategy
, startingPageNumber
, and uniqueElementIds
. In this example, I've used the by_page
strategy to partition the PDF file by page. So each item in the PartitionResponse
object represents a page in the PDF.
js
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
chunkingStrategy: "by_page",
startingPageNumber: 1,
uniqueElementIds: true,
};
const partitionParameters = {
files: {
content,
fileName: "recipe-book.pdf",
},
chunkingStrategy: "by_page",
startingPageNumber: 1,
uniqueElementIds: true,
};
For more parameters, visit the Unstructured API documentation.
Cleaning the Response from Unstructured API with the Batch API
The PDF book I processed is structured in such a way that some of the recipes start with a page that contains only the recipe image. Such pages are parsed as noise by the Unstructured API and you may not want to index that data as it is not useful.
To clean the response, you can use the OpenAI Batch API to filter out recipe text from pure noise by asking an LLM if the text is a recipe or not.
Using the Batch API is a great solution as it'll allow us to process all the book's pages data with the OpenAI's API at once.
To clean the response from the Unstructured API with the Batch API, follow these steps:
1. Initialise the OpenAI Node.js Library to use the Batch API
Install the OpenAI Node.js Library
To use the OpenAI Batch API, you need to install the OpenAI Node.js library. To install the library, run the following command in your terminal from your project's root folder:
bashnpm install openai
npm install openai
In JavaScript file, require the
OpenAI
class from theopenai
module.jsconst { OpenAI } = require("openai");
const { OpenAI } = require("openai");
Create a new instance of the
OpenAI
class and pass it your OpenAI project key obtained at your OpenAI dashboard.jsconst openai = new OpenAI("your-openai-key");
const openai = new OpenAI("your-openai-key");
2. Create a JSONL File with the Requests for the Batch API
The Batch API requires you to upload a .jsonl
file that contains the batch requests you want to make. For this, we will create 2 functions. One will append the page elements from the Unstructured API response to the .jsonl
file. Let's call it writeElementToJSONL
. The other function will recieve the elements array and the path to the .jsonl
file and call the writeElementToJSONL
function for each element in the array. Let's call it writeElementsToJSONL
.
The following is the definition of the writeElementToJSONL
function:
js
writeElementToJSONL(element, filename) {
return new Promise((resolve, reject) => { //4.
const stream = createWriteStream(filename, { flags: "a" }); //1.
const item = { //2.
custom_id: element.element_id,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o",
messages: [
{
role: "user",
content: `Is the following text a recipe? Answer 'Yes' or 'No':\n\n${element.text}`,
},
],
max_tokens: 1000,
},
};
stream.write(JSON.stringify(item) + "\n", (err) => {
if (err) {
reject(err);
} else {
stream.end();
resolve();
}
}); //3.
});
}
writeElementToJSONL(element, filename) {
return new Promise((resolve, reject) => { //4.
const stream = createWriteStream(filename, { flags: "a" }); //1.
const item = { //2.
custom_id: element.element_id,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o",
messages: [
{
role: "user",
content: `Is the following text a recipe? Answer 'Yes' or 'No':\n\n${element.text}`,
},
],
max_tokens: 1000,
},
};
stream.write(JSON.stringify(item) + "\n", (err) => {
if (err) {
reject(err);
} else {
stream.end();
resolve();
}
}); //3.
});
}
About the writeElementToJSONL
function:
- To write the request object to the
.jsonl
file, we start by creating a write stream using thecreateWriteStream
method of thefs
module. ThecreateWriteStream
method accepts the path to the file and an options object that specifies the write mode. In our case we set the mode toa
to append to the file instead of overwriting it. - Each line in the
.jsonl
file is a JSON object representing a request. Theitem
object contains the request data for each element(page data) in the response from the Unstructured API that we want write to each line in the.jsonl
file. - To append the request object to the
.jsonl
file, we call thewrite
method on the stream object and pass it the JSON-stringifieditem
object. Thewrite
method also accepts a callback function that gets called with an error if there is an issue writing to the file. - To wait until all Unstructured API elements are written to the
.jsonl
file, we return a promise that resolves when the write operation is complete.
The writeElementsToJSONL
function is defined as follows:
js
async function writeElementsToJSONL(elements, filename) {
const writePromises = elements.map((element) =>
writeElementToJSONL(element, filename)
);
await Promise.all(writePromises);
}
async function writeElementsToJSONL(elements, filename) {
const writePromises = elements.map((element) =>
writeElementToJSONL(element, filename)
);
await Promise.all(writePromises);
}
The await Promise.all(writePromises);
lines means that writeElementsToJSONL
only returns when all the elements have been written to the .jsonl
file.
- Call the
writeElementsToJSONL
function and pass it the elements from the response of the Unstructured API and the path of the.jsonl
file.
js
const elements = JSON.parse(readFileSync("recipe-book.json", "utf8"));
writeElementsToJSONL(elements, "recipes.jsonl");
const elements = JSON.parse(readFileSync("recipe-book.json", "utf8"));
writeElementsToJSONL(elements, "recipes.jsonl");
3. Upload the .jsonl
File to the Batch API
Before you make the actual request for the batch completion, you need to upload the .jsonl
file to the OpenAI Batch API. You can upload the file by calling the create
method on the files
property of the openai
instance and passing it the file stream and the purpose of the file as follows:
js
async function isRecipeBatch() {
// upload the file to OpenAI
let file;
try {
file = await openai.files.create({
file: fs.createReadStream("recipes.jsonl"),
purpose: "batch",
});
} catch (error) {
console.log("Error uploading", error);
}
console.log("File uploaded", file.id);
}
async function isRecipeBatch() {
// upload the file to OpenAI
let file;
try {
file = await openai.files.create({
file: fs.createReadStream("recipes.jsonl"),
purpose: "batch",
});
} catch (error) {
console.log("Error uploading", error);
}
console.log("File uploaded", file.id);
}
4. Make the batch request for the uploaded file
Once the batch file is uploaded successfully, you can create a batch request for the uploaded file by calling the create
method on the batches
property(Batches) of the openai
instance and passing it the uploaded file ID, endpoint and the completion window as follows:
js
try {
// create a batch completion
completion = await openai.batches.create({
input_file_id: file.id,
endpoint: "/v1/chat/completions",
completion_window: "24h",
});
console.log(completion.id);
} catch (error) {
console.log("Error creating batch", error);
}
try {
// create a batch completion
completion = await openai.batches.create({
input_file_id: file.id,
endpoint: "/v1/chat/completions",
completion_window: "24h",
});
console.log(completion.id);
} catch (error) {
console.log("Error creating batch", error);
}
The create
method of the Batches
object returns a Batch object(the completion
variable above) that contains the ID of the batch request. You can access the ID via the id
property of the Batch
object. You can then use the ID to check the status of the batch request and retrieve the results.
5. Retrieve the batch request results
To retrieve the results of the batch request, check if the status of the batch request is completed
by calling the retrieve
method on the batches
property of the openai
instance and passing it the ID of the batch request shown in the preceding step as follows:
js
try {
completion = await openai.batches.retrieve(completion.id);
console.log(completion.status);
} catch (error) {
console.log("Error retrieving batch", error);
}
try {
completion = await openai.batches.retrieve(completion.id);
console.log(completion.status);
} catch (error) {
console.log("Error retrieving batch", error);
}
The retrieve
method, like the create
method above, returns a Batch
object that contains the status of the batch request. You can access the status via the status
property of the Batch
object.
If the status is completed
, you can access the results of the batch request by calling the content
method on the Files
property of the OpenAI
instance and passing it the ID of request's output file as follows:
js
try {
if (completion.status === "completed") {
content = await openai.files.content(completion.output_file_id);
console.log(content);
}
} catch (error) {
console.log("Error retrieving content", error);
}
try {
if (completion.status === "completed") {
content = await openai.files.content(completion.output_file_id);
console.log(content);
}
} catch (error) {
console.log("Error retrieving content", error);
}
6.Save the results of the batch request to a file on your machine
You can write the results of the batch request to a file on your machine to access them later as follows:
js
try {
if (completion.status === "completed") {
content = await openai.files.content(completion.output_file_id);
content.body.on("data", (chunk) => {
const stream = fs.createWriteStream("recipes_output.jsonl", {
flags: "a",
});
stream.write(chunk);
stream.end();
});
content.body.on("end", () => {
console.log("File written");
});
content.body.on("error", (err) => {
console.log("Error writing file", err);
});
}
} catch (error) {
console.log("Error writing content", error);
}
try {
if (completion.status === "completed") {
content = await openai.files.content(completion.output_file_id);
content.body.on("data", (chunk) => {
const stream = fs.createWriteStream("recipes_output.jsonl", {
flags: "a",
});
stream.write(chunk);
stream.end();
});
content.body.on("end", () => {
console.log("File written");
});
content.body.on("error", (err) => {
console.log("Error writing file", err);
});
}
} catch (error) {
console.log("Error writing content", error);
}
7. Filter out the non-recipes
Once you have the results of the batch request, you can filter out the non-recipe text from the Unstructured API response by checking every element from the Unstructured API response against each response from the Batch API. If a response from the Batch API is a Yes
, you can keep the text as it is a recipe. If it is a No
, you can remove the text from the Unstructured API parsed data.
We filter out the non-recipe elements through the following steps:
- Create a function that reads the content of the file containing the results of the batch request. The function takes the path to the file as the first argument and a callback function as the second argument. The callback function is called with the content of the file once it has been read.
Why use a callback function? Because reading a file is an asynchronous operation in Node.js, and we need to wait for the file to be read before processing its content.
js
function readAndParseJSONL(filePath, callback) {
const readStream = fs.createReadStream(filePath, "utf8");
let data = "";
readStream.on("data", (chunk) => {
data += chunk; // Concatenate the chunk to the data string to collect all the file data
});
readStream.on("end", () => {
const lines = data.split("\n").filter((line) => line.trim()); // Split the data into lines and filter out any empty lines
const objects = lines
.map((line) => {
return JSON.parse(line); // Parse each line
})
.filter((value) => value !== undefined);
// Invoke the callback with Batches API responses
callback(objects);
});
readStream.on("error", (err) => {
console.error("Error reading the file:", err);
callback([]); // Invoke the callback with an empty array in case of error
});
}
function readAndParseJSONL(filePath, callback) {
const readStream = fs.createReadStream(filePath, "utf8");
let data = "";
readStream.on("data", (chunk) => {
data += chunk; // Concatenate the chunk to the data string to collect all the file data
});
readStream.on("end", () => {
const lines = data.split("\n").filter((line) => line.trim()); // Split the data into lines and filter out any empty lines
const objects = lines
.map((line) => {
return JSON.parse(line); // Parse each line
})
.filter((value) => value !== undefined);
// Invoke the callback with Batches API responses
callback(objects);
});
readStream.on("error", (err) => {
console.error("Error reading the file:", err);
callback([]); // Invoke the callback with an empty array in case of error
});
}
The callback function passed to the readAndParseJSONL
function looks like this:
js
const collections = { ids: [], documents: [], metadatas: [] };
readAndParseJSONL((dataFromOpenAI) => {
const dataPromises = dataFromOpenAI.map(async (_element) => {
const isRecipe =
_element.response.body.choices[0].message.content === "Yes"; //1.
if (_element.custom_id === element.element_id && isRecipe) {
// add to the ` collections` object for indexing in Chromadb
collections.ids.push(element.element_id);
collections.documents.push(this.cleanElementText(element.text));
collections.metadatas.push({
page_number: element.metadata.page_number,
});
}
});
Promise.all(dataPromises).then(() => resolve());
});
const collections = { ids: [], documents: [], metadatas: [] };
readAndParseJSONL((dataFromOpenAI) => {
const dataPromises = dataFromOpenAI.map(async (_element) => {
const isRecipe =
_element.response.body.choices[0].message.content === "Yes"; //1.
if (_element.custom_id === element.element_id && isRecipe) {
// add to the ` collections` object for indexing in Chromadb
collections.ids.push(element.element_id);
collections.documents.push(this.cleanElementText(element.text));
collections.metadatas.push({
page_number: element.metadata.page_number,
});
}
});
Promise.all(dataPromises).then(() => resolve());
});
- For each element in the response from the Unstructured API, we iterate over the data from the OpenAI Batch API.
- For each item in the OpenAI Batch API response, we create the
isRecipe
variable that checks if the response from the OpenAI Batch API isYes
. - We then match the
custom_id
of the element from the Unstructured API response with theelement_id
from the OpenAI Batch API response. - If the
custom_id
matches theelement_id
and the response from the OpenAI Batch API isYes
, we add the element to ourcollections
object for indexing in Chromadb.
8. Index the recipes in a Chromadb vector database
Once you have filtered out all the non-recipe elements that you possibly can, you can index the collections
object in the Chromadb vector database. To do this, you need to have created a Chromadb database and have it running on your machine. For instructions on how to create a Chromadb database and add a collection to it, visit the Chromadb documentation.
9. Query the Chromadb database
Once you have indexed the recipes in the Chromadb database, you can query the database to retrieve the relevant recipe, which you can then pass to an LLM to generate a response to the user(you or I in this case) query.
Conclusion
That's it from me! Even though this article is about a recipe book, I hope you can see how you can use the Unstructured API and Batch API to process any PDF file for your RAG application.