Quickstart for PDF Extract API (Java)
To get started using Adobe PDF Extract API, let's walk through a simple scenario - taking an input PDF document and running PDF Extract API against it. Once the PDF has been extracted, we'll parse the results and report on any major headers in the document. In this guide, we will walk you through the complete process for creating a program that will accomplish this task.
Prerequisites
To complete this guide, you will need:
- Java - Java 11 or higher is required.
- Maven
- An Adobe ID. If you do not have one, the credential setup will walk you through creating one.
- A way to edit code. No specific editor is required for this guide.
Step One: Getting credentials
1) To begin, open your browser to https://acrobatservices.adobe.com/dc-integration-creation-app-cdn/main.html?api=pdf-extract-api. If you are not already logged in to Adobe.com, you will need to sign in or create a new user. Using a personal email account is recommend and not a federated ID.
2) After registering or logging in, you will then be asked to name your new credentials. Use the name, "New Project".
3) Change the "Choose language" setting to "Java".
4) Also note the checkbox by, "Create personalized code sample." This will include a large set of samples along with your credentials. These can be helpful for learning more later.
5) Click the checkbox saying you agree to the developer terms and then click "Create credentials."
6) After your credentials are created, they are automatically downloaded:
Step Two: Setting up the project
1) In your Downloads folder, find the ZIP file with your credentials: PDFServicesSDK-JavaSamples.zip. If you unzip that archive, you will find a folder of samples and the pdfservices-api-credentials.json
file.
2) Take the pdfservices-api-credentials.json
file and place it in a new directory.
3) In this directory, create a new file named pom.xml
and copy the following content:
Copied to your clipboard1<?xml version="1.0" encoding="UTF-8"?>23<project xmlns="http://maven.apache.org/POM/4.0.0"4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"5 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">6 <modelVersion>4.0.0</modelVersion>78 <groupId>com.adobe.documentservices</groupId>9 <artifactId>pdfservices-sdk-extract-guide</artifactId>10 <version>1</version>1112 <name>PDF Services Java SDK Samples</name>1314 <properties>15 <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>16 <maven.compiler.source>11</maven.compiler.source>17 <maven.compiler.target>11</maven.compiler.target>18 <pdfservices.sdk.version>4.0.0</pdfservices.sdk.version>19 </properties>2021 <dependencies>2223 <dependency>24 <groupId>com.adobe.documentservices</groupId>25 <artifactId>pdfservices-sdk</artifactId>26 <version>${pdfservices.sdk.version}</version>27 </dependency>2829 <!-- log4j2 dependency to showcase the use of log4j2 with slf4j API-->30 <!-- https://mvnrepository.com/artifact/org.slf4j/slf4j-log4j12 -->31 <dependency>32 <groupId>org.apache.logging.log4j</groupId>33 <artifactId>log4j-slf4j-impl</artifactId>34 <version>2.21.1</version>35 </dependency>36 </dependencies>3738 <build>39 <plugins>40 <plugin>41 <groupId>org.apache.maven.plugins</groupId>42 <artifactId>maven-compiler-plugin</artifactId>43 <version>3.8.0</version>44 <configuration>45 <source>${maven.compiler.source}</source>46 <target>${maven.compiler.target}</target>47 </configuration>48 </plugin>49 <plugin>50 <groupId>org.codehaus.mojo</groupId>51 <artifactId>exec-maven-plugin</artifactId>52 <version>1.5.0</version>53 <executions>54 <execution>55 <goals>56 <goal>java</goal>57 </goals>58 </execution>59 </executions>60 </plugin>61 </plugins>62 </build>63</project>
This file will define what dependencies we need and how the application will be built.
Our application will take a PDF, Adobe Extract API Sample.pdf
(downloadable from here) and extract it's contents. The results will be saved as a ZIP file, ExtractTextInfoFromPDF.zip
. We will then parse the results from the ZIP and print out the text of any H1
headers found in the PDF.
4) In your editor, open the directory where you previously copied the credentials, and create a new directory, src/main/java
. In that directory, create ExtractTextInfoFromPDF.java
.
Now you're ready to begin coding.
Step Three: Creating the application
1) We'll begin by including our required dependencies:
Copied to your clipboard1import com.adobe.pdfservices.operation.PDFServices;2import com.adobe.pdfservices.operation.PDFServicesMediaType;3import com.adobe.pdfservices.operation.PDFServicesResponse;4import com.adobe.pdfservices.operation.auth.Credentials;5import com.adobe.pdfservices.operation.auth.ServicePrincipalCredentials;6import com.adobe.pdfservices.operation.exception.SDKException;7import com.adobe.pdfservices.operation.exception.ServiceApiException;8import com.adobe.pdfservices.operation.exception.ServiceUsageException;9import com.adobe.pdfservices.operation.io.Asset;10import com.adobe.pdfservices.operation.io.StreamAsset;11import com.adobe.pdfservices.operation.pdfjobs.jobs.ExtractPDFJob;12import com.adobe.pdfservices.operation.pdfjobs.params.extractpdf.ExtractElementType;13import com.adobe.pdfservices.operation.pdfjobs.params.extractpdf.ExtractPDFParams;14import com.adobe.pdfservices.operation.pdfjobs.result.ExtractPDFResult;15import org.apache.commons.io.IOUtils;16import org.json.JSONArray;17import org.json.JSONObject;18import org.slf4j.Logger;19import org.slf4j.LoggerFactory;2021import java.io.File;22import java.io.IOException;23import java.io.InputStream;24import java.io.OutputStream;25import java.nio.file.Files;26import java.nio.file.Paths;27import java.util.Arrays;28import java.util.Scanner;29import java.util.zip.ZipEntry;30import java.util.zip.ZipFile;
2) Now let's define our main class:
Copied to your clipboard1public class ExtractTextInfoFromPDF {23 private static final org.slf4j.Logger LOGGER = LoggerFactory.getLogger(ExtractTextInfoFromPDF.class);45 public static void main(String[] args) {67 }8}
3) Set the environment variables PDF_SERVICES_CLIENT_ID
and PDF_SERVICES_CLIENT_SECRET
by running the following commands and replacing placeholders YOUR CLIENT ID
and YOUR CLIENT SECRET
with the credentials present in pdfservices-api-credentials.json
file:
Windows:
set PDF_SERVICES_CLIENT_ID=<YOUR CLIENT ID>
set PDF_SERVICES_CLIENT_SECRET=<YOUR CLIENT SECRET>
MacOS/Linux:
export PDF_SERVICES_CLIENT_ID=<YOUR CLIENT ID>
export PDF_SERVICES_CLIENT_SECRET=<YOUR CLIENT SECRET>
4) Next, we can create our credentials and use them to create a PDF Services instance
Copied to your clipboard1// Initial setup, create credentials instance2Credentials credentials = new ServicePrincipalCredentials(3 System.getenv("PDF_SERVICES_CLIENT_ID"),4 System.getenv("PDF_SERVICES_CLIENT_SECRET"));56// Create PDF Services instance7PDFServices pdfServices = new PDFServices(credentials);
5) Now, let's upload the asset:
Copied to your clipboard1// Creates an asset(s) from source file(s) and upload2Asset asset = pdfServices.upload(inputStream, PDFServicesMediaType.PDF.getMediaType());
we define what PDF will be extracted. (You can download the source we used here.) In a real application, these values would be typically be dynamic.
6) Now, let's create the job:
Copied to your clipboard1// Build ExtractPDF options and set them into the operation2ExtractPDFParams extractPDFParams = ExtractPDFParams.extractPdfParamsBuilder()3 .addGetStylingInfo(false)4 .addElementsToExtract(Arrays.asList(ExtractElementType.TEXT))5 .build();67ExtractPDFJob extractPDFJob = new ExtractPDFJob(asset)8 .setParams(extractPDFParams);
This set of code defines what we're doing (an Extract operation), it defines parameters for the Extract job. PDF Extract API has a few different options, but in this example, we're simply asking for the most basic of extractions, the textual content of the document.
7) The next code block submits the job and gets the job result:
Copied to your clipboard1// Submit the job and get the job result2String location = pdfServices.submit(extractPDFJob);3PDFServicesResponse<ExtractPDFResult> pdfServicesResponse = pdfServices.getJobResult(location, ExtractPDFResult.class);45Asset resultAsset = pdfServicesResponse.getResult().getResource();6StreamAsset streamAsset = pdfServices.getContent(resultAsset);
This code runs the Extraction process, gets the content of the result zip in stream asset.
8) The next code block saves the result at the specified location:
Copied to your clipboard1// Creates an output stream and copy stream asset's content to it2String zipFileOutputPath = "output/ExtractTextInfoFromPDF.zip";3OutputStream outputStream = Files.newOutputStream(new File(zipFileOutputPath).toPath());4IOUtils.copy(streamAsset.getInputStream(), outputStream);
9) In this block, we read in the ZIP file, extract the JSON result file, and parse it:
Copied to your clipboard1ZipFile resultZip = new ZipFile(zipFileOutputPath);2ZipEntry jsonEntry = resultZip.getEntry("structuredData.json");3InputStream is = resultZip.getInputStream(jsonEntry);4Scanner s = new Scanner(is).useDelimiter("\\A");5String jsonString = s.hasNext() ? s.next() : "";6s.close();78JSONObject jsonData = new JSONObject(jsonString);
10) Finally we can loop over the result and print out any found element that is an H1
:
Copied to your clipboard1JSONArray elements = jsonData.getJSONArray("elements");2for(int i=0; i < elements.length(); i++) {3 JSONObject element = elements.getJSONObject(i);4 String path = element.getString("Path");5 if(path.endsWith("/H1")) {6 String text = element.getString("Text");7 System.out.println(text);8 }9}
Here's the complete application (src/main/java/ExtractTextInfoFromPDF.java
):
Copied to your clipboard1import com.adobe.pdfservices.operation.PDFServices;2import com.adobe.pdfservices.operation.PDFServicesMediaType;3import com.adobe.pdfservices.operation.PDFServicesResponse;4import com.adobe.pdfservices.operation.auth.Credentials;5import com.adobe.pdfservices.operation.auth.ServicePrincipalCredentials;6import com.adobe.pdfservices.operation.exception.SDKException;7import com.adobe.pdfservices.operation.exception.ServiceApiException;8import com.adobe.pdfservices.operation.exception.ServiceUsageException;9import com.adobe.pdfservices.operation.io.Asset;10import com.adobe.pdfservices.operation.io.StreamAsset;11import com.adobe.pdfservices.operation.pdfjobs.jobs.ExtractPDFJob;12import com.adobe.pdfservices.operation.pdfjobs.params.extractpdf.ExtractElementType;13import com.adobe.pdfservices.operation.pdfjobs.params.extractpdf.ExtractPDFParams;14import com.adobe.pdfservices.operation.pdfjobs.result.ExtractPDFResult;15import org.apache.commons.io.IOUtils;16import org.json.JSONArray;17import org.json.JSONObject;18import org.slf4j.Logger;19import org.slf4j.LoggerFactory;2021import java.io.File;22import java.io.IOException;23import java.io.InputStream;24import java.io.OutputStream;25import java.nio.file.Files;26import java.nio.file.Paths;27import java.util.Arrays;28import java.util.Scanner;29import java.util.zip.ZipEntry;30import java.util.zip.ZipFile;3132public class ExtractTextInfoFromPDF {3334 private static final Logger LOGGER = LoggerFactory.getLogger(ExtractTextInfoFromPDF.class);3536 public static void main(String[] args) {3738 try (InputStream inputStream = Files.newInputStream(new File("src/main/resources/Adobe Extract API Sample.pdf").toPath())) {39 // Initial setup, create credentials instance40 Credentials credentials = new ServicePrincipalCredentials(41 System.getenv("PDF_SERVICES_CLIENT_ID"),42 System.getenv("PDF_SERVICES_CLIENT_SECRET"));4344 // Creates a PDF Services instance45 PDFServices pdfServices = new PDFServices(credentials);4647 // Creates an asset(s) from source file(s) and upload48 Asset asset = pdfServices.upload(inputStream, PDFServicesMediaType.PDF.getMediaType());4950 // Create parameters for the job51 ExtractPDFParams extractPDFParams = ExtractPDFParams.extractPDFParamsBuilder()52 .addElementsToExtract(Arrays.asList(ExtractElementType.TEXT)).build();5354 // Creates a new job instance55 ExtractPDFJob extractPDFJob = new ExtractPDFJob(asset)56 .setParams(extractPDFParams);5758 // Submit the job and gets the job result59 String location = pdfServices.submit(extractPDFJob);60 PDFServicesResponse<ExtractPDFResult> pdfServicesResponse = pdfServices.getJobResult(location, ExtractPDFResult.class);6162 // Get content from the resulting asset(s)63 Asset resultAsset = pdfServicesResponse.getResult().getResource();64 StreamAsset streamAsset = pdfServices.getContent(resultAsset);6566 // Creates an output stream and copy stream asset's content to it67 Files.createDirectories(Paths.get("output/"));68 String zipFileOutputPath = "output/ExtractTextInfoFromPDF.zip";69 OutputStream outputStream = Files.newOutputStream(new File(zipFileOutputPath).toPath());70 LOGGER.info("Saving asset at output/ExtractTextInfoFromPDF.pdf");71 IOUtils.copy(streamAsset.getInputStream(), outputStream);72 outputStream.close();7374 ZipFile resultZip = new ZipFile(zipFileOutputPath);75 ZipEntry jsonEntry = resultZip.getEntry("structuredData.json");76 InputStream is = resultZip.getInputStream(jsonEntry);77 Scanner s = new Scanner(is).useDelimiter("\\A");78 String jsonString = s.hasNext() ? s.next() : "";79 s.close();8081 JSONObject jsonData = new JSONObject(jsonString);82 JSONArray elements = jsonData.getJSONArray("elements");83 for(int i=0; i < elements.length(); i++) {84 JSONObject element = elements.getJSONObject(i);85 String path = element.getString("Path");86 if(path.endsWith("/H1")) {87 String text = element.getString("Text");88 System.out.println(text);89 }90 }91 } catch (ServiceApiException | IOException | SDKException | ServiceUsageException e) {92 LOGGER.error("Exception encountered while executing operation", e);93 }94 }95}
Next Steps
Now that you've successfully performed your first operation, review the documentation for many other examples and reach out on our forums with any questions. Also remember the samples you downloaded while creating your credentials also have many demos.