Search text from PDF files stored in an S3 bucket
Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?
As a developer, you have 3 options:
- Search by Filename: Lookup by key/value like filename [Native]
- Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
- Full-Text-Search: Extract the contents into a search engine [OCR, Database, Search add-on]
Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.
In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.
Bonus: At the end will be a Github repository so you can import the code directly into your application.
Store the file
First we need a function to download the file locally in order to run our OCR extraction logic:
import boto3s3_client = boto3.client(
's3',
aws_access_key_id='aws_access_key_id',
aws_secret_access_key='aws_secret_access_key',
region_name='region_name'
)
with open(s3_file_name, 'wb') as file:
s3_client.download_fileobj(
bucket_name,
s3_file_name,
file
)
Extract the contents
We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):
from tika import parser
parsed_pdf_content = parser.from_file(s3_file_name)['content']
Insert contents into a search engine
We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.
Note: if you don’t have OpenSearch locally you must install it first, then run it:
brew update
brew install opensearch
opensearch
OpenSearch will now be accessible here: http://localhost:9200. Let’s build the index and insert the file contents:
from opensearchpy import OpenSearch
os = OpenSearch("http://localhost:9200/")
index_name="pdf-search"doc = {
"filename": s3_file_name,
"parsed_pdf_content": parsed_pdf_content
}
response = os.index(
index=index_name,
body=doc,
id=1,
refresh=True
)
Creating a PDF search API
We’ll use Flask to create a microservice that searches terms:
from flask import Flask, jsonify, request
from opensearchpy import OpenSearch
from config import *
app = Flask(__name__)
os = OpenSearch("http://localhost:9200/")
@app.route('/search', methods=['GET'])
def search_file():
query = request.args.get('q', default = None, type = str)# query payload to ES
payload = {
'query': {
'match': {
'parsed_pdf_content': query
}
}
}
response = os.search(
body=payload,
index=index_name
)
return jsonify(response)if __name__ == '__main__':
app.run(host="localhost", port=5011, debug=True)
Now we can call the API via:
GET: http://localhost:5011/search?q=SEARCH_TERM
{
"_shards": {
"failed": 0,
"skipped": 0,
"successful": 1,
"total": 1
},
"hits": {
"hits": [
{
"_id": "1",
"_index": "pdf-search",
"_score": 0.29289162,
"_source": {
"filename": "prescription.pdf",
"parsed_pdf_content": "http://localhost:5011/search?q=SEARCH_TERM"
}
}
],
"max_score": 0.29289162,
"total": {
"relation": "eq",
"value": 1
}
},
"timed_out": false,
"took": 40
}
Whoo we did it! We’ve successfully created an API that offers full text PDF search.
You can download the repo here: https://github.com/mixpeek/pdf-search-s3
So what’s next?
- Queuing: Ensuring concurrent file uploads are not dropped
- Security: Adding end to end encryption to the data pipeline
- Enhancements: Including more features like fuzzy, highlighting and autocomplete
- Rate Limiting: Building thresholds so users don’t abuse the system
Everything collapsed into just 2 API calls
If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.
Upload
import requests
url = "https://api.mixpeek.com/upload"
files=[
('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))
]
response = requests.request("POST", url, files=files)
Search
import requests
url = "https://api.mixpeek.com/search?q=SEARCH_QUERY"
response = requests.request("GET", url)
print(response.text)
Corresponding Postman Collection for your convenience.
Request an API key for free, and review the docs to get started.