View on GitHub


Extract meaningful content from pdf and psd file, such as texts and images both linked into a common JSON string

Download this project as a .zip file Download this project as a tar.gz file


Content-extractor is python based project. The only reason to this is the availability of the librairies. (The best ones are in python IMO) Currently, this project parse pdf and psd file to extract meaningful content, such as texts and images both linked under a common json string


Content-extractor is build upon the followings:


Since there is a lot of dependencies and most of them also have their own dependencies, I have made a shellscript to simplify the installation process.

The script assume the following:

When you launch the script, it's installing pip, if it isn't already present on your system.

$ cd install
$ sh

How to use it

For any extension (currently pdf/psd) you can use [file_path] [image_path] it will automaticaly do the job.

#Will write a metada.json and extract the images into the folder images
./ psdtools/work.psd './images/'
./ pdfreader/book.pdf './images/'

You can also import into your own python project and use it the folowing way:

#will return a string containing the json and extract the images into the folder images
from parser import parser
json = parser.parse("psdtools/work.psd", "./images/")
json = parser.parse("pdfreader/book.pdf", "./images/")

You can also use the pdfreader and psdtools script independently doing so:

# Shell:
$ ./psdtools/ psdtools/work.psd './images/'
$ ./pdfreader/ pdfreader/book.pdf './images/'
# Python:
from psdtools import main
json ="psdtools/work.psd", "./images/")
from pdfreader import main
json ="pdfreader/book.pdf", "./images/")
json ="pdfreader/book.pdf", "./images/", "ppm") #will extract the images as ppm/pbm ad then convert them as png
json ="pdfreader/book.pdf", "./images/", "jpeg") #default: will extract the images directly as jpg

./pdfreader/ is just a simplified interface to the very powerful pdfreader/util/, I have rewrite to be a class, but this is originally from pdfminer. However, you can still use as if it was the originial tool, here is the documentation.

$ pdfreader/util/ -o output.html samples/naacl06-shinyama.pdf
(extract text as an HTML file whose filename is output.html)

$ pdfreader/util/ -V -c euc-jp -o output.html samples/jo.pdf
(extract a Japanese HTML file in vertical writing, CMap is required)

$ pdfreader/util/ -P mypassword -o output.txt secret.pdf
(extract a text from an encrypted PDF file) can also be imported in a python project (but less options are available due to my lack of implementation)

# @see pdfreader/ as example
from util.convert import converter
convert = converter()
xml = convert.as_xml().add_input_file(fileinput).run()

How does it work

The information are extracted having in mind to keep the parent-child relations.

You can see under a simplified example taken out from book.pdf of how look the json string.

JSON Format (from pdfreader/book.pdf 'simplified')

    "pages": [
            "images": [
            "paragraphs": [
                    "size": 98,
                    "width": 587,
                    "string": "Book Title",
                    "y": -98,
                    "x": -324,
                    "font": "Georgia",
                    "height": 705
            "images": [
            "paragraphs": [
                    "size": 24,
                    "width": 138,
                    "string": "CHAPTER 1",
                    "y": -24,
                    "x": -88,
                    "font": "Georgia",
                    "height": 711
                    "size": 33,
                    "width": 489,
                    "string": "Lorem ipsum dolor sit amet, consectetur \n<i>adipisicing</i> <b>elit, sed</b> <i><b>do eiusmod</i></b>\ntempor incididunt ut labore et dolore \nmagna aliqua. Ut enim ad minim veniam,\nquis nostrud exercitation ullamco laboris \nnisi ut aliquip ex ea commodo\nconsequat.",
                    "y": -229,
                    "x": -439,
                    "font": "Georgia",
                    "height": 269
            "paragraphs": [
                    "size": 24,
                    "width": 133,
                    "string": "SECTION 1",
                    "y": -24,
                    "x": -83,
                    "font": "Georgia",
                    "height": 711

How to improve it


You're welcome to contribute to this project in any way you can. If you don't know how to code, don't have time, don't worry, you still can post issue, I will be happy to answer you and correct it as fast as possible. Want to code ? fork it and submit pull request! Also, pull request comming with an example of what has been improved will be merge in priority.