Spaces:
Running
Running
File size: 6,950 Bytes
046e3b8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | pdf.tocgen
==========
in.pdf
|
|
+----------------------+--------------------+
| | |
V V V
+----------+ +-----------+ +----------+
| | recipe | | ToC | |
| pdfxmeta +--------->| pdftocgen +-------->| pdftocio +---> out.pdf
| | | | | |
+----------+ +-----------+ +----------+
pdf.tocgen is a set of command-line tools for automatically
extracting and generating the table of contents (ToC) of a
PDF file. It uses the embedded font attributes and position
of headings to deduce the basic outline of a PDF file.
It works best for PDF files produces from a TeX document
using pdftex (and its friends pdflatex, pdfxetex, etc.), but
it's designed to work with any *software-generated* PDF
files (i.e. you shouldn't expect it to work with scanned
PDFs). Some examples include troff/groff, Adobe InDesign,
Microsoft Word, and probably more.
Please see the homepage [1] for a detailed introduction.
Installation
------------
pdf.tocgen is written in Python 3. It is known to work with
Python 3.7 to 3.11 on Linux, Windows, and macOS (On BSDs,
you probably need to build PyMuPDF yourself). Use
$ pip install -U pdf.tocgen
to install the latest version systemwide. Alternatively, use
`pipx` or
$ pip install -U --user pdf.tocgen
to install it for the current user. I would recommend the
latter approach to avoid messing up the package manager on
your system.
If you are using an Arch-based Linux distro, the package is
also available on AUR [8]. It can be installed using any AUR
helper, for example yay:
$ yay -S pdf.tocgen
Workflow
--------
The design of pdf.tocgen is influenced by the Unix philosophy [2].
I intentionally separated pdf.tocgen to 3 separate programs.
They work together, but each of them is useful on their own.
1. pdfxmeta: extract the metadata (font attributes, positions)
of headings to build a *recipe* file.
2. pdftocgen: generate a table of contents from the recipe.
3. pdftocio: import the table of contents to the PDF document.
You should read the example [3] on the homepage for a proper
introduction, but the basic workflow follows like this.
First, use pdfxmeta to search for the metadata of headings,
and generate *heading filters* using the automatic setting
$ pdfxmeta -p page -a 1 in.pdf "Section" >> recipe.toml
$ pdfxmeta -p page -a 2 in.pdf "Subsection" >> recipe.toml
Note that `page` needs to be replaced by the page number of
the search keyword.
The output `recipe.toml` file would contain several heading
filters, each of which specifies the attribute of a heading
at a particular level should have.
An example recipe file would look like this:
[[heading]]
level = 1
greedy = true
font.name = "Times-Bold"
font.size = 19.92530059814453
[[heading]]
level = 2
greedy = true
font.name = "Times-Bold"
font.size = 11.9552001953125
Then pass the recipe to `pdftocgen` to generate a table of
contents,
$ pdftocgen in.pdf < recipe.toml
"Preface" 5
"Bottom-up Design" 5
"Plan of the Book" 7
"Examples" 9
"Acknowledgements" 9
"Contents" 11
"The Extensible Language" 14
"1.1 Design by Evolution" 14
"1.2 Programming Bottom-Up" 16
"1.3 Extensible Software" 18
"1.4 Extending Lisp" 19
"1.5 Why Lisp (or When)" 21
"Functions" 22
"2.1 Functions as Data" 22
"2.2 Defining Functions" 23
"2.3 Functional Arguments" 26
"2.4 Functions as Properties" 28
"2.5 Scope" 29
"2.6 Closures" 30
"2.7 Local Functions" 34
"2.8 Tail-Recursion" 35
"2.9 Compilation" 37
"2.10 Functions from Lists" 40
"Functional Programming" 41
"3.1 Functional Design" 41
"3.2 Imperative Outside-In" 46
"3.3 Functional Interfaces" 48
"3.4 Interactive Programming" 50
[--snip--]
which can be directly imported to the PDF file using
`pdftocio`,
$ pdftocgen in.pdf < recipe.toml | pdftocio -o out.pdf in.pdf
Or if you want to edit the table of contents before
importing it,
$ pdftocgen in.pdf < recipe.toml > toc
$ vim toc # edit
$ pdftocio in.pdf < toc
Each of the three programs has some extra functionalities.
Use the -h option to see all the options you could pass in.
Development
-----------
If you want to modify the source code or contribute anything,
first install poetry [4], which is a dependency and package
manager for Python used by pdf.tocgen. Then run
$ poetry install
in the root directory of this repository to set up
development dependencies.
If you want to test the development version of pdf.tocgen,
use the `poetry run` command:
$ poetry run pdfxmeta in.pdf "pattern"
Alternatively, you could also use the
$ poetry shell
command to open up a virtual environment and run the
development version directly:
(pdf.tocgen) $ pdfxmeta in.pdf "pattern"
Before you send a patch or pull request, make sure the unit
test passes by running:
$ make test
GUI front end
-------------
If you are a Emacs user, you could install Daniel Nicolai's
toc-mode [9] package as a GUI front end for pdf.tocgen,
though it offers many more functionalities, such as
extracting (printed) table of contents from a PDF file. Note
that it uses pdf.tocgen under the hood, so you still need to
install pdf.tocgen before using toc-mode as a front end for
pdf.tocgen.
License
-------
pdf.tocgen itself a is free software. The source code of
pdf.tocgen is licensed under the GNU GPLv3 license. However,
the recipes in the `recipes` directory is separately
licensed under the CC BY-NC-SA 4.0 License [7] to prevent
any commercial usage, and thus not included in the
distribution.
pdf.tocgen is based on PyMuPDF [5], licensed under the GNU
GPLv3 license, which is again based on MuPDF [6], licensed
under the GNU AGPLv3 license. A copy of the AGPLv3 license
is included in the repository.
If you want to make any derivatives based on this project,
please follow the terms of the GNU GPLv3 license.
[1]: https://krasjet.com/voice/pdf.tocgen/
[2]: https://en.wikipedia.org/wiki/Unix_philosophy
[3]: https://krasjet.com/voice/pdf.tocgen/#a-worked-example
[4]: https://python-poetry.org/
[5]: https://github.com/pymupdf/PyMuPDF
[6]: https://mupdf.com/docs/index.html
[7]: https://creativecommons.org/licenses/by-nc-sa/4.0/
[8]: https://aur.archlinux.org/packages/pdf.tocgen/
[9]: https://github.com/dalanicolai/toc-mode
|