# pdf-extractor

One of the source handlers in Gresladix. It processes a folder with pdfs and pushes them to the `t1` topic which contains the raw text and metadata of each page.

## Building

To create the jar and `-standalone.jar`, make sure you have [Leiningen](https://leiningen.org/) installed.

```
lein uberjar
```
This will generate a `.jar``and a `-standalone.jar` in the `target` folder.

*TODO* describe how to deploy releases to esi-private repo.

## Usage

If you have a realased version, you can run:
```
java -jar pdf-extractor-x.x.x-standalone.jar -c config.json
```

During development you can run
```
lein run -c config.json
```

### Configuration options
The configuration json should be a single json object with a field `net.expertsystem.lab/pdf-extractor` and value an object with fields:
 * `kafka-server` url to connect to kafka. e.g. "172.16.32.80:9092"
 * `raw-doc-ktopic` name of the kafka topic where the extracted documents will be sent to
 * `pdf-dir` path to a local folder containing pdf files
 * `push-output` boolean value, by default `true`, specifying whether to push the output to the output kafka topic. This can be useful for verifying that the service is working as expected before actually sending documents to kafka.
 
## Limitations and Roadmap
Current version requires manual execution for new pdf files by:
 * placing them into a new folder
 * editing or copying config.json to point option `pdf-dir` to the new folder 
 * launching `pdf-extractor`
 
In the future it would be better to make this more automatic:
 * a local dir may still be used, but the component will move pdfs to a separate folder once they have been processed. This way the component can monitor the folder and process documents as they arrive.
 * instead of using a local dir, there may be a new kafka topic of urls to be processed. When the url resolves to a pdf file, the document can be processed.

## License

Copyright © 2018 Expert System

Distributed under the Eclipse Public License either version 1.0 or (at
your option) any later version.
