View Issue Details

IDProjectCategoryView StatusLast Update
0012319mantisbtattachmentspublic2016-07-21 15:15
ReportergthomasAssigned To 
Status newResolutionopen 
Product Version1.2.17 
Target VersionFixed in Version 
Summary0012319: Index attachments' content

I'd like to have a plugin for (full-text) indexing attachments' (.doc, .odt, .pdf) content.

Additional Information

a) use a separate Apache Lucene instance with some (RESTful HTTP?) interface.
b) use Apache Tika parser with PostgreSQL tsearch2 full-text indexer.

a) seems hard work and maybe heavyweight (Java servlet running on some servlet engine)
b) is waay easier - at least when you're already using PostgreSQL under your Mantis...

Tagsattachment, fts, plugin, postgresql




2010-09-05 15:39

reporter   ~0026579

I'd need some suggestions: onine or offline indexing of uploaded files?
If offline, then should I call the "java -jar tika-app.jar" directly from PHP, or should that be run from some cron script?

Any other ideas?



2010-09-19 02:58

reporter   ~0026782

This is a big undertaking.

I think you'd ideally want to perform indexing on a cron job cycle at low IO/CPU priority (ionice + renice). By calling an indexing command every time a file is uploaded you could potentially end up with multiple CPU intensive processes running at a time on your server. With a cron job you have much better control over what times of the day the intensive CPU workload is performed and how many CPUs should be used concurrently.

Of course, this would make it Linux-only which is a potential downside. Although saying that, it is a plugin, and someone could create a Windows specific version of this plugin if they wanted to. Or they could contribute patches later to add Windows support to the plugin you're proposing.

I'm a little concerned about how this will work when we support many different database types. I guess you could just make a full text search plugin specific for PostgreSQL, etc but then you'd be limiting the number of users who can use your plugin.



2010-09-19 02:59

reporter   ~0026783

Not to mention the multiple different ways in which attachments can be stored:

1) On a remote FTP server

2) As a file within the uploads/files directory

3) Within the database as big blobs



2010-09-19 03:47

reporter   ~0026786

This absolutely a WIP, but things works now:

  • extract with antiword/unzip/pdftotext OR tika
  • indexing backend: PostgreSQL's TSearch2 OR Xapian
  • indexing in a cronjob (uses file_api's file_get_content, so storage method is indifferent).

So indexing works, but usage (embed in "View Issues" page) is missing (hopefully next week), and configuration needs more work, too.




2010-09-19 15:50

reporter   ~0026788

Now search works, but why do I need to set $g_plugin_current[0] = 'AttachmentIndexer' (plugin's name) every time? (not just from the cron job, but from IndexerFilter.class.php, too).



2010-09-20 05:28


attachmentindexer-WIP-0.1.2.tbz2 (7,299 bytes)


2010-09-20 05:28

reporter   ~0026796

Attached a working (at least with TSearch2) version, without tika-app-0.7.jar (17MB).



2010-09-25 07:49

reporter   ~0026857

Since mantisforge doesn't accept my push efforts, uploaded it to

Issue History

Date Modified Username Field Change
2010-09-05 15:36 gthomas New Issue
2010-09-05 15:39 gthomas Note Added: 0026579
2010-09-05 15:40 gthomas Tag Attached: plugin
2010-09-05 15:40 gthomas Tag Attached: attachment
2010-09-05 15:40 gthomas Tag Attached: feature
2010-09-05 15:40 gthomas Tag Attached: fts
2010-09-05 15:40 gthomas Tag Attached: postgresql
2010-09-05 15:40 gthomas Tag Attached: wish
2010-09-19 02:58 dhx Note Added: 0026782
2010-09-19 02:59 dhx Note Added: 0026783
2010-09-19 03:47 gthomas Note Added: 0026786
2010-09-19 15:50 gthomas Note Added: 0026788
2010-09-20 05:28 gthomas File Added: attachmentindexer-WIP-0.1.2.tbz2
2010-09-20 05:28 gthomas Note Added: 0026796
2010-09-25 07:49 gthomas Note Added: 0026857
2014-02-02 11:25 atrol Severity tweak => feature
2014-10-12 18:34 grangeway Product Version git trunk => 1.2.17
2016-07-21 15:14 atrol Tag Detached: feature
2016-07-21 15:15 atrol Tag Detached: wish