View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0012319||mantisbt||attachments||public||2010-09-05 15:36||2016-07-21 15:15|
|Target Version||Fixed in Version|
|Summary||0012319: Index attachments' content|
I'd like to have a plugin for (full-text) indexing attachments' (.doc, .odt, .pdf) content.
a) seems hard work and maybe heavyweight (Java servlet running on some servlet engine)
|Tags||attachment, fts, plugin, postgresql|
I'd need some suggestions: onine or offline indexing of uploaded files?
Any other ideas?
This is a big undertaking.
I think you'd ideally want to perform indexing on a cron job cycle at low IO/CPU priority (ionice + renice). By calling an indexing command every time a file is uploaded you could potentially end up with multiple CPU intensive processes running at a time on your server. With a cron job you have much better control over what times of the day the intensive CPU workload is performed and how many CPUs should be used concurrently.
Of course, this would make it Linux-only which is a potential downside. Although saying that, it is a plugin, and someone could create a Windows specific version of this plugin if they wanted to. Or they could contribute patches later to add Windows support to the plugin you're proposing.
I'm a little concerned about how this will work when we support many different database types. I guess you could just make a full text search plugin specific for PostgreSQL, etc but then you'd be limiting the number of users who can use your plugin.
Not to mention the multiple different ways in which attachments can be stored:
1) On a remote FTP server
2) As a file within the uploads/files directory
3) Within the database as big blobs
This absolutely a WIP, but things works now:
So indexing works, but usage (embed in "View Issues" page) is missing (hopefully next week), and configuration needs more work, too.
Now search works, but why do I need to set $g_plugin_current = 'AttachmentIndexer' (plugin's name) every time? (not just from the cron job, but from IndexerFilter.class.php, too).
attachmentindexer-WIP-0.1.2.tbz2 (7,299 bytes)
Attached a working (at least with TSearch2) version, without tika-app-0.7.jar (17MB).
Since mantisforge doesn't accept my push efforts, uploaded it to
|2010-09-05 15:36||gthomas||New Issue|
|2010-09-05 15:39||gthomas||Note Added: 0026579|
|2010-09-05 15:40||gthomas||Tag Attached: plugin|
|2010-09-05 15:40||gthomas||Tag Attached: attachment|
|2010-09-05 15:40||gthomas||Tag Attached: feature|
|2010-09-05 15:40||gthomas||Tag Attached: fts|
|2010-09-05 15:40||gthomas||Tag Attached: postgresql|
|2010-09-05 15:40||gthomas||Tag Attached: wish|
|2010-09-19 02:58||dhx||Note Added: 0026782|
|2010-09-19 02:59||dhx||Note Added: 0026783|
|2010-09-19 03:47||gthomas||Note Added: 0026786|
|2010-09-19 15:50||gthomas||Note Added: 0026788|
|2010-09-20 05:28||gthomas||File Added: attachmentindexer-WIP-0.1.2.tbz2|
|2010-09-20 05:28||gthomas||Note Added: 0026796|
|2010-09-25 07:49||gthomas||Note Added: 0026857|
|2014-02-02 11:25||atrol||Severity||tweak => feature|
|2014-10-12 18:34||grangeway||Product Version||git trunk => 1.2.17|
|2016-07-21 15:14||atrol||Tag Detached: feature|
|2016-07-21 15:15||atrol||Tag Detached: wish|