View Issue Details

IDProjectCategoryView StatusLast Update
0031837mantisbtbugtrackerpublic2022-12-20 08:52
Reporterricardoalonsos Assigned To 
PrioritynormalSeverityfeatureReproducibilityalways
Status acknowledgedResolutionopen 
Summary0031837: Using disk to store bug/project files for large environments will have thousand of files into the same folder
Description

If the database is very large (>200 projects, >300000 issues), storing files in the DB is expensive, so using the disk is a nice option. But there's no simple way to move from DB to disk, already organizing per project (will require directly database update to comply). Or projects with a massive amount of attached files will also have a single folder with possibly thousands of files.

Dividing the files into folders is the best solution. And because the file names as already randomly set, it will be easy to hash them into folders, using the first 2 or 3 letters of their names.

I already implemented this, but it may need further testing or improvement how the files are hashed into the folders. The patches I used are attached.

Tagspatch
Attached Files
0001-changed-file-structure-creating-folders-to-avoid-too.patch (3,523 bytes)   
From f79953d38e0032ea4e0e6112a5558547ac6cd332 Mon Sep 17 00:00:00 2001
From: Ricardo Alonso <ralonso@redhat.com>
Date: Thu, 24 Nov 2022 16:44:15 +0000
Subject: [PATCH] changed file structure, creating folders to avoid too many
 files into a single folder

---
 .gitignore                   |  1 +
 admin/move_attachments.php   | 12 +++++-
 config/config_inc.php.sample | 83 ------------------------------------
 core/file_api.php            | 24 ++++++++++-
 4 files changed, 34 insertions(+), 86 deletions(-)
 create mode 100644 .gitignore
 delete mode 100644 config/config_inc.php.sample

diff --git a/admin/move_attachments.php b/admin/move_attachments.php
index 682da75..df34fb3 100644
--- a/admin/move_attachments.php
+++ b/admin/move_attachments.php
@@ -194,7 +194,9 @@ function move_attachments_to_disk( $p_type, array $p_projects ) {
 			$t_data = array();
 
 			while( $t_row = db_fetch_array( $t_result ) ) {
-				$t_disk_filename = $t_upload_path . $t_row['diskfile'];
+				# first check if filename is on new format already
+				$t_filepath = adjust_filepath($t_upload_path, $t_row['diskfile']); 
+				$t_disk_filename = $t_filepath . $t_row['diskfile'];
 				if ( file_exists( $t_disk_filename ) ) {
 					$t_status = 'Disk File Already Exists \'' . $t_disk_filename . '\'';
 					$t_failures++;
@@ -217,7 +219,7 @@ function move_attachments_to_disk( $p_type, array $p_projects ) {
 						}
 						$t_update_result = db_query(
 							$t_update_query,
-							array( $t_upload_path, $t_row['id'] )
+							array( $t_filepath, $t_row['id'] )
 						);
 
 						if( !$t_update_result ) {
@@ -242,6 +244,9 @@ function move_attachments_to_disk( $p_type, array $p_projects ) {
 					$t_file['bug_id'] = $t_row['bug_id'];
 				}
 				$t_data[] = $t_file;
+
+				$t_row = null;
+				gc_collect_cycles();
 			}
 		}
 
@@ -253,6 +258,9 @@ function move_attachments_to_disk( $p_type, array $p_projects ) {
 			'data'       => $t_data,
 		);
 
+		$t_result = null;
+		gc_collect_cycles();
+
 	}
 	return $t_moved;
 }
diff --git a/core/file_api.php b/core/file_api.php
index 9ccaa70..e344965 100644
--- a/core/file_api.php
+++ b/core/file_api.php
@@ -934,6 +934,9 @@ function file_add( $p_bug_id, array $p_file, $p_table = 'bug', $p_title = '', $p
 	$t_unique_name = file_generate_unique_name( $t_file_path );
 	$t_method = config_get( 'file_upload_method' );
 
+	# adjust the path to accomodate the files into smaller folders
+	$t_file_path = adjust_filepath( $t_file_path, $t_unique_name);
+
 	switch( $t_method ) {
 		case DISK:
 			file_ensure_valid_upload_path( $t_file_path );
@@ -1409,4 +1412,23 @@ function file_get_content_type_override( $p_filename ) {
  */
 function file_get_max_file_size() {
 	return (int)min( ini_get_number( 'upload_max_filesize' ), ini_get_number( 'post_max_size' ), config_get( 'max_file_size' ) );
-}
\ No newline at end of file
+}
+
+/**
+ * Adjust the file path to subdivide the uploaded files into folders. 
+ * 
+ * @param string $p_filepath the path to store the files
+ * @param string $p_filename the name of the file to store
+ * @return string the adjusted file path if necessary. 
+ * 
+ */
+function adjust_filepath($p_filepath, $p_filename){
+	$t_search = DIRECTORY_SEPARATOR . substr( $p_filename, 0, 2 ); 
+	if (strpos( $p_filepath, $t_search) === false ){
+		$t_filepath = $p_filepath . substr( $p_filename, 0, 2 ) . DIRECTORY_SEPARATOR;
+		if ( !file_exists( $t_filepath ) )
+			mkdir( $t_filepath, 0700 );
+		return $t_filepath;
+	}
+	return $p_filepath;
+}
-- 
2.38.1

0003-fixing-upload-download-with-new-folder-structure.patch (2,635 bytes)   
From 1a07a9b95f1c6de3145063ecfe2218bd41e05739 Mon Sep 17 00:00:00 2001
From: Ricardo Alonso <ricardo.alonso@niit.com>
Date: Wed, 14 Dec 2022 09:50:46 +0200
Subject: [PATCH] fixing upload/download with new folder structure

---
 core/bug_api.php  | 2 +-
 core/file_api.php | 3 ++-
 file_download.php | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/core/bug_api.php b/core/bug_api.php
index 9cc5711..2f2be56 100644
--- a/core/bug_api.php
+++ b/core/bug_api.php
@@ -1763,7 +1763,7 @@ function bug_get_bugnote_stats( $p_bug_id ) {
  */
 function bug_get_attachments( $p_bug_id ) {
 	db_param_push();
-	$t_query = 'SELECT id, title, diskfile, filename, filesize, file_type, date_added, user_id, bugnote_id
+	$t_query = 'SELECT id, title, concat(folder, diskfile) as diskfile, filename, filesize, file_type, date_added, user_id, bugnote_id
 		                FROM {bug_file}
 		                WHERE bug_id=' . db_param() . '
 		                ORDER BY date_added';
diff --git a/core/file_api.php b/core/file_api.php
index e344965..c25c42e 100644
--- a/core/file_api.php
+++ b/core/file_api.php
@@ -671,6 +671,7 @@ function file_delete( $p_file_id, $p_table = 'bug', $p_bugnote_id = 0 ) {
 
 	$c_file_id = (int)$p_file_id;
 	$t_filename = file_get_field( $p_file_id, 'filename', $p_table );
+	$t_folder = file_get_field( $p_file_id, 'folder', $p_table );
 	$t_diskfile = file_get_field( $p_file_id, 'diskfile', $p_table );
 
 	if( $p_table == 'bug' ) {
@@ -681,7 +682,7 @@ function file_delete( $p_file_id, $p_table = 'bug', $p_bugnote_id = 0 ) {
 	}
 
 	if( DISK == $t_upload_method ) {
-		$t_local_disk_file = file_normalize_attachment_path( $t_diskfile, $t_project_id );
+		$t_local_disk_file = file_normalize_attachment_path( $t_folder . $t_diskfile, $t_project_id );
 		if( file_exists( $t_local_disk_file ) ) {
 			file_delete_local( $t_local_disk_file );
 		}
diff --git a/file_download.php b/file_download.php
index 005fe4d..87fa5f3 100644
--- a/file_download.php
+++ b/file_download.php
@@ -102,6 +102,7 @@ if( false === $t_row ) {
 /**
  * @var int    $v_bug_id
  * @var int    $v_project_id
+ * @var string $v_folder
  * @var string $v_diskfile
  * @var string $v_filename
  * @var int    $v_filesize
@@ -177,7 +178,7 @@ $t_file_info_type = false;
 
 switch( $t_upload_method ) {
 	case DISK:
-		$t_local_disk_file = file_normalize_attachment_path( $v_diskfile, $t_project_id );
+		$t_local_disk_file = file_normalize_attachment_path( $v_folder . $v_diskfile, $t_project_id );
 		if( file_exists( $t_local_disk_file ) ) {
 			$t_file_info_type = file_get_mime_type( $t_local_disk_file );
 		}
-- 
2.38.1

Activities

dregad

dregad

2022-12-19 10:45

developer   ~0067227

Just checking, are you aware that you can already define a distinct directory to store attachments, for each individual project ? This may be sufficient to reduce the number of files in the directory down to an acceptable level. See Upload File Path in manage_proj_edit_page.php: by default it's blank, i.e the project is using the globally defined directory ($g_absolute_path_default_upload_folder).

there's no simple way to move from DB to disk

There's an admin script to do just that (admin/move_attachments_page.php), but admittedly it's quite basic and probably a bit outdated too as it does not see much usage. That being said, if the projects' attachment paths are already set, I believe it will store the files in the configured directories.

Out of curiosity, how many attachments are we talking about here ? This is the first time that I hear about this being a problem. And what would be a maximum acceptable number of files in a given directory ?

I am asking, because I wonder if a simple approach like the one you propose, i.e. using the first 2 or 3 letters of [the attachment file] names will only delay the problem and may not be enough to actually limit the number of files to an acceptable level. Maybe this needs to be configurable, I don't know.

Anyway, thanks for your contribution. This would require review and some testing to ensure it does not break anything for existing systems.

ricardoalonsos

ricardoalonsos

2022-12-19 11:48

reporter   ~0067229

I'm aware of the option to separate per project. But the problem is: We are migrating from an old version, where the storage was on DB. We have 200+ projects and it's a manual work to update every project to use it's own folder. We have 100000+ attachments, but some projects with none and other with 30000+, so still not evenly distributed.

Using 2 letters (246 folders on the first level), the files were better distributed, with less than 400 per folder.

But will be interesting to have some tool to manipulate and reorganize/rearrange this structure if necessary.

  • move from 2 to 3 letters
  • automatic set the project folder to be the project name (add a system name e.g.
dregad

dregad

2022-12-20 08:52

developer   ~0067231

Last edited: 2022-12-20 08:52

We have 200+ projects and it's a manual work to update every project to use it's own folder.

Actually not such a big effort, something along these lines should be enough to do the trick

update mantis_project_table 
set file_path = concat('/default_upload_folder/', id, '/')
where file_path = ''
mysql bugtracker -e "select id from mantis_project_table" --batch --skip-column-names |xargs mkdir

will be interesting to have some tool to manipulate and reorganize/rearrange this structure if necessary

I agree this might be useful functionality.

automatic set the project folder to be the project name

This might be worth implementing as default behavior actually.