Upload File and Test Dataset Db

Upload a file and the test dataset database

This page will analyse the main procedures of a typical PS workflow from data uploading to creating the environment for R execution and gathering data back.


Upload a file and get all necessary information

The first step in PS is file uploading done in Step 1 and automatically receiving all necessary information i.e. the type of the experiment (if it is metabolically labelled (L) - isobarically labeled (IL) or Labell free (LF)), the type of the file (MQ for MaxQuant evidence files, PD for PD psm files and MQP for MaxQuant protein_groups files), the labels - if any contained in the file - and the rawfiles - one for each MS run - of the experiment. Hitting the upload files button in Step 1 performs a click in a hidden upload button __s2btnupld letting the user choose one or more files to upload. This fires the change event that is bound to uploadButtonStatusChanged function in JS. uploadButtonStatusChanged takes the selected files as an argument and after validating that at least one file was selected and that there are no oversized files to upload, it fills the uploadingFiles array with then and it executes uploadFiles defining a postSuccess anonymous function to be executed at the end of each successfull file upload. uploadFiles creates a row in an invisible table s2uluploaders table - one per file and executes postFile once per file. Notice that nToUpload is a variable that stores all the time the amount of files pending upload. postFile asynchronously calls upload_files.php adding a listener to its progress to monitor and refresh a progress bar located in the respective row of s2uluploaders table. upload_files.php uploads the file and then finds the file type the experiment type the raw files and the labels of the uploaded file.

upload_files.php is based on two main arrays:

	$_get_labels_aguments_sets = array(
		'PD' => array(
			'L' => array(
				array('/^([^\s]+)\/([^\s]+)$/i', '/Modifications/i', '/\((?:[^:]+?:)?(.+?)\)(;|$)/'), //Proteome Discoverer
			)
		),
		'MQ' => array(
			'L' => array(
				array('/^Ratio ([^\s]+)\/([^\s]+)/i', null, null), //MaxQuant
				array('/^Reporter intensity ([0-9]+)/i', null, null), //MaxQuant, MS/MS-multiplexing (reporter ions), e.g. iTRAQ
			)
		)
	);
	// and
	$unique_patterns = array(
		'PD' => array(
			'L' => array('/Quan Channel/i'),
			'LF' => array(null), // there is no unique header for PD-LF
			'IL' => array('/\d+\/\d+/')
		),
		'MQ' => array(
			'L' => array('/Labeling State/i'),
			'LF' => array(null), // there is no unique header for MQ-LF
			'IL' => array('/Reporter intensity/i')
		)
	);

get_labels_aguments_sets is an array three dimensions the first dimension corresponds to different processing programs the second one to different filetypes (at the moment only L - Labelled filetypes are supported but in the future more specific triads might be introduced responding to e.g. LFQ or isobarically labelled data) and the third one is a pattern triad that is processing program and filetype specific: Each triad of patterns corresponds to: (1) a pattern of headers containing labels (2) a pattern for headers containing the peptide modifications and (3) a pattern to discover all modifications applied to the peptides in this experiment. The two last elements oif the triad can be used to find all peptide modifications contained in the experiment - a sonewhat useful feature that is currently commented out in PS. To enable it simply comment in the following block in upload_files.php:

// The following block checks for the validity of label definitions, comment in in case you want to use label definitions in ProteoSign:
						/*
							if (count($tmp[0]) > 0)
							{
							(...)

and the folowinmg one in get_labels.php:

// Search for all modifications in the file:
					// This feature is currently commented out in ProteoSign:
					/*
						while ($labeldefcol > -1 && ($data = fgetcsv($handle, 0, "\t")) !== FALSE)
						{
						(...)

this will return the variable “peptide_labels” to js containing all found modifications. At the moment, upload_files.php searches only for the labels contained in the headers of the uploaded file using the first argument as a regex pattern.

unique_patterns is an array of three dimensions. The first is the processing program, the second is a filetype and the third are the patterns, for example unique_patterns[‘PD’][‘L’] = array("/^Quan Channel$/") that means that this pattern can be found only in Metabolically labelled datasets from all datasets derived from PD. Notice that there are no unique patterns for LFQ datasets in both MQ and PD so if no unique patterns for Metabolically labelled and Isobarically labelled datasets are found the dataset will be considered to be an LFQ dataset.

upload_files.php will start by guessing the file type of the experiment (MQ or PD) by searching for patterns in the files headers: after loading the first line of the file the following code will find the file type and store it in dtype:

$tmp = preg_grep("/Spectrum File/i", $first_line);
if (count($tmp) > 0) {
	$dtype = 'PD';
	} else {
	$tmp = preg_grep("/Raw file/i", $first_line);
	if (count($tmp) > 0){
		$dtype = 'MQ';
	}
	else{
		$tmp = preg_grep("/Peptide IDs/i", $first_line);
		if (count($tmp) > 0){
			$dtype = 'MQP';
		}
		else{
			$dtype = 'unknown';
		}
	}
}

After deciding the file type the following file type specific lines will use the aforementioned unique patterns to find the experiment type:

$filetype = "";
if ($dtype == 'MQ' || $dtype == 'PD')
{
	foreach ($unique_patterns[$dtype]['L'] as &$Lpattern)
	{
		$tmp = preg_grep($Lpattern, $first_line);
		if (count($tmp) > 0)
		{
			$filetype = "L";
		}
	}
	if ($filetype == "")
	{
		foreach ($unique_patterns[$dtype]['IL'] as &$ILpattern)
		{
			$tmp = preg_grep($ILpattern, $first_line);
			if (count($tmp) > 0)
			{
				$filetype = "IL";
			}
		}
		if ($filetype == "")
		{
			$filetype = "LF";
		}
	}				
}

If a file is misdetected (usually as LF) from the code above (e.g. due to MQ or PD’s new version having slightly different headers) a new unique header can be added to $unique_patterns to the corresponding array dimension e.g. $unique_patterns['PD']\['L'\] for unique pattern of a file from PD from metabolically labelled experiments. After that, get_labels is called (stored in get_labels.php) that searches for header patterns containing labels. For example if the file was detected to be PD - L get_labels will search (according to _get_labels_aguments_sets as defined above) for all headers matching the pattern /^([^\s]+)\/([^\s]+)$/i that must return two labels per pattern matched and return all found labels. In a similar way get_rawfiles_names is called that searches cell by cell the rawfiles column of the file finding all rawfiles - one per MS run (this is the most time - consuming part of the process). Finally the file type, the experiment type, the labels and the rawfiles are returned to JS using the variables file_type, exp_type, peptide_labels_names and raw_filesnames respectively. Back to JS, the .done anonymous function in postFile is called setting the variables isLabelFree and isIsobaricLabel for the experiment type (if they are both false the experiment type is metabolically labelled) and it tweaks the environment to meet the experiment type’s needs. E.g. if the exp type is IL Replication Multiplexing is enabled or if it is LF the Assign condition option is enabled. AddedLabels is a global flag that ensures that PS will try to populate the conditions list using the files’ labels or tags just once. postFile’s .done function fills the conditions list with the conditions and also the expquantfiltlbl dropdown list that should contain all possible labels. Before finishing with post-upload file processing, postFile decrements nToUpload by one. After each successful file upload, postFile executes the postSuccess function that was defined in uploadButtonStatusChanged. In this function, if there are no other files to be uploaded (nToUpload == 0), PS checks all filetypes that were uploaded and if the combination is valid (e.g. MQ and MQP or a PD file alone) it guesses the processing program by the combination and allows the user to move to the next step. Otherwise it prompts the user and forces him to restart the process. The aforementioned procedure is described in the diagram below: Uploading a file in PS and the upload_files.php algorithm is described below: The algorithm of upload_files.php

Load a test file

In PS a user might opt for loading a test dataset from a predefined list and preview it with predefined options loaded from a database. All dataset files are stored in the ‘test data’ folder where testdatadb is also located, an SQL3 database containing all necessary information

testdatadb structure

Along with testdatadb we provide you with some tools to edit it. They are located under the edit_testdatadb folder. These are some php executable files, the one we will use at the moment is show_all_values.php. If you execute it in command line you will see all values of all tables in testdata db. It is not necessary to fully understand the structure of testdadadb since you can easily edit it with import_dataset.php and discard_dataset.php that are described below. Nevertheless let’s examine the tables one by one.

  • dataset (columns: id, desc): the main table in the dataset, it stores all the available datasets with a description (desc) and a primary key under id
  • files (columns: id, file): a table with all dataset file names (file) and a primary key for them (id). The files must be inside the test data folder with then exact file described in this table
  • dataset_files (columns: id, dataset_id, file_id): a table matching the keys of the two tables above. id is another primary key for each match, dataset_id is the id of the dataset from dataset table and file_id is the key from files table. Notice that one file might be matched to more than one datasets
  • param (columns: id, selector): this table matches a primary key to each jquery selector in order to easily assign values to DOM objects
  • param_value (columns: id, param_id, value, dataset_id): one of the most cryptic tables in PS, param_value matches some PS parameters to their predefined values for the dataset. dataset_id is the primary key in dataset table matching a record to a specific dataset. id is a primary key. param_id is the key in param table that matches one jquery selector (e.g. 1 -> input[name='exppddata’]). The value column stores the value to assign to the parameter. All values are sent as strings however in case of true-false data such as in check boxes checked state, they are encoded as 0-1. The value of the selected DOM object is then set to value although if the object is a check box, its checked state is set to true (1) or false (0).
  • processed_files (columns: id, name): a table matching a raw file to a primary key
  • experimental_structure (columns: id, processed_file_id, brep, trep, frac, dataset_id, used, condition): a table matching a rawfile with a n experimental structure coordinate - that is the brep the trep the fracture, the condition and if it is used or not in the analysis. The dataset_id is the key of the dataset the rawfile belongs to and processed_file_id is the key of the raw file in processed_files table. If a condition is not applicable because the dataset is not LFQ set the value to ‘-'.
  • extra_options (columns: id, dataset_id, option, opt_value): extra configuration options may be applicable to the dataset. This is mainly done in PS by setting some arrays in JS. The column is is a primary key and dataset_id the key of the dataset wherer the extra option is applied to. option is a description of the array that will be set and opt_value the value as a one line encoded array. The following comment from our JavaScript describes this function better:
		//	  Typically opt_value is a one-line encoded array that is parsed to a global array of PS
		// 
		// | Possible options |		  Description		       |
		// |------------------|--------------------------------|
		// | Rename		      | Renames (merges) conditions	   |
		// |				  | by editing the RenameArray	   |
		// | LS_Array		  | Sets Label Swap by editing the |
		// |				  | LS_Array					   |
		// | LS_c_p_Add	      | When a test dataset has	       |
		// |				  | Label Swaping LS_c_p_Add	   |
		// |				  | defines the structure of	   |
		// |				  | LS_counters_per_Add			   |
		// | Select_Labels	  | Selects a subset of the		   |
		// |				  | available conditions		   |
		//
		// opt_value: the argument of the corresponding option, usually denoting an array, opt_value is transformed to
		// an array by breaking the first dimension every time a | is found, the second dimension every time a || is
		// found etc.
		//
		// e.g. the value 0|Lung_SCC||1|Lung_SCC||2|Lung_SCC||3|Lung_ADC||4|Lung_ADC||5|Lung_ADC will become this array:
		//
		// +---+----------+
		// | 0 | Lung_SCC |
		// +---+----------+
		// | 1 | Lung_SCC |
		// +---+----------+
		// | 2 | Lung_SCC |
		// +---+----------+
		// | 3 | Lung_ADC |
		// +---+----------+
		// | 4 | Lung_ADC |
		// +---+----------+
		// | 5 | Lung_ADC |
		// +---+----------+
		//

Notice that test datasets do not currently support replication multiplexing

testdatadb tools

To edit the database you can simply open it using SQL3 however we built some tools for you to make your life easier.:

show_all_values.php

Execute show_all_values.php to display all values of all tables in tabular format

extend_table.php

Get data from a tabular file and import them into a table. Running this in command line will ask for 2 arguments: table to extend where you should type a valid table name and file to get rows from where you should type a relative or absolute path pointing to the tabular file to get rows from

execute_query.php

Simply type an SQL query and execute it - have in mind that execute_query.php does not return data from select queries so use it only to insert or delete data from the db

import_dataset.php

Maybe the most useful db editing tool, import_dataset allows you to add a test dataset automatically to the database given a description, the associated file names and a parameters file from a relative PS run. To setup the tool, first run a normal PS session with your dataset and download the results zipped folder. Inside it you will find the parameters file, copy it to the edit_testdatadb folder. Then copy the dataset files (the PSM file for PD datasets and the evidence and proteingroups for MQ datasets) inside the “test data” folder and run import_dataset.php under command line. You will be asked to give a name for the dataset - this will be the name displayed in the test dataset list in Step 1. After that, you should type the name of the parameters file you just copied. Finally type the name of the dataset files one by one. When you are finished simply hit the enter button and the dataset will be imported automatically. Next time you launch PS you should see the dataset in the test dataset list in Step 1 and it should be loaded automaticallly with the options you chose in the PS run above.

discard_dataset.php

In case you want to discard a test dataset from the list, execute discard_dataset.php. This script will show you all datasets currently in the list and ask you to choose the index of the one you want to erase. It will discard the dataset and prepare the db environment for a future test database import.