Upload a file and the test dataset database
This page will analyse the main procedures of a typical PS workflow from data uploading to creating the environment for R execution and gathering data back.
Upload a file and get all necessary information
The first step in PS is file uploading done in Step 1 and automatically receiving all necessary information i.e. the type of the experiment (if it is metabolically labelled (L) - isobarically labeled (IL) or Labell free (LF)), the type of the file (MQ for MaxQuant evidence files, PD for PD psm files and MQP for MaxQuant protein_groups files), the labels - if any contained in the file - and the rawfiles - one for each MS run - of the experiment. Hitting the upload files button in Step 1 performs a click in a hidden upload button __s2btnupld
letting the user choose one or more files to upload. This fires the change
event that is bound to uploadButtonStatusChanged
function in JS. uploadButtonStatusChanged
takes the selected files as an argument and after validating that at least one file was selected and that there are no oversized files to upload, it fills the uploadingFiles
array with then and it executes uploadFiles
defining a postSuccess anonymous function to be executed at the end of each successfull file upload. uploadFiles
creates a row in an invisible table s2uluploaders table
- one per file and executes postFile
once per file. Notice that nToUpload
is a variable that stores all the time the amount of files pending upload. postFile
asynchronously calls upload_files.php
adding a listener to its progress to monitor and refresh a progress bar located in the respective row of s2uluploaders table
. upload_files.php
uploads the file and then finds the file type the experiment type the raw files and the labels of the uploaded file.
upload_files.php
is based on two main arrays:
$_get_labels_aguments_sets = array(
'PD' => array(
'L' => array(
array('/^([^\s]+)\/([^\s]+)$/i', '/Modifications/i', '/\((?:[^:]+?:)?(.+?)\)(;|$)/'), //Proteome Discoverer
)
),
'MQ' => array(
'L' => array(
array('/^Ratio ([^\s]+)\/([^\s]+)/i', null, null), //MaxQuant
array('/^Reporter intensity ([0-9]+)/i', null, null), //MaxQuant, MS/MS-multiplexing (reporter ions), e.g. iTRAQ
)
)
);
// and
$unique_patterns = array(
'PD' => array(
'L' => array('/Quan Channel/i'),
'LF' => array(null), // there is no unique header for PD-LF
'IL' => array('/\d+\/\d+/')
),
'MQ' => array(
'L' => array('/Labeling State/i'),
'LF' => array(null), // there is no unique header for MQ-LF
'IL' => array('/Reporter intensity/i')
)
);
get_labels_aguments_sets
is an array three dimensions the first dimension corresponds to different processing programs the second one to different filetypes (at the moment only L - Labelled filetypes are supported but in the future more specific triads might be introduced responding to e.g. LFQ or isobarically labelled data) and the third one is a pattern triad that is processing program and filetype specific:
Each triad of patterns corresponds to: (1) a pattern of headers containing labels (2) a pattern for headers containing the peptide modifications and (3) a pattern to discover all modifications applied to the peptides in this experiment. The two last elements oif the triad can be used to find all peptide modifications contained in the experiment - a sonewhat useful feature that is currently commented out in PS. To enable it simply comment in the following block in upload_files.php
:
// The following block checks for the validity of label definitions, comment in in case you want to use label definitions in ProteoSign:
/*
if (count($tmp[0]) > 0)
{
(...)
and the folowinmg one in get_labels.php:
// Search for all modifications in the file:
// This feature is currently commented out in ProteoSign:
/*
while ($labeldefcol > -1 && ($data = fgetcsv($handle, 0, "\t")) !== FALSE)
{
(...)
this will return the variable “peptide_labels
” to js containing all found modifications.
At the moment, upload_files.php
searches only for the labels contained in the headers of the uploaded file using the first argument as a regex pattern.
unique_patterns
is an array of three dimensions. The first is the processing program, the second is a filetype and the third are the patterns, for example unique_patterns[‘PD’][‘L’] = array("/^Quan Channel$/") that means that this pattern can be found only in Metabolically labelled datasets from all datasets derived from PD. Notice that there are no unique patterns for LFQ datasets in both MQ and PD so if no unique patterns for Metabolically labelled and Isobarically labelled datasets are found the dataset will be considered to be an LFQ dataset.
upload_files.php
will start by guessing the file type of the experiment (MQ or PD) by searching for patterns in the files headers:
after loading the first line of the file the following code will find the file type and store it in dtype
:
$tmp = preg_grep("/Spectrum File/i", $first_line);
if (count($tmp) > 0) {
$dtype = 'PD';
} else {
$tmp = preg_grep("/Raw file/i", $first_line);
if (count($tmp) > 0){
$dtype = 'MQ';
}
else{
$tmp = preg_grep("/Peptide IDs/i", $first_line);
if (count($tmp) > 0){
$dtype = 'MQP';
}
else{
$dtype = 'unknown';
}
}
}
After deciding the file type the following file type specific lines will use the aforementioned unique patterns to find the experiment type:
$filetype = "";
if ($dtype == 'MQ' || $dtype == 'PD')
{
foreach ($unique_patterns[$dtype]['L'] as &$Lpattern)
{
$tmp = preg_grep($Lpattern, $first_line);
if (count($tmp) > 0)
{
$filetype = "L";
}
}
if ($filetype == "")
{
foreach ($unique_patterns[$dtype]['IL'] as &$ILpattern)
{
$tmp = preg_grep($ILpattern, $first_line);
if (count($tmp) > 0)
{
$filetype = "IL";
}
}
if ($filetype == "")
{
$filetype = "LF";
}
}
}
If a file is misdetected (usually as LF) from the code above (e.g. due to MQ or PD’s new version having slightly different headers) a new unique header can be added to $unique_patterns
to the corresponding array dimension e.g. $unique_patterns['PD']\['L'\]
for unique pattern of a file from PD from metabolically labelled experiments.
After that, get_labels
is called (stored in get_labels.php
) that searches for header patterns containing labels. For example if the file was detected to be PD - L get_labels will search (according to _get_labels_aguments_sets
as defined above) for all headers matching the pattern /^([^\s]+)\/([^\s]+)$/i
that must return two labels per pattern matched and return all found labels. In a similar way get_rawfiles_names
is called that searches cell by cell the rawfiles column of the file finding all rawfiles - one per MS run (this is the most time - consuming part of the process).
Finally the file type, the experiment type, the labels and the rawfiles are returned to JS using the variables file_type
, exp_type
, peptide_labels_names
and raw_filesnames
respectively.
Back to JS, the .done
anonymous function in postFile
is called setting the variables isLabelFree
and isIsobaricLabel
for the experiment type (if they are both false the experiment type is metabolically labelled) and it tweaks the environment to meet the experiment type’s needs. E.g. if the exp type is IL Replication Multiplexing is enabled or if it is LF the Assign condition option is enabled. AddedLabels
is a global flag that ensures that PS will try to populate the conditions list using the files’ labels or tags just once. postFile’s .done function fills the conditions list with the conditions and also the expquantfiltlbl
dropdown list that should contain all possible labels. Before finishing with post-upload file processing, postFile
decrements nToUpload by one.
After each successful file upload, postFile
executes the postSuccess function that was defined in uploadButtonStatusChanged
. In this function, if there are no other files to be uploaded (nToUpload == 0), PS checks all filetypes that were uploaded and if the combination is valid (e.g. MQ and MQP or a PD file alone) it guesses the processing program by the combination and allows the user to move to the next step. Otherwise it prompts the user and forces him to restart the process.
The aforementioned procedure is described in the diagram below:
and the upload_files.php
algorithm is described below:
Load a test file
In PS a user might opt for loading a test dataset from a predefined list and preview it with predefined options loaded from a database. All dataset files are stored in the ‘test data’ folder where testdatadb is also located, an SQL3 database containing all necessary information
testdatadb structure
Along with testdatadb we provide you with some tools to edit it. They are located under the edit_testdatadb folder. These are some php executable files, the one we will use at the moment is show_all_values.php
. If you execute it in command line you will see all values of all tables in testdata db. It is not necessary to fully understand the structure of testdadadb since you can easily edit it with import_dataset.php
and discard_dataset.php
that are described below. Nevertheless let’s examine the tables one by one.
- dataset (columns: id, desc): the main table in the dataset, it stores all the available datasets with a description (desc) and a primary key under id
- files (columns: id, file): a table with all dataset file names (file) and a primary key for them (id). The files must be inside the test data folder with then exact file described in this table
- dataset_files (columns: id, dataset_id, file_id): a table matching the keys of the two tables above. id is another primary key for each match, dataset_id is the id of the dataset from dataset table and file_id is the key from files table. Notice that one file might be matched to more than one datasets
- param (columns: id, selector): this table matches a primary key to each jquery selector in order to easily assign values to DOM objects
- param_value (columns: id, param_id, value, dataset_id): one of the most cryptic tables in PS, param_value matches some PS parameters to their predefined values for the dataset. dataset_id is the primary key in dataset table matching a record to a specific dataset. id is a primary key. param_id is the key in param table that matches one jquery selector (e.g. 1 -> input[name='exppddata’]). The value column stores the value to assign to the parameter. All values are sent as strings however in case of true-false data such as in check boxes checked state, they are encoded as 0-1. The value of the selected DOM object is then set to
value
although if the object is a check box, its checked state is set to true (1) or false (0). - processed_files (columns: id, name): a table matching a raw file to a primary key
- experimental_structure (columns: id, processed_file_id, brep, trep, frac, dataset_id, used, condition): a table matching a rawfile with a n experimental structure coordinate - that is the brep the trep the fracture, the condition and if it is used or not in the analysis. The dataset_id is the key of the dataset the rawfile belongs to and processed_file_id is the key of the raw file in processed_files table. If a condition is not applicable because the dataset is not LFQ set the value to ‘-'.
- extra_options (columns: id, dataset_id, option, opt_value): extra configuration options may be applicable to the dataset. This is mainly done in PS by setting some arrays in JS. The column is is a primary key and dataset_id the key of the dataset wherer the extra option is applied to. option is a description of the array that will be set and opt_value the value as a one line encoded array. The following comment from our JavaScript describes this function better:
// Typically opt_value is a one-line encoded array that is parsed to a global array of PS
//
// | Possible options | Description |
// |------------------|--------------------------------|
// | Rename | Renames (merges) conditions |
// | | by editing the RenameArray |
// | LS_Array | Sets Label Swap by editing the |
// | | LS_Array |
// | LS_c_p_Add | When a test dataset has |
// | | Label Swaping LS_c_p_Add |
// | | defines the structure of |
// | | LS_counters_per_Add |
// | Select_Labels | Selects a subset of the |
// | | available conditions |
//
// opt_value: the argument of the corresponding option, usually denoting an array, opt_value is transformed to
// an array by breaking the first dimension every time a | is found, the second dimension every time a || is
// found etc.
//
// e.g. the value 0|Lung_SCC||1|Lung_SCC||2|Lung_SCC||3|Lung_ADC||4|Lung_ADC||5|Lung_ADC will become this array:
//
// +---+----------+
// | 0 | Lung_SCC |
// +---+----------+
// | 1 | Lung_SCC |
// +---+----------+
// | 2 | Lung_SCC |
// +---+----------+
// | 3 | Lung_ADC |
// +---+----------+
// | 4 | Lung_ADC |
// +---+----------+
// | 5 | Lung_ADC |
// +---+----------+
//
Notice that test datasets do not currently support replication multiplexing
testdatadb tools
To edit the database you can simply open it using SQL3 however we built some tools for you to make your life easier.:
show_all_values.php
Execute show_all_values.php to display all values of all tables in tabular format
extend_table.php
Get data from a tabular file and import them into a table. Running this in command line will ask for 2 arguments: table to extend where you should type a valid table name and file to get rows from where you should type a relative or absolute path pointing to the tabular file to get rows from
execute_query.php
Simply type an SQL query and execute it - have in mind that execute_query.php does not return data from select queries so use it only to insert or delete data from the db
import_dataset.php
Maybe the most useful db editing tool, import_dataset allows you to add a test dataset automatically to the database given a description, the associated file names and a parameters file from a relative PS run. To setup the tool, first run a normal PS session with your dataset and download the results zipped folder. Inside it you will find the parameters file, copy it to the edit_testdatadb folder. Then copy the dataset files (the PSM file for PD datasets and the evidence and proteingroups for MQ datasets) inside the “test data” folder and run import_dataset.php under command line. You will be asked to give a name for the dataset - this will be the name displayed in the test dataset list in Step 1. After that, you should type the name of the parameters file you just copied. Finally type the name of the dataset files one by one. When you are finished simply hit the enter button and the dataset will be imported automatically. Next time you launch PS you should see the dataset in the test dataset list in Step 1 and it should be loaded automaticallly with the options you chose in the PS run above.
discard_dataset.php
In case you want to discard a test dataset from the list, execute discard_dataset.php. This script will show you all datasets currently in the list and ask you to choose the index of the one you want to erase. It will discard the dataset and prepare the db environment for a future test database import.