In this chapter, I wish to introduce some general concepts around the i2MassChroQ program, the reference to be used to cite the software in publications, the building and installation procedures.
i2MassChroQ is the successor of the X!TandemPipeline-Java project that has seen the following changes along the years:
Full rewrite of the X!TandemPipeline-Java program from Java to C++17. The Java-based software program had been published in Olivier Langella, Benoît Valot, Thierry Balliau, Mélisande Blein-Nicolas, Ludovic Bonhomme, and Michel Zivy (2016) X!TandemPipeline: A Tool to Manage Sequence Redundancy for Protein Inference and Phosphosite Identification. in J. Proteome Res. 2017, 16, 2, 494–503. https://doi.org/10.1021/acs.jproteome.6b00632.
Before the integrations described below, the product of the rewrite has been called transitorily X!TandemPipeline++ (or xtpcpp). That name might appear in some places while the code/documentation is being revised to change its name to i2MassChroQ.
Integration into the new software of the MassChroQ software project that was developed as a standalone C++ software piece. MassChroQ is a software project that was developed to perform quantitative proteomics in a variety of modes (label-free or with labelling).
Unfinalized integration of the MCQ~R project that was developed as a standalone project. MCQ~R is a GNU R project aimed at performing bio-statistical analyses on the quantification analysis performed by MassChroQ.
The i2MassChroQ project encompasses three main quantitative proteomics fields of endeavour:
Database search, peptide identification and protein inference. The database search is actually performed by X!Tandem and is started seamlessly by i2MassChroQ. Protein grouping is performed by original code in i2MassChroQ.
Quantitative proteomics, mainly based on area-under-the-curve processes (requires the full mass data set to extract ion current chromatograms, XIC). This part was historically performed by the MassChroQ software program.
Bio-statistical analysis of the quantification data. This part was historically performed by the MCQ~R GNU R-based package (unpublished software as of yet).
The i2MassChroQ software project aims at providing users with an integrated software solution for quantitative proteomics. As described in detail in another chapter of this book, quantitative proteomics involve a number of steps that can be enumerated in sequence below:
Search databases to connect MS/MS spectra to peptide sequences. This step is called identification;
Apply logic to reliably identify proteins based on the peptides identified at the previous step. This step is called inference;
Optionally perform quantification of the identified peptides and inferred proteins. i2MassChroQ has area-under-the-curve quantitative proteomics capabilites that are based on precursor peptide ion current extraction from the mass spectrometric data. The extracted ion currents are then plot like chromatograms: intensity as a function of retention time. This analytical process thus somehow involves “Mass Chromatograms” for the Quantification.
From the sequence above, the i2MassChroQ name becomes self-explanatory!
It is however possible (and encouraged) to mentally read i2MassChroQ as “I too MassChroQ ! ”
The previous X!TandemPipeline version of this software did store configuration data
in the local configuration directory and in the
PAPPSO/xtpcpp.conf
file. In order to preserve these
configuration data after having transitioned from X!TandemPipeline to i2MassChroQ,
please, rename that configuration file to
PAPPSO/i2masschroq.conf
.
This section describes the general concepts at the basis of the analysis of proteomics data that one needs to grok in order to properly assimilate the workings of the i2MassChroQ software.
Proteomics is a mass spectrometry-based field of endeavour that is aimed at characterizing the “protein complement” of a given genome. The protein complement of a genome is the set of proteins that are expressed at a given instant in the life of a cell, a tissue or an organ, for example. Characterizing that protein complement actually means identifiying the proteins expressed by a given living cell or tissue or organ. Optionally, if feasible, the characterization of post-translational modifications might be desirable.
There are two main variants of protemics: “bottom-up” proteomics and “top-down” proteomics:
The first variant—bottom-up proteomics—identifies proteins on the basis of the identification of all the peptides obtained by first digesting all the proteins of the sample using an enzyme of known specificity. In this variant, the sample that is injected in the mass spectrometer is the resulting peptide mixture (first resolved by high performance liquid chromatography). The identification of the proteins contained in the initial sample is performed in a number of steps that are actually the focus of i2MassChroQ. Indeed the i2MassChroQ software is a bottom-up-oriented software program.
The second variant—top-down proteomics—identifies proteins on the The second variant identifies proteins on the basis of intact proteins directly injected in the mass spectrometer. Of course, it might be necessary to fragment the proteins in the mass spectrometer and to use the fragments to actually identify the protein. However, the fact that the protein is first detected and analyzed as one entity (and not as set of peptides), allows for some very useful discoveries, like the identity and number of post-translational modifications, for example.
At the moment, i2MassChroQ does not handle top-down proteomics data: it is a bottom-up proteomics software project.
Once the initial sample, containing all the proteins to identify, has been digested using a protease of known cleavage specificity (trypsin, typically), the peptidic mixture (that might be highly complex) needs to be resolved as much as possible using chromatography. In the vast majority of the proteomics experimental settings, the chromatography setup is connected to the mass spectrometer so that when the gradient is developed, all the peptides are immediately injected “on line” to the mass spectrum ion source.
The mass spectrometer runs an analysis cycle that can be summarized like the following:
Acquire a full scan mass spectrum of the whole set of ions at a given chromatography retention time. This kind of mass spectrum is called a MS spectrum;
Enter a loop during which ions having the most intense signal are subjected in turn to collision-induced dissociation (CID), that is, are fragmented by accelerating them against gas molecules in a fragmentation cell. The mass spectra that are collected at each one of these fragmentation acquisitions are called MS/MS spectra because they are obtained after two mass analysis events: the first event is the measurement of the intact peptide ion's m/z value (full scan mass spectrum) and the second event is the measurement of all the obtained fragments' m/z values (MS/MS scan).
Each instrument records all the MS and MS/MS spectra in a raw data format file that is specific of the vendor. Free Software developers cannot know the internal structure of the files. To use the mass spectrometric data, they need to rely on a specific software that performs the conversion from the raw data format to an open data format (mzML). That program is called msconvert, from the ProteoWizard project.
Mass spectrometrists used to call ions that were analyzed in full scan mass spectra “parent ions”. They also used to call fragment ions arising upon fragmentation of a parent ion “daughter ions”. This terminology has been deprecated and has been replaced with “precursor ion” and “product ion”, respectively. In our document, we thus use the new terminology.
i2MassChroQ loads mzXML- and mzML-formatted files and needs for its operations to have accesss to all the MS and MS/MS spectra. Once data files have been loaded, i2MassChroQ allows the user to perform the following tasks, that will be detailed in later chapters:
Configure the X!Tandem database searching software (that is, the software, external to i2MassChroQ that actually performs the peptide-mass spectrum matches);
Run the X!Tandem software and load its results;
Display the results to the user in a way that they can be scrutinized and checked. The peptide identification results serve as the basis for another processing step that is integrally performed by i2MassChroQ: the “protein inference”. That step aims at using the peptide identifications to actually craft a list of proteins identities. The user is provided with various means to control that step in various ways.
Optionally start the MassChroQ module to perform the quantitative proteomics on the identification data checked at the previous step.
Optionally start the MCQ~R module to perform the bio-statistical analysis of the quantitative proteomics data obtained at the previous step.
The i2MassChroQ software is not yet published. A great part of the features found in i2MassChroQ have initially been found in the X!TandemPipeline-Java software that can be cited using the following reference: Olivier Langella, Benoît Valot, Thierry Balliau, Mélisande Blein-Nicolas, Ludovic Bonhomme, and Michel Zivy (2017) X!TandemPipeline: A Tool to Manage Sequence Redundancy for Protein Inference and Phosphosite Identification. in J. Proteome Res., 16, 2, 494–503. DOI: https://doi.org/10.1021/acs.jproteome.6b00632.
The installation material is available at http://pappso.inrae.fr/en/bioinfo/xtandempipeline/download/.
The installation of the software is extremely easy on the MS-Windows and macOS platforms. In both cases, the installation programs are standard and require no explanation.
The installation on Debian- and Ubuntu-based GNU/Linux platforms is also extremely easy (even more than in the above situations). ; is indeed packaged and released in the official distribution repositories of these distributions and the only command to run to install it is:
$
[3]
sudo apt install <package_name>
RETURN
In the command above, the typical package_name is
in the form i2masschroq
for the program package and
i2masschroq-doc
for the user manual package.
Once the package has been installed the program shows up in the Science menu. It can also be launched from the shell using the following command:
$
i2masschroq
RETURN