MetaX Cookbook

This guidebook is for the MetaX GUI version. If you are using the CLI, we recommend reading the documentation for instructions on how to use each MetaX module from the command line.

Overview

MetaX is a novel tool for linking peptide sequences with taxonomic and functional information in Metaproteomics. We introduce the Operational Taxon-Function (OTF) concept to explore microbial roles and interactions ("who is doing what and how") within ecosystems.

MetaX also features statistical modules and plotting tools for analyzing peptides, taxa, functions, proteins, and taxon-function contributions across groups.

Project Page

Visit GitHub to get more information:

https://github.com/byemaxx/MetaX

Getting Started

The main window of MetaX

Click 'Tools Menu' to switch different modules

Exploring Data with MetaX

See the Preparing Your Data section to build the database and annotate peptides to OTFs before starting.

Module 1. OTF Analyzer

After obtaining the Operational Taxa-Functions (OTF) Table using the Peptide Annotator, you can perform downstream analysis with the OTF Analyzer.

1. Data Preparation

OTFs (Operational Taxa-Functions) Table: Obtained from the Peptide Annotator module.

Meta Table: The first column is sample names, and the other columns represent different groups. If no meta table is provided, meta info will be generated automatically: (1) all samples are in the same group; (2) each sample is a separate group.

Example Meta Table:

samples	Individuals	Treatment	Sweetener
sample_1	V1	Treatment	XYL
sample_2	V1	Treatment	XYL
sample_3	V1	Treatment	XYL
sample_4	V1	Control	PBS
sample_5	V1	Control	PBS
sample_6	V1	Control	PBS

You can load example data by clicking the button.

Then, click Go to start the analysis.

Advanced Settings
Peptide Column Name: Specifies the column in the OTF table that contains peptide information.
Protein Column Name: Specifies the column in the OTF table that contains protein information (only required if protein summation is performed in downstream analysis).
Sample Column Prefix: Identifies the prefix of sample columns to determine intensity columns in the OTF table.
Any Data Mode: Allows analysis of any table using MetaX, not limited to OTF tables (only partial tool functionality is available).
- Customized Table Item Column Name: Specifies the column containing item names in any data mode. If left empty, the first column will be selected by default.

2. Data Overview

The Data Overview provides basic information about your data, such as the number of taxa, functions, and proportions.

Set the threshold for linked peptides and the differences between them to plot figures.

Select different functions to plot the proportion distribution.

Filter out samples for downstream analysis.

3. Set TaxaFunc

Data Selection

Function: Select a function for downstream analysis (None in the list means no function is selected, focusing only on peptides and taxa).
Function Filter Threshold: If a specific function within a protein group of a peptide has the highest proportion, it will be considered the representative function for that peptide. The default threshold is 1.00 (100%).

Taxa Level: Select a taxa level for downstream analysis (Life in the list means no filtering by taxa, and the following analysis focuses on functions).
Peptide Number Threshold: Only keep taxa, functions, or OTFs that have at least the specified number of peptides.
Split Function: Split the annotations with multi-functions.
KO Intensity

ko:K00625,ko:K13788 10

to

KO Intensity

ko:K00625 10

ko:K13788 10

If Share Intensity is checked, the intensity above will be split equally, giving 5 to each KO.
Remove unknown taxa: Checked by default. When enabled, peptides that are not annotated to the selected taxonomic level will be removed. When unchecked, such peptides will be retained and labeled as unknown, for example:

text d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__UMGS363;s_

to

text d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__UMGS363;s_unknown
Create Taxa and Func only from OTFs:
Without selection (checkbox not checked):
- Taxa table: Peptides are filtered based solely on taxa levels, without considering any functional categories.
- Function table: Peptides are filtered solely by functional categories and thresholds, regardless of their taxa levels.
- Taxa-Function (OTFs) table: Peptides are filtered by both taxa levels and functional categories simultaneously.
With selection (checkbox checked):
- Taxa table: Peptides are filtered by both taxa levels and functional categories simultaneously.
- Function table: Peptides are filtered by both taxa levels and functional categories simultaneously.
- Taxa-Function (OTFs) table: Peptides are filtered by both taxa levels and functional categories simultaneously.

KO	Intensity
ko:K00625,ko:K13788	10

KO	Intensity
ko:K00625	10
ko:K13788	10

Sum Proteins Intensity

Click Generate Protein Intensity Table to sum peptides to proteins if the Protein column is in the original table.

Occam's Razor, Anti-Razor and Rank: Methods available for inferring shared peptides.
Razor:
1. Build a minimal set of proteins to cover all peptides.
2. For each peptide, choose the protein with the most peptides (if multiple proteins have the same number of peptides, share intensity to them).
Anti-Razor:
- All proteins share the intensity of each peptide.
Rank:
1. Build the rank of proteins.
2. Choose the protein with a higher rank for the shared peptide.
Methods to Build Protein Rank: - unique_counts: Use the counts of proteins inferred by unique peptides. - all_count: Use the counts of all proteins. - unique_intensity: Use the intensity of proteins inferred by unique peptides. - shared_intensity: Use the intensity divided by the number of shared peptides for each protein.
Minimum peptide number per protein: Filters out proteins that contain fewer peptides than the specified threshold.

Data preprocessing

Quantitative Method:
Sum: Sum the peptides intensity directly to Taxa, Functions or OTFs intensity.
DirectLFQ: Use DirectLFQ to normalize peptides and then estimate intensity using intensity traces.
Outlier handling:

There are several methods for detecting and handling outliers.

Two steps will be applied:
Outlier Detection: Users can select a method to mark outlier values as NaN. Then the rows only contain NaN values and 0 will be removed. The remaining NaN values will be handled in the next step.
Outlier Handling: Users can choose a method to fill the remaining NaN values.
Outlier Detection:
IQR: In a group, if the value is greater than Q3+1.5*IQR or less than Q1-1.5*IQR, the value will be marked as NaN.
Missing-Value: Detect nan values in the data. If a value is nan, it will be marked as a NaN.
Half-Zero:

Applies to grouped data.
- If more than half of the values in a group are zero, all non-zero values are replaced with NaN.
- If fewer than half of the values are zero, all zero values are replaced with NaN.
- If the number of zero and non-zero values is equal, all values in the group are replaced with NaN.
Zero-Dominant:

Applies to grouped data.
- If more than half of the values in a group are zero, all non-zero values are replaced with NaN.
- Otherwise, the group remains unchanged.
Zero-Inflated Poisson: This method is based on the Zero-Inflated Poisson (ZIP) model, which is a type of model that is used when the data contains a lot of zeros, more than what is expected in a standard Poisson model. In this context, the ZIP model is used to detect outliers in the data. The process involves fitting the ZIP model to the data and then predicting the data values. If the predicted value is less than 0.01, then the data point is marked as an outlier (NaN).
Negative Binomial: This method is based on the Negative Binomial model, which is a type of model used when the variance of the data is greater than the mean. Similar to the ZIP method, the Negative Binomial model is fitted to the data and then used to predict the data values. If the predicted value is less than 0.01, then the data point is marked as an outlier (NaN).
Z-Score: Z-score is a statistical measure that tells how far a data point is from the mean in terms of standard deviations. Outliers are often identified as points with Z-scores greater than 2.5 or less than -2.5.
Mahalanobis Distance: Mahalanobis distance measures the distance between a point and a distribution, considering the correlation among variables. Outliers can be identified as points with a Mahalanobis distance that exceeds a certain threshold.

In all methods, you can choose one meta column for outlier detection and another meta column for handling outliers.

Outliers Imputation:
Drop: Remove peptides that contain any NaN values.
Original: Keep the remaining NaN values as-is.
Mean: Outliers will be imputed by the mean.
Median: Outliers will be imputed by the median.
KNN: Outliers will be imputed by KNN (K=5). The K-Nearest Neighbors algorithm uses the mean or median of the nearest neighbours to fill in missing values.
Regression: Outliers will be imputed by using IterativeImputer with regression method. This method uses round-robin linear regression, modelling each feature with missing values as a function of other features.
Multiple: Outliers will be imputed by using IterativeImputer with multiple imputations method. It uses the IterativeImputer with a specified number (K=5) of the nearest features.

You can choose outlier imputation by each group or by all samples.

Remove Batch Effect:
Here, you can choose a group as the batch effect and then use reCombat to handle it.
Data Transformation:
Log2, Log10, Square root transformation, Cube root transformation and box-cox.
Data Normalization:
Trace Shifting: Reframing the Normalization Problem with Intensity traces (inspired by DirectLFQ).
- Note: If both trace shifting and transformation are applied, normalization will be done before transformation.
Standard Scaling (Z-Score), Min-Max Scaling, Pareto Scaling, Mean centring, and normalization by percentage.

If you use Z-Score, Mean centring, or Pareto Scaling for data normalization, the data will be given a minimum offset again to avoid negative values.

Drag the item's name to change the order of data preprocessing.

Then, click Go to create a TaxaFunc object for analysis.

Then you can check the tables in the Table Review section and export them.

4. Basic Stats

PCA, Correlation and Box Plot

You can select meta groups or samples (default: all) to plot PCA, Correlation, and Box Plot for Taxa, Function, Taxa-Func, Peptide, and Protein tables.

Setting and modifying the plot
Show or hide labels in the figure by checking Show Labels.
Select Sub Meta to plot with two meta columns.
Change settings in the PLOT PARAMETER tab
Select specific Groups with condition

For example: Select PBS, BAS, and other groups only in Individual V1.
Select specific Samples to Analysis
Number stats
Plot the counts for each table by groups or by samples.
Taxa Specific
Alpha/Beta Diversity
Sunburst
TreeMap
Sankey

Heatmap and Bar Plot

Select items (Taxa, Function, Taxa-Func, and Peptide) to plot:
Add All Taxa, or select one we are interested in.

Add items to Top List: Select the top items to plot using a statistical method.
Clicking filter with threshold filters by the adjusted p-value of ANOVA and T-TEST, and by the adjusted p-value and Log2FC of DESeq2 results (configured on the corresponding page).

Add a list for plotting:
Make sure one row one item

Setting:
Change the setting fit for your data.
Rename Samples: Add group info to each sample name
Rename Taxa: Only keep the last taxonomic level to reduce to name
Plot Mean: calculate the mean of each group before plotting
Sub Meta: select a second meta, then combine two meta by mean for Heatmap and 3D bar plot
View all color maps by right-clicking Theme.
Plot:

Modify the pic to fit the window to get the Perfect picture:
Bar Plot:

Interactive functions:

Change to a line plot:
3D Bar Plot
Plot 3D bar by selecting a sub meta.

Peptide Query

Query everything of a peptide

5. Cross Test

T-TEST

Select two groups for T-test analysis on Taxa, Function, Taxa-Func, Peptide, and Protein tables.

ANOVA-TEST

Select some groups or all groups to run ANOVA on Taxa, Function, Taxa-Func, and Peptide tables.

Significant Taxa-Func

Significant comparison helps identify cases where taxa show no significant differences between two groups, while their related functions are significantly different, and vice versa.

Plot Cross Heatmap

The results of the T-test and ANOVA test will appear in a new window.

Plot Heatmap for results
Choose a table to plot a top differences heatmap or export the top table.

Taxa-Func cross heatmap:
The orange cells mean in the corresponding function ( X-axis) and Taxa( Y-axis) are significantly different between groups.

Func(Taxa) Heatmap:
The colour shows the intensity of the significant Func(Taxa) between groups.

Significant Taxa-Func Heatmap:
The colored tiles represent the taxa which were not significantly different between groups but the related functions were.

Group-Control TEST

Dunnett's Test

Set a Group as "Control", then compare all groups to Control

Comparing in Each Condition: Select a meta such as individual, then compare groups to control in each individual.
DESeq2 Test

Bingo! You noticed the hidden function of MetaX, click Help -> About -> Like 3 times to unlock the function to compare all groups to control.

Result of Dunnett's Test:
- T- Statistic value shown in the heatmap

DESeq2

Select two groups to calculate fold change with PyDESeq2.

Select p-adjust, log2FC to plot

(Ultra-Up(Down): |log2FC| > Max log2FC)

Volcano:
Sankey:
- The last node level is the functions linked to each Taxon (When plotting Taxa-Func)

Tukey Test

Select a function:
Test the significant groups in this function.
Select a Taxon:
Test the significant groups in this taxon.
Select both function and taxon:
Test the significant groups in this function and this taxon.

Show Linked Taxa Only: only shows the taxa linked with the current function in the taxa combo box.
Show Linked Func Only: Only shows the functions linked with the current taxon in the function combo box.

Do not forget to click Reset Function Taxa List to restore all items after filtering.
Tukey result plot:
The dots and lines show the difference in the mean value of the Tukey test

6. Expression Analysis

Co-Expression Networks & Heatmap

Select groups or samples to calculate correlations and plot the network.

Select a table, then set the correlation method and threshold.

Add some items to the focus list (Optional)

Network Plot
The Red dots are focus items
The depth of color and the width of edges represent the correlation value
The size of the dot indicates the number of connections

Expression correlation

Expression Trends

Add items to the list window to plot the clusters with similar trends of intensity

Clusters plot (clustered by k-means)
The coloured line is the average.

Select a specific cluster to plot interactive Lines or get the table
The dashed red line is the average

7. Taxa-Func Link

Taxa-Func Link Plot

Check all taxa in one function (or all functions in one taxon).
select a function, and click the button Show Linked Taxa Only
- Linked Number: The number shows how many taxa are linked in this function
- The number starts with Taxa: The number shows how many peptides are in this Taxa-Func

Filter items of the Taxa and Func list

Plot Heatmap or Bar
Select some groups (Default all) to get the intensity of each taxon of this function

Plot peptides in one Function of a Taxon

Switch Bar to Stacked or not ( Line)

Change Bar plot to Lines

Taxa-Func Network

Select some groups or samples (default: all).
Add some taxa, functions, or taxa-func items to focus the view (optional).

Plot list only
Plot List Only: Show only the items in the list and the items linked to them.
Without Links: Only show the items in the focus list.
Network plot
The yellow dots are taxa, and the grey dots are functions, the size of the dots presents the intensity
The red dots are the taxa we focused on
The green dots are the functions we focused on
More parameters can be set in Dev->Settings->Others (e.g. Nodes Shape, color, Line Style)

8. Restore Last TaxaFunc Object

Once you create TaxaFunc, the TaxaFunc Object is saved automatically, and you can restore it next time.
You can also export the current MetaX object to a file and reload it later.

Preparing Your Data

Module 2. Database Builder

Note: The results from MetaLab v2.3 MaxQuant workflow do not require database building. However, we do not recommend using these results as input to MetaX, as many peptides may be discarded.

Build the database for the first time using the Database Builder.

Option 1: Build Database Using MGnify Data

Ensure you download the correct database type corresponding to your data.

Option 2: Build Database Using Own Data

Annotation Table: A TSV table (tab-separated), with the first column as protein name joined with Genome by "_", e.g., "Genome1_protein1", and other columns containing annotation information.

Taxa Table: A TSV table (tab-separated), with the first column as Genome name, e.g., "Genome1", and the second column as taxa.

Example Annotation Table:

Query	Preferred_name	EC	KEGG_ko
MGYG000000001_00696	mfd	-	ko:K03723
MGYG000000001_02838	hxlR	-	-
MGYG000000001_01674	ispG	1.17.7.1,1.17.7.3	ko:K03526
MGYG000000001_02710	glsA	3.5.1.2	ko:K01425
MGYG000000001_01356	mutS2	-	ko:K07456
MGYG000000001_02630	-	-	-
MGYG000000001_02418	ackA	2.7.2.1	ko:K00925
MGYG000000001_00728	atpA	3.6.3.14	ko:K02111
MGYG000000001_00695	pth	3.1.1.29	ko:K01056
MGYG000000001_02907	-	-	ko:K03086
MGYG000000001_02592	rplC	-	ko:K02906
MGYG000000001_00137	-	-	ko:K03480,ko:K03488

Example Taxa Table:

Genome	Lineage
MGYG000000001	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Peptostreptococcales;f_Peptostreptococcaceae;g_GCA-900066495;s_GCA-900066495 sp902362365
MGYG000000002	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Lachnospirales;f_Lachnospiraceae;g_Blautia_A;s_Blautia_A faecis
MGYG000000003	d_Bacteria;p_Bacteroidota;c_Bacteroidia;o_Bacteroidales;f_Rikenellaceae;g_Alistipes;s_Alistipes shahii
MGYG000000004	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Oscillospirales;f_Ruminococcaceae;g_Anaerotruncus;s_Anaerotruncus colihominis
MGYG000000005	d_Bacteria;p_Firmicutes_A;c_Clostridia;o_Peptostreptococcales;f_Peptostreptococcaceae;g_Terrisporobacter;s_Terrisporobacter glycolicus_A
MGYG000000006	d_Bacteria;p_Firmicutes;c_Bacilli;o_Staphylococcales;f_Staphylococcaceae;g_Staphylococcus;s_Staphylococcus xylosus
MGYG000000007	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_Lactobacillus intestinalis
MGYG000000008	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Lactobacillus;s_Lactobacillus johnsonii
MGYG000000009	d_Bacteria;p_Firmicutes;c_Bacilli;o_Lactobacillales;f_Lactobacillaceae;g_Ligilactobacillus;s_Ligilactobacillus murinus

Module 3. Database Updater

The Database Updater allows updating the database built by the Database Builder or adding more annotations. This step is optional.

Update the built database and extend annotations.

Option 1: Built-in Mode

We recommend some extended databases, such as dbCAN_seq.

Option 2: TSV Table

Extend the database by adding a new database to the database table. Ensure the column separator is a tab and the first column is the Protein name, with other columns containing function annotations.

Example:

Protein ID	COG	KEGG	...
MGYG000000001_02630	Function 1	Function 1	...
MGYG000000001_01475	Function 2	Function 1	...
MGYG000000001_01539	Function 3	Function 1	...

Module 4. Peptide Annotator

1. Results from MAG Workflow

These peptide results use metagenome-assembled genomes (MAGs) as the reference database for protein searches, such as DIA-NN, MetaLab-MAG, MetaLab-DIA, and other workflows that use MAG databases like MGnify or custom MAG databases.

Annotate the peptide to the Operational Taxa-Functions (OTF) Table before analysis using the Peptide Annotator.

Required:

Database: The database created by Database Builder

Peptide Table:

Option 1: From the Search engine which using Metagenome-assembled genomes (MAGs) as database. (e.g. final_peptides.tsv in MetaLab-MAG, xxx_report.pr_matrix.tsv in DIA-NN result)
Option 2: Manually create a table with one column for the peptide sequence and another column for the protein group (e.g., MGYG000003683_00301; MGYG000001490_01143) from the MGnify or your own database. The remaining columns should contain the intensity values for each sample.

Example:

Sequence	Proteins	Intensity_V1_01	Intensity_V1_02	Intensity_V1_04
(Acetyl)KGGVEPQSETVWR	MGYG000002716_01681;MGYG000000195_00452;MGYG000001616_00519;MGYG000002926_00231;...	714650	0	0
(Acetyl)KVIPELNGK	MGYG000003589_01892;MGYG000001560_01812;MGYG000001789_00244;...	0	0	0
(Acetyl)LAELGAKAVTLSGPDGYIYDPDGITTK	MGYG000001199_02893	0	0	0
(Acetyl)LLTGLPDAYGR	MGYG000001757_01206;MGYG000004547_02135;MGYG000001283_00124	0	307519	0
(Acetyl)MDFTLDKK	MGYG000000076_01275;MGYG000003694_00879;MGYG000000312_02425;MGYG000000271_02102	306231	0	1214497

Output Save Path: The location to save the result table.
LCA Threshold: Find the LCA with the proportion threshold for each peptide. The default is 1.00 (100%).

2. Results from MaxQuant Workflow

These peptide results come from the MetaLab 2.3 MaxQuant workflow.

Select the MetaLab result folder, which contains the maxquant_search folder.

The Peptide Annotator will automatically find the peptides_report.txt, BuiltIn.pepTaxa.csv, and functions.tsv in the maxquant_search folder. Alternatively, you can select the files manually.
Select OTFs Save To to set the location to save the result table.

Developer Tools

Export Log
You can export the log file for debugging or reporting the issue.
Show or Hide the Console

Settings
Check Auto Check Update to enable or disable update checks on launch.
Choose whether to update from the stable version or beta version in Settings.
Other Options Settings

Enjoy MetaX

If you have any issues or suggestions, please open a new issue on GitHub.