Background

With the recent explosion of genomic data, researchers are storing, cleaning, processing, and analyzing increasingly large volumes of data [1]. Differential expression studies account for a large portion of this data. Entire differential expression analysis pipelines have been built specifically for analyzing data generated in differential expression studies. These pipelines often end at the differential expression analysis step, the output of which is lists of differentially expressed entities (e.g. genes, proteins, etc.) [2]. Each file represents all the entities that were differentially expressed between two conditions (e.g. control vs. drug A). In addition to the entity names themselves, fold changes, p-values, and many other categories of information are commonly listed within the differential expression files [3,4,5,6]. Filtering and comparing files of differentially expressed entities is a common task that is usually done by writing custom Perl, Python, or R scripts. This task can grow cumbersome when dealing with many pairwise comparisons. A-Lister addresses that concern by allowing the user to filter and compare any number of pairwise comparisons across any number of differential expression files within a single command. A-Lister accepts most common delimited (tab, comma, colon, semicolon, and space) text files containing differential expression data and is thus compatible with most differential expression tools.

A-Lister is intended for use within bioinformatics analysis pipelines between the differential expression (DE) analysis and the visualization/pathway analysis steps. A-Lister narrows down lists of differentially expressed entities produced by differential expression analysis tools. These entities can then be further analyzed using visualization, pathway, and other bioinformatics software.

Implementation

A-Lister is written in Python 3.7. A-Lister is freely available on GitHub at [7]. The command line interface (CLI) version can be run in Windows, Mac, and Unix operating systems. The graphical user interface (GUI) version can be used to generate and launch A-Lister commands.

Workflow and output

An A-Lister command can be written and executed directly at a command line, or the command can be generated and executed through the GUI. All relevant input is supplied within a single A-Lister command. There are two commands available: diff-expression and name-list. The diff-expression command is used to execute A-Lister in differential expression (DE) mode. The name-list command is used to execute A-Lister in generic name list mode.

Below is a description of how A-Lister executes diff-expression and name-list commands (Fig. 1). Once the command is entered, A-Lister proceeds to validate it. If the command is valid, the program reads in the input files provided by the user. If specified by the user, the data within the input files may be filtered by any column. Furthermore, in DE mode the individual pairwise comparisons can be filtered by direction (sign of fold change). Set operations are then performed on the groups (name list mode) or pairwise comparisons (DE mode) as specified within the query. Once the query is executed, a delimited list of the resultant entity names and the count, is written into the result file. A system dump file is also output containing additional information regarding A-Lister’s execution that can be helpful with debugging or validation. Additionally, in DE mode, the filtered copies of the original input files are output. These files are obtained by filtering the original input files by the result.

Fig. 1
figure 1

Data and control flow diagram of A-Lister

Input files

Input files for the name list command

In this mode, A-Lister accepts files containing columns of names delimited by tabs or commas. The header row must contain the group name for each column. An example of a name list file with three groups: control, treated1, and treated2 is shown below (Table 1). A further example of this format is supplied as an additional file (see Additional file 1: Sample input file S1).

Table 1 Example of a Name List File

Input files for the differential expression command

In this mode, A-Lister accepts differential expression files containing a primary ID column (e.g. gene name), fold change column(s), and any other columns present. The columns in these files must be delimited by tab, comma, colon, semicolon, or space. A-Lister supports two types of differential expression file formats described below.

Differential Expression Sample Format (DE-Sample) (Row-Format) File: This is a delimited text file containing a primary ID column, single Fold Change column, one Sample1 column, and one Sample2 column. The Sample1 and Sample2 columns identify to which pairwise comparison each row belongs. In this way, multiple pairwise comparisons can be listed within a single DE-Sample file using a single fold change column (Table 2). The .diff files that are output from Cuffdiff follow this format [8] (see Additional file 2: Sample input file S2).

Table 2 Example of a DE-Sample File

Differential Expression Series Format (DE-Series) (Column-Format) File: This is a delimited text file containing a single ID column and multiple Fold Change columns. Each Fold Change column contains data pertaining to a single pairwise comparison. In this way, multiple pairwise comparisons can be listed within a single file using multiple fold change columns (Table 3). This is the most common format for differential expression files (see Additional file 3: Sample input file S3).

Table 3 Example of a DE-Series File

A-Lister filtering

A-Lister filtering is performed if the user specifies the optional filter by column (−f) parameter (Table 4) for any column (attribute) within a differential expression file. When filtering a DE-Sample file by an attribute the entire file is filtered. When filtering a DE-Series file by an attribute, there are two possible behaviors. First, if the filter attribute belongs to a pairwise comparison, such as p-value2, then only that pairwise comparison is filtered. Second, if the filter attribute belongs to the entire file (e.g. ID column), then the entire file is filtered. Additionally, pairwise comparisons can be filtered by direction (sign of fold change) using the directional query (−dq) argument described below (Table 4).

Table 4 A-Lister Command Line Interface (CLI)

A-Lister directional query

A-Lister directional query is composed of pairwise comparisons, set operators, and optional directions. The pairwise comparison names are derived from the pairwise comparison map** argument (−pc) (Table 4). The permitted set operators are: AND, FAND, OR, and DIFF. Additionally, parenthesis can be used to nest and to set order of operations. A directional query is specified with the (−dq) argument used in the diff-expression command (Table 4).

Set operations

Specifying the AND operator on two sets of elements returns a set of all the elements that are present within both sets. The FAND operator applied to two sets returns a set of all the similar elements from within both sets. A customized Jaro-Winkler algorithm is used to calculate similarity. To be considered similar, two strings must have Jaro-Winkler score > 0.84 [9]. The OR operator applied to two sets returns all the elements present in either set. The DIFF operator applied to two sets returns all the elements present in the first set, but not in the second. All set operations are implemented using the standard Python library.

Directionality

Specifying the UP keyword in a query selects all entities whose fold change values are positive for a given pairwise comparison. Specifying DOWN in a query selects all entities whose fold change values are negative for a given pairwise comparison. ALL is a special modifier that results in multiple queries. That is, query results are returned as if ALL was specified as all combinations of UP and DOWN. For example, a query containing N ALL directions is transformed into 2N queries. Each query is then executed and the results for each query are output into the output files in separate directories. NONE is the default direction for all pairwise comparisons. Pairwise comparisons with NONE direction are not filtered by direction.

A-Lister query (non-directional)

A non-directional query is composed of group names and set operators. The set operators are the same as in the directional query (e.g., AND, FAND, OR, DIFF), and can also include parentheses to nest and order the operators. The group names are derived from the first (header) row of the name list files. The non-directional query argument is used in the name-list command (Table 4).

Results

A-Lister can be executed through a command line interface (CLI) or a graphical user interface (GUI). Underlying A-Lister’s CLI and GUI is organization into two commands. The two commands are name-list and diff-expression, which represent the generic name list mode and the differential expression mode of execution. Each command has its own set of arguments (Table 4) (see Additional file 4: User Manual). We will first describe the CLI through example use cases to illustrate the parameters and functionality, and then given an overview of the GUI version.

Use case 1: analysis of name list files and fuzzy intersection (FAND) operation

Suppose the user wants to identify all same and similar genes within two sets of genes (Table 5). The first set is contained in file A and the second set is contained in file B (see Additional file 5: Name-List file A, Additional file 6: Name-List file B).

Table 5 Example of Name-List Command with Intersection (AND) and Fuzzy Intersection (FAND) Query

The A-Lister command listed below will provide the same genes within the 2 files by using the AND operator:

python ALister_CLI.py name-list “Set1-AND-Set2” FileA.txt FileB.txt -o E:/Data/Sample_Output

The A-Lister command listed below will provide the similar genes within the 2 files by using the FAND operator:

python ALister_CLI.py name-list “Set1-FAND-Set2” FileA.txt FileB.txt -o E:/Data/Sample_Output

The output of these commands is shown in Table 5.

Use case 2: analysis of differential expression using a complex query

The data for this use case can be downloaded from NCBI’s gene expression omnibus (GEO) database [10]. The series number is GSE126785 [11]. There are three groups of samples in the study: two types of induced pluripotent stem cells (iPSCs) and embryonic stem cells (ESCs). The gene expression of each group was measured under 5% oxygen and under 20% oxygen. The published files are three DESeq2 files, each containing genes differentially expressed for a single cell line between the 5% oxygen and 20% oxygen conditions. M2 is the ESC line. M4 and M5 are the iPSC lines. Suppose the user wants to know which genes are significantly differentially expressed in the embryonic stem cells (under different oxygen conditions) but are not significantly differentially expressed in either of the induced pluripotent stem cells (Fig. 2).

Fig. 2
figure 2

Data and process flow chart for use case 2. Input Files M2, M4, and M5 each contain 65,678 rows of differentially expressed genes. A-Lister is used to filter these files by abs(log2(foldchange)) > 1.0. The filtered files are then processed by A-Lister with the OR and DIFF set operators, resulting in 36 genes

The A-Lister command listed below will provide the resulting genes:

python ALister_CLI.py diff-expression GSE126785_M2.txt GSE126785_M4.txt GSE126785_M5.txt -pc “M2Low*M2High- > 3.log2(FC)” “M4Low*M4High- > 3.log2(FC)” “M5Low*M5High- > 3.log2(FC)” -dq “M2Low*M2High-DIFF-(M5Low*M5High-OR-M4Low*M4High)” -o E:/Data/Sample_Output/ -n “1.GeneID” -f “3.log2(FC):agt1.0” -fc “3.log2(FC)” -v

There are three input files. Each file contains a single pairwise comparison that is mapped to its corresponding fold change column within the -pc argument. An output directory is specified using the optional -o argument. The ID and fold change columns are identified for each file using the -n and -fc flags. Each file is filtered according to the fold change values, which must be greater than 1 or less than − 1. The A-Lister directional query is specified within the -dq argument. The result file (see Additional file 7: Use case 2 result) containing thirty-six genes that satisfied the query and passed the filters can be found in the additional files.

Use case 3: analysis of differential expression using directionality patterns

The data for this use case can be downloaded from National Center for Biotechnology Information’s (NCBI’s) gene expression omnibus (GEO) database [10]. The series number is GSE108643 [12]. There are two groups of participants in the study: lean individuals and overweight/obese individuals. Muscle biopsies were collected from both groups before and after exercise. RNA-seq data was generated on the Illumina platform, TopHat was used for sequence alignment, and Cuffdiff was used for differential gene expression analysis.

The Cuffdiff files contain four conditions: LeanPre, LeanPost, OvobPre, OvobPost. Each condition is compared to every other condition resulting in six pairwise comparisons: LeanPre vs. LeanPost, LeanPre vs. OvobPre, LeanPre vs. OvobPost, LeanPost vs. OvobPre, LeanPost vs. OvobPost, and OvobPre vs. OvobPost. Suppose the user wants to examine which genes are significantly upregulated in both lean and overweight/obese individuals post exercise. The A-Lister command listed below will provide the resulting genes:

python ALister_CLI.py diff-expression GSE108643_Cuffdiff.txt -pc “LeanPre- > LPE,LeanPost- > LPO,OvobPre- > OPE,OvobPost- > OPO” -dq “LPE*LPO:UP-AND-OPE*OPO:UP” -f “log2(fold_change):agt1.0,q_value:lt0.05,value_1:gt1.0,value_2:gt1.0” -s1 “sample_1” -s2 “sample_2” -n “gene”

A diff-expression command will be executed with the GSE108643_Cuffdiff.txt input file. Each file specific condition label is mapped to a globally unique label within the -pc argument. This map** is important when dealing with multiple files that contain the same condition label names (e.g. q1, q2, q3, etc.) or, as in this example, when the user would like to shorten the name to avoid ty** long group names. The -s1, −s2, and -n arguments specify the names of sample1, sample2, and ID columns. In this example, the -f argument will be used to filter the file according to absolute value of log2(fold change) greater than 1.0 (agt1.0), q-value less than 0.05 (lt0.05), values 1 and 2 greater than 1.0 (gt1.0). The A-Lister query is specified within the -dq argument, where LPE*LPO represents lean pre-exercise vs. lean post-exercise, and OPE*OPO represents overweight/obese pre exercise vs. overweight/obese post exercise. Since no output directory was specified, the result is output in the result.txt file within the current working directory.

Now, suppose the user wants to examine all possible directionality patterns for the above-mentioned query. The four possible patterns are up, up; up, down; down, up; and down, down. This could be accomplished by changing the directions within the -dq argument from UP to ALL. This would result in the following A-Lister command:

python ALister_CLI.py diff-expression GSE108643_Cuffdiff.txt -pc “LeanPre- > LPE,LeanPost- > LPO,OvobPre- > OPE,OvobPost- > OP2O” -dq “LPE*LPO:ALL-AND-OPE*OPO:ALL” -f “log2(fold_change):agt1.0,q_value:lt0.05,value_1:gt1.0,value_2:gt1.0” -s1 “sample_1” -s2 “sample_2” -n “gene”

This is an ALL query (a query containing an ALL directionality) with two ALL directions, so it is effectively translated into four queries: LPE*LPO:UP-AND-OPE*OPO:UP, LPE*LPO:UP-AND-OPE*OPO:DOWN, LPE*LPO:DOWN-AND-OPE*OPO:UP, LPE*LPO:DOWN-AND-OPE*OPO:DOWN. Since no output directory was specified, the results for all four queries are output in the result.txt file within the current working directory.

This example found one hundred seven genes are differentially expressed in both LPE*LPO and OPE*OPO pairwise comparisons (Fig. 3). One hundred genes are upregulated and seven genes are downregulated in both of these pairwise comparisons. Zero genes are upregulated within one of these pairwise comparisons while also being upregulated in another one of these pairwise comparisons. The result file (see Additional file 8: Use case 3 result) containing the upregulated and downregulated genes can be found in additional files.

Fig. 3
figure 3

Heatmap visualization of significantly differentially expressed genes for use case 3. a All genes that are significantly differentially expressed for both LeanPre vs. LeanPost and OvobPre vs. OvobPost pairwise comparisons. b All genes that are significantly upregulated in both pairwise comparisons. c All genes that are significantly downregulated in both pairwise comparisons

Graphical user Interface (GUI)

The GUI guides the user through creating a command necessary to run A-Lister with desired settings. After selecting the mode (differential expression or name list), the parameters for that mode will be presented. The user will browse for files and preview the column headings and the first few lines of each input file in order to facilitate setting the filtering and map** parameters. If appropriate for the mode, the GUI will also enable selecting the comparison groups, directionality, and set operators, necessary for creating the query. Once the parameters for all files are set, the user can generate and launch the command. Detailed instructions on the use of the GUI can be found in the ReadMe.pdf file (Additional file 4), and an example screenshot of the GUI is shown (Fig. 4).

Fig. 4
figure 4

Example of screenshot of Graphical User Interface (GUI) version of A-Lister

Discussion

Although several existing bioinformatics tools have some overlap** functionality with A-Lister, none fill the same role as A-Lister. Several such tools are listed in Table 6 and are described below. Intervene is a tool that can compute and visualize intersections of gene sets (or genomic regions) using multiple visualization techniques such as Venn diagrams, UpSet plots, and heatmaps [15]. VennPainter and InteractiVenn are similar to Intervene [16, 17]. Statistical R packages SuperExactTest and Gene-Overlap package can also be used to compute and visualize intersections of sets [18, 19]. Galaxy suite text manipulation tools can be used to filter and compare tab delimited text files [14]. The key limitation of these tools is that they are not built to deal with differential expression data specifically. As such the above-mentioned tools lack the means to filter individual pairwise comparisons.

Table 6 Comparison to existing software

Functional Heatmap is another novel tool that seeks to make filtering and comparison of lists of differentially expressed entities less cumbersome [13]. Similar to A-Lister, Functional Heatmap allows the user to filter differential expression data by columns (e.g. p-value) and by direction (sign and magnitude of fold change). However, Functional Heatmap is specialized for analyzing time-series data, specifically analyzing patterns of fold change direction across time. A-Lister, on other hand, can be used to analyze any pairwise comparison differential expression data across conditions, tissues, and timepoints. Moreover, unlike Functional Heatmap, A-Lister supports the notion of queries. The queries allow the user to quickly examine complex relationships between pairwise comparisons (Table 6).

In the future we plan to add ID validation and map** in order to enable integration of different DEE types. Studies containing multiple -omics types are increasingly common, and we would like to be able to seamlessly compare genes, proteins, and methylation markers with each other. However, the names used as IDs for the different data types (genes, proteins, methylation markers) are generally not the same, and rather, are dependent on the data type naming convention and database. The IDs often do not map one to one, but rather one to many or one to none. Even within one data type, such as gene expression data, there are differences in naming due to annotation version, platform used, species, and other characteristics. Currently, UniProt provides a web-based tool (Retrieve/ID map** tool) to convert IDs between different annotations [20]. To maintain A-Lister’s lightweight requirements (e.g., the user does not have to download datasets), and offline capabilities for secure human data processing, we did not implement linking with databases through local or outside connections to web services to check or query IDs. We do recommend that users interested in such capability should initially process the data names in their files through a service such as UniProt before executing A-Lister. In the future, we may implement and host a web server to facilitate this functionality and to make A-Lister more accessible. However, these features will require the addition of complex name map** functionality and web back-end to A-Lister, which we propose as a future enhancement.

Conclusions

A-Lister allows the user to quickly filter and compare any number of pairwise comparisons across multiple heterogenous differential expression files. Additionally, the A-Lister can be used to examine patterns of fold change direction and to execute complex queries across multiple pairwise comparisons. This tool may be especially useful in the context of data mining applications where dealing with many heterogenous files is common. A-Lister will help researchers to save time spent on writing, maintaining, and adjusting custom differential expression analysis scripts.

Availability and requirements

Project name: A-Lister.

Project home page: https://github.com/staslist/A-Lister

Operating system(s): Windows, Mac OS (10.10.x+), Unix.

Programming language: Python.

Other requirements: Python 3.7+, Google Chrome 76.0+, Eel 0.10.4.

License: MIT.

Any restrictions to use by non-academics: None.