Tutorials
In this page, a list of tutorials is provided to learn how to use the software. First, the input data used for this tutorial will be described. Then, a tutorial for each mode will be presented.
Input data: Description of the input data.
Running build mode: Tutorial for the build mode.
Running samplespecif mode: Tutorial for the samplespecif mode.
Running cancerspecif mode: Tutorial for the cancerspecif mode.
Running mhcbind mode: Tutorial for the mhcbind mode.
Running pepquery mode: Tutorial for the pepquery mode.
Note
The folder with the tutorial results will be generated in the directory where immunopepper is run, not inside the immunopepper folder.
Input data
The input data used for this tutorial can be found in the folder ‘tests/data_simulated/data/build_mode’. The data is composed of the following files:
genes_graph_conf3.merge_graphs.pickle: Splice graphs obtained from SplAdder. The file will contain one splice graph per gene.
This file contains information for 9 genes.
The object obtained from SplAdder contains data and metadata. However, in this software, only the data is used, which contains different fields. The whole list of the fields can be explored in the SplAdder documentation. Here, only the relevant fields for this tutorial are shown.
chr: Chromosome information. In this example all the genes belong to chromosome 1. The chromosomes have the notation “x”, x being a valid chromosome number. Note: The chromosomes in the reference genome and in the reference annotation file must have the same notation.
name: Gene ID.
start: Start coordinate of the genes.
stop: Stop coordinate of the genes.
strand: Strand where the genes are placed. It can be either “+” or “-“.
Therefore, the data contained in the splice graph file for this particular example is the following:
chr
name
start
stop
strand
1
gene1
50
750
“+”
1
gene3
1050
1600
“+”
1
gene4
2100
2850
“+”
1
gene6
4050
4500
“+”
1
gene7
4550
5000
“-”
1
gene8
5050
5500
“+”
1
gene9
5550
6000
“-”
1
gene10| 6050
7000
“+”
1
gene15| 9050
9450
“+”
genes_graph_conf3.merge_graphs.count.hdf5: Count file obtained from SplAdder.
It contains a field “strains” that stores the sample names of the processed data. In this case the samples are: “simulated_Ipp_1_samplex”, with x being a number from 1 to 20.
These sample names are the ones accepted by the software wherever a sample id is required. For example, if one wants to set a –mutation sample in the command line, the sample name must be one of the “strains” names.
genome.fa: Reference genome. It is a fasta file containing the reference genome. For this example, the genome is generated randomly, so it does not corresponds to a biologically accurate genome. The chromosome names must be the same as in the annotation file and in the splice graph file.
genome.fa.fai: Genome index. The file is useful to access different positions of the genome faster.
simulated_Ipp.gtf: Reference annotation file. It is a gtf file containing the reference annotation used for this example. The annotation file needs to have information about the gene, gene transcripts, gene exons and gene CDS.
variants_somatic.vcf: VCF file containing the somatic variants. These somatic variants are randomly generated for the purpose of the example.
variants_germline.vcf: VCF file containing the germline variants. These germline variants are randomly generated for the purpose of the example.
Running build mode
Build mode without mutations
The main mode of immunopepper is the build mode. This mode traverses the splice graph and generates all possible peptides/kmers. It is the base of the other modes. The build mode accepts many different arguments, which add different functionalities to the mode.
Command
First of all, we will show an example with the basic commands, without any additional functionalities such as mutations. The command used for this part of the tutorial is:
immunopepper build --output-dir immunopepper_usecase/ --ann-path immunopepper/tests/data_simulated/data/build_mode/simulated_Ipp.gtf --splice-path immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.pickle --ref-path immunopepper/tests/data_simulated/data/build_mode/genome.fa --kmer 9 --count-path immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.count.hdf5 --parallel 1 --batch-size 1 --start-id 0 --process-num 0 --output-fasta --verbose 2
By calling this command, the software will generate the possible kmers/peptides for each of the 9 genes in the splice graph. It will take into account the reference genome and the annotation file, and it will generate an output for both the background and foreground peptides. The command is run on the Input data described in the section above. Moreover, the output directory is set to a folder called immunopepper_usecase, located on the directory where the command is executed. The kmer length is set to 9, as it is a common kmer length selected in clinical applications.
Terminal output
The output displayed in the command line is the following:
2023-06-22 12:48:54,100 INFO Command lineNamespace(output_dir='immunopepper_usecase/', ann_path='immunopepper/tests/data_simulated/data/build_mode/simulated_Ipp.gtf', splice_path='immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.pickle', ref_path='immunopepper/tests/data_simulated/data/build_mode/genome.fa', kmer=9, libsize_extract=False, all_read_frames=False, count_path='immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.count.hdf5', output_samples=[], heter_code=0, compressed=True, parallel=1, batch_size=1, pickle_samples=[], process_chr=None, complexity_cap=None, genes_interest=None, start_id=0, process_num=0, skip_annotation=False, libsize_path=None, output_fasta=True, force_ref_peptides=False, filter_redundant=False, kmer_database=None, gtex_junction_path=None, disable_concat=False, disable_process_libsize=False, mutation_sample=None, germline='', somatic='', sample_name_map=None, use_mut_pickle=False, verbose=2)
2023-06-22 12:48:54,100 INFO >>>>>>>>> Build: Start Preprocessing
2023-06-22 12:48:54,100 INFO Building lookup structure ...
2023-06-22 12:48:54,101 INFO Time spent: 0.000 seconds
2023-06-22 12:48:54,102 INFO Memory usage: 0.159 GB
2023-06-22 12:48:54,102 INFO Loading count data ...
2023-06-22 12:48:54,104 INFO Time spent: 0.002 seconds
2023-06-22 12:48:54,104 INFO Memory usage: 0.160 GB
2023-06-22 12:48:54,104 INFO Loading splice graph ...
2023-06-22 12:48:54,105 INFO Time spent: 0.000 seconds
2023-06-22 12:48:54,105 INFO Memory usage: 0.161 GB
2023-06-22 12:48:54,105 INFO Add reading frame to splicegraph ...
2023-06-22 12:48:54,107 INFO Time spent: 0.002 seconds
2023-06-22 12:48:54,107 INFO Memory usage: 0.161 GB
2023-06-22 12:48:54,107 INFO >>>>>>>>> Finish Preprocessing
2023-06-22 12:48:54,107 INFO >>>>>>>>> Start traversing splicegraph
2023-06-22 12:48:54,107 INFO >>>> Processing output_sample cohort, there are 9 graphs in total
2023-06-22 12:48:54,108 INFO Saving results to immunopepper_usecase/cohort_mutNone
2023-06-22 12:48:54,108 INFO Not Parallel
2023-06-22 12:48:54,108 INFO >>>>>>>>> Start Background processing
2023-06-22 12:48:54,111 INFO Saved ref_annot_peptides.fa.gz with 40 lines in 0.0003s
2023-06-22 12:48:54,111 INFO Saved ref_annot_kmer.gz with 294 lines in 0.0002s
2023-06-22 12:48:54,113 DEBUG ....cohort: annotation graph from batch all/9 processed, max time cost: 0.0, memory cost: 0.16 GB
2023-06-22 12:48:54,113 INFO >>>>>>>>> Start Foreground processing
2023-06-22 12:48:54,175 INFO Saved gene_expression_detail.gz with 9 lines in 0.0006s
2023-06-22 12:48:54,176 INFO Saved ref_sample_peptides.fa.gz with 88 lines in 0.0004s
2023-06-22 12:48:54,177 INFO Saved ref_sample_peptides_meta.gz with 44 lines in 0.0005s
2023-06-22 12:48:54,177 DEBUG ....cohort: output_sample graph from batch all/9 processed, max time cost: 0.02, memory cost: 0.16 GB
2023-06-22 12:48:54,188 INFO Saved library size results to immunopepper_usecase/expression_counts.libsize.tsv
Output files
The output files are saved in the directory immunopepper_usecase/cohort_mutNone. The output files are:
ref_annot_peptides.fa.gz:
This is a fasta file containing the background peptides for each gene transcript. The file contains a header, that is the transcript id, and the sequence of the corresponding peptide. The name also shows the mutation mode, which in this case is reference. Note: As this genome is simulated, there is a higher frequency of stop codons than in nature, that explains the existence of some short peptides.
This file shows the full transcripts found in the organism as described by the annotation file provided under –ann-path. The sequence of exons for each given transcript is obtained from the annotation file, and the regions corresponding to this exons are taken from the reference genome file provided under –ref-path, and translated to create the set of background peptides or kmers.
Output example:
fasta >gene8.t1 TSSRTMETLVP >gene1.t1 LQHNSTRSIFWH >gene10.t1 LSLVHPGTRRITKRRRQYPYVIASCQREAGCRGIICS ...
ref_annot_kmer.gz: This file contains the kmers of length 9 obtained from the background peptides. The kmers are obtained by passing a sliding window through the peptides contained in ref_annot_peptides.fa.gz. Note: For peptides that are shorter than 9 aminoacids, the kmers are not obtained.
In the example below, one can see the kmers obtained from gene8.t1.
Output example:
kmer TSSRTMETL SSRTMETLV SRTMETLVP ...
gene_expression_detail.gz: File containing gene expression information for each gene and sample. The output file is a table containing the coding genes in the rows and the samples in the columns. The gene expresison is displayed for each combination. An output example for this tutorial is:
Output example:*
gene
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
gene1
20688.0
17791.0
33285.0
23488.0
46584.0
31986.0
20888.0
32585.0
5499.0
13193.0
13595.0
11495.0
3199.0
25493.0
1899.0
10395.0
8496.0
20993.0
22495.0
5199.0
gene3
29585.0
58264.0
56063.0
36978.0
85934.0
52469.0
28083.0
51664.0
12189.0
21579.0
33179.0
32075.0
3496.0
31883.0
7594.0
31576.0
16291.0
54973.0
23487.0
9587.0
gene4
13643.0
28684.0
57713.0
46866.0
34632.0
43720.0
22390.0
44068.0
11195.0
16097.0
27289.0
17843.0
5592.0
19243.0
9098.0
51766.0
33586.0
12245.0
5948.0
19243.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
ref_sample_peptides.fa.gz: Fasta file containing the foreground peptides obtained by traversing the splice graph and identifying “short-range” novelty. This file is obtained as output because the command –output-fasta is passed to the program. The file contains a header that is the transcript id and the sequence of the corresponding peptide. The name also shows the mutation mode, which in this case is reference. Note: As this genome is simulated, there is a higher frequency of stop codons than in nature, that explains the existence of some short peptides.
Output example
fasta >gene3:1_3:0:1302:2-exons VSDGWACRGSATARPPNPRRAVLCKSIEPTYGRPSV >gene7:3_1:0:4980:2-exons FGRVPC >gene10:1_6:0:6112:2-exons YPYVIASCQREAGCRGIICS >gene15:0_2:0:9061:2-exons LS >gene10:1_5:0:6111:2-exons ISLCDRKLSEGGGLSRYNLLINCRKGFLGVINRTHVHSLPFRVLIILEPATSLDFRQPGTIDARHCFTMLTGIGNRG >gene10:0_7:0:6061:2-exons LSLVHPGTRRITKRRRQYPYVIASCQREAGCRGIICS ...
This example contains several important things to note.
First of all, it is important understand the transcript id. More information about it can be obtained in the metadata output file section. In this example, the variant id will always be 0 because there is no mutation information.
Secondly, it is important to note some short peptides such as gene7:3_1:0:4980:2-exons or gene15:0_2:0:9061:2-exons. The corresponding peptides are shorter than 9 amino acids, so they will not be shown in the kmer file. This is happening because the translation encountered a stop codon.
Finally, in this example, we get three peptides coming from gene 10. However, they are made from different vertex combinations, which results in different peptide sequences.
ref_graph_kmer_SegmExpr: This is a folder with different files. Each file contains information for the kmers derived from a specific gene, as well as the expression levels of the kmers in each sample. Kmers shown in this file are generated from a single exon and are not located across an exon junction. For a detailed description of the different fields of this file one can refer to the file 5 of the output section.
Output example:
kmer
coord
isCrossJunction
junctionAnnotated
readFrameAnnotated
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
RTMETLVP
5067:5094:nan:nan:None:None
False
False
True
45.64
115.36
105.56
47.54
91.91
101.54
61.33
54.53
18.4
44.64
72.92
96.13
5.88
90.36
6.78
43.46
20.87
95.22
54.16
22.57
TSSRTMETL
5061:5088:nan:nan:None:None
False
False
True
45.64
115.36
105.56
47.54
91.91
101.54
61.33
54.53
18.4
44.64
72.92
96.13
5.88
90.36
6.78
43.46
20.87
95.22
54.16
22.57
SSRTMETLV
5064:5091:nan:nan:None:None
False
False
True
45.64
115.36
105.56
47.54
91.91
101.54
61.33
54.53
18.4
44.64
72.92
96.13
5.88
90.36
6.78
43.46
20.87
95.22
54.16
22.57
LFSDAIRTS
4063:4090:nan:nan:None:None
False
False
True
56.17
101.9
112.0
127.29
142.99
112.17
53.95
103.04
35.6
54.26
59.72
83.53
10.57
86.79
11.46
51.51
37.02
133.1
73.39
28.55
HLFSDAIRT
4060:4087:nan:nan:None:None
False
False
True
56.17
101.9
112.0
127.29
142.99
112.17
53.95
103.04
35.6
54.26
59.72
83.53
10.57
86.79
11.46
51.51
37.02
133.1
73.39
28.55
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
This example contains several important things to note.
The folder contains 9 files. This is because in the file ref_sample_peptides.fa.gz there are peptides for the 9 different genes.
In this example the results for two different genes are shown. One can see that by looking at the expression levels. The expression levels for the first three kmers are the same, and the same happens for the two last kmers. This is because the gene expression is obtained at a per-gene basis. Therefore, all kmers derived from the same gene will have the same expression level.
As we are dealing with segment kmers, the fields isCrossJunction and junctionAnnotated are always False.
The field readFrameAnnotated shows whether the kmers were obtained from a read frame present in the annotation file or if they were obtained by reading frame propagation.
ref_graph_kmer_JuncExpr: This is a folder containing different files. In this case, it there are three files. Each file shows the expression levels for different kmers, across the 20 samples. For a detailed description of the different fields of this file one can refer to the file 6 of the output section.
Output example:
kmer
coord
isCrossJunction
junctionAnnotated
readFrameAnnotated
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
SSSLVSDGW
1140:1150:1300:1317:None:None
True
False
True
92.0
183.0
175.0
119.0
285.0
153.0
85.0
170.0
40.0
54.0
95.0
99.0
10.0
93.0
24.0
87.0
41.0
172.0
69.0
32.0
EPPTYGRPSV
1383:1400:1500:1510:None:None
True
False
False
54.0
117.0
75.0
77.0
138.0
70.0
33.0
93.0
23.0
31.0
46.0
75.0
6.0
42.0
4.0
61.0
26.0
83.0
30.0
6.0
RESSSLVSD
1134:1150:1300:1311:None:None
True
False
True
92.0
183.0
175.0
119.0
285.0
153.0
85.0
170.0
40.0
54.0
95.0
99.0
10.0
93.0
24.0
87.0
41.0
172.0
69.0
32.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
In this file, all the kmers appearing belong to a junction between exons. Therefore, the field isCrossJunction will always have the value True.
expression_counts.libsize.tsv: File containing 75% of expression and total expression for each sample. For a given sample, the “75% of expression” is defined as the 75th quantile of the gene expression distribution across coding genes. For a given sample the “total expression” is defined as the total gene expression across coding genes. Generated only if –-disable-libsize is set to False and if -–count-path file is provided. It is computed from the file gene_expression_detail.gz.
Output example:
sample
libsize_75percent
libsize_total_count
simulatedIpp1sample1
29374.0
212925.0
simulatedIpp1sample2
43077.0
361533.0
simulatedIpp1sample3
57713.0
485196.0
simulatedIpp1sample4
51055.0
383447.0
simulatedIpp1sample5
73240.0
541376.0
simulatedIpp1sample6
52469.0
417486.0
simulatedIpp1sample7
26440.0
239553.0
simulatedIpp1sample8
44205.0
390951.0
simulatedIpp1sample9
13190.0
103124.0
simulatedIpp1sample10
20537.0
180067.0
simulatedIpp1sample11
45020.0
305882.0
simulatedIpp1sample12
37876.0
300323.0
simulatedIpp1sample13
6248.0
48668.0
simulatedIpp1sample14
40640.0
339159.0
simulatedIpp1sample15
6696.0
49765.0
simulatedIpp1sample16
24288.0
196125.0
simulatedIpp1sample17
12985.0
110283.0
simulatedIpp1sample18
54973.0
442504.0
simulatedIpp1sample19
33586.0
263805.0
simulatedIpp1sample20
13190.0
109355.0
Annot_IS_SUCCESS: This is an empty file. It is obtained because the generation of the background (or annotation) files was successful. If the generation of the background files was not successful, this file would not be generated.
output_sample_IS_SUCCESS: This is an empty file. It is obtained because the generation of the foreground (or sample) files was successful. If the generation of the foreground files was not successful, this file would not be generated.
somatic_and_germline_sample_peptides_meta.gz: File containing details for each peptide. A detailed explanation of the output can be seen in metadata output file.
Output example:
peptide
id
readFrame
readFrameAnnotated
geneName
geneChr
geneStrand
mutationMode
hasStopCodon
isInJunctionList
isIsolated
variantComb
variantSegExpr
modifiedExonsCoord
originalExonsCoord
vertexIdx
kmerType
VSDGWACRGSATARPPNPRRAVLCKSIEPTYGRPSV
gene3:1_3:0:1302:2-exons
2
False
gene3
1
ref
1
nan
0
nan
nan
1302;1400;1500;1600
1300;1400;1500;1600
1;3
2-exons
FGRVPC
gene7:3_1:0:4980:2-exons
2
True
gene7
1
ref
1
nan
1
nan
nan
4900;4980;4700;4800
4900;5000;4700;4800
3;1
2-exons
YPYVIASCQREAGCRGIICS
gene10:1_6:0:6112:2-exons
0
True
gene10
1
ref
1
nan
1
nan
nan
6112;6250;6400;6748
6100;6250;6400;6750
1;6
2-exons
YPYVIASCQREAGCRGIICS
gene10:1_5:0:6112:2-exons
0
True
gene10
1
ref
1
nan
1
nan
nan
6112;6250;6400;6498
6100;6250;6400;6500
1;5
2-exons
ISLCDRKLSEGGGLSRYNLLINCRKGFLGVINRTHVHSLPFRVLIILEPATSLDFRQPGTIDARHCFTMLTGIGNRG
gene10:1_5:0:6111:2-exons
1
True
gene10
1
ref
1
nan
0
nan
nan
6111;6250;6400;6498
6100;6250;6400;6500
1;5
2-exons
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Things to note from the table above:
All the ids have a variant number equal to 0. The same happens with VariantComb and VariantSegExpr, which are nan. This is because mutations are not provided.
In the third and fourth column, one can see two peptides that have the same sequence but come from different vertices. Moreover, they both have the same reading frame. This is because the sequence has a stop codon in exon 1 if reading frame 0 is used, so that in both cases only the part of exon 1 up to the mutation is translated.
On the other hand, on the fifth line, we can see how a shift of 1 nucleotide in the reading frame leads to the disappearance of the stop codon, so that the whole sequence is translated.
Build mode with mutations
In the second part of the example, we introduce somatic and germline mutations in the analysis. The command used in this tutorial to run the build mode is:
Command
immunopepper build --output-dir immunopepper_usecase/ --ann-path immunopepper/tests/data_simulated/data/build_mode/simulated_Ipp.gtf --splice-path immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.pickle --ref-path immunopepper/tests/data_simulated/data/build_mode/genome.fa --kmer 9 --count-path immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.count.hdf5 --parallel 1 --batch-size 1 --start-id 0 --process-num 0 --output-fasta --somatic immunopepper/tests/data_simulated/data/build_mode/variants_somatic.vcf --germline immunopepper/tests/data_simulated/data/build_mode/variants_germline.vcf --mutation-sample simulated_Ipp_1_sample3 --verbose 2
In this command, the build mode of immunopepper is run on the Input data described in the section above. Moreover, the output directory is set to a folder called immunopepper_usecase/cohort_mutsimulated_Ipp_1_sample3, located on the directory where the command is executed. The kmer length is set to 9, as it is a common kmer length selected in clinical applications. Finally, there are also two mutation files provided, a somatic and a germline file. These files will apply the existing mutations and take them into account when computing the output. One important thing to note is that, if mutations are provided, an extra filter layer is included. This layer will ensure that only peptides different to the reference (base genome + germline) are included in the output.
Terminal output
The output displayed in the terminal is the following:
2023-06-20 19:21:51,580 INFO Command lineNamespace(output_dir='immunopepper_usecase/', ann_path='immunopepper/tests/data_simulated/data/build_mode/simulated_Ipp.gtf', splice_path='immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.pickle', ref_path='immunopepper/tests/data_simulated/data/build_mode/genome.fa', kmer=9, libsize_extract=False, all_read_frames=False, count_path='immunopepper/tests/data_simulated/data/build_mode/genes_graph_conf3.merge_graphs.count.hdf5', output_samples=[], heter_code=0, compressed=True, parallel=1, batch_size=1, pickle_samples=[], process_chr=None, complexity_cap=None, genes_interest=None, start_id=0, process_num=0, skip_annotation=False, keep_tmpfiles=False, libsize_path=None, output_fasta=True, force_ref_peptides=False, filter_redundant=False, kmer_database=None, gtex_junction_path=None, disable_concat=False, disable_process_libsize=False, mutation_sample='simulated_Ipp_1_sample3', germline='immunopepper/tests/data_simulated/data/build_mode/variants_germline.vcf', somatic='immunopepper/tests/data_simulated/data/build_mode/variants_somatic.vcf', sample_name_map=None, use_mut_pickle=False, verbose=2)
2023-06-20 19:21:51,580 INFO >>>>>>>>> Build: Start Preprocessing
2023-06-20 19:21:51,580 INFO Building lookup structure ...
2023-06-20 19:21:51,581 INFO Time spent: 0.000 seconds
2023-06-20 19:21:51,581 INFO Memory usage: 0.146 GB
2023-06-20 19:21:51,581 INFO Loading count data ...
2023-06-20 19:21:51,584 INFO Time spent: 0.003 seconds
2023-06-20 19:21:51,584 INFO Memory usage: 0.147 GB
2023-06-20 19:21:51,585 INFO Loading splice graph ...
2023-06-20 19:21:51,586 INFO Time spent: 0.000 seconds
2023-06-20 19:21:51,586 INFO Memory usage: 0.148 GB
2023-06-20 19:21:51,586 INFO Add reading frame to splicegraph ...
2023-06-20 19:21:51,588 INFO Time spent: 0.002 seconds
2023-06-20 19:21:51,588 INFO Memory usage: 0.148 GB
2023-06-20 19:21:51,588 INFO >>>>>>>>> Finish Preprocessing
2023-06-20 19:21:51,588 INFO >>>>>>>>> Start traversing splicegraph
2023-06-20 19:21:51,588 INFO >>>> Processing output_sample cohort, there are 9 graphs in total
2023-06-20 19:21:51,588 INFO Saving results to immunopepper_usecase/cohort_mutsimulated_Ipp_1_sample3
2023-06-20 19:21:51,588 INFO Not Parallel
2023-06-20 19:21:51,588 INFO >>>>>>>>> Start Background processing
2023-06-20 19:21:51,591 INFO Saved somatic_and_germline_annot_peptides.fa.gz with 40 lines in 0.0004s
2023-06-20 19:21:51,591 INFO Saved somatic_and_germline_annot_kmer.gz with 298 lines in 0.0003s
2023-06-20 19:21:51,592 DEBUG ....cohort: annotation graph from batch all/9 processed, max time cost: 0.0, memory cost: 0.15 GB
2023-06-20 19:21:51,592 INFO >>>>>>>>> Start Foreground processing
2023-06-20 19:21:51,632 INFO Saved gene_expression_detail.gz with 9 lines in 0.0004s
2023-06-20 19:21:51,632 INFO Saved somatic_and_germline_sample_peptides.fa.gz with 46 lines in 0.0003s
2023-06-20 19:21:51,633 INFO Saved somatic_and_germline_sample_peptides_meta.gz with 23 lines in 0.0004s
2023-06-20 19:21:51,633 DEBUG ....cohort: output_sample graph from batch all/9 processed, max time cost: 0.01, memory cost: 0.15 GB
2023-06-20 19:21:51,645 INFO Saved library size results to immunopepper_usecase/expression_counts.libsize.tsv
Output files
somatic_and_germline_annot_peptides.fa.gz: This is a fasta file containing the background peptides for each gene transcript. The file contains a header that is the transcript id and the sequence of the corresponding peptide. The name also shows the mutation mode, which in this case is somatic and germline. Note: As this genome is simulated, there is a higher frequency of stop codons than in nature, that explains the existence of some short peptides.
Output example:
>gene8.t1 TSSRTMETLVP >gene4.t2 SVRRTPRFRRTAEAPVSRSLIITHLGDGGWEP >gene15.t1 TFLKKSLRLNSI ...
The peptides of the annotation will aready have the germline variants included.
somatic_and_germline_annot_kmer.gz: This file contains the kmers of length 9 obtained from the background peptides. The kmers are obtained by passing a sliding window through the peptides contained in somatic_and_germline_annot_peptides.fa.gz. Note: For that peptides that are shorter than 9, the kmers are not obtained.
The kmers in this example are obtained from gene8.t1 and gene4.t2.
Output example:
kmer TSSRTMETL SSRTMETLV SRTMETLVP SVRRTPRFR VRRTPRFRRT ...
gene_expression_detail.gz: File containing gene expression information for each gene and sample. The output file is a table containing the coding genes in the rows and the samples in the columns. The gene expresison is displayed for each combination. An output example for this tutorial is:
Output example:*
gene
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
gene1
20688.0
17791.0
33285.0
23488.0
46584.0
31986.0
20888.0
32585.0
5499.0
13193.0
13595.0
11495.0
3199.0
25493.0
1899.0
10395.0
8496.0
20993.0
22495.0
5199.0
gene3
29585.0
58264.0
56063.0
36978.0
85934.0
52469.0
28083.0
51664.0
12189.0
21579.0
33179.0
32075.0
3496.0
31883.0
7594.0
31576.0
16291.0
54973.0
23487.0
9587.0
gene4
13643.0
28684.0
57713.0
46866.0
34632.0
43720.0
22390.0
44068.0
11195.0
16097.0
27289.0
17843.0
5592.0
19243.0
9098.0
51766.0
33586.0
12245.0
5948.0
19243.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
somatic_and_germline_sample_peptides.fa.gz: Fasta file containing the foreground peptides obtained by traversing the splice graph and taking into account the mutations. This file is obtained as output because the command –output-fasta is passed to the program. The file contains a header that is the transcript id and the sequence of the corresponding peptide. The name also shows the mutation mode, which in this case is somatic and germline. Note: As this genome is simulated, there is a higher frequency of stop codons than in nature, that explains the existence of some short peptides.
Output example
>gene4:0_1:3:2112:2-exons SVRRTPRFRRTAEAPVSRSLIITHLGDEGWEP >gene4:0_2:6:2112:2-exons SVRRTPRFRRTAEAPVSRSLIITHLGDEGWEP >gene4:0_1:5:2112:2-exons SVRRTPRFRRTAEAPVSRSLIITHLGDEGWEP >gene3:0_1:0:1062:2-exons NEVIGECIACSASFDATTIGRSRHRESSSLVSDGWACRGSATARPPNPRRAVLCKSIEPTYA >gene3:0_2:3:1062:2-exons NEVIGECIACSASFDATTIGRSRHRESSSLVSDGWACRGSATARPPNPRRAVLCKSIEPTYAR ...
- Important things to note:
When the somatic and germline files are included, only peptides belonging to gene4 and gene3 are given as output. This is because only the peptides that are different to the reference are given as an output. For many of the genes and exons, there are not mutations present, which means that the peptides will be the same as in the reference.
The peptides coming from gene4 are equal to each other. This is because there is a stop codon in exon 0, and the peptide is truncated there. It can be checked by looking at the isIsolated field in the metadata.
somatic_and_germline_graph_kmer_SegmExpr: This is a folder with different files. Each file contains information for different kmers derived from somatic_and_germline_sample_peptides.fa.gz. The files contain information about the expression levels of kmers found in an exon. Kmers shown in this file are generated from a single exon and are not located across an exon junction. For a detailed description of the different fields of this file one can refer to the file 5 of the output section.
In this example one can see three peptides derived from the same gene. This explains why the expression level is the same in each sample for the three kmers, as gene expression is computed per gene.
Output example:
kmer
coord
isCrossJunction
junctionAnnotated
readFrameAnnotated
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
RFRRTAEAP
2130:2157:nan:nan:None:None
False
False
True
36.98
79.75
161.4
128.52
101.0
119.43
62.57
119.59
29.76
48.62
77.01
51.13
15.85
125.49
15.11
54.82
23.91
141.25
95.49
33.13
APVSRSLII
2151:2178:nan:nan:None:None
False
False
True
36.98
79.75
161.4
128.52
101.0
119.43
62.57
119.59
29.76
48.62
77.01
51.13
15.85
125.49
15.11
54.82
23.91
141.25
95.49
33.13
RRTPRFRRT
2118:2145:nan:nan:None:None
False
False
True
36.98
79.75
161.4
128.52
101.0
119.43
62.57
119.59
29.76
48.62
77.01
51.13
15.85
125.49
15.11
54.82
23.91
141.25
95.49
33.13
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
somatic_and_germline_graph_kmer_JuncExpr: This is a folder containing different files. Each file has information for different kmers derived from somatic_and_germline_sample_peptides.fa.gz. The files contain information about the expression levels of kmers found across an exon junction. For a detailed description of the different fields of this file one can refer to the file 6 of the output section.
Output example:
kmer
coord
isCrossJunction
junctionAnnotated
readFrameAnnotated
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
RHRESSSLV
1128:1150:1300:1305:None:None
True
False
True
92.0
183.0
175.0
119.0
285.0
153.0
85.0
170.0
40.0
54.0
95.0
99.0
10.0
93.0
24.0
87.0
41.0
172.0
69.0
32.0
EPTYARPSV
1383:1400:1500:1510:None:None
True
False
True
54.0
117.0
75.0
77.0
138.0
70.0
33.0
93.0
23.0
31.0
46.0
75.0
6.0
42.0
4.0
61.0
26.0
83.0
30.0
6.0
KSIEPTYAR
1374:1400:1500:1501:None:None
True
False
False
54.0
117.0
75.0
77.0
138.0
70.0
33.0
93.0
23.0
31.0
46.0
75.0
6.0
42.0
4.0
61.0
26.0
83.0
30.0
6.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
expression_counts.libsize.tsv: File containing 75% of expression and total expression for each sample. For a given sample, the “75% of expression” is defined as the 75th quantile of the gene expression distribution across coding genes. For a given sample the “total expression” is defined as the total gene expression across coding genes. Generated only if –disable-libsize is set to False and if –count-path file is provided. It is computed from the file gene_expression_detail.gz.
The output is the same as in the tutorial without mutations. This is because this file,as well as the file gene_expression_detail.gz, are generated from the same count file. Therefore, including mutations does not have an influence on the output of this file.
Output example:
sample
libsize_75percent
libsize_total_count
simulatedIpp1sample1
29374.0
212925.0
simulatedIpp1sample2
43077.0
361533.0
simulatedIpp1sample3
57713.0
485196.0
simulatedIpp1sample4
51055.0
383447.0
simulatedIpp1sample5
73240.0
541376.0
simulatedIpp1sample6
52469.0
417486.0
simulatedIpp1sample7
26440.0
239553.0
simulatedIpp1sample8
44205.0
390951.0
simulatedIpp1sample9
13190.0
103124.0
simulatedIpp1sample10
20537.0
180067.0
simulatedIpp1sample11
45020.0
305882.0
simulatedIpp1sample12
37876.0
300323.0
simulatedIpp1sample13
6248.0
48668.0
simulatedIpp1sample14
40640.0
339159.0
simulatedIpp1sample15
6696.0
49765.0
simulatedIpp1sample16
24288.0
196125.0
simulatedIpp1sample17
12985.0
110283.0
simulatedIpp1sample18
54973.0
442504.0
simulatedIpp1sample19
33586.0
263805.0
simulatedIpp1sample20
13190.0
109355.0
Annot_IS_SUCCESS: This is an empty file. It is obtained because the generation of the background (or annotation) files was successful. If the generation of the background files was not successful, this file would not be generated.
output_sample_IS_SUCCESS: This is an empty file. It is obtained because the generation of the foreground (or sample) files was successful. If the generation of the foreground files was not successful, this file would not be generated.
somatic_and_germline_sample_peptides_meta.gz: File containing details for each peptide. A detailed explanation of the output can be seen in metadata output file.
Output example:
Important things to note:
In this case, the mutation index in the id is not always 0, since we have a mutation file and we study several mutations.
The field variantComb shows the variant combination used to create the peptide. Variants in the software are SNP at DNA level.
The field variantSegExpr shows the expression of the gene with the variant combination used to create the peptide.
We can see once more that the sequence for the gene4 peptides is the same. This is because there is a stop codon in the exon 0 of the gene, so that the peptide is truncated.
Running samplespecif mode
In this case, we will apply this mode to the output of build mode, in order to remove the background kmers from the foreground samples.
Command
immunopepper samplespecif --annot-kmer-files immunopepper_usecase/cohort_mutNone/ref_annot_kmer.gz --output-dir immunopepper_usecase/samplespecif --junction-kmer-files immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr --bg-file-path immunopepper_usecase/samplespecif/bg-file.gz --output-suffix trial
Terminal output
immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr --bg-file-path immunopepper_usecase/samplespecif/bg-file.gz --output-suffix trial
2023-07-10 22:05:57,679 INFO Command lineNamespace(annot_kmer_files=['immunopepper_usecase/cohort_mutNone/ref_annot_kmer.gz'], output_dir='immunopepper_usecase/samplespecif', junction_kmer_files=['immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr'], bg_file_path='immunopepper_usecase/samplespecif/bg-file.gz', output_suffix='trial', remove_bg=False, verbose=1)
2023-07-10 22:05:57,679 INFO >>>>>>>>> Remove annotation: Start
2023-07-10 22:05:57,679 INFO ...consider annotation file:immunopepper_usecase/cohort_mutNone/ref_annot_kmer.gz
2023-07-10 22:05:57,709 INFO generated unique background kmer file in immunopepper_usecase/samplespecif/bg-file.gz
2023-07-10 22:05:57,710 INFO ...consider foreground file:['immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr/part-b7a3198e-37eb-453b-8d92-b2dd5b855d58.gz', 'immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr/part-83795625-497b-4cf8-be61-cc363197f88d.gz', 'immunopepper_usecase/cohort_mutNone/ref_graph_kmer_JuncExpr/part-ac0ec9d0-6599-4ff6-a066-7b4538007680.gz']
2023-07-10 22:05:57,731 INFO output bg-removed kmer file : immunopepper_usecase/samplespecif/ref_graph_kmer_JuncExpr_trial.gz
2023-07-10 22:05:57,731 INFO >>>>>>>>> Remove annotation: Finish
Output files
For the purpose of this example, we provided an empty path in –bg-file-path option. Therefore, this mode will also generate the intermediate file containing the unique set of background peptides. Moreover, as –remove-bg is set to False, the mode will generate a file identical to ref_graph_kmer_JuncExpr from build mode, but with an extra column is_neo_flag indicating whether the kmer is a neoantigen or not i.e. If the kmer was present in the background set. Finally, as we set the suffix to be trial, the output file will be named ref_graph_kmer_JuncExpr_trial.gz.
ref_graph_kmer_JuncExpr_trial.gz: This file contains the kmers from the splice graph, with an extra column is_neo_flag indicating whether the kmer is a neoantigen or not.
kmer
coord
isCrossJunction
junctionAnnotated
readFrameAnnotated
simulatedIpp1sample1
simulatedIpp1sample2
simulatedIpp1sample3
simulatedIpp1sample4
simulatedIpp1sample5
simulatedIpp1sample6
simulatedIpp1sample7
simulatedIpp1sample8
simulatedIpp1sample9
simulatedIpp1sample10
simulatedIpp1sample11
simulatedIpp1sample12
simulatedIpp1sample13
simulatedIpp1sample14
simulatedIpp1sample15
simulatedIpp1sample16
simulatedIpp1sample17
simulatedIpp1sample18
simulatedIpp1sample19
simulatedIpp1sample20
is_neo_flag
SSSLVSDGW
1140:1150:1300:1317:None:None
True
False
True
92.0
183.0
175.0
119.0
285.0
153.0
85.0
170.0
40.0
54.0
95.0
99.0
10.0
93.0
24.0
87.0
41.0
172.0
69.0
32.0
True
EPPTYGRPSV
1383:1400:1500:1510:None:None
True
False
False
54.0
117.0
75.0
77.0
138.0
70.0
33.0
93.0
23.0
31.0
46.0
75.0
6.0
42.0
4.0
61.0
26.0
83.0
30.0
6.0
True
RESSSLVSD
1134:1150:1300:1311:None:None
True
False
True
92.0
183.0
175.0
119.0
285.0
153.0
85.0
170.0
40.0
54.0
95.0
99.0
10.0
93.0
24.0
87.0
41.0
172.0
69.0
32.0
True
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
bg-file.gz: This file contains the set of unique kmers found in the background file provided.
kmer VRSSGTSKW GYKFPLSSD RRVTNSLIM RHYTLGIIA ...
Running cancerspecif mode
Simulating normal data
The cancerspecif mode takes as input the kmer files derived from the normal and cancer samples. As this example is based on simulated data, we compute the build mode on our splice graph and count information, which will correspond to the cancer data.
Therefore, we will then simulate the normal data from the cancer data. This can be done by running the code generate_normal_data.py. The intuition behind the simulation is as follows:
Drop some kmers appearing in the cancer data. These dropped kmers will be present in the cancer files but not in the normal files, the first condition for a kmer to be considered a neopeptide. Each kmer is dropped with a probability of 30%.
Once the kmers are dropped, we will have the full set of normal peptides. The next thing to do is to change the expression level of the kmers.
For some of the kmers, the expression across all the samples will be set to 0. If this is the case, the kmer will be considered as part of the annotation only and it will be removed from the final normal kmer set.
For the rest of the kmers, the expression is set to a random number between 0 and 105 for every sample.
Once the expression is set, we will have the full set of normal peptides that will be used for the filtering.
Running cancerspecif mode
In this mode, the user can perform different filtering steps to keep only the kmers that are specific to a cancer sample or a cancer cohort.
For this example, the filtering will be performed in the files generated without mutations. Moreover, junction kmers of both cancer and normal samples will be used. There are different thresholds and filters used in this example. The different options selected will be explained below:
Normal filtering: For the normal filtering, three filters were used:
Filter on presence in sample: The kmers that are only present in the annotation but not in the samples are removed from the normal kmer set.
Filter on the number of samples: –n-samples-lim-normal = 15. This filter checks in how many samples (out of 20) each kmer is expressed. If it is expressed in 15 or more samples, it will be considered as a normal kmer and it will be added to the final normal kmer set.
Filter on the expression: –cohort-expr-support-normal = 100. This filter checks the expression of the kmer across all the samples. If the expression is higher than 100 in at least one sample, the kmer will be considered as a normal kmer and it will be added to the final normal kmer set.
Cancer filtering: For the cancer filtering, three types of filtering were used:
NeoJunctions filtering: By setting –filterNeojuncCoord = C, the cancer kmers are first filtered by JunctionAnnotated. Only those kmers where JunctionAnnotated = False will be considered, since we are interested in kmers resulting from novel splicing junctions.
Sample based filtering: We select a sample of interest –ids-cancer-samples* = simulated_Ipp_1_sample3. This sample will be subjected to an individual filtering on its expression. The filter is set with the argument –sample-expr-support-cancer = 20. This filter checks the expression of the kmer in the sample of interest. If the expression is higher than 20, the kmer will be considered as a cancer kmer and it will be added as a possible cancer kmer.
Cohort based filtering: For the rest of the samples, excluding the sample of interest, we will perform a number of samples and expession filtering. The values are set with –n-samples-lim-cancer = 2 and –cohort-expr-support-cancer = 110. This filter checks in how many samples (out of 20) each kmer is expressed with an expression level higher than 110. Those kmers expressed in 2 or more samples with an expression level >= 110 will be considered as cancer kmers and they will be added as a possible cancer kmer.
Finally, a differential filtering between cancer kmers and normal kmers is performed. We will consider cancer kmers the ones fulfilling all the conditions (intersection of the three filters) and normal kmers the ones fulfilling at least one of the normal conditions (union of the two last filters). In the end, only the kmers belonging to the cancer set, and not present in the normal set, will be kept.
Command
The command to run the cancerspecif mode is:
immunopepper cancerspecif --cores 2 --mem-per-core 2048 --parallelism 3 --kmer 9 --output-dir immunopepper_usecase/filter_case/ --interm-dir-norm immunopepper_usecase/filter_case --interm-dir-canc immunopepper_usecase/filter_case --ids-cancer-samples simulated_Ipp_1_sample3 --mut-cancer-samples ref --output-count immunopepper_usecase/filter_case/output-count.txt --path-normal-matrix-edge immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_NormalExpr/normal_junctions.gz --n-samples-lim-normal 15 --cohort-expr-support-normal 100 --sample-expr-support-cancer 20 --cohort-expr-support-cancer 110 --n-samples-lim-cancer 2 --path-cancer-matrix-edge immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_CancerExpr/cancer_junctions.gz --filterNeojuncCoord C --verbose 1
Terminal output
The terminal output for this mode is:
2023-06-26 12:20:51,722 INFO Command lineNamespace(cores=2, mem_per_core=2048, parallelism=3, out_partitions=None, scratch_dir='', interm_dir_norm='immunopepper_usecase/filter_case', interm_dir_canc='immunopepper_usecase/filter_case', kmer='9', ids_cancer_samples=['simulated_Ipp_1_sample3'], mut_cancer_samples=['ref'], whitelist_normal=None, whitelist_cancer=None, path_cancer_libsize=None, path_normal_libsize=None, normalizer_cancer_libsize=None, normalizer_normal_libsize=None, output_dir='immunopepper_usecase/filter_case/', output_count='immunopepper_usecase/filter_case/output-count.txt', tag_normals='', tag_prefix='', path_normal_matrix_segm=None, path_normal_matrix_edge=['immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_NormalExpr/normal_junctions.gz'], n_samples_lim_normal=15, cohort_expr_support_normal=100.0, sample_expr_support_cancer=20.0, cohort_expr_support_cancer=110.0, n_samples_lim_cancer=2, path_cancer_matrix_segm=None, path_cancer_matrix_edge=['immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_CancerExpr/cancer_junctions.gz'], cancer_support_union=False, path_normal_kmer_list=None, uniprot=None, filterNeojuncCoord='C', filterAnnotatedRF='', tot_batches=None, batch_id=None, on_the_fly=False, verbose=1)
driver_mem 3072
memory_per_executor_mb 80% 1638
parallelism_ 3
shuffle_partitions 3
permsize 1024M
2023-06-26 12:21:06,642 INFO
>>>>>>>> Preprocessing libsizes
2023-06-26 12:21:06,642 INFO At least one intermediate normals filtering file is missing.
2023-06-26 12:21:06,642 INFO Will compute full filtering steps according to user input parameters
2023-06-26 12:21:06,642 INFO
>>>>>>>> Preprocessing Normal samples
2023-06-26 12:21:06,642 INFO Load input ['immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_NormalExpr/normal_junctions.gz']
2023-06-26 12:21:12,558 INFO ...partitions: 1
2023-06-26 12:21:12,698 INFO ...partitions: 1
2023-06-26 12:21:12,698 INFO Isolating kmers only in backbone annotation
2023-06-26 12:21:13,043 INFO >>>> Save to immunopepper_usecase/filter_case/kmers_derived_solely_from_annotation.tsv.gz
2023-06-26 12:21:13,880 INFO Cast types
2023-06-26 12:21:14,101 INFO ...partitions: 1
2023-06-26 12:21:14,101 INFO
>>>>>>>> Normals: Perform Hard Filtering
(expressed in 15 samples with 100.0 normalized counts
2023-06-26 12:21:14,102 INFO expression filter
2023-06-26 12:21:14,102 INFO Filter matrix with cohort expression support >= 100.0 in 1 sample
2023-06-26 12:21:14,540 INFO Save intermediate 1/2 normals filtering file to immunopepper_usecase/filter_case/interm_normals_combiExprCohortLim100.0Across1.tsv.gz
2023-06-26 12:21:15,655 INFO Filter matrix with cohort expression support > 0.0 in 1 sample
2023-06-26 12:21:16,001 INFO Save intermediate 2/2 normals filtering file to immunopepper_usecase/filter_case/interm_normals_combiExprCohortLim0.0Across1.tsv.gz
2023-06-26 12:21:16,615 INFO Filter matrix with cohort expression support > 0 in 15 sample(s)
2023-06-26 12:21:16,665 INFO Load input immunopepper_usecase/filter_case/kmers_derived_solely_from_annotation.tsv.gz
2023-06-26 12:21:16,976 INFO At least one intermediate cancer_ref filtering file is missing.
2023-06-26 12:21:16,977 INFO Will compute full filtering steps according to user input parameters
2023-06-26 12:21:16,977 INFO
>>>>>>>> Preprocessing Cancer sample simulated_Ipp_1_sample3
2023-06-26 12:21:16,977 INFO Load input ['immunopepper/tests/data_simulated/data/cancerspecif_mode/ref_graph_kmer_CancerExpr/cancer_junctions.gz']
2023-06-26 12:21:17,263 INFO ...partitions: 1
2023-06-26 12:21:17,311 INFO ...partitions: 1
2023-06-26 12:21:17,341 INFO Cast types
2023-06-26 12:21:17,473 INFO ...partitions: 1
2023-06-26 12:21:17,609 INFO ...partitions: 1
2023-06-26 12:21:17,609 INFO Filter with simulatedIpp1sample3 > 0
2023-06-26 12:21:18,270 INFO # Init_cancer n = 59 kmers
2023-06-26 12:21:18,348 INFO ...partitions: 1
2023-06-26 12:21:18,348 INFO Filter with simulatedIpp1sample3 >= 20.0
2023-06-26 12:21:18,660 INFO # Filter_Sample n = 59 kmers
2023-06-26 12:21:18,660 INFO >>>> Save to immunopepper_usecase/filter_case/condition2
2023-06-26 12:21:18,931 INFO Target sample simulatedIpp1sample3 not included in the cohort filtering
2023-06-26 12:21:18,931 INFO Filter matrix with cohort expression support >= 110.0 in 1 sample
2023-06-26 12:21:19,196 INFO Save intermediate 1/2 cancer_ref filtering file to immunopepper_usecase/filter_case/interm_cancer_ref_combiExprCohortLim110.0Across1ExceptsimulatedIpp1sample3.tsv.gz
2023-06-26 12:21:19,375 INFO Filter matrix with cohort expression support > 0.0 in 1 sample
2023-06-26 12:21:19,650 INFO Save intermediate 2/2 cancer_ref filtering file to immunopepper_usecase/filter_case/interm_cancer_ref_combiExprCohortLim0.0Across1ExceptsimulatedIpp1sample3.tsv.gz
2023-06-26 12:21:19,889 INFO Filter matrix with cohort expression support >= 110.0 in 2 sample(s)
2023-06-26 12:21:19,994 INFO support intersect
2023-06-26 12:21:20,778 INFO # Filter_Sample_Cohort n = 13 kmers
2023-06-26 12:21:20,778 INFO
>>>>>>>> Cancers: Perform differential filtering
2023-06-26 12:21:21,071 INFO partitions: 1
2023-06-26 12:21:21,071 INFO Filtering normal background
2023-06-26 12:21:21,822 INFO partitions: 1
2023-06-26 12:21:21,822 INFO >>>> Save to immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz
2023-06-26 12:21:23,072 INFO # Filter_Sample_Cohort_CohortNormal n = 3 kmers
2023-06-26 12:21:24,044 INFO Save intermediate info to immunopepper_usecase/filter_case/output-count.txt
2023-06-26 12:21:24,879 INFO Closing down clientserver connection
Output files
In this mode, there are intermediate files generated for cancer and normal samples. These intermediate files will help speed up re-runs, as filtering steps can be computationally expensive for large cohorts.
Normal samples
For normal samples, only the intermediate files will be generated. These files will be later used to exclude some kmers as cancer candidates, as they will not be cancer-specific kmers.
kmers_derived_solely_from_annotation.tsv.gz: This is a folder containing the kmers that are only present in the annotation and not in the samples. This means that these kmers have a zero expression in all the samples, and therefore will be excluded from the normal dataset.
Format: The files contain a header, “kmer”, and then a kmer in each row.
kmer FLGVLPNAY LPFRVLIIL GVCTLGILR NCRKGFLGV GFLGVLPNA SSSLVSDGW ...
interm_normals_combiExprCohortLim0.0Across1.tsv.gz: This folder contains the kmers that are present with an expression bigger than zero in at least one sample. It is an intermediate file that will be later used for the filter based on the number of samples. The rest of the filtering, in which –n-lim-samples-normal threshold is applied will be performed on the fly.
Format: The files obtained in this folder will be tab-separated, and they will have two columns, with the first column showing the kmer and the second column showing the number of samples in which that kmer appears with more expression than 0.
KGFLGVCTL 19 LEPATSLDF 20 GFLGVCTLG 19 RKGFLGVCT 20 VLPNAYALI 19 PFRVLIILE 20 NCRKGFLGV 19 KGFLGVLPN 20 ...
interm_normals_combiExprCohortLim100.0Across1.tsv.gz: This folder contains the kmers that appear with an expression >= 100 in at least one sample. This is the file that will be directly used for the expression based filtering. The kmers passing the threshold will be marked as normal kmers and will be removed from the possible cancer kmers.
Format: The file is a tab seperated file with two columns, with the first column showing the kmer and the second column showing the number of samples in which that kmer appears with more or equal expression than 100.
LEPATSLDF 1 GFLGVCTLG 3 RKGFLGVCT 2 VLPNAYALI 3 PFRVLIILE 1 NCRKGFLGV 1 ...
Cancer samples
interm_cancer_ref_combiExprCohortLim0.0Across1ExceptsimulatedIpp1sample3.tsv.gz: This folder will contain the intermediate files showing the kmers having an expression bigger than 0.0 in at least 1 sample, without taking into account the target sample.
Format: The files obtained in this folder will be tab separated, and they will have two columns, with the first column showing the kmer and the second column showing the number of samples in which that kmer appears with more expression than 0.
KGFLGVCTL 19 VCTLGILRV 19 FLGVCTLGI 19 LEPATSLDF 19 GFLGVCTLG 19 FLGVLPNAY 19 ...
interm_cancer_ref_combiExprCohortLim110.0Across1ExceptsimulatedIpp1sample3.tsv.gz: This folder will contain the intermediate files showing the kmers that have an expression level higher than the expression threshold, 110 in this case, in at least one sample, without taking into account the target sample.
Format: The files obtained in this folder will be tab separated, and they will have two columns, with the first column showing the kmer and the second column showing the number of samples in which that kmer appears with more expression than 110.
SSSLVSDGW 6 EPTYGRPSV 2 RESSSLVSD 6 SRHRESSSL 6 ...
simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz: This is the file containing the cancer specific kmers, after application of the different thresholds and differential filtering against normal kmers. If an external database is not provided under –uniprot, this is the final output of the cancerspecif mode.
Format: The output is a tab separated file made up of 5 different columns. The first column contains the cancer kmer, the second column contains the coordinates of the kmer, the third column contains the expression level of the kmer in the sample of interest, the fourth column shows whether the junction is annotated or if it is a novel junction, and the last column shows whether the read frame is annotated or not.
kmer
coord
simulatedIpp1sample3
junctionAnnotated
readFrameAnnotated
HRESSSLVS
1131:1150:1300:1308:None:None
175.0
False
True
ESSSLVSDG
1137:1150:1300:1314:None:None
175.0
False
True
SLVSDGWAC
1146:1150:1300:1323:None:None
175.0
False
True
output-count.txt: This file will be generated because –output-count was provided in the arguments. It contains the number of remaining kmers after each filtering step for each sample of interest.
sample
mutation_mode
min_sample_reads
#_of_cohort_samples
reads_per_cohort_sample
#_normal_samples_allowed
normal_cohort_id
reads_per_normal_sample
Init_cancer
Filter_Sample
Filter_Sample_Cohort
Filter_Sample_Cohort_CohortNormal
simulated_Ipp_1_sample3
ref
20.0
2
110.0
15
100
59
59
13
3
Running mhcbind mode
This mode is a wrapper tool for the mhctools package. It allows to do mhc binding predictions on the output kmers from cancerspecif mode. Therefore, there are some input parameters specific to this software and some input parameters that will depend on the mhc prediction tool that the user chooses to use.
In this example, a binding affinity prediction on the three kmers given as output from the cancerspecif mode will be done. The affinity prediction will be done using the tool mhcflurry, and a threshold for affinity >=1000 is set to select the final kmers.
Note
The mhctools package requires the mhc binding affinity tools to be locally installed in the user’s computer. For more information on how to install the mhc binding affinity tools, please refer to the mhctools documentation.
Command
The command to run the mhcbind mode is the following:
Note
In this case you need to set the –mhc-software-path argument to the path where your cloned “mhctools” repository is.
immunopepper mhcbind --mhc-software-path ./immunopepper/mhctools/ --argstring "--mhc-predictor mhcflurry --mhc-alleles HLA-A*02:01 --output-csv immunopepper_usecase/mhc_bind/predictions.csv --input-peptides-file immunopepper_usecase/mhc_bind/input_peptides.csv" --partitioned-tsv immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz --output-dir immunopepper_usecase/mhc_bind --bind-score-method affinity --bind-score-threshold 1000 --verbose 1
All the arguments outside the –argstring belong to the immunopepper software, while the ones inside the argstring are used by the mhctools.
Immunopepper arguments:
–mhc-software-path: Path to the mhctools package. This is a required argument.
–partitioned-tsv: Path to the output folder of the cancerspecif mode. If the user wants to run the mhcbind mode on the output of the cancerspecif mode, this argument is required.
output-dir: Path to the output folder. This is a required argument. It should match the directory given under –output-csv in the –argstring argument.
–bind-score-method: Metric used for filtering. The filtering will be done on the results of the mhc binding prediction, so the score method should be an output of the selected mhc binding prediction tool.
–bind-score-threshold: Threshold for the filtering. The filtering will be done on the results of the selected metric in the previous argument. The threshold should be a number.
–verbose: Verbosity level. This is an optional argument. The default value is 1.
Mhctools arguments:
The arguments for mhctools are the ones contained in the –argstring:
–mhc-predictor: Name of the mhc binding prediction tool to use.
–output-csv: Path to the output file and name of the output file. The format should be .csv.
–input-peptides-file: Path to the input file containing the kmers to predict. The kmers are derived from the file provided under –partitioned-tsv. The format should be .csv.
–mhc-alleles: Alleles to use for the prediction.
Terminal output
The output displayed in the terminal for the mhcbind mode is the following:
2023-06-27 15:53:05,501 INFO Command lineNamespace(mhc_software_path='./immunopepper/mhctools/', argstring='--mhc-predictor mhcflurry --mhc-alleles HLA-A*02:01 --output-csv immunopepper_usecase/mhc_bind/predictions.csv --input-peptides-file immunopepper_usecase/mhc_bind/input_peptides.csv', output_dir='immunopepper_usecase/mhc_bind', partitioned_tsv='immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz', bind_score_method='affinity', bind_score_threshold=1000.0, less_than=False, verbose=2)
2023-06-27 15:53:05,892 INFO Process the outputs from cancerspecif mode
2023-06-27 15:53:05,905 INFO Launch MHC Tools with command --mhc-predictor mhcflurry --mhc-alleles HLA-A*02:01 --output-csv immunopepper_usecase/mhc_bind/predictions.csv --input-peptides-file immunopepper_usecase/mhc_bind/input_peptides.csv
2023-06-27 15:53:05,906 - mhctools.cli.args - INFO - Building MHC binding prediction type for alleles ['HLA-A*02:01'] and epitope lengths None
2023-06-27 15:53:05,906 INFO Building MHC binding prediction type for alleles ['HLA-A*02:01'] and epitope lengths None
2023-06-27 15:53:11,257 INFO Loaded 10 class1 pan allele predictors, 14839 allele sequences, 6308 percent rank distributions, and 0 allele specific models:
Wrote: immunopepper_usecase/mhc_bind/predictions.csv
2023-06-27 15:53:21,936 INFO Perform filtering with affinity >= 1000.0
2023-06-27 15:53:21,938 INFO Saving to immunopepper_usecase/mhc_bind/predictions_WithaffinityMoreLim1000.0.tsv
Output files
This mode creates files that are unique to immunopepper or files that are created by the mhctools package. This will be indicated in the output file description.
input_peptides.csv: This is the file created from –partitioned-tsv. The name is given by the user, when setting the argument –input-peptides-file in the argstring. It is a file unique to immunopepper, as it is the intermediate file that makes compatible the output of cancerspecif mode and the input to mhctools.
Format: .csv file with one column and without header. It contains a kmer per row. The kmers are the ones that passed the filtering steps in the cancerspecif mode.
HRESSSLVS ESSSLVSDG SLVSDGWAC
predictions.csv: This is the output of the mhc binding tool selected. In this case, the selected tool was mhcflurry. This is the file generated by the mhctools package. The name is selected by the user when setting –output-csv in the argstring.
Format: The format will depend on the prediction tool selected. The different prediction scores are the ones that can be used in –bind-score-method. In this case the options would be: score, affinity, percentile_rank.
source_sequence_name
offset
peptide
allele
score
affinity
percentile_rank
prediction_method_name
length
None
0
HRESSSLVS
HLA-A*02:01
0.0446471281349
30844.2694544
49.96275
mhcflurry
9
None
0
ESSSLVSDG
HLA-A*02:01
0.0493709236383
29307.415067
36.933125
mhcflurry
9
None
0
SLVSDGWAC
HLA-A*02:01
0.367525291443
937.518149464
1.700125
mhcflurry
9
predictions_WithaffinityMoreLim1000.0.tsv: This is the output of the mhcbind mode. This output combines the mhc binding predictions of mhctools with the output of cancerspecif mode. Moreover, it performs filteirng if selected. In this case, filtering was selected, so the name of the file reflects the metric and the threshold used. By looking at predictions.csv one can see that the affinity of the last kmer is 937.518149464. This is below the threshold of 1000, so this kmer is not included in the output file.
Format: The file will be in .tsv format. It will contain the columns of predictions.csv, merged with the output of cancerspecif mode.
kmer
simulatedIpp1sample3
junctionAnnotated
readFrameAnnotated
source_sequence_name
offset
allele
score
affinity
percentile_rank
prediction_method_name
length
HRESSSLVS
175
False
True
None
0
HLA-A*02:01
0.0446471
30844.3
49.9628
mhcflurry
9
ESSSLVSDG
175
False
True
None
0
HLA-A*02:01
0.0493709
29307.4
36.9331
mhcflurry
9
Running pepquery mode
This mode is a wrapper tool for the PepQuery software. It allows to do proteomic validations on the output kmers from cancerspecif or mhc_bind mode. Therefore, there are some input parameters specific to this software and some input parameters that are required in the PepQuery software tool.
Important
For this tutorial you need to first decompress the spectra file located under immunopepper/tests/data_simulated/data/pepquery_mode . One can do so by running:
tar -xzvf TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf.tar.gz
Pepquery mode starting from kmers
Important
This tutorial will not generate a full output and it will rise an error. This is because the input peptides are not contained in the spectra file, so no matches are found. To see an example of output go to the tutorial Pepquery mode starting from peptides below this one.
This tutorial is created to showcase how the kmers from cancerspecif mode can be provided as an input for pepQuery.
Note
This mode requires pepQuery to be installed locally on the user’s computer. PepQuery requires enough memory per core for the indexing step. We recommand to run the tutorial with 1 core and 4GB.
Command
Note
In order to properly run the tutorial, one should change the –pepquery-software-path argument to the location where pepQuery is installed.
The command used to run this tutorial is:
immunopepper pepquery --output-dir immunopepper_usecase/pepquery/ --partitioned-tsv immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz/ --metadata-path immunopepper_usecase/cohort_mutNone/ref_sample_peptides_meta.gz --kmer-type junctions --verbose 1 --pepquery-software-path pepQuery/pepquery-2.0.2/pepquery-2.0.2.jar --argstring "-o immunopepper_usecase/pepquery/results/ -i immunopepper_usecase/pepquery/pepquery_input.txt -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf -cpu 16"
Terminal output
The terminal output that you will see is:
2023-07-26 05:19:00,067 INFO Command lineNamespace(output_dir='immunopepper_usecase/pepquery/', argstring='-o immunopepper_usecase/pepquery/results/ -i immunopepper_usecase/pepquery/pepquery_input.txt -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf', pepquery_software_path='pepQuery/pepquery-2.0.2/pepquery-2.0.2.jar', partitioned_tsv='immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz/', metadata_path='immunopepper_usecase/cohort_mutNone/ref_sample_peptides_meta.gz', kmer_type='junctions', verbose=1)
2023-07-26 05:19:00,067 INFO >>>>> Running immunopepper in pepquery mode
2023-07-26 05:19:00,067 INFO >>>>> Extracting whole peptides from the filtered kmers file obtained in cancerspecif mode, and provided under immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz/
2023-07-26 05:19:00,098 INFO >>>>> Extracted peptides. Saved to immunopepper_usecase/pepquery/pepquery_input.txt
2023-07-26 05:19:00,098 INFO >>>>> Launching PepQuery with command -o immunopepper_usecase/pepquery/results/ -i immunopepper_usecase/pepquery/pepquery_input.txt -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf
2023-07-26 05:19:01 [INFO ] main.java.pg.CParameter[updateCParameter:873] - Task type: novel peptide identification
2023-07-26 05:19:01 [INFO ] uk.ac.ebi.pride.utilities.pridemod.io.unimod.xml.unmarshaller.UnimodUnmarshallerFactory[initializeUnmarshaller:41] - Unmarshaller Initialized
2023-07-26 05:19:01 [INFO ] main.java.OpenModificationSearch.ModificationDB[importPTMsFromUnimod:355] - All modifications in unimod:1375
2023-07-26 05:19:01 [INFO ] main.java.pg.PeptideSearchMT[search:243] - Start analysis
#############################################
PepQuery parameter:
2023-07-26 05:19:01 [INFO ] main.java.pg.DatabaseInput[getEnzymeByIndex:263] - Use enzyme:Trypsin
PepQuery version: 2.0.2
PepQuery command line: -o immunopepper_usecase/pepquery/results/ -i immunopepper_usecase/pepquery/pepquery_input.txt -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf
Fixed modification: 1 = Carbamidomethylation of C
Variable modification: 2 = Oxidation of M
Max allowed variable modification: 3
Add AA substitution: false
Enzyme: 1 = Trypsin
Max Missed cleavages: 2
Precursor mass tolerance: 10.0
Range of allowed isotope peak errors: 0
Precursor ion mass tolerance unit: ppm
Fragment ion mass tolerance: 0.6
Fragment ion mass tolerance unit: Da
Scoring algorithm: 1 = Hyperscore
Min score: 12.0
Min peaks: 10
Min peptide length: 7
Max peptide length: 45
Min peptide mass: 500.0
Max peptide mass: 10000.0
Random peptide number: 10000
Fast mode: false
CPU: 16
#############################################
2023-07-26 05:19:01 [INFO ] main.java.pg.PeptideSearchMT[search:250] - Spectrum ID type:1, use 1-based number as index for a spectrum.
2023-07-26 05:19:01 [INFO ] main.java.pg.PeptideSearchMT[search:253] - Step 1: target peptide sequence preparation and initial filtering ...
2023-07-26 05:19:01 [INFO ] main.java.pg.PeptideSearchMT[search:330] - Input peptide sequences:2
2023-07-26 05:19:01 [INFO ] main.java.pg.InputProcessor[searchRefDB:413] - Don't find indexed database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta.sqldb
2023-07-26 05:19:01 [INFO ] main.java.pg.InputProcessor[searchRefDB:414] - Use database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:19:01 [INFO ] main.java.pg.PepMapping[init:62] - Load db file: immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:19:01 [INFO ] main.java.pg.PepMapping[init:78] - Start indexing fasta file
2023-07-26 05:19:04 [INFO ] main.java.pg.PepMapping[init:88] - Indexing took 0.495450139 seconds and consumes 112.873096 MB
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[search:407] - Valid target peptides: 2
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1330] - Generate peptide objects ...
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1341] - CPU: 2
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1359] - Generate peptide objects done.
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[search:413] - Step 1: target peptide sequence preparation and initial filtering done: time elapsed = 0.07 min
2023-07-26 05:19:04 [INFO ] main.java.pg.PeptideSearchMT[search:418] - Step 2: candidate spectra retrieval and PSM scoring ...
2023-07-26 05:19:04 [INFO ] main.java.pg.SpectraInput[readMSMSfast:111] - Build target peptide index done.
2023-07-26 05:19:06 [INFO ] main.java.pg.SpectraInput[readMSMSfast:191] - Finished:10000
2023-07-26 05:19:08 [INFO ] main.java.pg.SpectraInput[readMSMSfast:191] - Finished:20000
2023-07-26 05:19:08 [INFO ] main.java.pg.SpectraInput[readMSMSfast:220] - Matched spectra: 0
2023-07-26 05:19:08 [INFO ] main.java.pg.SpectraInput[readMSMSfast:221] - Matched PSMs: 0
2023-07-26 05:19:10 [INFO ] main.java.pg.SpectraInput[readMSMSfast:257] - Finished:10000
2023-07-26 05:19:11 [INFO ] main.java.pg.SpectraInput[readMSMSfast:257] - Finished:20000
2023-07-26 05:19:12 [INFO ] main.java.pg.SpectraInput[readMSMSfast:272] - Saved spectra: 0
2023-07-26 05:19:12 [INFO ] main.java.pg.SpectraInput[readMSMSfast:274] - Get matched spectra done!
2023-07-26 05:19:12 [INFO ] main.java.pg.PeptideSearchMT[search:456] - Time elapsed: 0.19 min
2023-07-26 05:19:12 [INFO ] main.java.pg.PeptideSearchMT[search:476] - Step 2: candidate spectra retrieval and PSM scoring done: time elapsed = 0.19 min
2023-07-26 05:19:12 [INFO ] main.java.pg.PeptideSearchMT[search:480] - Step 3-4: competitive filtering based on reference sequences and statistical evaluation ...
2023-07-26 05:19:12 [INFO ] main.java.pg.PeptideSearchMT[search:515] - Don't find indexed database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta.sqldb
2023-07-26 05:19:12 [INFO ] main.java.pg.PeptideSearchMT[search:522] - Use database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:19:12 [INFO ] main.java.pg.DatabaseInput[getEnzymeByIndex:263] - Use enzyme:Trypsin
2023-07-26 05:19:20 [INFO ] main.java.pg.DatabaseInput[protein_digest:474] - Protein sequences:96808, total unique peptide sequences:2873094
2023-07-26 05:19:20 [INFO ] main.java.pg.DatabaseInput[protein_digest:475] - Time used for protein digestion:8 s.
2023-07-26 05:19:30 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:522] - Generating peptide forms is done!
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:539] - Sort peptide index ...
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:543] - Sort peptide index done
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:548] - Total unique peptide: 2873094, total peptide forms:0
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:553] - Index block size:0
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:554] - Time used for building peptide index:19 s.
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[db_search:694] - Searching reference database for spectra:0
2023-07-26 05:19:31 [INFO ] main.java.pg.DatabaseInput[db_search:756] - Time used for searching reference peptides:19 s.
2023-07-26 05:19:31 [INFO ] main.java.pg.PeptideSearchMT[search:536] - Time elapsed: 0.51 min
2023-07-26 05:19:31 [INFO ] main.java.pg.PeptideSearchMT[search:557] - Peptides with matched spectra: 0
2023-07-26 05:19:31 [INFO ] main.java.pg.PeptideSearchMT[search:560] - Done!
2023-07-26 05:19:31,220 INFO >>>>> PepQuery finished successfully
2023-07-26 05:19:31,227 ERROR >>>>> PepQuery failed to generate the psm_rank.txt file. Please check the output directory. Maybe the provided spectra file does not contain any of the input peptides
ERROR conda.cli.main_run:execute(47): `conda run python /home/pballesteros/immunopepper/immunopepper/get_parsers.py pepquery --output-dir immunopepper_usecase/pepquery/ --partitioned-tsv immunopepper_usecase/filter_case/simulated_Ipp_1_sample3_ref_SampleLim20.0CohortLim110.0Across2_FiltNormalsCohortlim100.0Across15.tsv.gz/ --metadata-path immunopepper_usecase/cohort_mutNone/ref_sample_peptides_meta.gz --kmer-type junctions --verbose 1 --pepquery-software-path pepQuery/pepquery-2.0.2/pepquery-2.0.2.jar --argstring -o immunopepper_usecase/pepquery/results/ -i immunopepper_usecase/pepquery/pepquery_input.txt -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf` failed. (See above for error)
Output files
pepquery_input.txt:
This is the input file for PepQuery. It is generated by expanding the provided kmers under –partitioned-tsv to their peptide context. This is done by matching the coordinates in the output kmers with the peptides coordinates provided in the metadata.
Format: It contains one peptide per line with no header.
Output example:
NEVIGECIACSASFDATTIGRSRHRESSSLVSDGWACRGSATARPPNPRRAVLCKSIEPTYG NEVIGECIACSASFDATTIGRSRHRESSSLVSDGWACRGSATARPPNPRRAVLCKSIEPTYGR
results:
It is a folder where the raw output of pepQuery should be contained. As the tool was not able to find any PSM, the folder does not contain the result file called psm_rank.txt. The folder contains some metadata files. The exact content of those files can be checked in the PepQuery documentation.
Pepquery mode starting from peptides
In this tutorial, a list of peptides is provided. The tool will run PepQuery and it will return the raw and formatted output files.
Command
The command needed to run this tutorial is:
pepquery --output-dir immunopepper_usecase/pepquery_peptides/ --verbose 1 --pepquery-software-path pepQuery/pepquery-2.0.2/pepquery-2.0.2.jar --argstring "-o immunopepper_usecase/pepquery/results_peptides/ -i immunopepper/tests/data_simulated/data/pepquery_mode/targetlist -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf -cpu 16"
Terminal output
The output one will see in the terminal is:
2023-07-26 05:45:18,084 INFO Command lineNamespace(output_dir='immunopepper_usecase/pepquery_peptides/', argstring='-o immunopepper_usecase/pepquery/results_peptides/ -i immunopepper/tests/data_simulated/data/pepquery_mode/targetlist -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf', pepquery_software_path='pepQuery/pepquery-2.0.2/pepquery-2.0.2.jar', partitioned_tsv=None, metadata_path=None, kmer_type=None, verbose=1)
2023-07-26 05:45:18,084 INFO >>>>> Running immunopepper in pepquery mode
2023-07-26 05:45:18,085 INFO >>>>> Launching PepQuery with command -o immunopepper_usecase/pepquery/results_peptides/ -i immunopepper/tests/data_simulated/data/pepquery_mode/targetlist -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf
2023-07-26 05:45:19 [INFO ] main.java.pg.CParameter[updateCParameter:873] - Task type: novel peptide identification
2023-07-26 05:45:19 [INFO ] uk.ac.ebi.pride.utilities.pridemod.io.unimod.xml.unmarshaller.UnimodUnmarshallerFactory[initializeUnmarshaller:41] - Unmarshaller Initialized
2023-07-26 05:45:20 [INFO ] main.java.OpenModificationSearch.ModificationDB[importPTMsFromUnimod:355] - All modifications in unimod:1375
2023-07-26 05:45:20 [INFO ] main.java.pg.PeptideSearchMT[search:243] - Start analysis
#############################################
PepQuery parameter:
2023-07-26 05:45:20 [INFO ] main.java.pg.DatabaseInput[getEnzymeByIndex:263] - Use enzyme:Trypsin
PepQuery version: 2.0.2
PepQuery command line: -o immunopepper_usecase/pepquery/results_peptides/ -i immunopepper/tests/data_simulated/data/pepquery_mode/targetlist -db immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta -ms immunopepper/tests/data_simulated/data/pepquery_mode/TCGA_E2-A10A_BH-A18Q_C8-A130_117C_W_BI_20130222_H-PM_fA.mgf
Fixed modification: 1 = Carbamidomethylation of C
Variable modification: 2 = Oxidation of M
Max allowed variable modification: 3
Add AA substitution: false
Enzyme: 1 = Trypsin
Max Missed cleavages: 2
Precursor mass tolerance: 10.0
Range of allowed isotope peak errors: 0
Precursor ion mass tolerance unit: ppm
Fragment ion mass tolerance: 0.6
Fragment ion mass tolerance unit: Da
Scoring algorithm: 1 = Hyperscore
Min score: 12.0
Min peaks: 10
Min peptide length: 7
Max peptide length: 45
Min peptide mass: 500.0
Max peptide mass: 10000.0
Random peptide number: 10000
Fast mode: false
CPU: 16
#############################################
2023-07-26 05:45:20 [INFO ] main.java.pg.PeptideSearchMT[search:250] - Spectrum ID type:1, use 1-based number as index for a spectrum.
2023-07-26 05:45:20 [INFO ] main.java.pg.PeptideSearchMT[search:253] - Step 1: target peptide sequence preparation and initial filtering ...
2023-07-26 05:45:20 [INFO ] main.java.pg.PeptideSearchMT[search:330] - Input peptide sequences:2000
2023-07-26 05:45:20 [INFO ] main.java.pg.InputProcessor[searchRefDB:413] - Don't find indexed database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta.sqldb
2023-07-26 05:45:20 [INFO ] main.java.pg.InputProcessor[searchRefDB:414] - Use database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:45:20 [INFO ] main.java.pg.PepMapping[init:62] - Load db file: immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:45:20 [INFO ] main.java.pg.PepMapping[init:78] - Start indexing fasta file
2023-07-26 05:45:22 [INFO ] main.java.pg.PepMapping[init:88] - Indexing took 0.580613138 seconds and consumes 112.873096 MB
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): VDECRERGPALCGSQRCENSPGSYRCVRDCDPGYHAGPEGTCDDVNECETLQGVCGAALCENV
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): VDECQEYGPEICGAQRCENTPGSYRCTPACDPGYQPTPGGGCQDVNECETLQGVCGAALCENV
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): YEGARDGRHCVDVNECETLQGVCGAALCENV
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): VNECETLQGVCGAALCENV
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): EIVVGSNMDKIYIVMNYVEHDLKSLMETMKQPFLPGEVKTLMIQLLRGVKHLHDNWILHRDLKTSNLLLSHAGILKVSPPPSGPSQGDPPGPTHSRPSVVAGG
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): LSPTQPCPEGSGDCRKQCEPDYYLDEAGRCTACVSCSRDDLVEKTPCAWNSSRTCECRPGMICATSATNSCARCVPYPICAAETVTKPQ
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): DLVEKTPCAWNSSRTCECRPGMICATSATNSCARCVPYPICAAETVTKPQDMAEKDTTFEAPPLGTQPDCNPTPENGEAPA
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): LGDDI
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MIFEKLRICSMPQFFCFMQDLPPLKYDPDVVVTDFRFGTIPVKLYQPKASTCTLKPGIVYYHGGGGVMGSLKTHHGICSRLCKESDSVVLAV
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): GLPYKHLITHHQEPPHRYLISTYDDHYNRHGYNPGLPPLRTWNGQKLLWLPEKSDFPLLAPPTNYGLYEQLKQRQLTPKAGLKQSTYTSSYPRPPLCAMSWREHAVPVPPHRLHPLPHF
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MSLQAPPRLLELAEQSLLRDRALAIPTLEELPRELFPPLFMEAFTRRCCETLTTMVQAWPFTCLPLGSLMKSCNLEIFRAVLEGLDALLAQKVRPRRWKLQVLDLRNVDENFWGIWSGASALSPEALSKRRTAGNCPRPGGQQPLMVILDLCFKNGTLDECLTHFLEWGKQRKGLLHVCCKELQIFGIAIHRIIEVLNTVELDCIQEVEVCCPWELSILIRFAPYLGQMRNLRKLVLFNIHVSACIPLDRKEQFVIQFTSQFLKLDYFQKLYMHSVSFLEGHLDQLL
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): RWKLQVLDLRNVDENFWGIWSGASALSPEALSKRRTAGNCPRPGGQQPLMVILDLCFKNGTLDECLTHFLEWGKQRKGLLHVCCKELQIFGIAIHRIIEVLNTVELDCIQEVEVCCPWELSILIRFAPYLGQMRNLRKLVLFNIHVSACIPLDRKEQFVIQFTSQFLKLDYFQKLYMHSVSFLEGHLDQLLRCLQAPLETVVMTECLLSESDLKHLSWCPSIRQLKELDLRGITLTHFSPEPLSVLLEQAEATLQTLDLEDCGIVDSQLSAILPALSRCSQLSTFSFCGNLISMAALENLLRHTVGLSKLSLELYPAPLESYDAQGALCWGRFSQLGAELMKTLRDLRQPKIIVFSTVPCPRCGIRASYDLEPSHCLLNACCQGGFI
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MSIQAPPRLLELAGQSLLRDQALSISAMEELPRVLYLPLFMEAFSRRHFQTLTVMVQAWPFTCLPLGSLMKTLHLETLKALLEGLHMLLTQKDRPRRWKLQVLD
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): GNPISMATLENLLSHTIILKNLCVEVYPAPRESYGADGTLCWSRFAQIRAELMNRVRDLRHPKRIFFCIDNCPDCGNRSFYDLEADQYCC
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): ILLYSSVDTGTQCLVSCRSPGLQPVLCLRHSPFHLLAGLQDGTLAAYPRTSGGVLWDLESPPVCLTVGPGPVRTLLSLEDAVWASCGPRVTVLEATTLQPQ
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): GVLWDLESPPVCLTVGPGPVRTLLSLEDAVWASCGPRVTVLEATTLQPQQSFEAHQDEAVSVTHMVKAGSGVWMAFSSGTSIRLFHTETLEHLQEINIATRTTFLLP
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): VLWDLESPPVCLTVGPGPVRTLLSLEDAVWASCGPRVTVLEATTLQPQQSFEAHQDEAVSVTHMVKAGSGVWMAFSSGTSIRLFHTETLEHLQEINIATRTTFLLP
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MNLSLSQKTVKIHRLFPMLAFSEPYEFVSLEWLQKWLDESTPTKPIDNHACLCSHDKLHPDKISIMKRISEYAADIFYSRYGGGPRLTVKALCKECVVERCRILRLKNQLNEDYKTVNNLLKAAVK
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MSSKYPRSVRRCLPLWALTLEAALILLFYFFTHYDASLEDQKGLVASYQVGQDLTVMAALGLGFLTSNFRRHSWSSVAFNLFMLALGVQWAILLDGFLSQFPPGKVVITLF
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MAQEEEDVRDYNLTEEQKAIKAKYPPVNRKYEYQQTHLLQSPHRIYSHLHSSAIWNPYLHLMGT
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): GGSRVSNPAVMAQEEEDVRDYNLTEEQKAIKAKYPPVNRKYEYQQTHLLQSPHRIYSHLHSSAIWNPYLHLMGT
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MKGGSRVSNPAVMAQEEEDVRDYNLTEEQKAIKAKYPPVNRKYEYQQTHLLQSPHRIYSHLHSSAIWNPYLHLMGT
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): HSVDPGLAGLLGQRAPRSQQPFVVTFFRASPSPIRTPRAVRPLRRRQPKKSNELPQANRLPGIFDDVHGSHGRQVCRRHELYVSFQDLGWL
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): DVHGSHGRQVCRRHELYVSFQDLGWLDWVIAPQGYSAYYCEGECSFPLDSCMNATNHAILQSL
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MILGHCSLDLLGSSNPPTSASQAATGFHVRGRWRTEDCHLRTKAIETLRVAGRHQLPDRSFISFGISSLQMVSSQKLEKPI
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): FHVRGRWRTEDCHLRTKAIETLRVAGRHQLPDRSFISFGISSLQMVSSQKLEKPIEMGSSEPLPIADGDRRRKKKRRGRATDSLPGKFE
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): MKRRNLFALETGVPPWGLLWVRS
2023-07-26 05:45:23 [WARN ] main.java.pg.PeptideSearchMT[search:342] - Ignore peptide (reason: exist in reference database): TRSSGSGRQPCRMRRCGRRSRTGGTWTSWTFSWVPGMKMTSNCQMQTSGLKWTHSCLKAMTPPPVVSPGFSTA
2023-07-26 05:45:23 [INFO ] main.java.pg.PeptideSearchMT[search:407] - Valid target peptides: 1972
2023-07-26 05:45:23 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1330] - Generate peptide objects ...
2023-07-26 05:45:23 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1341] - CPU: 16
2023-07-26 05:45:24 [INFO ] main.java.pg.PeptideSearchMT[generatePeptideInputs:1359] - Generate peptide objects done.
2023-07-26 05:45:24 [INFO ] main.java.pg.PeptideSearchMT[search:413] - Step 1: target peptide sequence preparation and initial filtering done: time elapsed = 0.10 min
2023-07-26 05:45:24 [INFO ] main.java.pg.PeptideSearchMT[search:418] - Step 2: candidate spectra retrieval and PSM scoring ...
2023-07-26 05:45:29 [INFO ] main.java.pg.SpectraInput[readMSMSfast:111] - Build target peptide index done.
2023-07-26 05:45:32 [INFO ] main.java.pg.SpectraInput[readMSMSfast:191] - Finished:10000
2023-07-26 05:45:34 [INFO ] main.java.pg.SpectraInput[readMSMSfast:191] - Finished:20000
2023-07-26 05:45:34 [INFO ] main.java.pg.SpectraInput[readMSMSfast:220] - Matched spectra: 2
2023-07-26 05:45:34 [INFO ] main.java.pg.SpectraInput[readMSMSfast:221] - Matched PSMs: 2
2023-07-26 05:45:37 [INFO ] main.java.pg.SpectraInput[readMSMSfast:257] - Finished:10000
2023-07-26 05:45:38 [INFO ] main.java.pg.SpectraInput[readMSMSfast:257] - Finished:20000
2023-07-26 05:45:38 [INFO ] main.java.pg.SpectraInput[readMSMSfast:272] - Saved spectra: 2
2023-07-26 05:45:38 [INFO ] main.java.pg.SpectraInput[readMSMSfast:274] - Get matched spectra done!
2023-07-26 05:45:38 [INFO ] main.java.pg.PeptideSearchMT[search:456] - Time elapsed: 0.34 min
2023-07-26 05:45:38 [INFO ] main.java.pg.PeptideSearchMT[search:476] - Step 2: candidate spectra retrieval and PSM scoring done: time elapsed = 0.34 min
2023-07-26 05:45:38 [INFO ] main.java.pg.PeptideSearchMT[search:480] - Step 3-4: competitive filtering based on reference sequences and statistical evaluation ...
2023-07-26 05:45:38 [INFO ] main.java.pg.PeptideSearchMT[search:515] - Don't find indexed database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta.sqldb
2023-07-26 05:45:38 [INFO ] main.java.pg.PeptideSearchMT[search:522] - Use database:immunopepper/tests/data_simulated/data/pepquery_mode/uniprot-proteome_UP000005640.fasta
2023-07-26 05:45:38 [INFO ] main.java.pg.DatabaseInput[getEnzymeByIndex:263] - Use enzyme:Trypsin
2023-07-26 05:45:47 [INFO ] main.java.pg.DatabaseInput[protein_digest:474] - Protein sequences:96808, total unique peptide sequences:2873094
2023-07-26 05:45:47 [INFO ] main.java.pg.DatabaseInput[protein_digest:475] - Time used for protein digestion:9 s.
2023-07-26 05:46:04 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:522] - Generating peptide forms is done!
2023-07-26 05:46:04 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:539] - Sort peptide index ...
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:543] - Sort peptide index done
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:548] - Total unique peptide: 2873094, total peptide forms:951099
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:553] - Index block size:7861
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[generate_reference_peptide_index:554] - Time used for building peptide index:26 s.
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[db_search:694] - Searching reference database for spectra:2
2023-07-26 05:46:05 [INFO ] main.java.pg.DatabaseInput[db_search:756] - Time used for searching reference peptides:27 s.
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:536] - Time elapsed: 0.79 min
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:557] - Peptides with matched spectra: 2
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:574] - Step 3-4: competitive filtering based on reference sequences and statistical evaluation done: time elapsed = 0.79 min
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[exportMGF:1309] - All spectra have been exported to file: /home/pballesteros/immunopepper_usecase/pepquery/results_peptides//psm_rank.mgf
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:580] - Step 5: competitive filtering based on unrestricted post-translational modification searching ...
2023-07-26 05:46:05 [INFO ] main.java.OpenModificationSearch.ModificationDB[doPTMValidation:1464] - Time used for PTM validation:0 s.
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[summariseResult:857] - Identified PSMs: 0
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[summariseResult:858] - Identified peptides: 0
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:589] - Step 5: competitive filtering based on unrestricted post-translational modification searching done: time elapsed = 0.79 min
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:617] - Time elapsed: 0.79 min
2023-07-26 05:46:05 [INFO ] main.java.pg.PeptideSearchMT[search:618] - End.
2023-07-26 05:46:05,880 INFO >>>>> PepQuery finished successfully
2023-07-26 05:46:05,935 INFO >>>>> Processed output file saved to immunopepper_usecase/pepquery_peptides/peptides_validated.tsv.gz
2023-07-26 05:46:05,935 INFO >>>>> Finished running immunopepper in pepquery mode
Output files
results_peptides: This is a folder containing the raw output of the PepQuery software. For more information on the different files contained in this folder one can look the PepQuery documentation.
peptides_validated.tsv.gz: This file contains the results obtained from the PepQuery software but formatted. It explains the filtering step at which each peptide-spectra match (PSM) failed. The file will only show the peptides involved in a PSM.
The output file looks like this:
peptide
modification
spectrum
score
confident
pvalue
# of reference DB peptides matching spectrum better than study peptide
# random shuffled peptides matching spectrum better that study peptide
# ptm-modified proteins matching better the spectra filtering summary than study peptide
filtering summary
GSPGIRGPQGITGPKGATGSAGQAGRPGSPGHQGVA
20517
14.876393794788058
No
NaN
27
NaN
NaN
Failed at competitive filtering based on reference sequences (step 3).
LDGSLPPDSRLENNMLMLPSVQPQD
Oxidation of M@15[15.9949];Oxidation of M@17[15.9949]
2757
12.5103564275612
No
NaN
177
NaN
NaN
Failed at competitive filtering based on reference sequences (step 3).
In this result, one can only see two peptides. This is because PSM were only found for these two peptides. The first peptide does not contain any modification, while the second one contains two modifications. Both peptides failed in the filtering step 3: spectra matching against reference peptides. In both cases, there is still at least one reference peptide that match the spectra better. As the peptides fail the filtering step, the identification is not confident, meaning that one cannot assure the identified peptide is expressed in our sample at the protein level.