bcell-sequence-based - version 0.1-beta ============================================ Introduction ------------ This package contains a collection of methods to predict linear B-cell epitopes based on sequence characteristics of the antigen using amino acid scales and HMMs. The collection is a mixture of Python scripts and Linux environment-specific binaries for the Bepipred method. Release Notes ------------- v0.1 Initial public beta release Prerequisites ------------- - Python 3.10 or higher http://www.python.org/ Installation ------------ Below, we use the example of installing to ~/iedb_tools. 1. Extract the code and change directory: mkdir ~/iedb_tools tar -xvzf IEDB_NG_BCELL-SEQUENCE-BASED-0.1-beta.tar.gz -C ~/iedb_tools cd ~/iedb_tools/ng_bcell-sequence-based-0.1-beta 2. (Optional) Create and activate a Python 3.10+ virtual environment. Example using venv: python3 -m venv ~/venvs/bcellseq source ~/venvs/bcellseq/bin/activate 3. Install Python requirements: pip3 install --upgrade pip pip3 install -r requirements.txt 4. Run the configure script to set up path variables: ./configure Note: This script creates a .env file in the project root, which is critical for the application to function correctly. Quick Start ----------- The simplest way to run a prediction: # Inline sequence (shortest): method + sequence as positional argument python3 src/run_b_cell_sequence_based.py predict -m Chou-Fasman -w 6 ADVAGHGQDILIRLFKSHPETLEKFD # Using a sequence file and method name: python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Emini # Using a UniProt ID: python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Emini # Using a JSON configuration file: python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json For help on any command: python3 src/run_b_cell_sequence_based.py --help Basic Usage - Predict Command ------------------------------ The predict command is the main way to run predictions. It supports four input formats: 1. Sequence File Input (-i) ------------------------ Run predictions directly on FASTA or text sequence files. Single sequence: python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Emini python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Bepipred3 -o output.json Multiple sequences: python3 src/run_b_cell_sequence_based.py predict -i examples/multiple_sequences.fasta -m Emini python3 src/run_b_cell_sequence_based.py predict -i examples/multiple_sequences.txt -m Bepipred3 2. UniProt ID Input (-u) ---------------------- Fetch sequence(s) from UniProt and run prediction. One ID or multiple comma-separated IDs: python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Emini python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Bepipred3 python3 src/run_b_cell_sequence_based.py predict -u P02185,P29320 -m Emini 3. JSON Configuration File (-j) ------------------------------ Run predictions using a JSON configuration file (see JSON Format section below): python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json -o results/output.json 4. Inline Sequence (positional) ------------------------------ Pass the sequence directly as the last argument (raw amino acids). Method and optional window size required. For multiple sequences, separate them with a comma (,) on the command line: python3 src/run_b_cell_sequence_based.py predict -m Chou-Fasman -w 6 ADVAGHGQDILIRLFKSHPETLEKFD python3 src/run_b_cell_sequence_based.py predict -m Emini VLSEGEWQLVLHVWAK python3 src/run_b_cell_sequence_based.py predict -m Emini VLSEGEWQLVLHVWAK,MKTIIALSYIFCLVFADYKDDDDK Predict Command Arguments -------------------------- Required: -m, --method Method name (required when using -i, -u, or positional sequence) Examples: Emini, Bepipred3, Chou-Fasman, Kolaskar-Tongaonkar, Parker Optional: -o, --output-prefix Output path (file or directory). Format inferred from extension if -f not specified. If not specified, results print to console. -f, --output-format Output format (tsv, json, stdout). Default is stdout. -w, --window-size Window size (uses method default if not specified) -t, --threshold Threshold value (uses method default if not specified) --no-plot Disable plot generation -r, --display-all-residues When using multiple methods, include all residues in output (outer merge; missing method scores shown as blank). -a, --display-all-rows Same as -r; display all rows including those with missing data from other methods. -j, --input-json Path to JSON configuration file (alternative to -i, -u, or positional sequence) Note: When using -i, -u, or inline sequence, you must specify a method with -m. When using -j, the method(s) are specified in the JSON file. Available Methods ----------------- The following prediction methods are available (case-insensitive, accepts hyphens or underscores, with optional version suffix): - bepipred3 (or bepipred, bepipred-3.0) - default version 3.0 - chou_fasman (or chou-fasman, Chou-Fasman, chou-fasman-1.0) - default version 1.0 - emini (or Emini, emini-1.0) - default version 1.0 - karplus_schulz (or karplus-schulz, Karplus-Schulz, karplus-schulz-1.0) - default version 1.0 - kolaskar_tongaonkar (or kolaskar-tongaonkar, Kolaskar-Tongaonkar, kolaskar-tongaonkar-1.0) - default version 1.0 - parker (or Parker, parker-1.0) - default version 1.0 Method names are normalized automatically. If a version is not specified, the default version for that method will be used. If an invalid version is specified, the tool will display an error message showing the invalid version and all available versions for that method. For example: $ python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m emini-1.3 ValueError: Version '1.3' is not available for method 'Emini'. Available versions: 1.0 Advanced Usage - Distributed Processing --------------------------------------- For processing multiple sequences or methods, you can use a three-step workflow that allows distributed processing: 1. Preprocess: Split input into job units python3 src/run_b_cell_sequence_based.py preprocess -j examples/bepipred3_single_sequence.json This creates individual parameter files in predict-inputs/params/ and a job_descriptions.json file listing all jobs to be executed. 2. Predict: Run each command listed in job_descriptions.json (each command runs predict -j predict-inputs/params/.json -o ...). These jobs can be distributed across multiple machines. 3. Postprocess: Aggregate results python3 src/run_b_cell_sequence_based.py postprocess -j job_descriptions.json -o results/output Options: -j, --job-desc-file Path to job_descriptions.json file (required) -o, --output-prefix Output file prefix (e.g., results/output) -f, --output-format Output format (json, default: json) Example Files (examples/) ------------------------- Naming convention for example JSON configs: - Single method, single sequence from file: {method}_single_sequence.json (e.g. bepipred3_single_sequence.json, chou_fasman_single_sequence.json) - Single method, single UniProt ID: {method}_single_uniprot.json (e.g. chou_fasman_single_uniprot.json) - Single method, multiple UniProt IDs: {method}_multiple_uniprot.json (e.g. chou_fasman_multiple_uniprot.json) - Single method, multiple sequences: {method}_multiple_sequences.json (e.g. chou_fasman_multiple_sequences.json) - Multiple methods, single sequence: multiple_methods_single_sequence.json - Multiple methods, single UniProt ID: multiple_methods_single_uniprot.json - Multiple methods, multiple UniProt IDs: multiple_methods_multiple_uniprot.json - Multiple methods, multiple sequences: multiple_methods_multiple_sequences.json JSON Input Format ----------------- When using JSON input (-j), the file must include one of: input_sequence_text_file_path, uniprot_id, or input_sequence_text. The file must be formatted as described below. Example 1: Single method with sequence file { "input_sequence_text_file_path": "examples/single_sequence.txt", "predictors": [ { "method": "bepipred3", "window_size": 9, "threshold": 0.1512, "plot_data": true } ] } Example 2: Multiple methods { "input_sequence_text_file_path": "examples/multiple_sequences.fasta", "predictors": [ { "method": "bepipred3", "threshold": 0.1512, "plot_data": true }, { "method": "chou_fasman", "window_size": 7, "threshold": 0.903, "plot_data": true }, { "method": "emini", "plot_data": true } ] } Example 3: Using UniProt ID { "uniprot_id": "P02185", "predictors": [ { "method": "bepipred3" } ] } JSON Field Descriptions ------------------------ - input_sequence_text_file_path Path to a file containing sequences. The file can be in one of two formats: - Text file: One sequence per line (no FASTA headers), e.g., examples/multiple_sequences.txt - FASTA file: Sequences with headers (lines starting with >), e.g., examples/multiple_sequences.fasta Required if uniprot_id and input_sequence_text are not provided. - input_sequence_text Inline sequence(s) as a single string. Multiple sequences can be separated by newline (\\n). Required if input_sequence_text_file_path and uniprot_id are not provided. - uniprot_id One UniProt accession ID or multiple comma-separated IDs (e.g. "P02185" or "P02185,P29320"). Sequence(s) will be fetched automatically. Required if input_sequence_text_file_path and input_sequence_text are not provided. - predictors An array of predictor configurations. Each predictor requires a "method" field and may include optional parameters. - method Method name (see Available Methods section above). - plot_data Whether to generate plotted output. Default: true - Other parameters Method-specific parameters are specified directly in the predictor object (not nested). If not provided, defaults are used (see Parameters section). Method Parameters ----------------- Scale-Based Methods (chou_fasman, emini, karplus_schulz, kolaskar_tongaonkar, parker) ------------------------------------------------------------------------------------- window_size Sliding window size for score calculation. Defaults: chou_fasman=7, emini=6, karplus_schulz=7, kolaskar_tongaonkar=7, parker=7 threshold Threshold for epitope prediction. Defaults: chou_fasman=0, emini=0, karplus_schulz=0, kolaskar_tongaonkar=0, parker=0. When 0 (or not set), the effective threshold is the average of the scores for that sequence and method (for Emini and Kolaskar-Tongaonkar epitope prediction; assignment columns use score >= threshold). plot_data Whether to generate plotted output. Default: true Note: Classical methods (Parker, Chou-Fasman, Karplus-Schulz) produce scores and statistics but do not generate explicit epitope predictions. Only Emini and Kolaskar-Tongaonkar produce epitope tables. Bepipred3.0 Parameters ---------------------- pred {mjv_pred, vt_pred} Majority vote ensemble prediction or variable threshold prediction based on average ensemble positive probabilities. Default: vt_pred threshold Threshold for predictions based on average ensemble positive probability. Default: 0.1512 window_size Window size for rolling average on linear epitope scores. Default: 9 top_cands Top percentage of candidate residues to display. Default: 0.2 (20%) add_seq_len Add sequence lengths to ESM encodings. Default: false esm_dir Directory to save ESM encodings. Default: false (uses current working directory) plot_linear_epitope_scores Use linear B-cell epitope probability scores for plotting. Default: false zip_results Create a ZIP archive of results (excluding interactive HTML figure). Default: false plot_data Whether to generate plotted output. Default: true Output Format ------------- The tool generates JSON output with the following structure: - residue_table Contains per-residue predictions with scores, percentile ranks, harmonic mean, and assignments for each method. - linear_epitope_table Contains predicted epitopes (for Emini and Kolaskar-Tongaonkar methods only). Columns include: - core.sequence_number: Sequence identifier - core.start: Epitope start position - core.end: Epitope end position - core.peptide: Epitope sequence - core.length: Epitope length - predictor_threshold Lists the threshold values used for each method. All numeric values (scores, percentile ranks, harmonic mean) are rounded to 2 decimal places. Percentile ranks are calculated such that higher scores correspond to higher percentile ranks. Output files are saved to the path specified by -o (or printed to console if not specified). Format is inferred from the file extension if -f is not specified. For distributed processing workflows, intermediate results are stored in the 'aggregate' subdirectory. Caveats ------- All IEDB next-generation standalone tools are developed primarily to support the website. Some user-facing features may be limited but will improve as the tools mature. Contact ------- Please contact us with any issues or questions using the channels below: IEDB Help Desk: https://help.iedb.org/ Email: help@iedb.org