bcell-sequence-based - version 0.1-beta
============================================

Introduction
------------
This package contains a collection of methods to predict linear B-cell epitopes
based on sequence characteristics of the antigen using amino acid scales and HMMs.
The collection is a mixture of Python scripts and Linux environment-specific
binaries for the Bepipred method.


Release Notes
-------------
v0.1
  Initial public beta release


Prerequisites
-------------
- Python 3.10 or higher
  http://www.python.org/


Installation
------------
Below, we use the example of installing to ~/iedb_tools.

1. Extract the code and change directory:
     mkdir ~/iedb_tools
     tar -xvzf IEDB_NG_BCELL-SEQUENCE-BASED-0.1-beta.tar.gz -C ~/iedb_tools
     cd ~/iedb_tools/ng_bcell-sequence-based-0.1-beta

2. (Optional) Create and activate a Python 3.10+ virtual environment.
   Example using venv:
     python3 -m venv ~/venvs/bcellseq
     source ~/venvs/bcellseq/bin/activate

3. Install Python requirements:
     pip3 install --upgrade pip
     pip3 install -r requirements.txt

4. Run the configure script to set up path variables:
     ./configure

   Note: This script creates a .env file in the project root, which is critical for
   the application to function correctly.


Quick Start
-----------
The simplest way to run a prediction:

  # Inline sequence (shortest): method + sequence as positional argument
  python3 src/run_b_cell_sequence_based.py predict -m Chou-Fasman -w 6 ADVAGHGQDILIRLFKSHPETLEKFD

  # Using a sequence file and method name:
  python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Emini

  # Using a UniProt ID:
  python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Emini

  # Using a JSON configuration file:
  python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json

For help on any command:
  python3 src/run_b_cell_sequence_based.py <command> --help


Basic Usage - Predict Command
------------------------------
The predict command is the main way to run predictions. It supports four input
formats:

1. Sequence File Input (-i)
   ------------------------
   Run predictions directly on FASTA or text sequence files.

   Single sequence:
     python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Emini
     python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m Bepipred3 -o output.json

   Multiple sequences:
     python3 src/run_b_cell_sequence_based.py predict -i examples/multiple_sequences.fasta -m Emini
     python3 src/run_b_cell_sequence_based.py predict -i examples/multiple_sequences.txt -m Bepipred3


2. UniProt ID Input (-u)
   ----------------------
   Fetch sequence(s) from UniProt and run prediction. One ID or multiple comma-separated IDs:

     python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Emini
     python3 src/run_b_cell_sequence_based.py predict -u P02185 -m Bepipred3
     python3 src/run_b_cell_sequence_based.py predict -u P02185,P29320 -m Emini


3. JSON Configuration File (-j)
   ------------------------------
   Run predictions using a JSON configuration file (see JSON Format section below):

     python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json
     python3 src/run_b_cell_sequence_based.py predict -j examples/bepipred3_single_sequence.json -o results/output.json


4. Inline Sequence (positional)
   ------------------------------
   Pass the sequence directly as the last argument (raw amino acids). Method and optional window size required.
   For multiple sequences, separate them with a comma (,) on the command line:

     python3 src/run_b_cell_sequence_based.py predict -m Chou-Fasman -w 6 ADVAGHGQDILIRLFKSHPETLEKFD
     python3 src/run_b_cell_sequence_based.py predict -m Emini VLSEGEWQLVLHVWAK
     python3 src/run_b_cell_sequence_based.py predict -m Emini VLSEGEWQLVLHVWAK,MKTIIALSYIFCLVFADYKDDDDK


Predict Command Arguments
--------------------------
Required:
  -m, --method        Method name (required when using -i, -u, or positional sequence)
                     Examples: Emini, Bepipred3, Chou-Fasman, Kolaskar-Tongaonkar, Parker

Optional:
  -o, --output-prefix       Output path (file or directory). Format inferred from extension if -f not specified. If not specified, results print to console.
  -f, --output-format       Output format (tsv, json, stdout). Default is stdout.
  -w, --window-size        Window size (uses method default if not specified)
  -t, --threshold          Threshold value (uses method default if not specified)
  --no-plot                Disable plot generation
  -r, --display-all-residues   When using multiple methods, include all residues in output (outer merge; missing method scores shown as blank).
  -a, --display-all-rows   Same as -r; display all rows including those with missing data from other methods.
  -j, --input-json         Path to JSON configuration file (alternative to -i, -u, or positional sequence)

Note: When using -i, -u, or inline sequence, you must specify a method with -m. When using -j,
the method(s) are specified in the JSON file.


Available Methods
-----------------
The following prediction methods are available (case-insensitive, accepts hyphens
or underscores, with optional version suffix):

  - bepipred3 (or bepipred, bepipred-3.0) - default version 3.0
  - chou_fasman (or chou-fasman, Chou-Fasman, chou-fasman-1.0) - default version 1.0
  - emini (or Emini, emini-1.0) - default version 1.0
  - karplus_schulz (or karplus-schulz, Karplus-Schulz, karplus-schulz-1.0) - default version 1.0
  - kolaskar_tongaonkar (or kolaskar-tongaonkar, Kolaskar-Tongaonkar, kolaskar-tongaonkar-1.0) - default version 1.0
  - parker (or Parker, parker-1.0) - default version 1.0

Method names are normalized automatically. If a version is not specified, the
default version for that method will be used.

If an invalid version is specified, the tool will display an error message
showing the invalid version and all available versions for that method. For
example:

  $ python3 src/run_b_cell_sequence_based.py predict -i examples/single_sequence.fasta -m emini-1.3
  ValueError: Version '1.3' is not available for method 'Emini'. Available versions: 1.0


Advanced Usage - Distributed Processing
---------------------------------------
For processing multiple sequences or methods, you can use a three-step workflow
that allows distributed processing:

1. Preprocess: Split input into job units
     python3 src/run_b_cell_sequence_based.py preprocess -j examples/bepipred3_single_sequence.json

   This creates individual parameter files in predict-inputs/params/ and a
   job_descriptions.json file listing all jobs to be executed.

2. Predict: Run each command listed in job_descriptions.json (each command runs predict -j predict-inputs/params/<N>.json -o ...). These jobs can be distributed across multiple machines.

3. Postprocess: Aggregate results
     python3 src/run_b_cell_sequence_based.py postprocess -j job_descriptions.json -o results/output

   Options:
     -j, --job-desc-file    Path to job_descriptions.json file (required)
     -o, --output-prefix     Output file prefix (e.g., results/output)
     -f, --output-format     Output format (json, default: json)


Example Files (examples/)
-------------------------
Naming convention for example JSON configs:
- Single method, single sequence from file:  {method}_single_sequence.json (e.g. bepipred3_single_sequence.json, chou_fasman_single_sequence.json)
- Single method, single UniProt ID:         {method}_single_uniprot.json (e.g. chou_fasman_single_uniprot.json)
- Single method, multiple UniProt IDs:      {method}_multiple_uniprot.json (e.g. chou_fasman_multiple_uniprot.json)
- Single method, multiple sequences:        {method}_multiple_sequences.json (e.g. chou_fasman_multiple_sequences.json)
- Multiple methods, single sequence:        multiple_methods_single_sequence.json
- Multiple methods, single UniProt ID:       multiple_methods_single_uniprot.json
- Multiple methods, multiple UniProt IDs:   multiple_methods_multiple_uniprot.json
- Multiple methods, multiple sequences:     multiple_methods_multiple_sequences.json


JSON Input Format
-----------------
When using JSON input (-j), the file must include one of: input_sequence_text_file_path,
uniprot_id, or input_sequence_text. The file must be formatted as described below.

Example 1: Single method with sequence file
     {
         "input_sequence_text_file_path": "examples/single_sequence.txt",
         "predictors": [
             {
                 "method": "bepipred3",
                 "window_size": 9,
                 "threshold": 0.1512,
                 "plot_data": true
             }
         ]
     }

Example 2: Multiple methods
     {
         "input_sequence_text_file_path": "examples/multiple_sequences.fasta",
         "predictors": [
             {
                 "method": "bepipred3",
                 "threshold": 0.1512,
                 "plot_data": true
             },
             {
                 "method": "chou_fasman",
                 "window_size": 7,
                 "threshold": 0.903,
                 "plot_data": true
             },
             {
                 "method": "emini",
                 "plot_data": true
             }
         ]
     }

Example 3: Using UniProt ID
     {
         "uniprot_id": "P02185",
         "predictors": [
             {
                 "method": "bepipred3"
             }
         ]
     }

JSON Field Descriptions
------------------------
- input_sequence_text_file_path
    Path to a file containing sequences. The file can be in one of two formats:
    - Text file: One sequence per line (no FASTA headers), e.g., examples/multiple_sequences.txt
    - FASTA file: Sequences with headers (lines starting with >), e.g., examples/multiple_sequences.fasta
    Required if uniprot_id and input_sequence_text are not provided.

- input_sequence_text
    Inline sequence(s) as a single string. Multiple sequences can be separated by newline (\\n).
    Required if input_sequence_text_file_path and uniprot_id are not provided.

- uniprot_id
    One UniProt accession ID or multiple comma-separated IDs (e.g. "P02185" or "P02185,P29320").
    Sequence(s) will be fetched automatically. Required if input_sequence_text_file_path and input_sequence_text are not provided.

- predictors
    An array of predictor configurations. Each predictor requires a "method"
    field and may include optional parameters.

- method
    Method name (see Available Methods section above).

- plot_data
    Whether to generate plotted output. Default: true

- Other parameters
    Method-specific parameters are specified directly in the predictor object
    (not nested). If not provided, defaults are used (see Parameters section).


Method Parameters
-----------------

Scale-Based Methods (chou_fasman, emini, karplus_schulz, kolaskar_tongaonkar, parker)
-------------------------------------------------------------------------------------
  window_size
      Sliding window size for score calculation.
      Defaults: chou_fasman=7, emini=6, karplus_schulz=7,
                kolaskar_tongaonkar=7, parker=7

  threshold
      Threshold for epitope prediction.
      Defaults: chou_fasman=0, emini=0, karplus_schulz=0,
                kolaskar_tongaonkar=0, parker=0.
      When 0 (or not set), the effective threshold is the average of the
      scores for that sequence and method (for Emini and Kolaskar-Tongaonkar
      epitope prediction; assignment columns use score >= threshold).

  plot_data
      Whether to generate plotted output. Default: true

Note: Classical methods (Parker, Chou-Fasman, Karplus-Schulz) produce scores
and statistics but do not generate explicit epitope predictions. Only Emini
and Kolaskar-Tongaonkar produce epitope tables.


Bepipred3.0 Parameters
----------------------
  pred {mjv_pred, vt_pred}
      Majority vote ensemble prediction or variable threshold prediction
      based on average ensemble positive probabilities.
      Default: vt_pred

  threshold
      Threshold for predictions based on average ensemble positive probability.
      Default: 0.1512

  window_size
      Window size for rolling average on linear epitope scores.
      Default: 9

  top_cands
      Top percentage of candidate residues to display.
      Default: 0.2 (20%)

  add_seq_len
      Add sequence lengths to ESM encodings.
      Default: false

  esm_dir
      Directory to save ESM encodings.
      Default: false (uses current working directory)

  plot_linear_epitope_scores
      Use linear B-cell epitope probability scores for plotting.
      Default: false

  zip_results
      Create a ZIP archive of results (excluding interactive HTML figure).
      Default: false

  plot_data
      Whether to generate plotted output.
      Default: true


Output Format
-------------
The tool generates JSON output with the following structure:

- residue_table
    Contains per-residue predictions with scores, percentile ranks, harmonic
    mean, and assignments for each method.

- linear_epitope_table
    Contains predicted epitopes (for Emini and Kolaskar-Tongaonkar methods
    only). Columns include:
    - core.sequence_number: Sequence identifier
    - core.start: Epitope start position
    - core.end: Epitope end position
    - core.peptide: Epitope sequence
    - core.length: Epitope length

- predictor_threshold
    Lists the threshold values used for each method.

All numeric values (scores, percentile ranks, harmonic mean) are rounded to
2 decimal places. Percentile ranks are calculated such that higher scores
correspond to higher percentile ranks.

Output files are saved to the path specified by -o (or printed to console if
not specified). Format is inferred from the file extension if -f is not
specified. For distributed processing workflows, intermediate results are
stored in the 'aggregate' subdirectory.


Caveats
-------
All IEDB next-generation standalone tools are developed primarily to support
the website. Some user-facing features may be limited but will improve as the
tools mature.


Contact
-------
Please contact us with any issues or questions using the channels below:

IEDB Help Desk:
  https://help.iedb.org/

Email:
  help@iedb.org