IEDB Next-Generation Tools PEPMatch - version 0.1.2-beta
======================================================

Introduction
------------
PEPMatch will take a list of peptides as input and search against a reference
proteome for peptides that match with X or fewer substitutions.  This command-line
tool drives the PEPMatch tool at https://nextgen-tools.iedb.org/pipeline?tool=pepmatch.

It is a wrapper layer around version 0.9.6 of PEPMatch, which is available on
GitHub: https://github.com/IEDB/PEPMatch/.


Release Notes
-------------
v0.1.2-beta - Initial public beta release


Prerequisites
-------------

The following prerequisites must be met before installing the tools:

+ Python 3.7 or higher
  * http://www.python.org/

The following prerequisites must be met before running the tools:

+ Preprocessed proteome files are required for running the tool.  These may be
downloaded from: https://downloads.iedb.org/datasets/pepmatch-proteome/preprocessed/LATEST/proteomes-20240313.tgz
and extract the tgz file to a specific path. For example:
  $ wget https://downloads.iedb.org/datasets/pepmatch-proteome/preprocessed/LATEST/proteomes-20240313.tgz  
  $ tar -xvzf proteomes-20240313.tgz -C /your/target/directory

+ 8GB RAM or more


Installation
------------

Below, we will use the example of installing to ~/iedb_tools.

1. Extract the code and change directory: 
  $ mkdir ~/iedb_tools
  $ tar -xvzf IEDB_NG_PEPMATCH-0.1.2-beta.tar.gz -C ~/iedb_tools
  $ cd ~/iedb_tools/ng_pepmatch-0.1.2-beta

2. Optionally, create and activate a Python 3.7+ virtual environment using your favorite virtual environment manager.  Here, we will assume the virtualenv is at ~/virtualenvs/cluster:
  $ python3 -m venv ~/venvs/pepmatch
  $ source ~/venvs/pepmatch/bin/activate

3. Install python requirements:
  $ python3 -m pip install --upgrade pip
  $ pip install -r requirements.txt


Usage
-----
1. Set the environment variable `PEPMATCH_PROTEOMES_PATH` to specify the path to the proteome files folder:
  ```sh
  export PEPMATCH_PROTEOMES_PATH=[PROTEOMES_PATH]
  ```

  Then run the tool with:
  ```sh
  python3 src/match.py -j <input_json_file> [-o <output_prefix>] [-f <output_format>]
  ```

2. Alternatively, specify the proteome path directly via the command line using the `--proteomes-path` parameter:
  ```sh
  python3 src/match.py -j <input_json_file> --proteomes-path <proteomes_path> \
  [-o <output_prefix>] [-f <output_format>]
  ```

The format of the input JSON file is described below.

The output_prefix and output_format are optional.  By default, the output will
be printed to the screen in TSV format.  Options are 'tsv' or 'json'.

Input formats
-------------
Currently, only JSON input is supported.

*NOTE*: This tool only accepts JSON inputs, formatted as described below

{
    "input_sequence_text": "DDEDSKQNIFHFLYR\nADPGPHLMGGGGRAK\nKAVELGVKLLHAFHT\nQLQNLGINPANIGLS\nHEVWFFGLQYVDSKG",
    "mismatch": 3,
    "proteome": "human",
    "best_match": true
}

* input_sequence_text: a fasta-formatted string.  To create an appropriate string
    from a fasta file:
      awk '{printf "%s\\n", $0}' <fasta_file>
* mismatch: the maximum number of mismatches to allow in the search
* proteome: this must correspond to one of the preprocessed proteome names
    listed below.  Future versions of this tool will allow for custom proteomes:
      - cow
      - dog
      - horse
      - human
      - mouse
      - pig
      - rabbit
      - rat
* best_match: return only the best match per peptide.  If false, all matches
    at or below the mismatch threshold will be returned.


Caveats
-------
All IEDB next-generation standalones have been developed with the primary
focus of supporting the website.  Some user-facing features may be lacking,
but will be improved as these tools mature.


Contact
-------
Please contact us with any issues encountered or questions about the software
through any of the channels listed below.

IEDB Help Desk: https://help.iedb.org/
Email: help@iedb.org
