Genepidgin repackaging

Genepidgin is a utility belt of gene naming tools. It helps compare, clean, and select names intended for coding sequence. Its primary draw is turning an almost-great name:

BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein

into a great name:

glycine/betaine/L-proline ABC transporter, periplasmic-binding protein

It'll even offer detail overload in doing so:

$ echo "sample_name BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein" > input.txt
$ genepidgin cleaner input.txt output.txt
Cleaning up names found in: input.txt
sample_name filtered name in 1 step:
0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein
1)   reason: delete id
    pattern: \b[A-Za-z0-9]+\d{4,}(?<!\b(?:DUF|UPF)\d{4})\b(?!\s*(kD(a)?|-like|family|protein\s+family))
   filtered:  glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein
2)   reason: transport protein -> transporter
    pattern: \btransport(er)?\s+protein\b
   filtered:  glycine/betaine/L-proline ABC transporter, periplasmic-binding protein
3)   reason: delete spaces at beginning of name
    pattern: ^\s+
   filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein

(It also has a silent-execution option)

This open-source (BSD) project is the product of several man-years of arguments and a couple of months of labor. It's not representative of current practices, and there are better approaches available. For broad-stroke categorization, it does a good job.

Many smart people worked on this project, and Matthew Pearson and I wrote it. Check Genepidgin's credits for more information.

This update is not a functional update in any way; it merely brings the project into the modern era of documentation and python packaging. The last significant update to this project was in 2010.

If you are a large-scale sequencing and annotation house who needs to assign names to programmatically generated annotations of genetic sequence, and your preferred method of doing so is via raw sequence alignments against disparate protein libraries, then genepidgin might help you. For more information, please consult Genepidgin's documentation.