MistDBParse


Download Link — Scripts written for Python 3.1.


Description


There are two main scripts that need to be run to retrieve and sort the data: MistParser.py and Aggregator.py, respectively. MistParser takes a MistDB URL and scrapes it for all of the domain architecture data for a given classification. It even requests and scrapes additional pages if they exist for the result set (i.e. page 2, 3, etc.). Aggregator parses the output files from MistParser and pulls out specific domain architectures in the specified protein classification, with the option to filter based on collocation of domains.


Note: In order for Aggregator to find the files created by MistParser, they must be run at the root of a folder structure where each subfolder is named for the species and contains that species' relevant domain information created by MistParser. Preferably the output files should be of the format 'SpeciesName-Classification' e.g. Bacillus.Anthracis-HK.txt/pkl


Example


Note: arguments are breifly described for each script if they are run without any arguments.


Retrieves the HK proteins for Clostridium acetobutylicum ATCC 824 and saves them to output.txt (and output.pkl).


MistParser.py http://mistdb.com/proteins/slice/repcon_id:702/class:2cp+hk C.acetobutylicum-HK


Move the resulting C.acetobutylicum-HK.txt and .pkl files to the appropriate subfolder (see Note above) and run Aggregator.py to filter and format the results.


Aggregator.py HK HisKA


Produces an output file called "HK-HisKA.fasta" that contains all of the HisKA domains within HK proteins for all species for which data exists. This file is FASTA-formatted and can be imported to any alignment program, etc.


Optionally, you can also add the 'lookFor' flag to only retrieve domains that are collocated on a protein. Collocation criteria is loose i.e. if trans and HisKA occur on the same protein (separated by another domain or not), they are collocated.


Aggregator.py HK -l trans HisKA