Modules¶
SPIDER-WEB package consists of four modules:
biochemical constraint module (biofilter.py): implementation of ‘biochemical constraint filter’.
transcoding process module (spiderweb.py): implementation of ‘coding algorithm generation’, ‘encoding process’, ‘decoding process’, ‘shuffled matrix creation for mapping shuffling’, ‘Varshamov-Tenengolts-based path check set’, and ‘obtained DNA string correction’.
graph-based operation module (graphized.py): implementation of ‘data structure transformation between different graph representations’, ‘vertex search’, ‘path search’, and ‘capacity approximation’.
fundamental operation module (operation.py): implementation of ‘process monitor’, ‘large integer basic operation’, and ‘conversion between binary message, decimal number, and DNA string’.
Biochemical Constraint Module¶
- class dsw.biofilter.LocalBioFilter(observed_length, max_homopolymer_runs=None, gc_range=None, undesired_motifs=None)[source]¶
Bases:
DefaultBioFilter- valid(dna_sequence, only_last=True)[source]¶
Judge whether the DNA sequence meets the local biochemical constraints.
- Parameters:
dna_sequence (str) – DNA sequence to be judged.
only_last (bool) – only check the DNA sequence of the last observed window.
- Returns:
judgement.
- Return type:
bool
Note
“only_last” parameter is used to save time. For most tree-based coding algorithms, it is not necessary to detect the sub DNA sequences observed in each window from scratch every time.
Transcoding Process Module¶
- dsw.spiderweb.find_vertices(observed_length, bio_filter, verbose=False)[source]¶
Find valid vertices based on the given the biochemical constraints.
- Parameters:
observed_length (int) – length of the DNA sequence in a vertex.
bio_filter (dsw.biofilter.DefaultBioFilter) – screening operation for identifying the valid DNA sequence (required the given constraints).
verbose (bool) – need to print log.
- Returns:
available vertices.
- Return type:
numpy.ndarray
- Raises:
ValueError – if no valid vertices are available under the required biochemical constraints.
- Example
>>> from dsw import LocalBioFilter, find_vertices >>> bio_filter = LocalBioFilter(observed_length=2, max_homopolymer_runs=2, gc_range=[0.5, 0.5]) >>> # "1" refers to the available index of vertex, otherwise "0" refers to unavailable index of vertex. >>> find_vertices(observed_length=2, bio_filter=bio_filter).astype(int) array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])
Note
Reference [1] Florent Capelli and Yann Strozecki (2019) Discrete Applied Mathematics
- dsw.spiderweb.connect_valid_graph(observed_length, vertices, verbose=False)[source]¶
Connect a valid graph by valid vertices.
- Parameters:
observed_length (int) – length of the DNA sequence in a vertex.
vertices (numpy.ndarray) – vertex accessor, in each cell, True is valid vertex and False is invalid vertex.
verbose (bool) – need to print log.
- Returns:
accessor of the valid graph.
- Return type:
numpy.ndarray
- Raises:
ValueError – if no valid vertices are available.
- Example
>>> from numpy import array >>> from dsw import connect_valid_graph >>> vertices = array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0]) >>> connect_valid_graph(observed_length=2, vertices=vertices) array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
Note
Reference [1] Nicolaas Govert de Bruijn (1946) Indagationes Mathematicae
- dsw.spiderweb.connect_coding_graph(observed_length, vertices, threshold, verbose=False)[source]¶
Connect a coding algorithm by valid vertices and the threshold for minimum out-degree.
- Parameters:
observed_length (int) – length of the DNA sequence in a vertex.
vertices (numpy.ndarray) – vertex accessor, in each cell, True is valid vertex and False is invalid vertex.
threshold (int) – threshold for minimum out-degree.
verbose (bool) – need to print log.
- Returns:
coding vertices and coding accessor.
- Return type:
(numpy.ndarray, numpy.ndarray)
- Raises:
ValueError – if no coding graph are created because of the vertex set and/or the trimming requirement.
- Example
>>> from numpy import array >>> from dsw import connect_coding_graph >>> vertices = array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0]) >>> vertices, accessor = connect_coding_graph(observed_length=2, vertices=vertices, threshold=2) >>> accessor array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> vertices.astype(int) array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])
Note
Reference [1] Nicolaas Govert de Bruijn (1946) Indagationes Mathematicae
- dsw.spiderweb.create_random_shuffles(observed_length, random_seed=None, verbose=False)[source]¶
Create the shuffles for accessor through the random mechanism.
- Parameters:
observed_length (int) – length of the DNA sequence in a vertex.
random_seed (int) – random seed for creating shuffles.
verbose (bool) – need to print log.
- Returns:
shuffles for accessor.
- Return type:
numpy.ndarray
- Example
>>> from dsw import create_random_shuffles >>> create_random_shuffles(observed_length=2, random_seed=2021) array([[3, 2, 1, 0], [2, 3, 1, 0], [3, 1, 0, 2], [0, 3, 1, 2], [3, 2, 0, 1], [1, 0, 3, 2], [0, 3, 1, 2], [2, 0, 1, 3], [2, 3, 0, 1], [1, 0, 3, 2], [2, 0, 1, 3], [0, 1, 3, 2], [2, 3, 1, 0], [2, 0, 3, 1], [0, 1, 3, 2], [0, 3, 2, 1]])
Note
The mapping shuffle strategy disrupts the digit-to-nucleotide mapping order, so it can be used as an privacy protection mechanism.
Value 0 ~ 3 in each line shuffle only describes the relationship between progressive order and position. The original position is [0, 1, 2, 3]. If the follow-up nucleotides in a vertex are [“A”, “G”] when the line shuffle is [3, 2, 1, 0], This line shuffle can be regarded as [1, -, 0, -] for [“A”, -, “G”, -].
Ths shuffle will not disclose the topology information of accessor.
- dsw.spiderweb.encode(binary_message, accessor, start_index, is_faster=False, vt_length=0, shuffles=None, need_path=False, verbose=False)[source]¶
Encode a bit array by the specific accessor.
- Parameters:
binary_message (numpy.ndarray) – binary message.
accessor (numpy.ndarray) – accessor of the coding algorithm.
start_index (int) – virtual vertex to start encoding.
is_faster (bool) – encode in a faster way.
vt_length (int or None) – length of Varshamov-Tenengolts code for error-correction.
shuffles – shuffle relationships for bit-nucleotide mapping.
need_path (bool) – need to record the restricted state transition path.
verbose (bool) – need to print log.
- Type:
numpy.ndarray
- Returns:
DNA sequence encoded by this graph (and VT check sequence if required).
- Return type:
str or (str, str)
Note
If the parameter “is_faster” is set as True, the out-degree of coding digraph cannot contains 3.
- Example
>>> from numpy import array >>> from dsw import encode >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> binary_message = array([0, 1, 0, 1, 0, 1, 0, 1]) >>> encode(accessor=accessor, binary_message=binary_message, start_index=1) 'TCTCTCT' >>> encode(accessor=accessor, binary_message=binary_message, start_index=1, vt_length=5) ('TCTCTCT', 'TAAGC') >>> shuffles = array([[3, 2, 1, 0], [2, 3, 1, 0], [3, 1, 0, 2], [0, 3, 1, 2], [3, 2, 0, 1], [1, 0, 3, 2], [0, 3, 1, 2], [2, 0, 1, 3], [2, 3, 0, 1], [1, 0, 3, 2], [2, 0, 1, 3], [0, 1, 3, 2], [2, 3, 1, 0], [2, 0, 3, 1], [0, 1, 3, 2], [0, 3, 2, 1]]) >>> encode(accessor=accessor, binary_message=binary_message, shuffles=shuffles, start_index=1) 'AGAGAGA'
- dsw.spiderweb.decode(dna_sequence, bit_length, accessor, start_index, is_faster=False, vt_check=None, shuffles=None, verbose=False)[source]¶
Decode a DNA sequence by the specific accessor.
- Parameters:
dna_sequence (str) – DNA sequence encoded by this graph.
bit_length (int) – length of the bit array.
accessor (numpy.ndarray) – accessor of the coding algorithm (consistent with the encoding process).
start_index (int) – virtual vertex to start decoding (consistent with the encoding process).
is_faster (bool) – encode in a faster way.
vt_check (str or None) – check sequence of Varshamov-Tenengolts code.
shuffles (numpy.ndarray) – shuffle relationships for bit-nucleotide mapping.
verbose (bool) – need to print log.
- Returns:
binary message decoded by this graph.
- Return type:
numpy.ndarray
- Raises:
ValueError – if one or more errors are found.
Note
If the parameter “is_faster” is set as True, the out-degree of coding digraph cannot contains 3.
- Example
>>> from numpy import array >>> from dsw import decode >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> dna_sequence = "TCTCTCT" >>> decode(accessor=accessor, dna_sequence=dna_sequence, start_index=1, bit_length=8) array([0, 1, 0, 1, 0, 1, 0, 1]) >>> vt_check = "TAAGC" >>> decode(accessor=accessor, dna_sequence=dna_sequence, start_index=1, vt_check=vt_check, bit_length=8) array([0, 1, 0, 1, 0, 1, 0, 1]) >>> shuffles = array([[3, 2, 1, 0], [2, 3, 1, 0], [3, 1, 0, 2], [0, 3, 1, 2], [3, 2, 0, 1], [1, 0, 3, 2], [0, 3, 1, 2], [2, 0, 1, 3], [2, 3, 0, 1], [1, 0, 3, 2], [2, 0, 1, 3], [0, 1, 3, 2], [2, 3, 1, 0], [2, 0, 3, 1], [0, 1, 3, 2], [0, 3, 2, 1]]) >>> decode(accessor=accessor, dna_sequence="TCTCTCT", start_index=1, shuffles=shuffles, bit_length=8) array([0, 0, 0, 0, 0, 0, 0, 0])
- dsw.spiderweb.set_vt(dna_sequence, vt_length)[source]¶
Set Varshamov-Tenengolts-based path check string (‘salt-protected’) from DNA (payload) sequence.
- Parameters:
dna_sequence (str) – DNA sequence encoded through SPIDER-WEB.
vt_length (int or None) – length of DNA sequence (path check).
- Returns:
path check DNA sequence with required length.
- Example
>>> from dsw import set_vt >>> dna_sequence = "TCTCTCT" >>> vt_length = 5 >>> set_vt(dna_sequence=dna_sequence, vt_length=vt_length) 'TAAGC'
Note
Reference [1] Rom R. Varshamov and Grigory M. Tenengolts (1965) Avtomat. i Telemekh
Reference [2] Grigory Tenengolts (1984) IEEE Transactions on Information Theory
Reference [3] William H. Press et al. (2020) Proceedings of the National Academy of Sciences
Reference [4] A. Xavier Kohll et al. (2020) Chemical Communications
- dsw.spiderweb.repair_dna(dna_sequence, accessor, start_index, observed_length, vt_check=None, has_indel=False, heap_size=1000.0)[source]¶
Repair the DNA sequence containing one (or more) errors.
- Parameters:
dna_sequence (str) – DNA sequence waiting for recovery.
accessor (numpy.ndarray) – accessor of the coding algorithm.
start_index (int) – virtual vertex to start encoding.
observed_length (int) – length of the DNA sequence in a vertex.
vt_check (str or None) – check sequence of Varshamov-Tenengolts code.
has_indel (bool) – consider insertion and/or deletion error.
heap_size (int) – maximum heap size.
- Returns:
repaired DNA sequence set and additional information.
- Return type:
(list, (bool, bool, int))
- Example
>>> from numpy import array >>> from dsw import repair_dna >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> dna_sequence = "TCTCTATCTCTC" # "TCTCTCTCTCTC" is original DNA sequence >>> repair_dna(dna_sequence=dna_sequence, accessor=accessor, start_index=1, observed_length=2, has_indel=True) (['TCTCTCTCTCTC', 'TCTCTGTCTCTC'], (1, False, 2, 14)) >>> vt_check = "AACGC" # check list of Varshamov-Tenengolts code. >>> repair_dna(dna_sequence=dna_sequence, accessor=accessor, start_index=1, observed_length=2, vt_check=vt_check, has_indel=True) (['TCTCTCTCTCTC'], (1, True, 2, 14))
- dsw.spiderweb.remove_nasty_arc(accessor, latter_map, iteration=0, has_insertion=True, has_deletion=True, verbose=False)[source]¶
Remove the nasty arc.
- Parameters:
accessor (numpy.ndarray) – accessor of the coding algorithm.
latter_map (dict) – latter map of the coding algorithm.
iteration (int) – current round if required.
has_insertion (bool) – need to repair insertion error.
has_deletion (bool) – need to repair deletion error.
verbose (bool) – need to print log.
- Returns:
adjusted accessor, adjusted latter map, removed arc, and maximum intersection score.
- Return type:
(numpy.ndarray, dict, tuple, int)
- Example
>>> from numpy import array >>> from dsw import accessor_to_latter_map, remove_nasty_arc >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> latter_map = accessor_to_latter_map(accessor) >>> result = remove_nasty_arc(accessor=accessor, latter_map=latter_map, iteration=0, verbose=False, has_insertion=True, has_deletion=True) >>> result[0] array([[-1, -1, -1, -1], [-1, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> result[1] {1: [7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]} >>> result[2] (1, 4) >>> result[3] [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
Note
The graph information contained in “accessor” and “latter_map” is consistent.
It is a gift for the follow-up investigation. That is, removing arc to improve the capability of the probabilistic error correction.
Graph-based Operation Module¶
- dsw.graphized.approximate_capacity(accessor, tolerance_level=-10, repeats=1, maximum_iteration=500, process=False, verbose=False)[source]¶
Approximate the capacity of the specific graph through Perron–Frobenius theorem.
- Parameters:
accessor (numpy.ndarray) – accessor of graph.
tolerance_level (int) – error tolerance of power iteration.
repeats (int) – random repeats for approximating the capacity.
maximum_iteration (int) – maximum iteration in the power method.
process (bool) – need eigenvalue in the process.
verbose (bool) – need to print log.
- Returns:
capacity of this graph (accessor) and process values if required.
- Return type:
float or (float, list)
- Example
>>> from numpy import array, random >>> from dsw import approximate_capacity >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> approximate_capacity(accessor=accessor, tolerance_level=-10, repeats=2, process=False) 1.0 >>> random.seed(0) >>> capacity, processes = approximate_capacity(accessor=accessor, tolerance_level=-10, repeats=2, process=True) >>> capacity 1.0 >>> ["%.5f" % _ for _ in processes[0]] ['0.57779', '0.91175', '1.00000', '1.00000'] >>> ["%.5f" % _ for _ in processes[1]] ['0.81488', '0.68189', '1.00000', '1.00000']
Note
Reference [1] Oskar Perron (1907) Mathematische Annalen
Reference [2] Georg Frobenius (1912) Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften
Reference [3] Brian H. Marcus et al. (2001) Lecture notes
Reference [4] Nabil Kahale (1995) Journal of the ACM
Reference [5] William Ford (2014) Academic Press
- dsw.graphized.path_matching(dna_sequence, accessor, previous_index, occur_location, has_indel=False, nucleotides=None)[source]¶
Perform saturation repair at the selected position and obtain the DNA sequences matching the path of accessor.
- Parameters:
dna_sequence (str) – DNA sequence waiting for saturation substitution in the specific location.
accessor (numpy.ndarray) – accessor.
previous_index (int) – previous vertex index before the occurred error location.
occur_location (int) – the location that may occurring substitution (or specific location).
has_indel (bool) – consider insertion and/or deletion error.
nucleotides (str or None) – usage of nucleotides.
- Returns:
repaired DNA sequences (may contain multiple repair show) and visited count.
- Return type:
list, int
- Example
>>> from numpy import array >>> from dsw import path_matching >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> dna_sequence = "TCTCTATCTCT" # "TCTCTCTCTCT" is original DNA sequence >>> path_matching(dna_sequence=dna_sequence, accessor=accessor, previous_index=7, occur_location=5, has_indel=True) ([(('S', 5, 'C'), 'TCTCTCTCTCT'), (('S', 5, 'G'), 'TCTCTGTCTCT')], 12)
- dsw.graphized.calculate_intersection_score(latter_map, observed_length=10, has_insertion=True, has_deletion=True, verbose=False)[source]¶
Calculate the intersection score based on the breach-first search.
- Parameters:
latter_map (dict) – latter vertex map of graph.
observed_length (int) – length of the DNA sequence in a vertex.
has_insertion (bool) – consider to repair insertion errors.
has_deletion (bool) – consider to repair deletion errors.
verbose (bool) – need to print log.
- Returns:
intersection scores for each arc in the coding graph (consistent with the shape of accessor).
- Return type:
numpy.ndarray
- Example
>>> from dsw import calculate_intersection_score >>> # latter_map with GC-balanced >>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]} >>> calculate_intersection_score(latter_map, observed_length=10, has_insertion=True, has_deletion=True, verbose=False) array([[ 0, 0, 0, 0], [28, 0, 0, 28], [28, 0, 0, 28], ..., [ 0, 0, 0, 0], [ 0, 0, 0, 0], [ 0, 0, 0, 0]])
Note
It is a gift for the follow-up investigation. That is, removing arc to improve the capability of the probabilistic error correction.
- dsw.graphized.obtain_formers(current, observed_length)[source]¶
Obtain former vertex given_amino_acids based on the current vertex index.
- Parameters:
current (int) – current vertex index.
observed_length (int) – length of the DNA sequence in a vertex.
- Returns:
former vertex given_amino_acids.
- Return type:
list
- Example
>>> from dsw import obtain_formers >>> current = 0 >>> obtain_formers(current=current, observed_length=2) [0, 4, 8, 12] >>> obtain_formers(current=current, observed_length=3) [0, 16, 32, 48] >>> obtain_formers(current=current, observed_length=4) [0, 64, 128, 192] >>> obtain_formers(current=current, observed_length=5) [0, 256, 512, 768] >>> obtain_formers(current=current, observed_length=6) [0, 1024, 2048, 3072] >>> obtain_formers(current=current, observed_length=7) [0, 4096, 8192, 12288] >>> obtain_formers(current=current, observed_length=8) [0, 16384, 32768, 49152] >>> obtain_formers(current=current, observed_length=9) [0, 65536, 131072, 196608]
- dsw.graphized.obtain_latters(current, observed_length)[source]¶
Obtain latter vertex given_amino_acids based on the current vertex index.
- Parameters:
current (int) – current vertex index.
observed_length (int) – length of the DNA sequence in a vertex.
- Returns:
latter vertex given_amino_acids.
- Return type:
list
- Example
>>> from dsw import obtain_latters >>> current = 0 >>> obtain_latters(current=current, observed_length=2) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=3) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=4) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=5) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=6) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=7) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=8) [0, 1, 2, 3] >>> obtain_latters(current=current, observed_length=9) [0, 1, 2, 3]
- dsw.graphized.obtain_leaf_vertices(vertex_index, depth, accessor=None, latter_map=None)[source]¶
Obtain leaf vertices in required depth of the tree with the rooted vertex.
- Parameters:
vertex_index (int) – vertex index in the graph.
depth (int) – stride of the breadth-first search.
accessor (numpy.ndarray) – accessor of graph.
latter_map (dict) – latter vertex map of graph.
- Returns:
given_amino_acids of required leaf vertex.
- Return type:
numpy.ndarray
- Example
>>> from numpy import array >>> from dsw import obtain_leaf_vertices >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> obtain_leaf_vertices(vertex_index=1, depth=1, accessor=accessor) array([4, 7]) >>> obtain_leaf_vertices(vertex_index=1, depth=2, accessor=accessor) array([ 1, 2, 13, 14]) >>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]} >>> obtain_leaf_vertices(vertex_index=1, depth=1, latter_map=latter_map) array([4, 7]) >>> obtain_leaf_vertices(vertex_index=1, depth=2, latter_map=latter_map) array([ 1, 2, 13, 14])
Note
Either the parameter “accessor” or the parameter “latter_map” must occur, but not both.
- dsw.graphized.remove_useless(latter_map, threshold, verbose=False)[source]¶
Remove useless vertices (the out-degree of witch less than threshold) in the latter map.
- Parameters:
latter_map (dict) – latter vertex map of graph.
threshold (int) – minimum out-degree threshold.
verbose (bool) – need to print log.
- Returns:
useful latter map.
- Return type:
dict
- Example
>>> from dsw import remove_useless >>> latter_map_1 = {0: [1, 2], 1: [], 2: [3, 4], 3: [0]} >>> remove_useless(latter_map=latter_map_1, threshold=1) {0: [2], 2: [3], 3: [0]} >>> latter_map_2 = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]} >>> remove_useless(latter_map=latter_map_2, threshold=1) {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
- dsw.graphized.adjacency_matrix_to_accessor(matrix, verbose=False)[source]¶
Convert the adjacency matrix to the equivalent accessor (compressed matrix).
- Parameters:
matrix (numpy.ndarray) – adjacency matrix.
verbose (bool) – need to print log.
- Raises:
ValueError – when you input an adjacency matrix with wrong format.
- Returns:
equivalent compressed matrix (accessor), compress rate = len(n_system) / len(matrix).
- Return type:
numpy.ndarray
- Example
>>> from dsw import adjacency_matrix_to_accessor >>> adjacency_matrix = array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]]) >>> adjacency_matrix_to_accessor(matrix=adjacency_matrix) array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]])
Note
The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.
- dsw.graphized.accessor_to_adjacency_matrix(accessor, maximum_length=8, verbose=False)[source]¶
Convert the accessor (compressed matrix) to its equivalent adjacency matrix.
- Parameters:
accessor – accessor (compressed matrix).
maximum_length (int) – maximum vertex length (like 8 in general) of the adjacency matrix.
verbose (bool) – need to print log.
- Type:
numpy.ndarray
- Raises:
MemoryError – when you generate a large adjacency matrix that your memory cannot allocate.
ValueError – when you input a graph of DNA Spider-Web with wrong format.
- Returns:
adjacency matrix of the uncompressed graph.
- Return type:
numpy.ndarray
- Example
>>> from dsw import accessor_to_adjacency_matrix, adjacency_matrix_to_accessor, get_complete_accessor >>> accessor = array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15], [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]]) >>> accessor_to_adjacency_matrix(accessor=accessor) array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
Note
The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.
- dsw.graphized.get_complete_accessor(observed_length, verbose=False)[source]¶
Get a complete accessor with the required observed length.
- Parameters:
observed_length (int) – length of the DNA sequence in a vertex.
verbose (bool) – need to print log.
- Returns:
complete accessor.
- Return type:
numpy.ndarray
- Example
>>> from dsw import get_complete_accessor >>> get_complete_accessor(observed_length=2) array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15], [ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11], [12, 13, 14, 15]])
Note
The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.
- dsw.graphized.latter_map_to_accessor(latter_map, observed_length, threshold=None, verbose=False)[source]¶
Convert the latter map to the equivalent accessor.
- Parameters:
latter_map (dict) – latter vertex map of graph.
observed_length (int) – length of the DNA sequence in a vertex.
threshold (int or None) – minimum out-degree threshold.
verbose (bool) – need to print log.
- Returns:
equivalent accessor.
- Return type:
numpy.ndarray
- Example
>>> from dsw import latter_map_to_accessor >>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]} >>> latter_map_to_accessor(latter_map=latter_map, observed_length=2) array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
Note
The size of accessor is 4 ^ l * 4.
The size of corresponding latter map is further reduced, which only retains available information of follow-up vertices. However, latter map is not suitable for matrix calculation.
- dsw.graphized.accessor_to_latter_map(accessor, verbose=False)[source]¶
Convert the accessor to its equivalent latter map.
- Parameters:
accessor (numpy.ndarray) – accessor of graph.
verbose (bool) – need to print log.
- Returns:
latter vertex map of graph.
- Return type:
dict
- Example
>>> from numpy import array >>> from dsw import get_complete_accessor, accessor_to_latter_map >>> # accessor with GC-balanced >>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, 1, 2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1], [-1, -1, -1, -1], [ 4, -1, -1, 7], [ 8, -1, -1, 11], [-1, -1, -1, -1]]) >>> accessor_to_latter_map(accessor=accessor) {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
Note
The size of accessor is 4 ^ l * 4.
The size of corresponding latter map is further reduced, which only retains available information of follow-up vertices. However, latter map is not suitable for matrix calculation.
Fundamental Operation Module¶
- dsw.operation.calculus_addition(number, base)[source]¶
Do huge number addition calculus with a small base value, as number + base.
- Parameters:
number (str) – huge number.
base (str) – base value:
- Returns:
number + base.
- Return type:
str
Note
The integer of parameter “base” must less be than 10 in decimal system.
- Example
>>> from dsw import calculus_addition >>> calculus_addition(number="99999999999999999999999999999999999999999999999999", base="2") '100000000000000000000000000000000000000000000000001'
- dsw.operation.calculus_subtraction(number, base)[source]¶
Do huge number subtraction calculus with a small base value, as number - base.
- Parameters:
number (str) – huge number.
base (str) – base value:
- Returns:
number - base.
- Return type:
str
Note
The integer of parameter “base” must less be than 10 in decimal system.
- Example
>>> from dsw import calculus_subtraction >>> calculus_subtraction(number="10000000000000000000000000000000000000000000000001", base="2") '9999999999999999999999999999999999999999999999999'
- dsw.operation.calculus_multiplication(number, base)[source]¶
Do huge number multiplication calculus with a small base value, as number * base.
- Parameters:
number (str) – huge number.
base (str) – base value:
- Returns:
number * base.
- Return type:
str
Note
The integer of parameter “base” must less be than 10 in decimal system.
- Example
>>> from dsw import calculus_multiplication >>> calculus_multiplication(number="9999999999999999999999999999999999999999999999999", base="2") '19999999999999999999999999999999999999999999999998'
- dsw.operation.calculus_division(number, base)[source]¶
Do huge number division calculus with a small base value, as number / base and number % base.
- Parameters:
number (str) – huge number.
base (str) – base value:
- Returns:
number // base and number % base.
- Return type:
(str, str)
- Example
>>> from dsw import calculus_division >>> calculus_division(number="9999999999999999999999999999999999999999999999999", base="2") ('4999999999999999999999999999999999999999999999999', '1')
Note
The integer of parameter “base” must less be than 10 in decimal system.
When the divisor (base) is “0”, the program will return “0”, “0”.
- dsw.operation.dna_to_number(dna_sequence, is_string=True)[source]¶
Transform a DNA string to the equivalent decimal number.
- Parameters:
dna_sequence (str) – required DNA string.
is_string – type of equivalent decimal number is str.
- Type:
is_string: bool
- Returns:
equivalent decimal number (may huge) of the inputted DNA string.
- Return type:
str or int
- Example
>>> from dsw import dna_to_number >>> dna_to_number(dna_sequence="ACGTACGT", is_string=True) '6939' >>> dna_to_number(dna_sequence="ACGTACGT", is_string=False) 6939
- dsw.operation.number_to_dna(decimal_number, dna_length)[source]¶
Transform a decimal number to the equivalent DNA string with specific length.
- Parameters:
decimal_number (str or int) – decimal number (may huge) of the DNA string.
dna_length (int) – default length of the DNA string.
- Returns:
equivalent DNA string of the decimal number.
- Return type:
str
- Example
>>> from dsw import number_to_dna >>> number_to_dna(decimal_number=6939, dna_length=8) 'ACGTACGT' >>> number_to_dna(decimal_number="6939", dna_length=8) 'ACGTACGT'
- dsw.operation.bit_to_number(bit_array, is_string=True, verbose=False)[source]¶
Transform a bit array to the equivalent decimal number.
- Parameters:
bit_array (list) – bit array.
is_string – type of equivalent decimal number is str.
verbose (bool) – need to print log.
- Type:
is_string: bool
- Returns:
equivalent decimal number (may huge) of the inputted bit array.
- Return type:
str or int
- Example
>>> from dsw import bit_to_number >>> bit_to_number(bit_array=[1, 1, 1, 1, 1, 0, 0, 1, 1, 1], is_string=True) '999' >>> bit_to_number(bit_array=[1, 1, 1, 1, 1, 0, 0, 1, 1, 1], is_string=False) 999
- dsw.operation.number_to_bit(decimal_number, bit_length)[source]¶
Transform a decimal number to the equivalent bit array with specific length.
- Parameters:
decimal_number (str or int) – decimal number (may huge) of the bit array.
bit_length (int) – default length of the bit array.
- Returns:
bit array.
- Return type:
list
- Example
>>> from dsw import number_to_bit >>> number_to_bit(decimal_number="999", bit_length=10) [1, 1, 1, 1, 1, 0, 0, 1, 1, 1] >>> number_to_bit(decimal_number=999, bit_length=10) [1, 1, 1, 1, 1, 0, 0, 1, 1, 1]