Modules¶

SPIDER-WEB package consists of four modules:

biochemical constraint module (biofilter.py): implementation of ‘biochemical constraint filter’.
transcoding process module (spiderweb.py): implementation of ‘coding algorithm generation’, ‘encoding process’, ‘decoding process’, ‘shuffled matrix creation for mapping shuffling’, ‘Varshamov-Tenengolts-based path check set’, and ‘obtained DNA string correction’.
graph-based operation module (graphized.py): implementation of ‘data structure transformation between different graph representations’, ‘vertex search’, ‘path search’, and ‘capacity approximation’.
fundamental operation module (operation.py): implementation of ‘process monitor’, ‘large integer basic operation’, and ‘conversion between binary message, decimal number, and DNA string’.

Biochemical Constraint Module¶

class dsw.biofilter.DefaultBioFilter(screen_name)[source]¶

Bases: object

valid(dna_string)[source]¶

Judge whether the DNA string meets the requirements.

Parameters:: dna_string (str) – DNA string to be judged.
Raise:: this interface needs to be implemented.
Returns:: judgement.
Return type:: bool

class dsw.biofilter.LocalBioFilter(observed_length, max_homopolymer_runs=None, gc_range=None, undesired_motifs=None)[source]¶

Bases: DefaultBioFilter

valid(dna_sequence, only_last=True)[source]¶

Judge whether the DNA sequence meets the local biochemical constraints.

Parameters:

dna_sequence (str) – DNA sequence to be judged.
only_last (bool) – only check the DNA sequence of the last observed window.

Returns:

judgement.

Return type:

bool

Note

“only_last” parameter is used to save time. For most tree-based coding algorithms, it is not necessary to detect the sub DNA sequences observed in each window from scratch every time.

Transcoding Process Module¶

dsw.spiderweb.find_vertices(observed_length, bio_filter, verbose=False)[source]¶

Find valid vertices based on the given the biochemical constraints.

Parameters:

observed_length (int) – length of the DNA sequence in a vertex.
bio_filter (dsw.biofilter.DefaultBioFilter) – screening operation for identifying the valid DNA sequence (required the given constraints).
verbose (bool) – need to print log.

Returns:

available vertices.

Return type:

numpy.ndarray

Raises:

ValueError – if no valid vertices are available under the required biochemical constraints.

Example

>>> from dsw import LocalBioFilter, find_vertices
>>> bio_filter = LocalBioFilter(observed_length=2, max_homopolymer_runs=2, gc_range=[0.5, 0.5])
>>> # "1" refers to the available index of vertex, otherwise "0" refers to unavailable index of vertex.
>>> find_vertices(observed_length=2, bio_filter=bio_filter).astype(int)
array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])

Note

Reference [1] Florent Capelli and Yann Strozecki (2019) Discrete Applied Mathematics

dsw.spiderweb.connect_valid_graph(observed_length, vertices, verbose=False)[source]¶

Connect a valid graph by valid vertices.

Parameters:

observed_length (int) – length of the DNA sequence in a vertex.
vertices (numpy.ndarray) – vertex accessor, in each cell, True is valid vertex and False is invalid vertex.
verbose (bool) – need to print log.

Returns:

accessor of the valid graph.

Return type:

numpy.ndarray

Raises:

ValueError – if no valid vertices are available.

Example

>>> from numpy import array
>>> from dsw import connect_valid_graph
>>> vertices = array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])
>>> connect_valid_graph(observed_length=2, vertices=vertices)
array([[-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1]])

Note

Reference [1] Nicolaas Govert de Bruijn (1946) Indagationes Mathematicae

dsw.spiderweb.connect_coding_graph(observed_length, vertices, threshold, verbose=False)[source]¶

Connect a coding algorithm by valid vertices and the threshold for minimum out-degree.

Parameters:

observed_length (int) – length of the DNA sequence in a vertex.
vertices (numpy.ndarray) – vertex accessor, in each cell, True is valid vertex and False is invalid vertex.
threshold (int) – threshold for minimum out-degree.
verbose (bool) – need to print log.

Returns:

coding vertices and coding accessor.

Return type:

(numpy.ndarray, numpy.ndarray)

Raises:

ValueError – if no coding graph are created because of the vertex set and/or the trimming requirement.

Example

>>> from numpy import array
>>> from dsw import connect_coding_graph
>>> vertices = array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])
>>> vertices, accessor = connect_coding_graph(observed_length=2, vertices=vertices, threshold=2)
>>> accessor
array([[-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1]])
>>> vertices.astype(int)
array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0])

Note

Reference [1] Nicolaas Govert de Bruijn (1946) Indagationes Mathematicae

dsw.spiderweb.create_random_shuffles(observed_length, random_seed=None, verbose=False)[source]¶

Create the shuffles for accessor through the random mechanism.

Parameters:

observed_length (int) – length of the DNA sequence in a vertex.
random_seed (int) – random seed for creating shuffles.
verbose (bool) – need to print log.

Returns:

shuffles for accessor.

Return type:

numpy.ndarray

Example

>>> from dsw import create_random_shuffles
>>> create_random_shuffles(observed_length=2, random_seed=2021)
array([[3, 2, 1, 0],
       [2, 3, 1, 0],
       [3, 1, 0, 2],
       [0, 3, 1, 2],
       [3, 2, 0, 1],
       [1, 0, 3, 2],
       [0, 3, 1, 2],
       [2, 0, 1, 3],
       [2, 3, 0, 1],
       [1, 0, 3, 2],
       [2, 0, 1, 3],
       [0, 1, 3, 2],
       [2, 3, 1, 0],
       [2, 0, 3, 1],
       [0, 1, 3, 2],
       [0, 3, 2, 1]])

Note

The mapping shuffle strategy disrupts the digit-to-nucleotide mapping order, so it can be used as an privacy protection mechanism.

Value 0 ~ 3 in each line shuffle only describes the relationship between progressive order and position. The original position is [0, 1, 2, 3]. If the follow-up nucleotides in a vertex are [“A”, “G”] when the line shuffle is [3, 2, 1, 0], This line shuffle can be regarded as [1, -, 0, -] for [“A”, -, “G”, -].

Ths shuffle will not disclose the topology information of accessor.

dsw.spiderweb.encode(binary_message, accessor, start_index, is_faster=False, vt_length=0, shuffles=None, need_path=False, verbose=False)[source]¶

Encode a bit array by the specific accessor.

Parameters:

binary_message (numpy.ndarray) – binary message.
accessor (numpy.ndarray) – accessor of the coding algorithm.
start_index (int) – virtual vertex to start encoding.
is_faster (bool) – encode in a faster way.
vt_length (int or None) – length of Varshamov-Tenengolts code for error-correction.
shuffles – shuffle relationships for bit-nucleotide mapping.
need_path (bool) – need to record the restricted state transition path.
verbose (bool) – need to print log.

Type:

numpy.ndarray

Returns:

DNA sequence encoded by this graph (and VT check sequence if required).

Return type:

str or (str, str)

Note

If the parameter “is_faster” is set as True, the out-degree of coding digraph cannot contains 3.

Example

>>> from numpy import array
>>> from dsw import encode
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> binary_message = array([0, 1, 0, 1, 0, 1, 0, 1])
>>> encode(accessor=accessor, binary_message=binary_message, start_index=1)
'TCTCTCT'
>>> encode(accessor=accessor, binary_message=binary_message, start_index=1, vt_length=5)
('TCTCTCT', 'TAAGC')
>>> shuffles = array([[3, 2, 1, 0], [2, 3, 1, 0], [3, 1, 0, 2], [0, 3, 1, 2],                               [3, 2, 0, 1], [1, 0, 3, 2], [0, 3, 1, 2], [2, 0, 1, 3],                               [2, 3, 0, 1], [1, 0, 3, 2], [2, 0, 1, 3], [0, 1, 3, 2],                               [2, 3, 1, 0], [2, 0, 3, 1], [0, 1, 3, 2], [0, 3, 2, 1]])
>>> encode(accessor=accessor, binary_message=binary_message, shuffles=shuffles, start_index=1)
'AGAGAGA'

dsw.spiderweb.decode(dna_sequence, bit_length, accessor, start_index, is_faster=False, vt_check=None, shuffles=None, verbose=False)[source]¶

Decode a DNA sequence by the specific accessor.

Parameters:

dna_sequence (str) – DNA sequence encoded by this graph.
bit_length (int) – length of the bit array.
accessor (numpy.ndarray) – accessor of the coding algorithm (consistent with the encoding process).
start_index (int) – virtual vertex to start decoding (consistent with the encoding process).
is_faster (bool) – encode in a faster way.
vt_check (str or None) – check sequence of Varshamov-Tenengolts code.
shuffles (numpy.ndarray) – shuffle relationships for bit-nucleotide mapping.
verbose (bool) – need to print log.

Returns:

binary message decoded by this graph.

Return type:

numpy.ndarray

Raises:

ValueError – if one or more errors are found.

Note

If the parameter “is_faster” is set as True, the out-degree of coding digraph cannot contains 3.

Example

>>> from numpy import array
>>> from dsw import decode
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> dna_sequence = "TCTCTCT"
>>> decode(accessor=accessor, dna_sequence=dna_sequence, start_index=1, bit_length=8)
array([0, 1, 0, 1, 0, 1, 0, 1])
>>> vt_check = "TAAGC"
>>> decode(accessor=accessor, dna_sequence=dna_sequence, start_index=1, vt_check=vt_check, bit_length=8)
array([0, 1, 0, 1, 0, 1, 0, 1])
>>> shuffles = array([[3, 2, 1, 0], [2, 3, 1, 0], [3, 1, 0, 2], [0, 3, 1, 2],                               [3, 2, 0, 1], [1, 0, 3, 2], [0, 3, 1, 2], [2, 0, 1, 3],                               [2, 3, 0, 1], [1, 0, 3, 2], [2, 0, 1, 3], [0, 1, 3, 2],                               [2, 3, 1, 0], [2, 0, 3, 1], [0, 1, 3, 2], [0, 3, 2, 1]])
>>> decode(accessor=accessor, dna_sequence="TCTCTCT", start_index=1, shuffles=shuffles, bit_length=8)
array([0, 0, 0, 0, 0, 0, 0, 0])

dsw.spiderweb.set_vt(dna_sequence, vt_length)[source]¶

Set Varshamov-Tenengolts-based path check string (‘salt-protected’) from DNA (payload) sequence.

Parameters:

dna_sequence (str) – DNA sequence encoded through SPIDER-WEB.
vt_length (int or None) – length of DNA sequence (path check).

Returns:

path check DNA sequence with required length.

Example

>>> from dsw import set_vt
>>> dna_sequence = "TCTCTCT"
>>> vt_length = 5
>>> set_vt(dna_sequence=dna_sequence, vt_length=vt_length)
'TAAGC'

Note

Reference [1] Rom R. Varshamov and Grigory M. Tenengolts (1965) Avtomat. i Telemekh

Reference [2] Grigory Tenengolts (1984) IEEE Transactions on Information Theory

Reference [3] William H. Press et al. (2020) Proceedings of the National Academy of Sciences

Reference [4] A. Xavier Kohll et al. (2020) Chemical Communications

dsw.spiderweb.repair_dna(dna_sequence, accessor, start_index, observed_length, vt_check=None, has_indel=False, heap_size=1000.0)[source]¶

Repair the DNA sequence containing one (or more) errors.

Parameters:

dna_sequence (str) – DNA sequence waiting for recovery.
accessor (numpy.ndarray) – accessor of the coding algorithm.
start_index (int) – virtual vertex to start encoding.
observed_length (int) – length of the DNA sequence in a vertex.
vt_check (str or None) – check sequence of Varshamov-Tenengolts code.
has_indel (bool) – consider insertion and/or deletion error.
heap_size (int) – maximum heap size.

Returns:

repaired DNA sequence set and additional information.

Return type:

(list, (bool, bool, int))

Example

>>> from numpy import array
>>> from dsw import repair_dna
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> dna_sequence = "TCTCTATCTCTC"  # "TCTCTCTCTCTC" is original DNA sequence
>>> repair_dna(dna_sequence=dna_sequence, accessor=accessor, start_index=1, observed_length=2, has_indel=True)
(['TCTCTCTCTCTC', 'TCTCTGTCTCTC'], (1, False, 2, 14))
>>> vt_check = "AACGC"  # check list of Varshamov-Tenengolts code.
>>> repair_dna(dna_sequence=dna_sequence, accessor=accessor, start_index=1, observed_length=2,                        vt_check=vt_check, has_indel=True)
(['TCTCTCTCTCTC'], (1, True, 2, 14))

dsw.spiderweb.remove_nasty_arc(accessor, latter_map, iteration=0, has_insertion=True, has_deletion=True, verbose=False)[source]¶

Remove the nasty arc.

Parameters:

accessor (numpy.ndarray) – accessor of the coding algorithm.
latter_map (dict) – latter map of the coding algorithm.
iteration (int) – current round if required.
has_insertion (bool) – need to repair insertion error.
has_deletion (bool) – need to repair deletion error.
verbose (bool) – need to print log.

Returns:

adjusted accessor, adjusted latter map, removed arc, and maximum intersection score.

Return type:

(numpy.ndarray, dict, tuple, int)

Example

>>> from numpy import array
>>> from dsw import accessor_to_latter_map, remove_nasty_arc
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> latter_map = accessor_to_latter_map(accessor)
>>> result = remove_nasty_arc(accessor=accessor, latter_map=latter_map, iteration=0, verbose=False,                                       has_insertion=True, has_deletion=True)
>>> result[0]
array([[-1, -1, -1, -1],
       [-1, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1]])
>>> result[1]
{1: [7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
>>> result[2]
(1, 4)
>>> result[3]
[16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]

Note

The graph information contained in “accessor” and “latter_map” is consistent.

It is a gift for the follow-up investigation. That is, removing arc to improve the capability of the probabilistic error correction.

Graph-based Operation Module¶

dsw.graphized.approximate_capacity(accessor, tolerance_level=-10, repeats=1, maximum_iteration=500, process=False, verbose=False)[source]¶

Approximate the capacity of the specific graph through Perron–Frobenius theorem.

Parameters:

accessor (numpy.ndarray) – accessor of graph.
tolerance_level (int) – error tolerance of power iteration.
repeats (int) – random repeats for approximating the capacity.
maximum_iteration (int) – maximum iteration in the power method.
process (bool) – need eigenvalue in the process.
verbose (bool) – need to print log.

Returns:

capacity of this graph (accessor) and process values if required.

Return type:

float or (float, list)

Example

>>> from numpy import array, random
>>> from dsw import approximate_capacity
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> approximate_capacity(accessor=accessor, tolerance_level=-10, repeats=2, process=False)
1.0
>>> random.seed(0)
>>> capacity, processes = approximate_capacity(accessor=accessor, tolerance_level=-10, repeats=2, process=True)
>>> capacity
1.0
>>> ["%.5f" % _ for _ in processes[0]]
['0.57779', '0.91175', '1.00000', '1.00000']
>>> ["%.5f" % _ for _ in processes[1]]
['0.81488', '0.68189', '1.00000', '1.00000']

Note

Reference [1] Oskar Perron (1907) Mathematische Annalen

Reference [2] Georg Frobenius (1912) Sitzungsberichte der Königlich Preussischen Akademie der Wissenschaften

Reference [3] Brian H. Marcus et al. (2001) Lecture notes

Reference [4] Nabil Kahale (1995) Journal of the ACM

Reference [5] William Ford (2014) Academic Press

dsw.graphized.path_matching(dna_sequence, accessor, previous_index, occur_location, has_indel=False, nucleotides=None)[source]¶

Perform saturation repair at the selected position and obtain the DNA sequences matching the path of accessor.

Parameters:

dna_sequence (str) – DNA sequence waiting for saturation substitution in the specific location.
accessor (numpy.ndarray) – accessor.
previous_index (int) – previous vertex index before the occurred error location.
occur_location (int) – the location that may occurring substitution (or specific location).
has_indel (bool) – consider insertion and/or deletion error.
nucleotides (str or None) – usage of nucleotides.

Returns:

repaired DNA sequences (may contain multiple repair show) and visited count.

Return type:

list, int

Example

>>> from numpy import array
>>> from dsw import path_matching
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> dna_sequence = "TCTCTATCTCT"  # "TCTCTCTCTCT" is original DNA sequence
>>> path_matching(dna_sequence=dna_sequence, accessor=accessor, previous_index=7, occur_location=5,                           has_indel=True)
([(('S', 5, 'C'), 'TCTCTCTCTCT'), (('S', 5, 'G'), 'TCTCTGTCTCT')], 12)

dsw.graphized.calculate_intersection_score(latter_map, observed_length=10, has_insertion=True, has_deletion=True, verbose=False)[source]¶

Calculate the intersection score based on the breach-first search.

Parameters:

latter_map (dict) – latter vertex map of graph.
observed_length (int) – length of the DNA sequence in a vertex.
has_insertion (bool) – consider to repair insertion errors.
has_deletion (bool) – consider to repair deletion errors.
verbose (bool) – need to print log.

Returns:

intersection scores for each arc in the coding graph (consistent with the shape of accessor).

Return type:

numpy.ndarray

Example

>>> from dsw import calculate_intersection_score
>>> # latter_map with GC-balanced
>>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14],                           8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
>>> calculate_intersection_score(latter_map, observed_length=10,                                          has_insertion=True, has_deletion=True, verbose=False)
array([[ 0,  0,  0,  0],
       [28,  0,  0, 28],
       [28,  0,  0, 28],
       ...,
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0],
       [ 0,  0,  0,  0]])

Note

It is a gift for the follow-up investigation. That is, removing arc to improve the capability of the probabilistic error correction.

dsw.graphized.obtain_formers(current, observed_length)[source]¶

Obtain former vertex given_amino_acids based on the current vertex index.

Parameters:

current (int) – current vertex index.
observed_length (int) – length of the DNA sequence in a vertex.

Returns:

former vertex given_amino_acids.

Return type:

list

Example

>>> from dsw import obtain_formers
>>> current = 0
>>> obtain_formers(current=current, observed_length=2)
[0, 4, 8, 12]
>>> obtain_formers(current=current, observed_length=3)
[0, 16, 32, 48]
>>> obtain_formers(current=current, observed_length=4)
[0, 64, 128, 192]
>>> obtain_formers(current=current, observed_length=5)
[0, 256, 512, 768]
>>> obtain_formers(current=current, observed_length=6)
[0, 1024, 2048, 3072]
>>> obtain_formers(current=current, observed_length=7)
[0, 4096, 8192, 12288]
>>> obtain_formers(current=current, observed_length=8)
[0, 16384, 32768, 49152]
>>> obtain_formers(current=current, observed_length=9)
[0, 65536, 131072, 196608]

dsw.graphized.obtain_latters(current, observed_length)[source]¶

Obtain latter vertex given_amino_acids based on the current vertex index.

Parameters:

current (int) – current vertex index.
observed_length (int) – length of the DNA sequence in a vertex.

Returns:

latter vertex given_amino_acids.

Return type:

list

Example

>>> from dsw import obtain_latters
>>> current = 0
>>> obtain_latters(current=current, observed_length=2)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=3)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=4)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=5)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=6)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=7)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=8)
[0, 1, 2, 3]
>>> obtain_latters(current=current, observed_length=9)
[0, 1, 2, 3]

dsw.graphized.obtain_leaf_vertices(vertex_index, depth, accessor=None, latter_map=None)[source]¶

Obtain leaf vertices in required depth of the tree with the rooted vertex.

Parameters:

vertex_index (int) – vertex index in the graph.
depth (int) – stride of the breadth-first search.
accessor (numpy.ndarray) – accessor of graph.
latter_map (dict) – latter vertex map of graph.

Returns:

given_amino_acids of required leaf vertex.

Return type:

numpy.ndarray

Example

>>> from numpy import array
>>> from dsw import obtain_leaf_vertices
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> obtain_leaf_vertices(vertex_index=1, depth=1, accessor=accessor)
array([4, 7])
>>> obtain_leaf_vertices(vertex_index=1, depth=2, accessor=accessor)
array([ 1,  2, 13, 14])
>>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14],                           8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
>>> obtain_leaf_vertices(vertex_index=1, depth=1, latter_map=latter_map)
array([4, 7])
>>> obtain_leaf_vertices(vertex_index=1, depth=2, latter_map=latter_map)
array([ 1,  2, 13, 14])

Note

Either the parameter “accessor” or the parameter “latter_map” must occur, but not both.

dsw.graphized.remove_useless(latter_map, threshold, verbose=False)[source]¶

Remove useless vertices (the out-degree of witch less than threshold) in the latter map.

Parameters:

latter_map (dict) – latter vertex map of graph.
threshold (int) – minimum out-degree threshold.
verbose (bool) – need to print log.

Returns:

useful latter map.

Return type:

dict

Example

>>> from dsw import remove_useless
>>> latter_map_1 = {0: [1, 2], 1: [], 2: [3, 4], 3: [0]}
>>> remove_useless(latter_map=latter_map_1, threshold=1)
{0: [2], 2: [3], 3: [0]}
>>> latter_map_2 = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14],                             8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
>>> remove_useless(latter_map=latter_map_2, threshold=1)
{1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}

dsw.graphized.adjacency_matrix_to_accessor(matrix, verbose=False)[source]¶

Convert the adjacency matrix to the equivalent accessor (compressed matrix).

Parameters:

matrix (numpy.ndarray) – adjacency matrix.
verbose (bool) – need to print log.

Raises:

ValueError – when you input an adjacency matrix with wrong format.

Returns:

equivalent compressed matrix (accessor), compress rate = len(n_system) / len(matrix).

Return type:

numpy.ndarray

Example

>>> from dsw import adjacency_matrix_to_accessor
>>> adjacency_matrix = array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],                                       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],                                       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],                                       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],                                       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])
>>> adjacency_matrix_to_accessor(matrix=adjacency_matrix)
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Note

The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.

dsw.graphized.accessor_to_adjacency_matrix(accessor, maximum_length=8, verbose=False)[source]¶

Convert the accessor (compressed matrix) to its equivalent adjacency matrix.

Parameters:

accessor – accessor (compressed matrix).
maximum_length (int) – maximum vertex length (like 8 in general) of the adjacency matrix.
verbose (bool) – need to print log.

Type:

numpy.ndarray

Raises:

MemoryError – when you generate a large adjacency matrix that your memory cannot allocate.
ValueError – when you input a graph of DNA Spider-Web with wrong format.

Returns:

adjacency matrix of the uncompressed graph.

Return type:

numpy.ndarray

Example

>>> from dsw import accessor_to_adjacency_matrix, adjacency_matrix_to_accessor, get_complete_accessor
>>> accessor = array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15],                               [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15],                               [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15],                               [0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11], [12, 13, 14, 15]])
>>> accessor_to_adjacency_matrix(accessor=accessor)
array([[1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]])

Note

The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.

dsw.graphized.get_complete_accessor(observed_length, verbose=False)[source]¶

Get a complete accessor with the required observed length.

Parameters:

observed_length (int) – length of the DNA sequence in a vertex.
verbose (bool) – need to print log.

Returns:

complete accessor.

Return type:

numpy.ndarray

Example

>>> from dsw import get_complete_accessor
>>> get_complete_accessor(observed_length=2)
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Note

The size of accessor is 4 ^ l * 4 and that of corresponding adjacency matrix is 4 ^ l * 4 ^ l.

dsw.graphized.latter_map_to_accessor(latter_map, observed_length, threshold=None, verbose=False)[source]¶

Convert the latter map to the equivalent accessor.

Parameters:

latter_map (dict) – latter vertex map of graph.
observed_length (int) – length of the DNA sequence in a vertex.
threshold (int or None) – minimum out-degree threshold.
verbose (bool) – need to print log.

Returns:

equivalent accessor.

Return type:

numpy.ndarray

Example

>>> from dsw import latter_map_to_accessor
>>> latter_map = {1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14],                           8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}
>>> latter_map_to_accessor(latter_map=latter_map, observed_length=2)
array([[-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1,  1,  2, -1],
       [-1, -1, -1, -1],
       [-1, -1, -1, -1],
       [-1, 13, 14, -1],
       [-1, -1, -1, -1],
       [ 4, -1, -1,  7],
       [ 8, -1, -1, 11],
       [-1, -1, -1, -1]])

Note

The size of accessor is 4 ^ l * 4.

The size of corresponding latter map is further reduced, which only retains available information of follow-up vertices. However, latter map is not suitable for matrix calculation.

dsw.graphized.accessor_to_latter_map(accessor, verbose=False)[source]¶

Convert the accessor to its equivalent latter map.

Parameters:

accessor (numpy.ndarray) – accessor of graph.
verbose (bool) – need to print log.

Returns:

latter vertex map of graph.

Return type:

dict

Example

>>> from numpy import array
>>> from dsw import get_complete_accessor, accessor_to_latter_map
>>> # accessor with GC-balanced
>>> accessor = array([[-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1,  1,  2, -1], [-1, -1, -1, -1], [-1, -1, -1, -1], [-1, 13, 14, -1],                               [-1, -1, -1, -1], [ 4, -1, -1,  7], [ 8, -1, -1, 11], [-1, -1, -1, -1]])
>>> accessor_to_latter_map(accessor=accessor)
{1: [4, 7], 2: [8, 11], 4: [1, 2], 7: [13, 14], 8: [1, 2], 11: [13, 14], 13: [4, 7], 14: [8, 11]}

Note

The size of accessor is 4 ^ l * 4.

The size of corresponding latter map is further reduced, which only retains available information of follow-up vertices. However, latter map is not suitable for matrix calculation.

Fundamental Operation Module¶

class dsw.operation.Monitor[source]¶: Bases: object

dsw.operation.calculus_addition(number, base)[source]¶

Do huge number addition calculus with a small base value, as number + base.

Parameters:

number (str) – huge number.
base (str) – base value:

Returns:

number + base.

Return type:

str

Note

The integer of parameter “base” must less be than 10 in decimal system.

Example

>>> from dsw import calculus_addition
>>> calculus_addition(number="99999999999999999999999999999999999999999999999999", base="2")
'100000000000000000000000000000000000000000000000001'

dsw.operation.calculus_subtraction(number, base)[source]¶

Do huge number subtraction calculus with a small base value, as number - base.

Parameters:

number (str) – huge number.
base (str) – base value:

Returns:

number - base.

Return type:

str

Note

The integer of parameter “base” must less be than 10 in decimal system.

Example

>>> from dsw import calculus_subtraction
>>> calculus_subtraction(number="10000000000000000000000000000000000000000000000001", base="2")
'9999999999999999999999999999999999999999999999999'

dsw.operation.calculus_multiplication(number, base)[source]¶

Do huge number multiplication calculus with a small base value, as number * base.

Parameters:

number (str) – huge number.
base (str) – base value:

Returns:

number * base.

Return type:

str

Note

The integer of parameter “base” must less be than 10 in decimal system.

Example

>>> from dsw import calculus_multiplication
>>> calculus_multiplication(number="9999999999999999999999999999999999999999999999999", base="2")
'19999999999999999999999999999999999999999999999998'

dsw.operation.calculus_division(number, base)[source]¶

Do huge number division calculus with a small base value, as number / base and number % base.

Parameters:

number (str) – huge number.
base (str) – base value:

Returns:

number // base and number % base.

Return type:

(str, str)

Example

>>> from dsw import calculus_division
>>> calculus_division(number="9999999999999999999999999999999999999999999999999", base="2")
('4999999999999999999999999999999999999999999999999', '1')

Note

The integer of parameter “base” must less be than 10 in decimal system.

When the divisor (base) is “0”, the program will return “0”, “0”.

dsw.operation.dna_to_number(dna_sequence, is_string=True)[source]¶

Transform a DNA string to the equivalent decimal number.

Parameters:

dna_sequence (str) – required DNA string.
is_string – type of equivalent decimal number is str.

Type:

is_string: bool

Returns:

equivalent decimal number (may huge) of the inputted DNA string.

Return type:

str or int

Example

>>> from dsw import dna_to_number
>>> dna_to_number(dna_sequence="ACGTACGT", is_string=True)
'6939'
>>> dna_to_number(dna_sequence="ACGTACGT", is_string=False)
6939

dsw.operation.number_to_dna(decimal_number, dna_length)[source]¶

Transform a decimal number to the equivalent DNA string with specific length.

Parameters:

decimal_number (str or int) – decimal number (may huge) of the DNA string.
dna_length (int) – default length of the DNA string.

Returns:

equivalent DNA string of the decimal number.

Return type:

str

Example

>>> from dsw import number_to_dna
>>> number_to_dna(decimal_number=6939, dna_length=8)
'ACGTACGT'
>>> number_to_dna(decimal_number="6939", dna_length=8)
'ACGTACGT'

dsw.operation.bit_to_number(bit_array, is_string=True, verbose=False)[source]¶

Transform a bit array to the equivalent decimal number.

Parameters:

bit_array (list) – bit array.
is_string – type of equivalent decimal number is str.
verbose (bool) – need to print log.

Type:

is_string: bool

Returns:

equivalent decimal number (may huge) of the inputted bit array.

Return type:

str or int

Example

>>> from dsw import bit_to_number
>>> bit_to_number(bit_array=[1, 1, 1, 1, 1, 0, 0, 1, 1, 1], is_string=True)
'999'
>>> bit_to_number(bit_array=[1, 1, 1, 1, 1, 0, 0, 1, 1, 1], is_string=False)
999

dsw.operation.number_to_bit(decimal_number, bit_length)[source]¶

Transform a decimal number to the equivalent bit array with specific length.

Parameters:

decimal_number (str or int) – decimal number (may huge) of the bit array.
bit_length (int) – default length of the bit array.

Returns:

bit array.

Return type:

list

Example

>>> from dsw import number_to_bit
>>> number_to_bit(decimal_number="999", bit_length=10)
[1, 1, 1, 1, 1, 0, 0, 1, 1, 1]
>>> number_to_bit(decimal_number=999, bit_length=10)
[1, 1, 1, 1, 1, 0, 0, 1, 1, 1]