Skip to content

Fingerprints Module

FP Annotation

FP_annotation

annotate_motifs

annotate_motifs(smiles_per_motifs, fp_type='maccs', threshold=0.8)

runs all the scripts to generate a selected fingerprint for a motif

  • smiles2mol: convert smiles to mol objects
  • mols2fps: convert mol objects to selected fingerprint
  • scale_fps: check present of fingerprints bits across motif
  • fps2motif: make the motif fingerprint binary based on given threshold
  • fps2smarts: retrieve SMARTS for found motif fingerprint bits

  • motifs2tanimotoScore: calculated motif similarity based on motif fingerprints using tanimoto similarity

Parameters:

Name Type Description Default
smiles_per_motifs

list(list(str)): SMILES for every motif in a different list

required
fp_type object

a object that represents a type of fingerprint that will be calculated

'maccs'
threshold float; 0 > x <= 1

number that defines if a bit in the fingerprint with be set to zero (below threshold) or to one (above threshold)

0.8

Returns:

Name Type Description
fps_motifs list(list(array))

binary fingerprint for motifs, based on given threshold for including/excluding bits on their presents in a motif

smarts_per_motifs list(list(object))

mol object for the present bits in fps_motifs (SMARTS pattern)

motifs_similarities list

tanimoto score for every motif combination

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
def annotate_motifs(
    smiles_per_motifs, fp_type="maccs", threshold=0.8
):  # can be simplyfied
    """runs all the scripts to generate a selected fingerprint for a motif

    - smiles2mol: convert smiles to mol objects
    - mols2fps: convert mol objects to selected fingerprint
    - scale_fps: check present of fingerprints bits across motif
    - fps2motif: make the motif fingerprint binary based on given threshold
    - fps2smarts: retrieve SMARTS for found motif fingerprint bits

    - motifs2tanimotoScore: calculated motif similarity based on motif fingerprints using tanimoto similarity


    ARGS:
        smiles_per_motifs: list(list(str)): SMILES for every motif in a different list
        fp_type (CDK_pywrapper.fp_type.object): a object that represents a type of fingerprint that will be calculated
        threshold (float; 0 > x <= 1): number that defines if a bit in the fingerprint with be set to zero (below threshold) or to one (above threshold)

    RETURNS:
        fps_motifs (list(list(np.array))): binary fingerprint for motifs, based on given threshold for including/excluding bits on their presents in a motif
        smarts_per_motifs (list(list(rdkit.mol.object))): mol object for the present bits in fps_motifs (SMARTS pattern)
        motifs_similarities (list): tanimoto score for every motif combination
    """
    fps_motifs = []
    all_mols = list(chain(*smiles_per_motifs))
    smarts = generate_fingerprint([Chem.MolFromSmiles(mol) for mol in all_mols])

    for smiles_per_motif in smiles_per_motifs:
        fps_per_motif = mols2fps(smiles_per_motif, fp_type, smarts)
        # print(fps_per_motif)
        scaled_fps = scale_fps(fps_per_motif)
        # print(scaled_fps)
        fps_motif = fps2motif(scaled_fps, threshold)
        # print(fps_motif)
        fps_motifs.append(fps_motif)

    return fps_motifs

fps2motif

fps2motif(scaled_fps, threshold)

overlaps fingerprints of compounds allocated to the same topic/motif

Parameters:

Name Type Description Default
scaled_fps array

a fingerprint array with values between 0 and 1 showing the presents of substructures within a motif

required
threshold float; 0 > x <= 1

number that defines if a bit in the fingerprint with be set to zero (below threshold) or to one (above threshold)

required

Returns:

Name Type Description
scaled_fps array

could also be called motif_fps, because it represents the most common fingerprint bits in a motif (bits above the threshold)

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
def fps2motif(scaled_fps, threshold):
    """overlaps fingerprints of compounds allocated to the same topic/motif

    ARGS:
        scaled_fps (np.array): a fingerprint array with values between 0 and 1 showing the presents of substructures within a motif
        threshold (float; 0 > x <= 1): number that defines if a bit in the fingerprint with be set to zero (below threshold) or to one (above threshold)

    RETURNS:
        scaled_fps (np.array): could also be called motif_fps, because it represents the most common fingerprint bits in a motif (bits above the threshold)
    """
    # above_threshold_indices = np.where(scaled_fps > threshold)[0] # useful for retrieval, but maybe you can do it in another function
    # maybe you can use the masking also for the retrieveal of SMARTS patterns

    lower_as_threshold = scaled_fps < threshold
    higher_as_threshold = scaled_fps >= threshold

    scaled_fps[lower_as_threshold] = 0
    scaled_fps[higher_as_threshold] = 1

    return scaled_fps

mols2fps

mols2fps(smiles_per_motif, selected_fp_type, smarts=None)

calculates the selected fingerprint for a given list of rdkit mol objects

Parameters:

Name Type Description Default
mols_per_motif list(rdkit.mol.objects

list of rdkit.mol.objects associated with one motif

required
fp_type str

a name of fingerprint that will be calculated

required

Returns:

Name Type Description
fps array

a multi-array numpy error containing all fingerprints

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def mols2fps(smiles_per_motif, selected_fp_type, smarts=None):
    """calculates the selected fingerprint for a given list of rdkit mol objects

    ARGS:
        mols_per_motif (list(rdkit.mol.objects)): list of rdkit.mol.objects associated with one motif
        fp_type (str): a name of fingerprint that will be calculated

    RETURNS:
        fps (numpy.array): a multi-array numpy error containing all fingerprints
    """

    fps_type = [
        "adaptive",
        "pubchem",
        "daylight",
        "kr",
        "lingo",
        "estate",
        "dfs",
        "asp",
        "lstar",
        "rad2d",
        "ph2",
        "ph3",
        "ecfp",
        "avalon",
        "tt",
        "maccs",
        "fcfp",
        "ap",
        "rdkit",
        "map4",
        "mhfp",
    ]

    selected_fp_type = selected_fp_type.lower()

    # self developed dynamic fingerprint
    if selected_fp_type == "adaptive":
        from FP_calculation.adaptive_fps import calc_adaptive

        mols_per_motif = smiles2mols(smiles_per_motif)
        fps = calc_adaptive(mols_per_motif, smarts)

    # cdk based fingerprints
    elif selected_fp_type in fps_type[1:6]:
        from FP_calculation.cdk_fps import (
            calc_PUBCHEM,
            calc_DAYLIGHT,
            calc_KR,
            calc_LINGO,
            calc_ESTATE,
        )

        if selected_fp_type == "pubchem":
            fps = calc_PUBCHEM(smiles_per_motif)
        elif selected_fp_type == "daylight":
            fps = calc_DAYLIGHT(smiles_per_motif)
        elif selected_fp_type == "kr":
            fps = calc_KR(smiles_per_motif)
        elif selected_fp_type == "lingo":
            fps = calc_LINGO(smiles_per_motif)
        elif selected_fp_type == "estate":
            fps = calc_ESTATE(smiles_per_motif)

    # jmap based fingerprints
    elif selected_fp_type in fps_type[6:10]:
        from FP_calculation.jmap_fps import (
            calc_DFS,
            calc_ASP,
            calc_LSTAR,
            calc_RAD2D,
            calc_PH2,
            calc_PH3,
        )

        if selected_fp_type == "dfs":
            fps = calc_DFS(smiles_per_motif)
        elif selected_fp_type == "asp":
            fps = calc_ASP(smiles_per_motif)
        elif selected_fp_type == "lstar":
            fps = calc_LSTAR(smiles_per_motif)
        elif selected_fp_type == "rad2d":
            fps = calc_RAD2D(smiles_per_motif)
        elif selected_fp_type == "ph2":
            fps = calc_PH2(smiles_per_motif)
        elif selected_fp_type == "ph3":
            fps = calc_PH3(smiles_per_motif)

    # rdkit based fingerprints
    elif selected_fp_type in fps_type[10:19]:
        mols_per_motif = smiles2mols(smiles_per_motif)

        if selected_fp_type == "ecfp":
            fps = calc_ECFP(mols_per_motif)
        elif selected_fp_type == "avalon":
            fps = calc_AVALON(mols_per_motif)
        elif selected_fp_type == "maccs":
            fps = calc_MACCS(mols_per_motif)
        elif selected_fp_type == "fcfp":
            fps = calc_FCFP(mols_per_motif)
        elif selected_fp_type == "ap":
            fps = calc_AP(mols_per_motif)
        elif selected_fp_type == "rdkit":
            fps = calc_RDKIT(mols_per_motif)

    # minhash based fingerprints
    elif selected_fp_type in fps_type[19:]:
        from FP_calculation.minhash_fps import calc_MAP4, calc_MHFP

        mols_per_motif = smiles2mols(smiles_per_motif)
        if selected_fp_type == "map4":
            fps = calc_MAP4(mols_per_motif)
        elif selected_fp_type == "mhfp":
            fps = calc_MHFP(mols_per_motif)

    else:
        raise Exception(
            f"One of the following fingerprint types need to be selected: {fps_type}"
        )

    return fps

scale_fps

scale_fps(fps_per_motif)

calculates the percentage of the presents of every fingerprint bit in a motif

Parameters:

Name Type Description Default
fps_per_motif dataframe

a dataframe (rows are molecules and columns are fingerprint bit) for all molecular fingerprints

required

Returns:

Name Type Description
scaled_fps array

a fingerprint array with values between 0 and 1 showing the presents of substructures within a motif

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def scale_fps(fps_per_motif):
    """calculates the percentage of the presents of every fingerprint bit in a motif

    ARGS:
        fps_per_motif (pandas.dataframe): a dataframe (rows are molecules and columns are fingerprint bit) for all molecular fingerprints

    RETURNS:
        scaled_fps (np.array): a fingerprint array with values between 0 and 1 showing the presents of substructures within a motif

    """
    n_fps_per_motif = len(fps_per_motif)
    combined_fps = sum(fps_per_motif)

    scaled_fps = combined_fps / n_fps_per_motif
    # error with Nan if cluster is empty
    return scaled_fps

smiles2mols

smiles2mols(smiles)

converts SMILES to rdkit mol objects

Parameters:

Name Type Description Default
smiles list

list of SMILES strings

required

Returns:

Name Type Description
mols list

list of rdkit mol objects

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def smiles2mols(smiles):
    """converts SMILES to rdkit mol objects

    ARGS:
        smiles (list): list of SMILES strings

    RETURNS:
        mols (list): list of rdkit mol objects
    """
    mols = []

    for smi in smiles:
        mol = MolFromSmiles(smi)
        if mol:
            mols.append(mol)

    return mols

tanimoto_similarity

tanimoto_similarity(fps_1, fps_2)

Compute Tanimoto similarity for two sets of fingerprints using NumPy.

Args: pair1_maccs_fps: List[List[int]] - The first set of binary fingerprints. pair2_maccs_fps: List[List[int]] - The second set of binary fingerprints.

np.ndarray: A 2D array where the element at (i, j) represents the Tanimoto similarity between pair1_maccs_fps[i] and pair2_maccs_fps[j].

Source code in MS2LDA/Add_On/Fingerprints/FP_annotation.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def tanimoto_similarity(fps_1, fps_2):
    """
    Compute Tanimoto similarity for two sets of fingerprints using NumPy.

    Args:
    pair1_maccs_fps: List[List[int]] - The first set of binary fingerprints.
    pair2_maccs_fps: List[List[int]] - The second set of binary fingerprints.

    Returns:
    np.ndarray: A 2D array where the element at (i, j) represents the Tanimoto similarity
                between `pair1_maccs_fps[i]` and `pair2_maccs_fps[j]`.
    """
    # Convert to NumPy arrays
    pair1 = np.array(fps_1, dtype=np.int32)
    pair2 = np.array(fps_2, dtype=np.int32)

    # Compute intersection and union for each pair of fingerprints
    intersection = np.dot(pair1, pair2.T)  # Dot product gives pairwise |A ∩ B|
    sum1 = pair1.sum(axis=1, keepdims=True)  # Row-wise sums (|A|)
    sum2 = pair2.sum(axis=1, keepdims=True)  # Row-wise sums (|B|)
    union = sum1 + sum2.T - intersection  # Pairwise |A ∪ B|

    # Handle cases where union is zero (to avoid division by zero)
    tanimoto_scores = np.divide(
        intersection,
        union,
        out=np.zeros_like(intersection, dtype=float),
        where=(union != 0),
    )

    return tanimoto_scores

Substructure Retrieval

Substructure_retrieval

retrieve_substructures

retrieve_substructures(fp_per_motifs, smiles_per_motifs)

retrieves the SMARTS patterns from the adaptive fingerprint based on the motif fingerprints alignments

Parameters:

Name Type Description Default
fp_per_motifs list or arrays

List of np.arrays, where every array is a motif fingerprints

required
smiles_per_motifs list of lists

list of SMILES associated with the same motif

required

Returns:

Name Type Description
substructure_matches list of lists

retrieved substructures from adaptive fingerprint where the motif fingerprint had 1 as a bit

Source code in MS2LDA/Add_On/Fingerprints/Substructure_retrieval.py
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def retrieve_substructures(fp_per_motifs, smiles_per_motifs):
    """retrieves the SMARTS patterns from the adaptive fingerprint based on the motif fingerprints alignments

    ARGS:
        fp_per_motifs (list or arrays): List of np.arrays, where every array is a motif fingerprints
        smiles_per_motifs (list of lists): list of SMILES associated with the same motif

    RETURNS:
        substructure_matches (list of lists): retrieved substructures from adaptive fingerprint where the motif fingerprint had 1 as a bit
    """

    all_mols = list(chain(*smiles_per_motifs))
    frequent_substructures = generate_fingerprint(
        [Chem.MolFromSmiles(mol) for mol in all_mols]
    )

    substructure_matches = [list() for i in range(len(fp_per_motifs))]
    for i, fp_per_motif in enumerate(fp_per_motifs):
        substructures_per_motif_indices = np.where(fp_per_motif == 1)[0]
        for idx in substructures_per_motif_indices:
            substructure_match = frequent_substructures[idx]
            substructure_matches[i].append(substructure_match)

    return substructure_matches