|
1.INTRODUCTIONAs is reported in the global cancer report, in 2020, the number of breast cancer patients came to 2.26 million, which made breast cancer the most serious cancer in the world. Studies have found that the advancement of breast cancer is tightly associated with the estrogen receptor. Early researches shows that estrogen receptors alpha (ERα) is expressed in about fifty to eighty percent of the tumour cells but only about ten percent in normal cells. Experiments on lab rats around ERα also supported that it does play an essential role in the mammary gland development1-3. At present, anti-hormone treatment is commonly adopted for breast cancer patients, whose body has high estrogen level and needs to regulate the activity of estrogen receptors. Thus, researches soon regard ERα as the target of breast cancer therapy, and start trying to find compounds which can reduce ERα activity as the medicine candidates. As an example, some classic medicines in treating breast cancer such as raloxifene and tamoxifen are ERα antagonists. At present, in pharmaceutical researches, modeling to predict the activity of the compound is widely used to search potentially active compounds. To do so, researches collect the data of compounds acting on ERα and their bioactivity. These data are used to construct a quantitative structure-activity relationship model in which molecular structure descriptors are the independent variables and the bioactivity are the dependent. The model can make the prediction of new compound molecules which have better bioactivity or guide the structural optimization of existing compounds. Also, compounds can become suitable pharmaceutical candidates not only need to have better bioactivity, but also required safety and pharmacokinetic properties, which can be explained as ADMET properties. ADMET here means absorption, distribution, metabolism, excretion and toxicity. In particular, absorption, distribution, metabolism and excretion refer to the pharmacokinetic properties describing the pattern of concentration of the compound in the organism over time, and toxicity refers to the toxic effects of the compounds4. If a compound with good activity preforms bad in ADMET properties, such as being difficult to be absorbed, or metabolized too fast in vivo, or having some kind of toxicity, is still difficult to become a suitable pharmaceutical candidate. Thus, the ADMET property optimization is also needed. To facilitate modeling, this thesis considers only five ADMET properties as below: (1) Caco-2, which measures how the human body absorb the compound; (2) CYP3A4, which measures the compounds’ metabolic stability; (3) hERG, which refers to the cardiotoxicity; (4) HOB, which shows the proportion absorbed into blood after the medicine entering human body; (5) MN, which determines whether the compound is genotoxic. In order to screen suitable active compounds as pharmaceutical candidates and to determine the range of major molecular descriptors in the compounds (as variables), this paper combines machine learning algorithms with PSO algorithms to develop a new compound optimization model for anti-breast cancer pharmaceutical candidates. The data used in this paper are from the files “Molecular_Descriptor.xlsx”, “ERα_activity.xlsx” and “ADMET.xlsx”, all from the Chinese National Post-Graduate Mathematic Contest in Modeling 20215. Firstly, according to the files “Molecular_Descriptor.xlsx” and “ERα_activity.xlsx”, the gray correlation analysis6, Pearson correlation coefficient7 and cosine similarity8-9 are jointly used to screen the most significant effective to the bioactivity molecular descriptors. And the regression model between molecular descriptors and biological activity are built using BP neural networks10. Then, this paper uses the random forest classification algorithm11 to construct a regression model between molecular descriptors and ADMET data based on “Molecular_Descriptor.xlsx”, “ERα_activity.xlsx” and “ADMET.xlsx”. And finally, the final compound optimization model for anti-breast cancer medicine candidates was obtained by combining classification and regression models and combining with PSO algorithm12. It is shown experimentally that the model developed in this paper can better determine the range of molecular descriptors, which ensures the compounds to have better biological activity for ERα inhibition and better ADMET properties. 2.DETERMINE MODELLING VARIABLESThe data files used in this paper “Molecular_Descriptor.xlsx”, “ERα_activity.xlsx” and “ADMET.xlsx” are all from the Chinese National Post-Graduate Mathematic Contest in Modeling 2021. Among them, the file “ERα_activity.xlsx” provides data on the biological activity of 1974 compounds against ERα; the file “Molecular_Descriptor.xlsx” gives information on 729 molecular descriptors (i.e., independent variables) for the above-mentioned 1974 compounds; and the file “ADMET.xlsx” provides data on the five ADMET properties for the 1974 compounds. As the compounds provided in the above document contain a large number of variables that are troublesome in analysis and modelling, feature screening is required to filter out the 20 main variables that are most representative and independent. Because different feature extraction methods extract features from different angles of thinking, the main features obtained are also different. To improve the precision of the screened variables, we first screen the main 20 variables affecting biological activity by combining three methods: Pearson correlation coefficient, cosine similarity, and grey correlation analysis. Then, the 20 variables with the greatest influence on ADMET properties are screened by using the feature screening method. And finally, the variables with greater influence on both biological activity and ADMET properties are obtained by scoring strategy. 2.1Pearson correlation coefficientTo calculate the correlation between variables, our first idea comes from Pearson correlation coefficients (PCCs), which are mainly used to calculate the linear correlation between operational variables1. In particular, suppose there are two sets of data, A: {A1, A2, ‧ ‧ ‧ An} and B: {B1, B2, ‧ ‧ ‧ Bn}, and the mean of the sample is: The covariance of the sample is: So the Pearson correlation coefficients between A and В is: in which SA means the standard deviation of sample A, and in the same way, SB means the standard deviation of sample B. In this paper, we take the bioactive pIC50 as В and 729 molecular descriptors as A, and separately calculate the Pearson correlation coefficients of molecular descriptors on bioactivity. 2.2Cosine similarityCosine similarity (cos) was initially used to calculate the similarity between two vectors by calculating the cosine similarity between two vectors to digitize the similarity between vectors, which related expression is: Suppose the vector X is {X1, X2, ‧ ‧ ‧ Xn} and the vector Y is {Y1, Y2, ‧ ‧ ‧ Yn}, then the cosine of the angle θ between the vectors X and Y is: Then, the cosine values are normalized for the similarity: In this case, the closer the similarity is to 0, the more similar the vectors are to each other. Similar to the Pearson correlation coefficient, we took the bioactive pIC50 as separate vectors. Next, select the main variables according to the cosine similarity separately calculated between the 729 molecular descriptors and pIC50. 2.3Grey relation analysisGrey Relation Analysis (GRA) is widely used and is very suitable for calculating correlations between data with unknown relationships between them. It refers to the process of unifying the data with a uniform magnitude and then using one set of sequences as the parent and the rest of the sequences as the subsequence. Then determine the degree of similarity in geometry between the parent and subsequence to find whether the two sets of sequences are closely related2. The relevant steps are as follows:
2.4SolutionsIn this paper, we first extracted the top 20 main variables using each of the three feature extraction methods mentioned above. Considering that only the correlation calculated by the Pearson correlation coefficient is meaningful whether linear relationship exists between two variables, the scatter plot matrix shown in Figure 1 expressed the screened main molecular descriptors after obtaining the results calculated by Pearson correlation coefficient. The figure which is used to verify the usability of the results clearly shows that variables screened are basically linearly correlated with bioactivity. Then the molecular descriptors that make ADMET have better properties were screened on this basis, which means the desired molecular identifiers that ultimately have a greater effect on the biological activity and ADMET properties were obtained. We assigned scores to the molecular descriptors obtained by each method according to their ranking, with the first ranking given 20 points and decrease when the ranking goes ahead. Table 1 shows the 20 molecular descriptors which are most significant effective on bioactivity and their corresponding weights judging by the score, from which we can find that the most significant effect on bioactivity is MDEC-23, following by MLogP, LipoaffinityIndex, nC, LipoaffinityIndex, maxsOH, and minsOH. It is known that MDEC-23 represents the edge of molecular distance between secondary carbons and tertiary carbons, which is important in determining the bioactive properties. Also, variables such as MLogP, CrippenLogP are also important for bioactivity. These facts proved that the variables we screened are of strong practical significance. Table 1.The 20 variables with the most significant effects on biological activity.
In order to filter out the molecular descriptors that make ADMET have better properties, we supposed a molecular descriptor need at least three good properties to be regard as suitable. In the file “ADMET.xlsx”, the better data of Caco-2, CYP3A4 and HOB are already marked as 1. But for hERG and MN, we can find 0 means the better properties, so we need to normalize the data, turn 0 in hERG and MN to 1. After normalizing, we counted the number of compounds with ADMET properties of 1 for each compound as count, and used count = 3 as the dividing line to divide the data into two parts, and Figure 2 shows the histogram of this result. We calculate the mean value of each molecular descriptor of each part separately after normalization to get two matrices of size 1×729, and then subtract the two matrices to plot the difference of the two matrices and plot the difference into the difference histogram shown in Figure 3. Figure 3a shows the histogram plotted by the original difference. Considering that we mainly pick out the numerator descriptors with larger difference values as the variables that affect the larger nature of ADMET exist negative values which are not easy to operate. We plotted the graph (b) in Figure 3 with all values taken as absolute values. The graph shows obviously that some molecular descriptors have a larger difference between the count ≥ 3 and count < 3 groups. In order to find the better molecular descriptors that affect the properties of ADMET, we plot the values of the differences after taking the absolute values as the scatter plot in Figure 4a. It is obvious to find a lot of discrete points in the high out. In order to pick out all these discrete ones as much as possible without introducing too many molecular descriptors, we screened the difference of 0.11 as the dividing line and drew Figure 4b. All molecular descriptors above the black line in Figure 4b were picked out to obtain a primary sieve of better molecular descriptors affecting the properties of ADMET, and a total of 52 primary molecular descriptors are obtained. Since our first step was only a primary screening of the main molecular descriptors affecting the properties of ADMET, molecular descriptors were sieved out are more than needed. To further reduce the number of variables, the same 20 main variables were extracted using the previously mentioned method. And the molecular descriptors which are most significant effective on biological activity extracted earlier were intersected to obtain the desired final molecular identifiers with a high impact on biological activity and ADMET properties. Theses molecular descriptors consist of Table 2. Table 2.Molecular descriptors effect in both ADMET and bioactivity.
3.CONSTRUCT OPTIMIZATION MODELSIn this section, a regression model of the molecular descriptors and pIC50 was established using BP neural network by regarding the seven major molecular descriptors screened in the previous section and the pIC50 values of the compounds as the independent and dependent variables. Also, we use the random forest classification algorithm to build a classification model between bioactivity and ADMET data. Then, he above regression model, classification model and PSO algorithm were combined to build the compound optimization model to obtain anti-breast cancer medicine candidates. 3.1Construct regression modelA BP neural network with 30 hidden layers was first built using the neural network toolbox in MATLAB, the regression model of molecular descriptors with pIC50 is modeled with the Bayesian Regularization training algorithm. This algorithm has good generalization and can deal with small or noisy datasets by taking more time. Back Propagation Neural Network (BP neural network) is a multilayer feedforward neural network trained by error back propagation using gradient descent algorithm, which belongs to a kind of artificial neural network. It calculates the gradient of the loss function for individual weights by the chain rule, which, unlike the direct calculation, is valid for one layer at a time, and its associated schematic is shown in Figure 5. 3.2Construct classification modelRandom Forest (RF) is an integrated learning algorithm which consists decision trees and is capable of performing classification and regression. For regression tasks, Random Forest usually takes the mean value of the output of each decision tree as the final result. Compared with decision trees, random forests are more accurate and more stable. The stochastic process of random forest is shown in Figure 6. The decision tree itself can be used to classify and regress the data. The decision tree for regression is mainly divided into cells in the data feature space, and each cell will have a corresponding output. The regression decision tree mainly answers the “yes” and “no” questions, so its structure is a binary tree, in which there are mainly root nodes, decision nodes and leaf nodes, where leaf nodes refer to the output of a specific category. Figure 7 illustrates the combination strategy in random forest, and the three main methods commonly used are averaging, voting and learning. For numerical regression prediction problems, the averaging method is generally used as a combination strategy, and the averaging method has arithmetic averaging and weighting. In this case, we mainly weight the decision tree, and the final prediction is: in which hi regards to the weak learning machine, wi ≥ 0, . In this paper, the data obtained after feature filtering is fed into a decision tree classifier to train. And we can classify and predict the unknown data by the trained random forest classification algorithm. 3.3Construct optimization modelParticle swarm optimization’s (PSO) central idea is derived from a group collaboration-based search algorithm developed to simulate the foraging behaviour of a flock of birds13. The PSO algorithm designs a massless particle to simulates the birds in a flock, which regards velocity and position as its attributes. Each particle individually records its current individual extreme value by searching optimal solution in the space, and they share their own extreme value to find the global optimal position. By continuously iterating and updating, the global optimal solution is found after the departure stopping condition are met. Figure 8 illustrates the overall flow of the PSO algorithm. The first step is to initialize the particle swarm. Suppose the swarm size is n, the position of the particle is xi, the initial velocity of the particle vi, and the initial adaptability value of the particle Pi. Then set the swarm iterates m times, denotes the best position that the ith particle passes through after d iterations; gbestd denotes the best position that all particle pass through in d iterations. When , is replaced by . Similarly, when , the current gbestd is replaced by . The update equations for the particle positions and velocities in the swarm are shown below: where r1, r2 are random numbers between [0,1], w represents the inertia weights, and c1, c2 represent the learning factors. The above steps are repeated until the maximum number of iterations is reached. The optimal solution of the objective function and the optimal position of the particle xi are the output. We use the regression model established by BP neural network, assuming the output of the random forest regression algorithm is f (x). After that, we use the classification model established by the random forest classification algorithm, assuming that the output of the random forest classification algorithm is h(x). So the constrained optimization problem established is as follows: 4.EXPERIMENTAL RESULTS4.1The evaluation criteriaThe evaluation indexes of the regression model mainly include MAE, MSE and R2. The model’s regression effect is better if MAE and MSE is closer to 0 and R2 is to 1. MSE and R2 are selected as the evaluation indexes to evaluate the three models. where Pre denotes the predicted value, and Obj denotes the actual value. where y is the data to be fitted, denotes the mean of the data, denotes the fitted data, SST denotes the sum of total squares, SSR denotes the sum of the regression squares, SSE denotes the sum of squared residuals. Equations (15)-(18) show the evaluation metrics for the classification problem. In which True Positive (TP) refers to the correct rate of true judgment; True Negative (TN) refers to the correct rate of false judgement; False Positive (FP) refers to the false alarm rate; True Negative (TN) refers to the miss rate. ROC represents the perception curve, and the value closer to 1 means the result is better. 4.2Experimental resultsTo facilitate the operation, we copy the attributes pIC50 and the extracted features from the “ERα_activity.xlsx” file into one table, and take the extracted features as independent variables and pIC50 as dependent variables. Then it is input into MATLAB’s neural network toolbox to fit using BP neural network, and the results of the training and test sets obtained at the end of the fitting are shown in Figure 9. Figure 10 shows the effect of fitting the predicted and true values of the model created by BP neural network, and it proves the satisfactory of the results. To visually compare the advantages and disadvantages of the regression prediction effects of the three models, we calculate MSE and R2 for the BP neural network and the other two comparison models. The relevant data are shown in Table 3. Table 3.The comparison of the three models.
By comparing the three models, it can be concluded that the BP neural network not only has the best fit closest to 1, but also the mean square error can be kept at a low level. The data obtained from the previous feature extraction is fed into the random forest classifier to obtain its classification results, and the results are plotted in Figure 11. The experimental results of some other methods are also shown in Figure 11 in order to reflect the superiority of the method used in this paper. It can be clearly seen that the random forest algorithm has the best classification results from Figure 11. We use the PSO algorithm for modelling analysis. Since the default PSO algorithm in sklearn is to solve the minimum value problem but the optimization problem requires the solution of the maximum value problem, we make slight changes to optimize our constructed constraint. Figure 12 shows the change curve of the fitness of the PSO algorithm during the solution process. As the result the optimization algorithm get is the optimal value point, not a range that we want, we repeated iterations of the PSO algorithm for many times to get all the optimal results in the fluctuation of each variable maximum and minimum value. These are regarded as the value range fluctuations of the seven molecular descriptors. The seven molecular descriptors optimal value fluctuation graph are as shown in Figure 13. The values of the optimal fitness obtained after each run of the particle swarm optimizer are as shown in Figure 13, which is to verify the accuracy of the range obtained. Also, we selected two molecular descriptors, LipoaffinityIndex and CrippenLogP, and plotted the values of fitness and the corresponding molecular descriptors for each particle swarm training in Figure 14, in which red line represents the change in the optimal fitness obtained each train, and the purple and green represent the upper and lower bounds. It is obvious from Figures 13 and 14 that the optimal fitness of the output obtained from each data entry into the particle swarm optimizer varies very little and can be assumed as constant. On the contrary, the corresponding numerator descriptors have relatively large variations. So, after various iterations with the optimal fitness remaining basically unchanged, we can assume that the maximum and minimum values obtained for each numerator descriptor in multiple iterations are an approximate range of values for the variable. From the above analysis, it is known that our method is effective in solving for the approximate range of values when the seven molecular descriptors can make the compound biologically more active for inhibiting ERα while having better ADMET properties, and finally we filled in the obtained range of values in Table 4. Table 4.The range of the molecular descriptors.
5.CONCLUSIONIn this paper, we propose a method based on optimization to solve the optimal range of molecular descriptors for anti-breast cancer medicine candidates. First, a regression prediction model between biological activity and molecular descriptors was built. Then a classification prediction model between molecular descriptors and ADMET properties was developed, an optimization equation was established using the PSO algorithm to optimize them. Our experiments show that the method can more accurately determine the range of values when the compound has better biological activity for inhibiting ERα and also has better ADMET properties. The compound optimization model of anti-breast cancer medicine candidates proposed in this paper can scientifically explain the biological activity and ADMET properties of the compounds, which is a strong guide to optimize the biological activity of ERα antagonists. It also has a greater clinical significance in finding the molecular descriptors that potentially affect the activity of compounds for anti-breast cancer medicine reduce the lethality of breast cancer. In addition, this paper is innovative in terms of data processing and model prediction, which has certain reference significance for other related studies. Although there are still many shortcomings in this paper, we hope to make some contributions to the manufacture of anti-breast cancer medicine and promote the treatment of breast cancer through the research in this paper. REFERENCESBernabe, T., Sastre-Serra, J., Ciobu, N., et al.,
“Estrogen receptor beta (ERβ) maintains mitochondrial network regulating invasiveness in an obesity-related inflammation condition in breast cancer,”
Antioxidants, 10
(9), 1371
(2021). https://doi.org/10.3390/antiox10091371 Google Scholar
Amaral, C., Trouille, F. M., Almeida, C. F., et al.,
“Unveiling the mechanism of action behind the anti-cancer properties of cannabinoids in ER+ breast cancer cells: Impact on aromatase and steroid receptors,”
Journal of Steroid Biochemistry and Molecular Biology, 210 105876
(2021). https://doi.org/10.1016/j.jsbmb.2021.105876 Google Scholar
Tavianatou, A. G., Piperigkou, Z., Koutsakis, C., et al.,
“The action of hyaluronan in functional properties, morphology and expression of matrix effectors in mammary cancer cells depends on its molecular size,”
FEBS Journal, 288
(14), 4291
–4310
(2021). https://doi.org/10.1111/febs.v288.14 Google Scholar
Muchtaridi, M., Dermawan, D. and Yusuf, M.,
“Molecular docking, 3D structure-based pharmacophore modeling, and ADME prediction of alpha mangostin and its derivatives against estrogen receptor alpha,”
Journal of Young Pharmacists, 10
(3),
(2018). https://doi.org/10.5530/jyp Google Scholar
, (Chinese National Post-Graduate Mathematic Contest in Modeling,”
(2021) https://cpipc.acge.org.cn/ Google Scholar
Prakash, K. B., Amarkarthik, A., Ravikumar, M., et al.,
“Optimizing performance characteristics of blower for combustion process using taguchi based grey relational analysis,”
Advances in Materials Research, 155
–163 Springer, Singapore
(2021). https://doi.org/10.1007/978-981-15-8319-3 Google Scholar
Deng, J., Deng, Y. and Cheong, K. H.,
“ombining conflicting evidence based on Pearson correlation coefficient and weighted graph,”
International Journal of Intelligent Systems, 36
(12), 7443
–7460
(2021). https://doi.org/10.1002/int.v36.12 Google Scholar
Deng, H., Feng, Z., Qian, G., et al.,
“MFCosface: A masked-face recognition algorithm based on large margin cosine loss,”
Applied Sciences, 11
(16), 7310
(2021). https://doi.org/10.3390/app11167310 Google Scholar
Bae, J., Chatzidakis, S. and Bean, R.,
“Effective solid angle model and monte carlo method: Improved estimations to measure cosmic muon intensity at sea level in all zenith angles,”
Inter. Conf. on Nuclear Engineering, 85277
(2021). Google Scholar
Jin, W., Li, Z. J., Wei, L. S., et al.,
“The improvements of BP neural network learning algorithm,”
Proc. of 5th Inter. Conf. on Signal Processing and 16th World Computer Congress, 1647
–1649
(2000). https://doi.org/10.1109/ICOSP.2000.893417 Google Scholar
Dharumarajan, S. and Hegde, R.,
“Digital mapping of soil texture classes using random forest classification algorithm,”
Soil Use and Management, 38
(1), 135
–149
(2022). https://doi.org/10.1111/sum.v38.1 Google Scholar
Wang, F., Zhang, H., Zhou, A.,
“A particle swarm optimization algorithm for mixed-variable optimization problems,”
Swarm and Evolutionary Computation, 60
(2), 100808
(2021). https://doi.org/10.1016/j.swevo.2020.100808 Google Scholar
Huang, J., Zhong, M. and Hu, Q.,
“LSTM stock prediction model based on improved particle swarm optimization,”
Journal of East China University of Science and Technology, 1
–12
(2021). https://doi.org/10.14135/j.cnki.1006-3080.20210616001 Google Scholar
|