Introduction

The transition from fossil fuels to environment-friendly and renewable sources of energy has become one of the central challenges for today’s society and across scientific disciplines1. Though global energy studies report that there has been record-breaking growth in sustainable energy production, further improvements are desirable2,3,4. Solar energy is a promising solution for the near future, owing to its availability and potentially high output. Since the first practical silicon photovoltaic cells were developed more than 50 years ago5, several methods of harvesting energy from sunlight have been investigated. Today, four generations of photovoltaic cells are recognised6,7, which are categorised based on factors such as the absorbed spectral bandwidth, and the power conversion efficiency and their upper thermodynamic maxima. However, although record conversion efficiencies are reported frequently8,9, many of these prototypes are not suitable for mass production as they use rare or toxic chemicals, or due to their short operational lifetimes. Considering the demand for a global solution, an upgrade to mass-producible first-generation technology could be valuable.

One such strategy aims to exceed the Shockley-Queisser (SQ) limit in single junction cells10 by applying a multiple exciton generation technique called singlet fission (SF)11,12,5 in the ESI). Hence, while smaller groups at the R1/R10 positions correlate with SF-PV, the exact cause for this remains to be uncovered in future work.

Next, to include two-body relationships, a connectivity diagram was used to illustrate the 200 most commonly co-occurring combinations of groups and positions (see Fig. 5a)34. To facilitate the discussion, we introduce the shorthand notation Grp1Pos1/Grp2Pos2, indicating chemical group Grp1 located at position Pos1 and chemical group Grp2 located at position Pos2, respectively. In this diagram, the brightness of connecting lines represents the number of occurrences for each such combination of sites and groups, given that the probed molecule satisfies the SF-PV criteria (i.e. conditional probability). Owing to the nature of the artificial randomness of structures, the balance between symmetrically equivalent modifications (e.g. OMe7/Hyd4 and OMe4/Hyd7) was not exactly equal. A list of the ten most and least common co-occurring pairs can be found in Supplementary Table 1 of the ESI.

Fig. 5: Most commonly co-occurring combinations of groups and positions.
figure 5

a Plot showing connectivity between pairs of groups and positions, generated using the MNE Python programme package34. The brightness of each node indicates how strongly it is included in the connections. b Graphical representation of the learned feature importance of the Random Forest Classifier, generated using the MNE Python programme package34. The three groups of features, explained in the Methods section, are depicted as follows: Node brightness indicates the importance of individual groups at specific positions (first group of features). Lines connecting two nodes represent the 'and/or' symmetry-equivalent position information for a chemical group (e.g. hydrogen at either/and R4 or R7; second group of features). Finally, the brightness of the colour-coded node borders specifies the importance of typical, average, or uncommon groups (blue, green and red, respectively; third group features). The exact importance values can be found in Supplementary Table 2 of the ESI.

As shown, the most common modification (4.74%) contains a methoxy-group at position R4 and hydrogen at position R7. This percentage was significantly elevated, considering that the mean conditional frequency for all possibilities was 1.56%. In accordance with the previous analysis, there were many methoxy-, fluoro- and hydrogen modifications among the high-percentage co-occurring entries at positions R4 and R7. However, small groups with positive mesomeric effects, such as the methoxy-, fluoro- and chloro-groups, also occurred together in pairs relatively often (cf. connectivity diagram of chemical groups, Supplementary Fig. 4b in the ESI). Similarly, by analysing the least common combinations, we confirmed that there were only few structures among the SF-PV candidates that featured any two of the electron-withdrawing cyano- or bulky thiophenyl-groups simultaneously.

Focusing on just positional arguments in these combinations, the symmetry-equivalent positions R4 and R7 displayed the strongest overall connection and had the highest summed density of connections to any other node (cf. Supplementary Fig. 4a in the ESI). Conversely, R3 and R8 displayed the lowest density of connections, and the weakest connection was found between R3 and R1. Likewise focusing only on the chemical groups regardless of their position, mostly pairs of hydrogen were observed, while there were almost no combinations containing pairs of any of the larger phenyl- and thiophenyl-substituents, or the strongly electron withdrawing cyano-group (cf. Supplementary Fig. 4b in the ESI).

In summary, these considerations indicate that the R4 and R7 positions can be expected to play a significant role in potential design principles for SF-PV in 2,2′-diethenyl cibalackrot. Further, it is implied that such a design principle favours small, electron donating groups at these positions (methoxy- and hydrogen groups in 42.9% of cases, cf. Fig. 1c). However, likely due to the non-local nature of ππ* excitations, note that these positions are also strongly connected to all remaining positions in the molecule (cf. Supplementary Fig. 4a). Hence, a clear physicochemical origin may be difficult to pinpoint without additional data. Take note, that a full ‘top-down search’ for any potential origin property always requires that any such probed origin property is available for all structures of the database (like in the case for the repulsion of ketone groups above). To work around this problem, we therefore instead choose a ‘bottom-up approach’, in which we first train a classification model to identify SF-PV structures from the available information and then search for correlations in the physicochemical properties of a smaller group of representative molecules chosen by the classifier logic.

Random forest classification

To identify generally applicable structural design principles, a random forest classifier was trained using the predicted excitation energies of the database. The features were based exclusively on the chemical groups and their positions, with the aim of translating the energetic region of interest directly into structural design rules relevant to synthesis. A detailed description of the construction of the three groups of features can be found in the Methods section. For training and testing purposes, two types of datasets were created from the full database of predictions. Both types contained all 393,841 SF-PV candidate structures and an equally large amount of non-SF-PV molecules. In the first type, all the non-SF-PV structures were taken from one of the energetic quadrants (cf. Fig. 1c), and in the second type, those structures were taken randomly from anywhere. In the following discussion, the former is called a balanced quadrant set, and the latter is called a balanced random set.

The random forest model was found to be most suitable for classification, when it was trained using the upper-right balanced quadrant set (i.e. energetically close to the SF-PV distribution itself). Using this training method, an accuracy of 80.3% was achieved when performing classification on a random balanced test set. From the confusion matrix of this test set, equally high precision and sensitivity were found (cf. Supplementary Fig. 6 in the ESI). When the balanced random set was applied for training, the resulting random forest model achieved a slightly better accuracy of 82.3%. However, in this case, the precision and sensitivity of the model was unfavourable for the desired application, because there were far more false-positive classifications. In particular, because the actual chemical space is highly imbalanced (91.2% non-SF-PV and 8.8% SF-PV entities), a high sensitivity is desirable. Finally, when the model was trained in any of the other quadrants, the performance in the respective quadrant was highest individually, but it failed to generalise towards any other quadrant. The largest discrepancy was observed when training was conducted using the lower-left balanced quadrant set (testing accuracy of 97.7%), and the resulting model was tested using the upper-right balanced quadrant set (testing accuracy 55.2%). The training statistics for all quadrants can be found in Supplementary Table 3 of the ESI.

From a chemical informatics perspective, these findings can be explained by the fact that the different energetic quadrants do not share enough similarities to learn anything other than the difference between the quadrants themselves. In the previous extreme case, the model practically learnt whether molecules were part of the bottom-left quadrant or not by judging the presence or absence of cyano-groups. However, this approach to classification cannot be used to make decisions about the upper-right quadrant, where few cyano-groups are featured anyway (see chemical composition of energetic quadrants, Supplementary Fig. 2 in the ESI).

To understand the meaning behind the learned design principles, the importance of the features can be interpreted together with the distribution of the observed chemical residues. Figure 5b is a graphical representation of the importance of all the features. The brightness of individual nodes GrpPos reflects how important the singular combination of group and position is for classification. The opacity of the lines between two opposing nodes gives the importance of a positional either–or presence \({{{{\rm{Grp}}}}}_{{{{{\rm{PoS}}}}}_{1},{{{{\rm{PoS}}}}}_{2}}\) (i.e. whether the group is found at either or both of the two positions). Finally, the intensity of the differently coloured edges around the nodes represents the importance of typical, common, or uncommon groups of chemical residues for a specific position (blue, green and red, respectively). A list of the actual importance values can be found in Supplementary Table 2 in the ESI, and a more detailed explanation of the features is given in the ‘Methods’ section.

The most important feature for classification was the presence (or absence) of a thiophenyl group at position R10 (i.e. unc10 in 5b), followed by the inversion-symmetric Boolean of whether or not there was a thiophenyl substituent at position R4 or R7 (i.e. Thi4,7). Furthermore, the chemical distribution of the SF-PV molecules was devoid of thiophenyl across all these positions (cf. Fig. 5a). Comparing in this fashion what features are important for classification and what groups were present in the SF-PV molecules, the absence of Thi4,7,10 can be considered as a generalisable design rule for the specific set of molecules and all 2,2′-diethenyl cibalackrot derivatives. By going through the list of importances in this way, it can be seen that similar clear generalisations include the absence of Ph3,8 and the presence of typ7 (i.e., hydrogen or methoxy substituents at position R7).

In summary, bulky substituents like the phenyl- and thiophenyl-groups are generally confirmed to be unfavourable close to the central positions, whereas the small electron-donating methoxy-group is favoured on the R4 and R7 positions. These findings are consistent with the actual distribution of groups in the energetic region, and were identified as global classification rules to distinguish from the non-SF-PV molecules. Further, note that because this random forest model does not possess chemical intuition (e.g. concepts of electron donating/withdrawing or bulky groups) the design principles learnt by the model can be understood as purely data-driven and free of human bias except for the nature of features that the model learned from.

Transfer towards wavefunction-based design strategies

When using the trained random forest model to predict classes for all 4,463,586 PM6-converged molecules, each structure was associated with its own class probability for each SF-PV and non-SF-PV. Interpreting this probability as a measure of how representative these structures were for their individual group and thus to which extent they represent the related design rules, the most confident 500 SF-PV and 500 non-SF-PV structures were selected for an a posteriori in-depth analysis of their wavefunction characteristics. This way, we attempt to transfer the structural design rules of the random forest classifier towards more general electronic structure characteristics that might be applicable also for other core structures and substituents.

For all selected structures, the SF criterion energy E(SF) = E(S1) − 2E(T1) was calculated from the respective TDDFT and Δ-SCF results after performing a B3LYP structure optimisation, as for the training data. Inspired by the work of Zeng et al.27, we extracted several different wavefunction-related features from the individual rings A–C and A’–C’ and from the α-carbon atoms (i.e., the 2,2′ positions which connect to the ethenyl moieties) and investigated these local properties for correlations with singlet fission. To ensure the analysis was insensitive to inversion symmetry on the artificially random cases of either image or inverted image, all features were summed up inside every pair of inversion-symmetric regions of the core scaffold (i.e. rings A+A’, B+B’ and C+C’ and the 2+2’ carbon atoms that connect to the ethenyl groups). Note that features were only extracted on the original cibalackrot core to improve transferability (i.e. including the ketone oxygen atoms, but excluding the 2,2′-diethenyl substituents).

Besides these local properties, a connection towards the biradical character y0 of each structure was investigated, since this property had been reported to show correlations with singlet fission in earlier works as well27. Here, we evaluated y0 as the occupation number of the lowest unoccupied natural orbital (LUNO) from a broken-symmetry unrestricted Hartree–Fock (BS-UHF) calculation25.

A consistent gradient in the SF criterion energy can be observed with respect to the sum of ground state Mulliken charges and spin densities in specific regions of the core structure (cf. Fig. 6). As shown in the upper row of graphs, charge neutral 2+2’-carbon atoms tended to fulfil the SF condition, with a further separation becoming apparent when the charge in other core regions (A+A’, B+B’ and C+C’) was considered at the same time. Similarly, a separation into clusters was obtained when considering the spin densities of the C+C’ region against the other core regions. Here, low spin densities in both the C+C’ region and other regions are desirable.

Fig. 6: Distributions of Mulliken charges and spin densities colour coded by the SF criterion energy.
figure 6

(Top) Distribution of summed Mulliken charge over different parts of the molecule and at different excitation energies, colour coded by the SF criterion energy. (Bottom) Distribution of summed spin densities over different parts of the molecule and at different excitation energies, colour coded by the SF criterion energy.

We closer examined these local features by considering their correlations with respect to the individual E(S1) and E(T1) excitation energies and found that more charge neutral 2+2’-carbon atoms led to lower triplet energies, while the singlet energies remained mostly unaffected. A similar independence for excited singlet energies was observed for the sum of spins in the C+C’ rings. These observations show that these features may become useful for tuning the energetics in favour of singlet fission and generalise our previously suspected design concept of energetically asymmetric shifts towards the other substituent positions as well.

It is assumed that the independence of E(S1) with respect to the charge on the 2,2′-carbon atoms can be explained by the electro-neutrality of the 2,2′-carbon atoms and their attached ethenyl substituents. Considering that the first singlet excitation is of global ππ* nature, the charge of two singular carbon atoms inside the conjugated network will be efficiently re-distributed among the rest of the core π-network upon excitation anyway, regardless of their initial charge on those specific atoms - thus being of little importance to the overall E(S1) excitation energy. However, since the electronic (spin-)density in the triplet state does not have the same global extent, but is instead accumulated strongly on the C,C’-rings and the two ethenyl substituents (cf. Fig. 3a, b), the observed negative correlation can be related to a situation where more negative local charge on the 2,2′-carbon atoms could lead to a higher energy requirement for E(T1).

As for the second correlated feature, higher spin densities forming in any part of the molecule can be practically associated with an equally larger excitation energy for their formation, explaining the origin of the positive correlation between density and E(T1). This is indirectly confirmed by noting that any combination of axes (i.e. plotting A+A’ against B+B’ instead) would result in an equally correlated picture. The reason why there is no correlation with E(S1), however, is likely connected to the same asymmetric red-shift of E(T1) versus E(S1) as observed for the example study earlier (cf. Fig. 3c–f). While both excitation energies change with the structure in a similar way due to the practically identical hole and particle wavefunctions, only the triplet energy will additionally benefit from spin-related effects.

Finally, the evaluation of the biradical characters y0 of 921 molecules that converged to a broken-symmetry solution implies that a third correlation with respect to the singlet fission energy criterion E(SF) is contained in the structural design principles of the random forest classifier (cf. Fig. 7). Seeing how radical stabilisation was found to be beneficial for SF for substituents at the 2,2′-positions of cibalackrot27, we can thus conclude that also other positions may be used to improve the radical stabilisation further. Take note though, that even if correlations exist for the biradical value y0 inside the chosen samples it is not a straightforward task to identify what groups and positions exactly caused which shift, due to the non-locality of the excited state wavefunction that was briefly mentioned at the end of the first half of the results section.

Fig. 7: Correlation diagram between the singlet fission energy criterion E(SF) and the biradical value y0.
figure 7

Correlation diagram between the singlet fission energy criterion E(SF) and the biradical value y0. Note that 48.3% of the 921 molecules fulfil SF conditions when values of up to E(SF) = −0.06 eV are allowed (see dashed line), which represents two times the standard deviation of the energy prediction models that the classifier learned from.

In summary, the sampled designs follow an asymmetric shift of the T1 versus S1 states’ excitation energies. More specifically, this asymmetric shift is found to be correlated to local changes in the spin density of the C,C’-ring regions and the charges on the 2,2′-carbon atoms. Both findings imply that the purely data-driven classification model has learned a design strategy, where local changes in the electronic structure close to the central part of the molecule (i.e. in the region where most of the spin density is accumulated) play a key role for designing singlet fission molecules. The design principles of the model are further backed up by a simultaneous increase in the radical stabilisation through the biradical character (i.e. y0-value). Overall, this means that our envisioned three-step procedure was successful in learning structural design rules that could be transferred into relevant underlying wavefunction-related properties.

Discussion

Machine learning has become an invaluable asset when designing materials. On one hand, it can be used to predict key properties of molecules in an efficient way to construct databases of millions of molecules in a relatively short amount of time. This makes it possible to screen for candidate molecules in silico and identify those that satisfy the desired criteria, such as the energetic condition for SF35,36. On the other hand, by analysing large databases, it is possible to search for correlations with specific properties to help identify potential chromophores and broaden our physicochemical understanding25,37. In this work, these directions were rigorously combined to search for potential SF structures from a generated database, investigate the common design principles inside their molecular structure and finally translate them into more basic properties, that is, their wavefunction and electronic structure.

For this purpose, we first trained a kernel ridge regression (KRR) model to predict the first excited singlet and triplet excitation energies of the 4,573,800 chemical derivatives of 2,2′-diethenyl cibalackrot in TDDFT quality (MAEs of 0.025 and 0.026 eV, respectively) based on features obtained from semi-empirical PM6 calculations. Using this database of excitation energies, the suitability of each candidate structure for SF-PV was assessed based on two factors: (a) whether the predicted singlet excitation energy was within the range of twice the predicted triplet excitation energy (i.e. ∣E(S1) − 2E(T1)∣ < 0.03 eV allowing only for slightly exoergic candidates); and (b), whether the predicted triplet excitation energy E(T1) was larger than the band gap energy of silicon (1.12 eV29) to avoid voltage penalties in hypothetical devices21. We found that 30.15% of the 2,2′-diethenyl subspace fulfilled the SF condition (a), and 8.82% of the overall chemical space also fulfilled the photovoltaic condition (b). The comparably high percentage of SF candidates was attributed specifically to the double ethenyl functionalisation of the core structure, which greatly (and almost exclusively) decreased the triplet excitation energy compared to the unfunctionalised cibalackrot. This observation is in line with previous studies on the 2,2′-substituent position and was previously attributed to the radical stabilising effect of the ethenyl substituent27. A collection of ten SF-PV structures that were identified within our predictions along with their a posteriori confirmed TDDFT/Δ-SCF excitation energies can be found in Fig. 8. Note that these structures were specifically selected from only symmetric molecules to increase their potential for synthesis with varying degrees of functionalisation as a preview of the diversity of the database. Further, since cibalackrot is a well-known indigoid vat dye with more than 100 years of history and former commercial use38,39, its modifications may be equally promising and available for applications.

Fig. 8: Selection of ten symmetric SF-PV candidate structures.
figure 8

Selection of ten symmetric structures (aj) with varying degrees of substitution that may be target for synthesis. Excitation energies below the molecules were obtained at TDDFT/Δ-SCF level of theory. Structural data for these examples are available in the ESI.

To extract which chemical modifications were beneficial for singlet fission photovoltaics and to generalise towards wavefunction-based properties, a three-step extraction procedure was performed on the predicted database. In the first step, the distribution of chemical groups at different substituent positions in the set of predicted SF-PV candidate molecules was screened. Here, it was found that electron withdrawing groups (cyano- most specifically) were less common in the predicted SF-PV designs at practically any position, whereas electron donating groups (methoxy- and hydrogen substituents at the central positions, most specifically) were overall more often present (cf. Fig. 9a). However, even if these observations from inside the PV-SF set are understood as characteristic for the subspace itself, we would like to raise awareness that they remain difficult to generalise towards global design rules for the full 2,2′-diethenyl cibalackrot space. For example, while it was found that roughly 280,000 out of 400,000 SF-PV structures featured an electron donating group at position R4, another 2,250,000 of non-SF-PV structures fall into this same category. Further, due to the highly non-local nature of the electronic structure, formulating design recommendations for several positions will become increasingly difficult, since making the decision of actually placing a specific group will lead to a changed situation for the remaining positions (example shown in Fig. 9b).

Fig. 9: Structural occurrences of electron withdrawing and donating groups.
figure 9

a Probabilities of finding an electron withdrawing (blue left number) and donating group (red right number) at each position of the 393,841 predicted SF-PV candidates. b Conditional probabilities for encountering a second group, after an electron withdrawing group has been chosen for position R2. The substituents were assigned to the respective groups in accordance with the colour scheme of Fig. 2.

To account for this problem, the second step of the procedure is aimed at extracting design principles that are both applicable to SF-PV molecules while excluding non-SF-PV structures. For this purpose, a random forest classification approach was chosen to treat the non-local nature of the problem. The model achieved an overall classification accuracy of 80.3% with equally high sensitivity and precision when classifying a balanced test set of SF-PV molecules and otherwise random non-SF-PV structures. By analysing the importance values of the classification features (Fig. 5b) together with the chemical distributions of the previous analysis step (Fig. 5a), we further deduced that bulky substituents such as phenyl- and thiophene are disfavoured especially for positions R1 and R10 which are close to the ketone oxygen and confirm that cyano-groups are not only rare in SF-PV molecules, but also carry significant exclusivity towards the non-SF-PV structures. While we are aware that this is a somewhat unusual perspective, note that we consider this combined analysis of which groups were typically present (or not) together with which factors were important in the classifier to be our most compact and concise attempt at capturing all possible structural recommendations in one image, while both paying the necessary respect to the non-locality of the problem and balancing the aforementioned generality and exclusivity that is necessary for formulating a valid recommendation.

In the last step of the analysis we transfer the structural logic learnt by the random forest model towards wavefunction-based properties that are related with SF-PV. This was done by analysing the most likely 500 SF-PV molecules and 500 non-SF-PV molecules according to the model and subsequently perform (TD)DFT, UDFT and broken-symmetry UHF calculations. Relating the SF criterion energy E(SF) = E(S1) − 2E(T1) to different wavefunction-related quantities (cf. Figs. 6 and 7), three meaningful correlated features were found. These are the sum of the ground state charges at the 2,2’-carbon atoms, the sum of T1 spin density inside the C,C’-rings and the occupation numbers of the LUNO of the broken-symmetry UHF solution (i.e. the y0-value as used in Minami et al.25). Note that these are identical or at least in part related to previously reported correlations for singlet fission in a chemical space of cibalackrot that focused on 2,2′-derivates27. Thus, our results not only serve as a proof of concept for the purely data-driven extraction method of design principles for future properties and chemical spaces, but further generalise design rules known to be beneficial for singlet fission towards a much larger chemical space when following the design rules provided by the random forest model.

Methods

Generation of chemical space

For the generation of cibalackrot derivatives, the core structure was decorated with eight different substituent groups (cf. Fig. 2a, b) that are commonly used in SF studies27,36,40. The chloro- and fluoro-groups were chosen to represent small substituents with mesomeric electron-donating and inductive withdrawing effects, the cyano-group was used to represent a medium-sized electron-withdrawing substituent, and the methoxy- and ethenyl-groups were considered as electron-donating medium-sized groups. For eventual tuning of the crystal structure packing41,42, the bulky phenyl-group was added. Finally, the 5-methylthiophenyl substituent (in the following just referred to as thiophenyl) was introduced. For simplicity of comparison with their study on aromaticity, we adopted the naming convention for 5- and 6-membered rings used by Zeng et al.27.

We deemed it difficult to attach large numbers of phenyl- and thiophenyl-groups simultaneously from steric hindrance and potential synthetic perspectives, so we limited the number of such groups on either side of the half-structures (R1–R5 and R6–R10, respectively) to one at a time. Following these rules for chemical space generation and considering the inversion symmetry of the core scaffold, the full catalogue of structures included 215,001,216 molecules.

To identify potential SF candidate molecules from this pool of possibilities, a pre-screening was performed to find a chemical subspace in which to carry out the machine learning study (cf. ‘Results’). For this purpose, a set of 2000 structures was randomly selected from the full space, and it was investigated using quantum chemistry calculations. After this initial screening, the chemical space was narrowed down to 4,573,800 molecules, in which the central residues (R5 and R6) were fixed to ethenyl-groups. From this reduced space, a set of 7500 molecules was randomly selected to serve as the training set for machine learning.

Utilising the inversion symmetry of the core scaffold, initial guess structures for the calculations were efficiently generated from combinations of the half-structures (indicated by the dashed line in Fig 2a). The half-structures themselves were constructed using the Schrodinger Materials Science Suite43. Note that, for the initial guess, no structural optimisation other than that provided by the Schrodinger Suite was performed43. In some instances when the substituents were too close or overlap**, the guess structures were adjusted prior to any quantum chemistry calculations by rotating the substituents slightly apart.

Quantum chemistry calculations

To generate the excitation energies of the chemical space, a machine learning model was developed. For this purpose both labels (i.e. answer data E(S1) and E(T1)) and features (i.e. describing data) for the excitation energies of the first singlet and triplet states, had to be obtained for a set of molecules.

The label excitation energies of the 7500 selected training set structures were obtained with the following quantum chemistry protocol. After pre-optimisation of the guess structures using the semi-empirical PM6 method44, the remaining 7214 converged structures were used in restricted density functional theory (RDFT)45,46 calculations for final structure optimisation. The excited singlet energies E(S1) were calculated using TDDFT47,48 for the RDFT optimised ground state structures. Previous benchmark studies reported that triplet energies from TDDFT calculations tend to be unreliable for the study of SF materials, so the Δ-SCF method was used to calculate the E(T1) energy labels instead32,36.

All quantum chemistry calculations were conducted using the Gaussian programme package49 and utilised the B3LYP functional in the respective restricted or unrestricted version50,51,52,53. Similar to other studies36, we found that combinations with the 6-31G(d’) basis set54,55 had a favourable balance between accuracy (compared to available data from other studies) and computation time for several available standard methods. It should be mentioned that, although we expected the reported combination of the M06-2X functional and the 6-311+G(d,p) basis set to give a more accurate image27, the trade-off with respect to the higher calculation time was considered unfavourable for the high-throughput nature of our study. Calculation of the diracdical character y0 were performed by running BS-UHF calculations at the B3LYP optimised ground state structures at the higher 6-311+G(d,p) basis56.

The features required to train the machine learning model were obtained from the semi-empirical PM6 structure optimisation calculations (which also served as pre-optimisation for the training set). On average, one of these calculations took less than 40 CPU seconds per structure, so they presented a balanced way of obtaining useful electronic structure information and structural features for the 4,573,800 molecules considered. In the following discussion, the generation of the features will be considered together with the central aspects of the machine learning model. For a general overview, Westermayr et al. recently compiled a list of state-of-the-art machine learning models for excited state properties57.

General machine learning model specifications

The machine learning technique applied in this work was KRR,58 as implemented in the scikit-learn Python package59. This is a non-linear method where a sum of products is used to predict a desired property (here, energy Epred) connected to a molecule \(M^{\prime}\) with respect to a training set M with known reference energies. That is,

$${E}^{{{{\rm{pred}}}}}(M^{\prime} ,{{{\bf{M}}}})=\mathop{\sum}\limits_{i}{\alpha }_{i}f(M^{\prime} ,{M}_{i}).$$
(1)

Here, αi are the coefficients to be trained, and the function \(f(M^{\prime} ,{M}_{i})\) determines how similar molecule \(M^{\prime}\) is to the i-th molecule of the training set Mi. Training of the coefficients is performed by minimising the expression

$${\min }_{\alpha }\mathop{\sum}\limits_{j}{({E}^{{{{\rm{pred}}}}}({M}_{j})-{E}_{j}^{{{{\rm{lab}}}}})}^{2}+\lambda \mathop{\sum}\limits_{j}{\alpha }_{j}^{2},$$
(2)

for the training subset. Here the superscript ‘lab’ indicates the label values for the respective excitation energies E. To increase the generality, a regularisation term with respect to the norm of the coefficients was added. The corresponding prefactor λ is a hyperparameter of the model itself, and it does not change during training; instead, it needs to be optimised separately via cross-validation. Through kernelisation, the minimisation problem can be expressed as the matrix inversion problem

$${{{\boldsymbol{\alpha }}}}={({{{\bf{K}}}}+\lambda {{{\bf{I}}}})}^{-1}{{{{\bf{E}}}}}^{{{{\rm{ref}}}}},$$
(3)

in which I is the identity matrix and K is the so-called kernel matrix. Individual elements of the kernel matrix are obtained as Kij = f(Mi, Mj). To design a machine learning model suitable for predicting excitation energies, different descriptive expressions for f correlated to the desired quantity must be defined.

In this work, f took the form of products of Gaussians with respect to the abstract distance measures between two molecules dg(Mi, Mj),

$$f({M}_{i},{M}_{j})=\exp \left(-\left(\mathop{\sum}\limits_{g}\frac{{d}_{g}{({M}_{i},{M}_{j})}^{2}}{2{\sigma }_{g}^{2}}\right)\right).$$
(4)

Here, the width of each respective Gaussian σg is a hyperparameter of the machine learning model. Note that in practice, however, we made an initial guess for the individual σg based on the data distribution in histograms of the training set’s distances dg, and then scaled all σg according to a single, collective hyperparameter σ. Therefore the model only uses two hyperparameters that needed optimisation through cross-validation.

Depending on the functional choice of the distance measures dg, the machine learning model will be more or less capable of relating the features of a set of molecules to their respective labels. To prevent the model from learning the training set without being able to generalise to unknown structures (i.e., to prevent overfitting), we performed stratified 5-fold cross-validation for the selection of hyperparameters and benchmarked models that were trained with differently split compositions of 80% training and 20% testing sets.

Overall, the final models use a total of 54 features belonging to fifteen types of distance measure descriptors dg, the detailed descriptions of which can be found in the ESI. Note that, even though the features were designed to support machine learning correlations, not all of them necessarily reflected physico- and quantum chemical behaviours. Further, the landscape of machine learning assisted chemical design studies is growing rapidly, and by the nature of the task, the developed features are often custom-made solutions for a given problem, rather than universally applicable quantities whose true origin can be easily tracked. Therefore, although we reference special cases where we know of an earlier application or definition of a feature60,61, the list of references is almost certainly incomplete.

Finally, take note that in it’s current form, the models for energy prediction designed here are not transferable to other chemical spaces in their current form, since they were only developed as a tool to predict the excitation energies of cibalackrot-type molecules.

Random forest classifier

To determine a map** between the predicted excitation energies and chemical structures, a random forest classifier62,63, as implemented in the sci-kit learn Python programme package59, was trained to distinguish SF-PV structures from non-SF-PV structures only using information regarding the chemical substituents and their positions. In total, 120 features were used for training, which were subdivided into three groups of increasing generality as follows.

In the first group of features, each possible combination of the eight chemical substituents at the eight possible functionalisation sites R1–R4 and R7–R10 was evaluated. This resulted in 82 = 64 different Boolean features (e.g., checking for specific modifications such as Ph1). Similarly, a second group was constructed such that any of the pairwise inversion-symmetry equivalent positions counted for the determination of the Boolean. In other words, for all eight chemical groups, the four pairs of positions were evaluated in the same way as for group one (e.g., specific modification Ph1,10 at R1 or R10). This treatment resulted in 4 ⋅ 8 = 32 features for the second group.

Finally, a third group of features was generated based on the predicted chemical group distribution of SF-PV molecules (cf. Fig. 1c). For this purpose, the substituent sites of each molecule were evaluated based on whether they were carrying a substituent that was typical, common, or uncommon in SF-PV molecules. The three cases were distinguished as follows.

In the predicted distribution of chemical groups in SF-PV molecules (cf. Fig. 1c), for each position, a specific portion of molecules was found to carry a chemical group at given pairs of chemical inversion-symmetry equivalent sites, expressed as pPos(Grp). For each site, the mean percentage \({\tilde{p}}^{{{{\rm{Pos}}}}}\) and standard deviation σPos was calculated. The portion of phenyl- and thiophenyl groups was different owing to the choice of structure generation rules per se; therefore, to equalise the picture somewhat, the portion of phenyl- and thiophenyl groups was doubled for the purpose of pPos(Grp). For each position, the distinction between typical, common and uncommon groups was then performed according to:

$${{{\rm{Type}}}}\,\,{{{\rm{of}}}}\,\,{{{{\rm{Grp}}}}}_{{{{\rm{Pos}}}}}=\left\{\begin{array}{l}{{{\rm{typical}}}}\,\,\,\,{{{\rm{for}}}}\,\,{p}^{{{{\rm{Pos}}}}}({{{\rm{Grp}}}}) \,>\, {\tilde{p}}^{{{{\rm{Pos}}}}}+{\sigma }_{p}^{{{{\rm{Pos}}}}}/2,\quad \\ {{{\rm{uncommon}}}}\,\,\,\,{{{\rm{for}}}}\,\,{p}^{{{{\rm{Pos}}}}}({{{\rm{Grp}}}}) \,<\, {\tilde{p}}^{{{{\rm{Pos}}}}}-{\sigma }_{p}^{{{{\rm{Pos}}}}}/2,\quad \\ {{{\rm{common}}}}\,\,\,\,\,{{\mbox{for all remaining cases.}}}\,\quad \end{array}\right.$$
(5)

Following this scheme, for each position of each molecule, one of three cases was assigned. These three cases were subsequently one-hots encoded to obtain three Boolean features for each of the eight positions, giving a total of 24 features for this last group.

To check the generality of the resulting random forest classifier, all structures were randomly subjugated to an inversion operation before feature generation. The model used 300 different decision trees with a maximum depth of 20 distinguishing steps. It was confirmed that for different randomisation seeds, training-testing splits and slightly changed model size and depth parameters, the arising classification rules were approximately identical, within reasonable margins. The training itself was conducted with a balanced data subset of 50% SF-PV and 50% non-SF-PV molecules. From the full chemical space, different subsets were constructed. First, a balanced set with non-SF-PV structures taken from all energetic regions of the E(S1)/E(T1) distribution was considered. Second, non-SF-PV structures were taken from specific energetic quadrants with respect to the mean energies \(\tilde{E}({{{{\rm{S}}}}}_{{{{\rm{1}}}}})\) and \(\tilde{E}({{{{\rm{T}}}}}_{{{{\rm{1}}}}})\) (cf. grey lines on left of Fig. 1c). From each of the quadrants, one additional balanced set was generated. Training was conducted using each of the subsets, which resulted in different classification rules each time. To test the quality of the resulting models, each dataset was split 80:20 into training and testing sets. The quality of each model was determined using the test set of the balanced random set. The best performing model was considered based on accuracy and sensitivity with an emphasis on the latter, since in the actual chemical space much more non-SF-PV structures are expected. It should be noted that although all the datasets used the same SF-PV molecules before the training/testing split, the non-SF-PV entries of each quadrant may or may not have randomly overlapped with the balanced random set.

Once the model was chosen, the probability of each classification, ’SF-PV’ or ’non-SF-PV’, was calculated for all 4,463,586 molecules. From these molecules, the ones with the highest probabilities for SF-PV and non-SF-PV were treated as the most confident; i.e., most congruent with the learned classifier. Consequently, these confident cases were expected to yield the best separation in subsequent investigations of the quantum chemical properties, as performed in the results section.