Introduction

Deep learning (DL) algorithms are a subdomain of artificial intelligence (AI) that uses a high generalization approach to recognize and interpret images,1 enabling an efficient identification of properties of materials.2 AI research has been used in two-dimensional (2D) materials to analyze their optical,2,3 physical,4 and electronic properties.5,6,7 Electronic properties, such as bandgaps and electron affinities, have been predicted using machine learning (ML) and DL models based on the structure–property relationship. Segmentation,8,14 has been studied to identify and segment MoS2 flakes with mono-, bi-, tri-, and multilayers. An encoder-decoder semantic segmentation network15 has been studied and configured for pixel-wise identification of optical images of 2D materials along with graphical features, such as contrast, color, edges, shapes, flake sizes, and their distributions. Similarly, a DL-based atomic defect detection framework (DL-ADD)13 has been demonstrated to efficiently detect atomic defects in MoS2 and generalize the model for defect detection in other TMD materials. The three DL architectures DenseNet,3 while demanding many datapoints to train the networks in characterizing the optical images. A ML-based solution19 has been modeled to map simulation results from indentation pillar-splitting experiments and predict the critical indentation load of fracture instability using Gaussian process regression. Notably, image-to-image translation20,21,22,23 using generative conditional adversarial networks (cGANs) has been studied in translating optically sectioned structured illumination microscopy (SIM) images, semantic segmentation,24,25 and image processing.26 A game theory-based cGAN26 has also been demonstrated to predict physical fields such as stress or strain from the material microstructure geometry. While the cGAN works well with limited data to capture complex information from the pixels, the application of pix2pix in characterizing TMDs remains unexplored to date.

Here, we demonstrate a DL-based image-to-image translation approach with cGANs, trained with optical labeled images to enable intelligent characterization of mechanically exfoliated and CVD-grown TMDs. Unlike other AI-based research on TMDs, this method requires limited data to train and evaluate the model. To ensure that our DL model effectively learns the complex variations of pixels and accurately maps them to TMD thicknesses, we utilize experimental data obtained from Raman and PL spectroscopy. These data associate layer information with individual pixels and assign specific colors to represent different layers. We train a pix2pix model to generate the labeled images from optical images to identify the number of layers in TMDs and preprocess data for training. To assess the performance of the model, we conduct quantitative measurements using structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) scores. We further investigate the generalization ability of the model by training it on MoS2 and WS2 samples and successfully testing it on WSe2 samples, demonstrating its capability to adapt to different materials. Finally, we apply the model to characterize heterostructures, highlighting its ability to analyze complex material structures.

Results and discussion

Synthesis and characterizations of TMDs

Figure 1 illustrates the workflow of multimodal analysis of TMDs using DL-based cGANs. TMDs were transferred (via mechanical exfoliation) or synthesized (via low-pressure CVD [LPCVD]) on 300-nm SiO2/Si substrates (see the “Materials and Methods” section) with varying numbers of layers. To verify the growth of materials, we characterized the samples using Raman spectroscopy, photoluminescence (PL) spectroscopy, and atomic force microscopy (AFM) to determine the number of layers. Figure 1c-d gives the Raman and PL peaks of three and four layers as well as bulk (layers greater than four) mechanically exfoliated MoS2 as an example. All spectra were taken with 532-nm excitation. Figure 1c gives the E12g (in-plane vibration of Mo and S atoms) and A1g (out-of-plane vibration of S atoms) phonon modes of mechanically exfoliated MoS2. The phonon modes of MoS2 three layers (3L) are located at 382.02 cm−1 (E12g) and 405.54 cm−1 (A1g).

Figure 1
figure 1

Process of multimodal analysis of transition-metal dichalcogenides. (a) Optical image of mechanically exfoliated MoS2. (b) Labeled image of MoS2. (c) Raman spectra of three and four layers, and bulk MoS2. (d) PL spectra of three and four layers, and bulk MoS2. (e) Workflow of image-to-image translation using generative conditional adversarial networks.

Similarly, phonon modes of MoS2 bulk (layers greater than four) are located at E12g = 381.62 cm−1 and A1g = 407.09 cm−1. These peaks are blue-shifted from 405.54 cm–1 to 407.09 cm−1 for the in-plane and out-of-plane peaks because they move from three to bulk layers, directly proportional to the increased number of layers.27 The Raman shift between in-plane and out-of-plane peaks increases from ∼23.52 cm–1 for three layers (3L) to ∼25.47 cm–1 for the bulk (layers greater than four). 27 Figure 1d presents the PL spectra of mechanically exfoliated MoS2. As observed by others, the PL intensity for three-layer samples was much higher than the other two samples (four or thicker layers).27,28 Figure 1e shows the architecture of the cGAN model to characterize TMDs. The labeled images section shows labeled images based on the number of layers identified using Raman and PL spectra. A cGAN is comprised of a generator and a discriminator. The generator takes optical images as input and generates images, which are subsequently fed to the discriminator along with labeled images. The discriminator then compares both images and returns the output to the generator.

Data preprocessing

Multiple preprocessing steps were applied to the collected optical images to improve the image quality before the collected optical images were fed to the model. This procedure addressed the potential deterioration of images captured by an optical microscope, including uneven lighting and the gradual degradation of the camera sensor. First, we applied median filtering,29 a denoising technique, to smoothen the optical images and generate denoised images, and Gaussian filtering30 to smoothen the image further. The Gaussian average of neighboring pixels for each pixel can be calculated by

$${\text{Gaussian}}\left( {x,y} \right)\, = \,\frac{1}{{2\uppi \sigma^{2} }}e^{{ - \frac{{x^{2} \, + \,y^{2} }}{{2\sigma^{2}, }}}}$$
(1)

where x and y refer to the pixel location in the image. This process generated blurred images and removed high-frequency noise from the image. Figure 2a, d shows the denoised images of CVD-grown WS2 and exfoliated MoS2 after applying median and Gaussian filtering operations. We produced sharpened images by blending the denoised images with a positive weight of 1.5 and the blurred image with a negative weight of 0.5. This process enhanced the edges and other details in images. We normalized the pixel values of the sharpened image to the range (0, 255) calculated by

$$pixel\,Normalized\, = \,\frac{pixel\,Val - pixel\,Min}{{pixel\,Max - pixel\,Min}},$$
(2)

where pixel Val is the actual value of the pixel, and pixel Min and pixel Max are the minimum and maximum values of all pixels in the images, respectively. Figure 2b, e shows the final normalized images after applying median filtering, Gaussian filtering, sharpening and normalization processes. This process ensures that the data are within a consistent scale for a faster convergence during training, leading to less training time to improve the efficacy of the model’s generalization.

Figure 2
figure 2

Optical image preprocessing steps of chemical vapor deposition (CVD)-grown MoS2 (a, d) Denoised image of CVD-grown WS2 and exfoliated MoS2, respectively. (b, e) Normalized image of CVD-grown WS2 and exfoliated MoS2, respectively. (c, f, i) HSV (H-hue, S-saturation, V-value) color space of optical image in (a), (d), (g). (g) Optical image with reference line on pixels with substrate, monolayers, and bilayers. (h) Detection of flakes. (j) Color profile of reference line present in (g). (k) Area distribution of optical image (top-right).

Color-based segmentation for detecting TMDs

We further performed color-based segmentation to verify the presence of TMDs in optical images by converting optical images from RGB (R-red, G-green, B-blue) to HSV31 (H-hue, S-saturation, V-value) color space to separate the images into three components—hue, saturation, and value. Then, we created a mask based on the hue component values of the HSV color space, resulting in a mask containing the color information of the original images for pixels that meet the hue criteria, while setting other pixels to zero. Figure 2c, f, and i shows the HSV color space of the optical images, with the scale bars displaying the hue component of the images. Figure 2h shows the detected flakes bounded by a rectangle, while a reference line of pixels shown in Figure 2g contains pixels with substrate, monolayers, and bilayers. We generated color profiles of red, green, and blue channels of the pixels of the reference line. Figure 2j shows the color profiles from the reference line where the deviation of a red channel within the small circle represents the presence of a bilayer.

Data labeling

For the layer identification in TMDs, collected optical images were manually annotated using Labelbox,32 an open-source web-based labeling tool to annotate the data using the well-defined ontology. It provides a set of inbuilt web services that can be used to automate the process on a batch of data. Five different classes were used to define each image pixel from mono-, bi-, tri-, and four layers, as well as bulk (layers greater than four), and each class was labeled as one specific color. The monolayer pixels, bilayer pixels, pixels with three layers, four layers, and layers greater than four were colored in blue, green, red, cyan, and light gray, respectively. Figure 3a shows a set of labeled images. Figure 3b–c shows an optical image and the labeled image of mechanically exfoliated MoS2. Figure 3d shows the mask images for each class and the legend used for coloring the image pixels. Figure 3e–f depicts 3D plots of pixel intensities to visualize the pixels before and after labeling. Once labeled, each image was paired with its respective labeled image to be fed into the model (Figure 4).

Figure 3
figure 3

Optical image labeling. (a) Grid of labeled optical images. (b) Optical image of MoS2. (c) Manually labeled image of MoS2. (d) Mask images of each layer in the image (c). (e) Scatterplot of pixels of the optical image in red-green-blue color space shown in the image (c). (f) Scatterplot of pixels of the labeled image.

Figure 4
figure 4

Model architecture of the generative conditional adversarial network. The architecture includes a generator and discriminator in an adversarial training framework trained on NVIDIA GeForce GTX 1080 graphics processing unit (GPU). The generator (based on U-Net architecture) transforms input images into labeled images with a resolution of 780 × 588 pixels. The discriminator (based on PatchGAN architecture) distinguishes between actual labeled images and generated images and provides output as 0 (fake) or 1 (real) that is subsequently backpropagated to the generator.

Model architecture and training

The model architecture of the pix2pix model, a cGAN designed explicitly for image-to-image translation, comprises two models: a generator and a discriminator. The generator takes an optical image as input and transforms it into another image, which, along with the corresponding labeled image, is fed into the discriminator model, comparing the similarity between both images. The generator model is an encoder-decoder model that is based on U-Net17 architecture. The encoder encodes the input image and extracts the features while the decoder maps the pixels to the size of the image. The discriminator model is based on PatchGAN23 architecture, which provides the binary output as 0 or 1 to indicate whether the generated image is fake or real. PatchGAN focuses on discriminating local image patches rather than the entire image. This approach allows for finer-grained analysis of image details and facilitates more precise feedback to the generator. In image-to-image translation tasks like ours, where optical features, including contrast variations, thickness variations, and colors, are crucial, PatchGAN’s localized discrimination helps capture intricate features accurately. PatchGAN enables the discriminator to produce high-resolution outputs by operating on image patches. In our case, where the goal is to generate detailed annotations for optical images of TMDs, the ability to produce high-resolution output maps is essential for preserving image quality and capturing optical features. The generator and discriminator models are stacked together to update each other dynamically during training. The generator model is updated continuously to minimize the loss of the generated images from the discriminator model. This loss is known as adversarial loss, which is also updated to minimize the loss between the generated image and the labeled image calculated by

$$\begin{gathered} G_{loss} \, = \,Adversarial\,loss\, + \,\left( {\uplambda \, \times \,L1\,loss} \right) \hfill \\ G_{loss} \, = \,\frac{1}{m}\sum\limits_{i = 1}^{m} {\log \left( {1 - G\left( x \right)} \right)\, + \,\left( {\uplambda \, \times \,MAE\left( {G\left( x \right),y} \right)} \right),} \hfill \\ \end{gathered}$$
(3)

where λ is hyperparameter, L1 loss is the mean absolute error (MAE) between the generated and labeled images, y is the labeled image, G(x) represents the generated image, Gloss represents the generator loss, and Adversarial loss is the sigmoid cross-entropy loss. Similarly, discriminator loss is calculated by

$$D_{loss} \, = \,\frac{1}{m}\sum\limits_{i = 1}^{m} {\log \left( {1 - y} \right)\, + \,\log \,\left( {G\left( x \right)} \right),}$$
(4)

where Dloss is discriminator loss, y is the labeled image, and G(x) represents the generated image. Overall, the conditional generative adversarial network loss tries to maximize discriminator loss and minimize the generator loss simultaneously to generate the required result. The combined loss function is given by

$$Combined\;los{{s}_{G,D}}={{E}_{x,y}}\left[ \log D\left( x,y \right) \right]+{{E}_{x}}\left[ \log \left( 1-D\left( x,G\left( x \right) \right) \right) \right],$$
(5)

where Ex,y is the expected value of the logarithm of D(x,y) and D(x,y), which represent discriminator output when a pair of input images and its corresponding labeled image are provided as input, Ex is the expected value of the logarithm related to D(x,G(x)), which represents discriminator output when a pair of input images and generator output image are provided. The final loss function of the network is calculated by

$$G*,\,D*\, = \,\arg \mathop {\min }\limits_{G} \,\mathop {\max }\limits_{D} \,Combined\,loss_{G,D} \, + \,\left( {\uplambda 2*MES\left( G \right)} \right),$$
(6)

where MSE(G) is the mean squared error loss of the generator, λ2 is a hyperparameter, and the min and max functions represent minimizing the generator loss and maximizing the discriminator loss, respectively, at the same time.

Prediction and performance evaluation

A set of 100 preprocessed optical images and their corresponding manually labeled images were fed to the model for training. Before feeding the image to the model, we resized it to 512 × 512 pixels to ensure compatibility with the model architecture. Additionally, we employed augmentation techniques, such as flip** and rotation, to further enhance the training data set’s diversity and robustness. We chose the adaptive moment estimation (Adam)Data acquisition

We utilized an optical microscope, PL and Raman spectroscopy to characterize the existence of layers of the deposited TMDs. Specific areas consisting of crystals with different layer numbers were captured using an optical microscope. Raman and PL spectra were obtained with a 532-nm excitation laser and a laser power of <500 μW to avoid damage to the samples. A spectral grating with 1800 lines/mm was used for both measurements. We chose a 100× objective lens to capture the images.14

Data processing

The mask of the optical image is calculated by

$$mask\, = \,uppermask\, \times \,lowermask\, \times \,valuemask,$$
(7)

where the value of the uppermask and lowermask depend on the hue component, and the valuemask depends on the value component of the HSV images. The resultant mask was converted to gray scale to create binary images. Here, pixels with values greater than 0 were set to true, while pixels with values equal to or less than 0 were set to false. Each connected component was assigned a unique label, and the resulting labeled image was displayed using a colormap to visualize different flakes present in the image.

Model setup and training

The pix2pix model was implemented using the TensorFlow open-source DL package. Over 35 optical images of WS2 and 65 images of MoS2 were used for the training data set. A set of optical images and their corresponding manually labeled images were used to train the model. The training was performed on a system with an NVIDIA GeForce GTX 1080 graphics card with CUDA version 10.1. It took 3 h to train the model for 200 epochs on 100 image data pairs, and the training was stopped once the difference between the actual labeled image and generated image became negligible. We choose 200 as an optimized number of epochs based on the observation.