Technical note: colab_zirc_dims: a Google Colab-compatible toolset for automated and semi-automated measurement of mineral grains in laser ablation–inductively coupled plasma–mass spectrometry images using deep learning models

Sitar, Michael C.; Leary, Ryan J.

doi:https://doi.org/10.5194/gchron-5-109-2023

Articles | Volume 5, issue 1

https://doi.org/10.5194/gchron-5-109-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/gchron-5-109-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 5, issue 1

Short communication/technical note

|

10 Mar 2023

Short communication/technical note |

| 10 Mar 2023

Technical note: colab_zirc_dims: a Google Colab-compatible toolset for automated and semi-automated measurement of mineral grains in laser ablation–inductively coupled plasma–mass spectrometry images using deep learning models

Michael C. Sitar and Ryan J. Leary

Download

Final revised paper (published on 10 Mar 2023)
Preprint (discussion started on 05 May 2022)

Interactive discussion

Status: closed

RC1:
'Comment on gchron-2022-12', Simon Nachtergaele, 01 Jun 2022
As a junior researcher interested in the combination of geology and artificial intelligence, I must admit that I read this article with great pleasure. The work of MC Sitar and RJ Leary is highly appreciated and there is still a lot of work to do in this field of research. It seems like a rapidly expanding field of research and a hot topic. But, I learned a lot but I have many suggestions for potential improvement, although the manuscript already looks very good. Some of my suggestions are major comments (MajC) and some are minor comments (MinC).

Major comments:

MajC1: From experience with LA ICP MS I know that the laser ablation system only takes images using reflected light, unfortunately. To my opinion, many of the segmentation errors are actually caused by the reflected light images that are too sensitive to scratches or cracked grains. This paper finds a solution for this trouble that is more or less induced by using (low quality) reflected light images. However, it would be interesting to use images taken with a camera without reflected light, but with (option A) transmitted light from an optical light microscope, or option B: SEM images using a CL detector) would give you less segmentation problems and also textural (or even chemical zoning using CL) information.

MajC2: Line 154: Figure 1a: in this figure it is quite obvious that different minerals (with each a different reflectivity) are shown. My best guess would be that there is some apatite present and this “zircon generalization” troubles me a lot.

MajC3: Line 154: Figure 1c: explain why the red sticks extend out of the mineral. The segmentation seems quite good, but the red sticks are longer. So, I cannot judge if there is a problem with the segmentation but maybe the problem lies in the calculation of the radius or perhaps in the entire image calibration (!). Also, for figure 1 it would have been appreciated that much more results were shown, for example of images that include air bubbles or cracked grains.

MajC4: The resolution of the Youtube tutorial video is (for some reason) not sufficient and needs to be improved.

MajC5: the paper would definitely benefit from an additional application (such as done by AnalyZr (Scharf et al., 2022)) that illustrates the strength and usefulness of the developed method. An additional data visualisation plot in the notebook where you can compare the zircon U-Pb age with the computed grain size metrics would be amazing (see figure 12 in Scharf et al. (2022)).

Minor comments:

MinC1: Line 32: name the “published studies” that you mentioned.

MinC2: Line 113: which GPU’s could you use in Google Colab? K80? T4? P100? It is an interesting detail. And please mention the training time for a particular GPU and network as well.

MinC3: Line 148: you jumped from Swin to Swin-T in some lines without explaining why. Some literature research learned me that this Swin-T variant of Swin is about 4 times smaller and that its complexity is about the same as the ResNet50 network architecture. These light versions are often called “tiny” variants and are a lot quicker than the original “full-option” network. This should be mentioned in order to let the reader realize that these network architectures are very large and that you’re trying to solve that problem by using a tiny variant.

MinC4: Line 188: why did you not try resizing for the largest images (1280p on 1024p)? It would save a lot of computing time.

MinC5: Line 178: following the book of Russel and Norvig (2002) (4^th edition, page 832, section 22.7.2) it seems not so smart to start to train models from scratch. You would need a lot more training time or a very small network in order to overcome this trouble. So I think this is not surprising and do not think the “start-from-scratch” models are of much added value.

MinC6: Line 192-193: please use “heavy mineral” instead of “zircon” because this “zircon” class is incorporating apatites and monazites as well. Perhaps also change the name of the Colab notebook in that case.

MinC7: Line 203: describe the learning rate more in detail in the text

MinC8: Figure 3: why not use cropping and scaling as well for data augmentation?

MinC9: Figure 4: what were your criteria to prevent over-fitting your model to the data? Did you just pick the most performant network?

MinC10: Table 3: line 326: Otsu thresholding has a failure rate of 0.00%. This is contradictory to what you state in line 326 and, on top of that, failure rate is never explained in the text.

MinC11: Table 3 below: please provide a metric to compare both GPU’s against each other. A logical question that comes up into a reader’s mind would be: “which one is the best?”

MinC12: Figure 6: in this figures I need to see a 1:1 line which indicates the ideal ratio of an automated measurement and a manual measurement. In the horizontal axis, you need to add “manual” before “measurement (µm)” in each of the two figures.

MinC13: Line 335: explain “negative skew” in plain language.

MinC14: Figure 7: mention in the text that this “grain merging” problem can be perhaps solved with the NMS threshold that you mentioned earlier in line 133

MinC15: caption Figure 8: the median value is displayed by the black horizontal lines inside the boxes (!).

MinC16: please add the scale on the upper panel of the figure, instead of the lower panel.

MinC17: Lines 77-94: it is indeed important to emphasize the history of this method development and to clearly give the (dis)advantages of both methods.

Once these comments are incorporated (where possible), I would certainly want to see this published in Geochronology.

Simon Nachtergaele

Department of Geology, Ghent University, Belgium

References:

Russel, S., Norvig, P., 2002. Artificial Intelligence: a modern approach.

Scharf, T., Kirkland, C.L., Daggitt, M.L., Barham, M., Puzyrev, V., 2022. AnalyZr: A Python application for zircon grain image segmentation and shape analysis. Comput. Geosci. 162, 105057. https://doi.org/10.1016/j.cageo.2022.105057
Citation: https://doi.org/10.5194/gchron-2022-12-RC1
- AC1:
  'Reply on RC1', Michael Sitar, 22 Jul 2022
  We would like to thank S. Nachtergaele for his thoughtful and extremely helpful comments on our manuscript and appreciate his willingness to apply his experience from the nascent intersection between AI and geology to reviewing our work. While we cannot incorporate all of the suggested changes and additions, the comments do identify many areas where changes and/or elaboration will improve the manuscript and have responded to each comment (italicized) in the text below. We use “revised manuscript” below to refer to a revised copy of our original manuscript that we will submit if invited to do so by the editors.
  Major comments:
  MajC1: From experience with LA ICP MS I know that the laser ablation system only takes images using reflected light, unfortunately. To my opinion, many of the segmentation errors are actually caused by the reflected light images that are too sensitive to scratches or cracked grains. This paper finds a solution for this trouble that is more or less induced by using (low quality) reflected light images. However, it would be interesting to use images taken with a camera without reflected light, but with (option A) transmitted light from an optical light microscope, or option B: SEM images using a CL detector) would give you less segmentation problems and also textural (or even chemical zoning using CL) information.
  
  We fully agree that the use of other grain imaging techniques (e.g., transmitted light and CL) could mitigate some of the problems that our method struggles with and provide important additional data for analysis of the relationships between age and grain size and shape. We also agree that support for said image types would be valuable here or in any automated grain measurement/characterization workflow developed in the future, though AnalyZr (Scharf et al., 2022) notably already allows for thresholding-based segmentation of grains in transmitted light images. We are, however, unable to implement transmitted light image segmentation in colab_zirc_dims at present for lack of applicable training data, and we feel that implementation of CL image grain/zone segmentation is beyond the scope of this technical note.
  The main objective of colab_zirc_dims is to allow semi-automated, RCNN-based measurement of grains from images captured directly by LA-ICP-MS systems. We have consequently trained our models only on such reflected light images and would hope to do the same if training new models to segment grains in transmitted light images. Though many LA-ICP-MS systems (e.g., at UCSB and ALC) are in fact capable of capturing transmitted light images, this is (to the best of our knowledge) rarely if ever done in practice due to the superiority of reflected light for identifying suitable, exposed grain surfaces for ablation. This was the case during collection of our grain-image-age datasets, and as a result we currently lack a corpus of appropriate transmitted light images to train/retrain new models and implement support for transmitted light images in colab_zirc_dims. Due to the rarity of transmitted light image capture during LA-ICP-MS analysis, we do not feel that the present lack of support for transmitted light image segmentation in colab_zirc_dims significantly detracts from its utility, but we do hope to implement said support should we acquire or gain access to a suitable training image dataset in the future.
  Given sufficient spatial resolution, algorithmic segmentation of CL images of detrital mineral grains might allow for more accurate classification of grain sizes and shapes. An algorithm additionally capable of accurately identifying, segmenting, and/or characterizing intra-grain zoning from CL images would enable rapid acquisition of data previously obtained qualitatively through human observation. We completely agree with the reviewer that this would be fantastic - both in of itself and because it could be incorporated in an end-to-end, fully automated spot picking algorithm (as speculatively mentioned at line 390 of our pre-print manuscript) to allow for parameterizable, informed intra-grain spot localization. The mineral zone segmentation process developed by Sheldrake and Higgins (2021) may be adaptable to CL images but is (to the best of our knowledge) untested in this use-case. If this (i.e., Sheldrake and Higgins, 2021) method is not applicable, solving this problem would likely require acquisition of new CL image datasets and development of new processing workflows and/or models (deep-learning-based or otherwise). As such, and because CL images are not acquired directly during LA-ICP-MS analyses, we feel that developing a methodology for CL image grain segmentation and/or characterization is beyond the scope of the colab_zirc_dims package and our technical note.
  MajC2: Line 154: Figure 1a: in this figure it is quite obvious that different minerals (with each a different reflectivity) are shown. My best guess would be that there is some apatite present and this “zircon generalization” troubles me a lot.
  
  The images certainly do include (probable) apatite grains labelled as zircon, which is expected because we only trained our models to identify and segment heavy mineral grains and not to distinguish between minerals. We attempted to explain our reasoning for doing so at Line 191 – our models that are trained to segment all grains from images seem to be fairly robust to variations in image quality, brightness, and exposure, but a model also trained to distinguish mineral phases might confuse different minerals given new (e.g., much brighter) images. We agree that our labelling of all grains as “zircon” in <v1.0.9 colab_zirc_dims visualizations and in our initial manuscript obscures what the algorithm is actually doing (segmenting all heavy mineral grains).
  To address the reviewer’s comment, we have changed the code and processing notebooks for the v1.0.9 colab_zirc_dims release to clarify that our code and models are not distinguishing zircon grains from other heavy mineral grains. Segmented grains in visualizations are now labelled with “grain” instead of with “zircon”, mosaic_info.csv headers now include “Max_grain_size” instead of “Max_zircon_size”, and measured grain dimensions are now saved to “grain_dimensions” rather than “zircon_dimensions” folders and .csv files. Explanatory text in our processing notebooks has also been updated to reflect these changes.
  We have concurrently made the following changes to our revised manuscript:
  Changed the Table 2, column 3 header to “Grains” from “Zircon grains”
  
  Revised lines 191-193 to: “Some training and validation images contain likely detrital apatite grains in addition to zircon, and we segmented all visible mineral grains into a single class to avoid harming our models’ generalization abilities in the presence of varying image exposure and brightness levels.”
  
  Updated colab_zirc_dims file and parameter names throughout the manuscript to be consistent with v1.0.9
  
  Revised Figure 2 (see attached) to reflect new segmentation visualization labels
  
  MajC3: Line 154: Figure 1c: explain why the red sticks extend out of the mineral. The segmentation seems quite good, but the red sticks are longer. So, I cannot judge if there is a problem with the segmentation but maybe the problem lies in the calculation of the radius or perhaps in the entire image calibration (!). Also, for figure 1 it would have been appreciated that much more results were shown, for example of images that include air bubbles or cracked grains.
  
  We are glad that the reviewer noticed the apparent mismatch between the lengths of plotted axes (red sticks) in Fig. 2c because it points to an algorithmic detail that we failed to explain in the manuscript and neglected to note as a likely source of error in our test results. We have no reason to believe that the axial measurements plotted on colab_zirc_dims verification images (e.g., Fig. 2c) are mis-scaled or otherwise decoupled from their calculated values and so can confidently attribute discrepancies between said axes and the actual grain masks to their calculation algorithm. This algorithm (implemented in the scikit-image .measure module; van der Walt et al., 2014) calculates axes using the normalized 2^nd order central moments of grain masks, and as a result fit better to elliptical grains than to rectangular ones (see attached, revised Fig. 2c).
  These metrics have been used as representative measurements of grain length and width in analysis of detrital zircon grain size/shape vs. age by other researchers (i.e., Scharf et al., 2022). In addition to producing moment-based per-grain axial measurements, though, the AnalyZr software (Scharf et al., 2022) does additionally output major and minor “Feret diameters”; in their implementation these are respectively the Feret diameter and the width of a minimum-area circumscribing rectangle for a grain mask. We certainly see the value in providing measurements without any built-in error for rectangular grains, and measurement results from colab_zirc_dims processing in version v1.0.9 now include additional long and short axis “rectangular diameters”. These diameters respectively correspond to the long and short axial lengths of the minimum-area circumscribing rectangle (i.e., Fig. 2C, revised) for a grain mask and are calculated using the OpenCV minAreaRect function (Bradski, 2000). We opt to use minimum-area rectangle measurements exclusively here rather than take the same approach as Scharf et al. (2022) in order to maintain orthogonality between reported axial measurements.
  The reviewer may wonder whether the new measurement algorithm is more accurate. This was a concern for us too, and one with potentially major implications for the utility of our (moment-based) error evaluation data. We conducted an additional evaluation of error on our full test dataset using segmentation masks generated by model M-R101-C and minimum-area-rectangle-based axial measurements. Our evaluation results are attached for the reviewer’s perusal – see the “full_circumscribing_rectangle_measure” vs. “full_moment_measure” .xlsx files and “rectangular_measurement_scatter” vs. “fig6_revised” .png plots.
  Using rectangular measurements instead of moment-based ones very marginally decreases evaluated average absolute measurement error along grain long axes (by 0.038 μm / 0.08%) and marginally increases it along grain short axes (by 0.382 μm / 1.03%). Both differences are deep in the sub-pixel range for our dataset images. A comparison of the error scatter plots (i.e., points near the 1:1 line, which presumably have accurate segmentation masks) suggests some more substantial differences. The moment-based axial measurement algorithm does seem to be the source of fairly consistent, ~low-level (2-7 μm) positive measurement error for longer (>100 μm) accurately segmented grains; these grains are probably more likely to have ~rectangular masks and so be poorly fit by an ellipse. The plots also suggest that the rectangular measurement algorithm introduces some (generally positive) error when calculating grains’ short axis lengths, possibly in part because minimum-area rectangles may be completely misaligned from what a human would interpret as axial orientation for grains with low aspect ratios.
  Our assessment of our results here is that accuracy on moment-based calculations is still the better metric for evaluating our models on the full test dataset. A major caveat is that the axes of minimum-area circumscribing rectangles may be better measurements for grains that have high aspect ratios; these grains are less likely to be fit with poorly oriented rectangles and are themselves more likely to be ~rectangular. We have added the following text to our draft revised manuscript to correct our omissions and to explain the new measurements and the cases where they are indicated:
  In section 3.3:
  ‘“Major and minor axis lengths are calculated from the moments of the grain mask image and reported axes thus correspond to “the length of the… axis of the ellipse that has the same normalized second central moments as the region” (van der Walt et al., 2014). Calculated axes will consequently fit exactly to perfectly elliptical and circular grain masks but may be more approximate in the cases of rectangular and irregularly shaped grains (e.g., fig. 2c). Rectangular diameter measurements correspond to the long and short axes of the minimum-area circumscribing rectangle that can be fitted to a grain mask using the OpenCV minAreaRect function (Bradski, 2000). Minimum-area rectangles will exactly fit to rectangular grain masks, but in the case of more equant grains may be grossly misaligned from the grain axes that a human researcher would interpret. The two types of calculated axial measurement parameters each have drawbacks, but we suggest that researchers who want to use both define an aspect ratio (i.e., major axis length divided by minor axis length) threshold (e.g., 2.0) above which to treat rectangle-based measurements as representative and below which to treat moment-based measurements as representative.”
  In section 5:
  “Moment-based axial measurements rather than rectangle-based measurements were used for the purposes of these evaluations in order to avoid evaluating measurements from misaligned circumscribing rectangles (i.e., Sect. 3.3).”
  In section 5.2:
  “Another source for low-magnitude error throughout the test dataset is the moment-based axial length calculation algorithm (i.e., Sect. 3.3). This algorithm may slightly overestimate the lengths of grains’ long axes depending on the shape of the mask; such errors will likely be negligible in the case of circular and elliptical grains but may be more pronounced for rectangular grains. Axial length calculation error is the likely reason that a significant population of longer (>100 μm) grains that presumably have accurate segmentation masks plot several (~2-8) μm above the 1:1 line in Figure 6A.”
  In section 5.3:
  “Differences between axial length measurements done by hand and those produced through moment-based calculation (i.e., Sect. 3.3) probably contribute some level of baseline error for both manual re-segmentation and automated segmentation (Table 4), but we are unable to conclusively quantify this error.”
  We have also revised Figure 2c (see attached) to show the ellipse corresponding to the normalized second central moments of the example grain mask and as well as the minimum-area circumscribing rectangle.
  With regards to adding more examples of segmentations using different algorithms: our revised version of Figure 1 shows (in component D) examples of image artefacts with some problematic and non-problematic Otsu thresholding segmentation results. Results of colab_zirc_dims (M-R101-C) segmentation of the same images are shown in our revised version of figure 2 (Fig. 2d).
  MajC4: The resolution of the Youtube tutorial video is (for some reason) not sufficient and needs to be improved.
  
  The reviewer is correct in noting that the Youtube video resolution is fairly low; this is unfortunately the same resolution that it was recorded at. We plan to record a new, higher resolution video tutorial for colab_zirc_dims v1.0.9 and if we are invited to submit a revised manuscript will include this video in its assets.
  MajC5: the paper would definitely benefit from an additional application (such as done by AnalyZr (Scharf et al., 2022)) that illustrates the strength and usefulness of the developed method. An additional data visualisation plot in the notebook where you can compare the zircon U-Pb age with the computed grain size metrics would be amazing (see figure 12 in Scharf et al. (2022)).
  
  We agree that it is important to place the current manuscript in a provenance analysis context. However, because the manually measured grain-dimension data to which our current automated dataset is compared has been presented and interpreted in another paper (Leary et al., in press), we refer readers to that publication and provide only a summary of the conclusions in the current technical note. To add this context, we have added the following text to the current technical note (added in the new section 7):
  “The ability to generate grain-dimension data for large detrital datasets has major implications for improving the robustness of provenance interpretations and for generating new provenance interpretations. Because few large (n > several thousand) detrital geochronology studies include grain-dimensional data (cf. Lawrence et al., 2011; Leary et al., 2020a, b; Cantine et al., 2021; Scharf et al., 2022; Leary et al., in press), much of the interpretive power of large, geochronologic-grain-dimension datasets remains to be discovered. However, one recent example of the increased interpretive power of such an approach is presented in Leary et al. (in press). That study used zircon grain-dimension data to reinterpret the provenance and transport mechanism of 500-800 Ma zircons within the Pennsylvanian-Permian Ancestral Rocky Mountains system in southwest Laurentia. Based on the arrival of dominantly small (< 60 µm), 500-800 Ma zircons in that study area at the Pennsylvanian-Permian boundary, Leary et al. (in press) interpreted these grains as having been transported into the study area principally by wind and reinterpreted their provenance as Gondwanan (as opposed to Arctic and/or northern Appalachian as previously interpreted by Leary et al., 2020b). Our hope is that the increased ability to generate large grain-dimension datasets from toolsets such as those presented here and by Scharf et al. (2022) will improve future provenance interpretations, specifically as they relate to grain transport processes (e.g. Lawrence et al., 2011; Ibañez-Mejia et al., 2018; Leary et al., 2020a, b).”
  Because we hope to limit the scope of the colab_zirc_dims package to measurement-related functions and utilities, we do not plan to implement parsing/visualization/interpretation functions involving actual age data in the colab_zirc_dims code or notebooks. That said, the reviewer has identified a deficiency in the functionality of colab_zirc_dims for exploratory analysis (as of v1.0.8): users can collect measurements rapidly but have no way of quickly viewing or evaluating dataset-scale measurement results. To remedy this, we have added a new exploratory measurement data visualization module (colab_zirc_dims.expl_vis) to the v1.0.9 colab_zirc_dims code and notebooks. While this module strictly deals with colab_zirc_dims measurement results, it does allow interactive and parameterizable loading, filtering (e.g., such that shots on standards grains are ignored), and plot-based (i.e., bar-whisker, histogram, or X-Y scatter) visualization of colab_zirc_dims measurement datasets within the Colab notebooks. We hope that the addition of this module and its constituent features at least partially satisfies the reviewer’s suggestion.
  Minor comments:
  MinC1: Line 32: name the “published studies” that you mentioned.
  
  We have added these citations, and the text now reads:
  “A principal challenge in collecting such data has been that few automated approaches have been published (e.g. Scharf et al., 2022), and the time required to manually collect grain dimensions from large detrital datasets is a substantial barrier to widespread application of these methods (e.g. Leary et al., 2020a).”
  MinC2: Line 113: which GPU’s could you use in Google Colab? K80? T4? P100? It is an interesting detail. And please mention the training time for a particular GPU and network as well.
  
  To provide readers with this important information, we have:
  Revised line 57 of the manuscript to read:
  “Google Colab is a free service that allows users to run Jupyter notebooks (i.e., Kluyver et al., 2016) on cloud-based virtual machines with variably high-end GPUs from the NVIDIA Tesla series (i.e., K80, T4, P100, and V100) that are allocated based on availability.”
  Added the following sentence to section 3.2.3:
  “Training a Mask RCNN ResNet-FPN model from a pre-trained Resnet-101 base (i.e., as in M-R-101-C) for 11,000 iterations on a Google Colab virtual machine equipped with a NVIDIA Tesla P100 GPU using the provided notebook takes about 1.2 hours.”
  MinC3: Line 148: you jumped from Swin to Swin-T in some lines without explaining why. Some literature research learned me that this Swin-T variant of Swin is about 4 times smaller and that its complexity is about the same as the ResNet50 network architecture. These light versions are often called “tiny” variants and are a lot quicker than the original “full-option” network. This should be mentioned in order to let the reader realize that these network architectures are very large and that you’re trying to solve that problem by using a tiny variant.
  
  We agree that these are important details that should be conveyed to the reader, and have added the following text to section 3.2.1 of our revised manuscript:
  “As in the case of ResNet, different and variably complex variants of the Swin network architecture exist (Liu et al., 2021). The largest Swin network variant, Swin-large (Swin-L), has 197 million trainable parameters and is both computationally expensive to train and prohibitively large for application in a Google Colab virtual machine environment (Liu et al., 2021). The smallest Swin network variant, Swin-tiny (Swin-T), however, has a much more manageable 29 million trainable parameters (comparable to a ResNet-50 network; Liu et al., 2021) and is consequently more appropriate for Colab-based training and implementation for relatively fast image segmentation.”
  MinC4: Line 188: why did you not try resizing for the largest images (1280p on 1024p)? It would save a lot of computing time.
  
  We did resize the images during training to reduce compute time and (in the cases of our Swin-T and Centermask2 models) as a random augmentation method. Though we mention this in the caption for Fig. 3, we realize that this information, along with information about minor cropping augmentation performed during training (which we originally failed to note), would be more appropriately situated in our figure and, for resizing, within the manuscript text. As such, we have revised figure 3 (attached) and added the following text to section 3.2.3 of our revised manuscript:
  “As per the default training settings for Detectron2 implementation, we uniformly resized training image inputs to our Mask RCNN ResNet-FPN models such that their shortest edges were 800 pixels in length (Detectron2). For our Mask RCNN Swin-T-FPN and Centermask2 models, we randomly resized the short edges of training image inputs to between 400 and 800 pixels as an additional augmentation.”
  MinC5: Line 178: following the book of Russel and Norvig (2002) (4^thedition, page 832, section 22.7.2) it seems not so smart to start to train models from scratch. You would need a lot more training time or a very small network in order to overcome this trouble. So I think this is not surprising and do not think the “start-from-scratch” models are of much added value.
  
  We completely agree that training from scratch is not recommended by our small training dataset and relatively large models and did not have any great expectations for their performance. We have added the following text to section 3.2.3 of our revised manuscript to clarify our intent:
  “In some cases, however, randomly initialized models can match the performance of those initialized from pretrained weights during training on non-augmented datasets that are relatively small, albeit much larger than ours (He et al., 2018). When pretraining datasets are sufficiently different from target data (e.g., natural image versus medical CT), transfer learning can also be of limited utility (Karimi et al., 2021).”
  And changed the last sentence of section 5.1 to clarify what we learned:
  “It is clear that our training dataset was not large enough and the task of segmenting grains from reflected light images not distinct enough from natural image segmentation (e.g., in MS COCO) for random initialization to be useful (i.e., He et al., 2018; Karimi et al., 2021), though image augmentation did notably push the test dataset accuracies of our randomly initialized models significantly closer to those of pre-trained models.”
  We do think that our results here are worth reporting if only to establish the usefulness of transfer learning for researchers working on very similar problems in the future.
  MinC6: Line 192-193: please use “heavy mineral” instead of “zircon” because this “zircon” class is incorporating apatites and monazites as well. Perhaps also change the name of the Colab notebook in that case.
  
  As noted in response to MajC2, we have changed this to ‘grain’ in our revised manuscript, code, and figures.
  MinC7: Line 203: describe the learning rate more in detail in the text
  
  We have added the following text to section 3.2.3 of our revised manuscript to expand on our discussion of learning rate:
  “We trained each of our models in Google Colab using Detectron2 for at least 11,000 total two-image iterations with model-dependent learning rate schedules, all of which incorporated a 1,000 iteration warmup period and stepped 50% learning rate reductions at variable (generally 1,000 iteration) intervals starting at 1500 (for M-ST-C and C-V-C) or 2,000 iterations (for all other models). Peak learning rates of 0.02, 0.0005, and 0.00025 were respectively used for our randomly initialized models, M-ST-C, and for our Mask RCNN Resnet-FPN models and C-V-C; these rates were modified empirically from those included in default training configurations (e.g., Lee and Park, 2020; Ye et al., 2021; Detectron2) based on training and validation curves in trial training sessions. “
  MinC8: Figure 3: why not use cropping and scaling as well for data augmentation?
  
  As we mentioned in our response to MinC4, we did use scaling (dependent on model) and cropping (random, to 0.95 X original image size, for all models). Both are noted in the revised text and in our revision to Fig. 3.
  MinC9: Figure 4: what were your criteria to prevent over-fitting your model to the data? Did you just pick the most performant network?
  
  Roughly, yes. We have revised figure 4 to include average absolute long axis error results from running each saved model checkpoint against the full Leary et al. (in press) dataset (see attached). We were unable to fit test results for our model trained without image augmentation (M-R50-S-NA) on our revised version of Fig. 4, but for the reviewer’s benefit we have attached a version of the figure that does plot these data (which notably do suggest overfitting after 4000 iterations) as “revised fig 4 with unaugmented model test results.png”. We have also revised the text at the end of section 5 of our revised manuscript to explain:
  “We picked “best” model checkpoints (Table 1) at checkpoints beyond 3000 iterations where models achieved apparent local maxima in validation accuracies (i.e., Fig. 4) and local minima or plateaus in various measurement error metrics (e.g., failure rate and absolute long axis error; Table 3; Fig. 4) when evaluated on the full Leary et al. (in press) test dataset. We set our threshold (greater than 3000 iterations) for checkpoint picking based on qualitative observations that grain masks for all models appeared to be more “blobby” (i.e., more refined to actual grain areas) at lower training iterations, though it is worth noting that we fail to see conclusive evidence for this relationship in training, validation mask loss, or test accuracy curves (Fig. 4). Changes in most evaluation accuracy metrics (roughly represented by average absolute long axis error in Fig. 4) for the models trained with image augmentations were largely stochastic after ~2000 (for the pretrained models) to ~3000 iterations (for the randomly initialized models; Fig. 4). This suggests a lack of meaningful overfitting (possibly attributable to a combination of learning rate drawdown and training image augmentation) in relation to our test dataset and probable negligible negative effects on model generalization abilities from our selecting models at relatively high training iterations.”
  MinC10: Table 3: line 326: Otsu thresholding has a failure rate of 0.00%. This is contradictory to what you state in line 326 and, on top of that, failure rate is never explained in the text.
  
  We are grateful to the reviewer for pointing this out. We did not originally believe the 0.00% error rate was significant in of itself because Otsu thresholding is inherently indiscriminate and thus may segment non-grain objects/artefacts as ‘grains’ and so erroneously pass our central grain identification algorithm. In verifying the 0.00% pass rate to respond to this correction, however, we did find and correct a bug in our code (previously fixed, re-introduced during refactoring) that resulted in incomplete remove of background in our Otsu segmentation function. After fixing the bug and re-running evaluation on the test dataset, our Otsu segmentation algorithm has a ‘fail rate’ of 0.02% and significantly (i.e., several percent) better error evaluation metrics. These new data are incorporated into our revision of Table 3 (attached); we regret the error.
  We have also added the following text to section 3.3 of our revised manuscript to explain our “failure rate” metric:
  “To avoid erroneously returning significantly off-central (i.e., non-target) grains, the algorithm is considered to have “failed” if it cannot find a grain mask after this search, and null values are returned for the spot instead of grain size and shape parameters.”
  MinC11: Table 3 below: please provide a metric to compare both GPU’s against each other. A logical question that comes up into a reader’s mind would be: “which one is the best?”
  
  This is a good point, but we do feel that a comparison of GPUs would be somewhat superfluous in our technical note. We have re-run the dataset with model C-V-C in our Colab notebook after being allocated an NVIDIA Tesla T4 GPU and have included the results (identical, obviously, except for the segmentation time metric) in our revised version of Table 3 (attached). As this change enables 1:1 comparison of all metrics for all the models, we hope that this satisfies the reviewer’s request.
  MinC12: Figure 6: in this figures I need to see a 1:1 line which indicates the ideal ratio of an automated measurement and a manual measurement. In the horizontal axis, you need to add “manual” before “measurement (µm)” in each of the two figures.
  
  These issues have been fixed in our revised version of figure 6 (attached).
  MinC13: Line 335: explain “negative skew” in plain language.
  
  This is best illustrated with a figure, and we have added two histogram plots (for long and short axis error) to our figure 6 as Fig. 6b (see attached). We have also an equation (which will be rendered using the MS Word equation formatter as Equation 2) for Pearson’s skewness coefficient as follows:
  Pearson's skewness coefficient = 3(mean-median)/(standard deviation)
  Additionally, we have revised the text at the beginning of section 5.2 to read:
  “Per-grain automated (M-R101-C) measurements for the full Leary et al. (in press) dataset generally hew close to ground-truth measurements but with a significant number of datapoints plotting well below the 1:1 measured versus ground truth (i.e., Leary et al., in press) line (Fig. 6). The apparent dominant cause of this negative skew (i.e., Equation 2, Fig. 6B) is…”
  MinC14: Figure 7: mention in the text that this “grain merging” problem can be perhaps solved with the NMS threshold that you mentioned earlier in line 133.
  
  A good point. We have amended text in section 5.2 of our draft revised manuscript to read:
  “Major positive measurement errors are relatively rare (Fig. 6) but are probably mainly attributable to segmentation masks that merge different grains (Fig. 7). The occurrence rates of these errors may be reducible through tuning of our models’ respective NMS thresholds, although we believe that our current chosen settings are fairly optimal for eliminating undesirable masks.”
  MinC15: caption Figure 8: the median value is displayed by the black horizontal lines inside the boxes (!).
  
  We changed the caption of Figure 8 in our revised manuscript to read:
  “A sample-by-sample boxplot comparison of human (Leary et al., in press) and automated (M-R101-C) measurements along long and short grain axes. Boxes extend from Q1 to Q3, and whiskers extend from Q1 - 1.5 * (Q3 - Q1) to Q3 + 1.5 * (Q3 + Q1); sample medians are indicated by black horizontal lines *within each box.*”
  MinC16: please add the scale on the upper panel of the figure, instead of the lower panel.
  
  This has been corrected in the revised figure (attached).
  MinC17: Lines 77-94: it is indeed important to emphasize the history of this method development and to clearly give the (dis)advantages of both methods.
  
  We agree, and hope that we did so adequately in our manuscript!
  
  Additional corrections:
  Line 110: We erroneously state that PyTorch is developed by Google:
  Corrected to “…also developed by Facebook…”
  
  References:
  Bradski, G.: The OpenCV Library, Dr Dobbs J. Softw. Tools, 2000.
  Cantine, M. D., Setera, J. B., Vantongeren, J. A., Mwinde, C., and Bergmann, K. D.: Grain size and transport biases in an Ediacaran detrital zircon record, J. Sediment. Res., 91, 913–928, https://doi.org/10.2110/jsr.2020.153, 2021.
  He, K., Girshick, R., and Dollár, P.: Rethinking ImageNet Pre-training, https://doi.org/10.48550/arXiv.1811.08883, 21 November 2018.
  Ibañez-Mejia, M., Pullen, A., Pepper, M., Urbani, F., Ghoshal, G., and Ibañez-Mejia, J. C.: Use and abuse of detrital zircon U-Pb geochronology—A case from the Río Orinoco delta, eastern Venezuela, Geology, 46, 1019–1022, https://doi.org/10.1130/G45596.1, 2018.
  Karimi, D., Warfield, S. K., and Gholipour, A.: Transfer Learning in Medical Image Segmentation: New Insights from Analysis of the Dynamics of Model Parameters and Learned Representations, Artif. Intell. Med., 116, 102078, https://doi.org/10.1016/j.artmed.2021.102078, 2021.
  Lawrence, R. L., Cox, R., Mapes, R. W., and Coleman, D. S.: Hydrodynamic fractionation of zircon age populations, GSA Bull., 123, 295–305, https://doi.org/10.1130/B30151.1, 2011.
  Leary, R., Smith, M. E., and Umhoefer, P.: Mixed eolian-longshore sediment transport in the Late Paleozoic Arizona Pedregosa basin, USA: a case study in grain-size analysis of detrital zircon datasets, J. Sediment. Res., in press.
  Leary, R. J., Smith, M. E., and Umhoefer, P.: Grain-Size Control on Detrital Zircon Cycloprovenance in the Late Paleozoic Paradox and Eagle Basins, USA, J. Geophys. Res. Solid Earth, 125, e2019JB019226, https://doi.org/10.1029/2019JB019226, 2020a.
  Leary, R. J., Umhoefer, P., Smith, M. E., Smith, T. M., Saylor, J. E., Riggs, N., Burr, G., Lodes, E., Foley, D., Licht, A., Mueller, M. A., and Baird, C.: Provenance of Pennsylvanian–Permian sedimentary rocks associated with the Ancestral Rocky Mountains orogeny in southwestern Laurentia: Implications for continental-scale Laurentian sediment transport systems, Lithosphere, 12, 88–121, https://doi.org/10.1130/L1115.1, 2020b.
  Lee, Y. and Park, J.: CenterMask : Real-Time Anchor-Free Instance Segmentation, ArXiv191106667 Cs, 2020.
  Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, https://doi.org/10.48550/ARXIV.2103.14030, 2021.
  Scharf, T., Kirkland, C. L., Daggitt, M. L., Barham, M., and Puzyrev, V.: AnalyZr: A Python application for zircon grain image segmentation and shape analysis, Comput. Geosci., 162, 105057, https://doi.org/10.1016/j.cageo.2022.105057, 2022.
  Sheldrake, T. and Higgins, O.: Classification, segmentation and correlation of zoned minerals, Comput. Geosci., 156, 104876, https://doi.org/10.1016/j.cageo.2021.104876, 2021.
  van der Walt, S., Schönberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., Yu, T., and contributors, the scikit-image: scikit-image: Image processing in Python, PeerJ, 2, e453, https://doi.org/10.7717/peerj.453, 2014.
  Detectron2: https://github.com/facebookresearch/detectron2.
  Ye, H., Yang, Y., and L3str4nge: SwinT_detectron2: v1.2, Zenodo, https://doi.org/10.5281/ZENODO.6468976, 2021.
  
  Citation: https://doi.org/10.5194/gchron-2022-12-AC1
RC2:
'Comment on gchron-2022-12', Taryn Scharf, 12 Jun 2022
The manuscript of Sitar and Leary is well written, includes thorough explanations of methods and results, with appropriate literary support. The construction of colab_zirc_dims demonstrates good working knowledge of those sectors of deep learning and computer vision that are applicable to the segmentation of reflected light images of zircon grains. This work thus provides the geological community with a valuable step forward in the development of highly accurate, rapid and automated tools for zircon image segmentation and shape measurement. I believe this manuscript should be published once comments have been addressed.
Specific Comments
Line 45-46: Colab_zirc_dims works exclusively on reflected light (RL) images of zircons mounted in resin. The authors mention that zircon shape may be partially obscured by resin, resulting in minimum shape measurements instead of true dimensions. As colab_zirc_dims is presented as a tool for zircon shape measurement, could the authors kindly expand the discussion to cover whether the error introduced by reflected light images is significant, and whether or not it predisposes colab_zirc_dims to certain use cases? For example, do we know what proportion of a dataset is typically affected by this phenomenon? Is there any risk in comparative studies in which mounts have been differently handled (e.g. ground to different depths) or where shape measurements have been extracted from a variety of image types?

Line 72-76: Please include an image that compares the segmentation achieved by traditional methods such as Otsu thresholding against those of colab_zirc_dims, when artefacts are present (e.g. anomalous bright spots, bubbles), to support this assertion. Alternatively, please include supporting literature references.

Line 88: I am perhaps confused by the term “zonal area” – does this refer to the banding seen in cathodoluminescence (CL) images of zircon? Have the authors used AnalyZr to extract bands from within grains? Unfortunately, as AnalyZr was not developed for CL images, full grain segmentation from CL images is expected to fail. Perhaps reword the text to clarify, as it might mistakenly be interpreted as a recommendation to use AnalyZr for CL image grain segmentation.

Potentially inconsistent terminology. Do the authors intend these terms to have different meanings, or do they all refer to a MaskRCNN implementation with FPN, using a ResNet backbone? If the latter, I’d recommend that terminology be standardised throughout the paper.

Mask RCNN FPN (line 139 & 147)

Mask RCNN (line 151, Fig 2 caption)

Mask RCNN ResNet-FPN (line 178)

Line 155: Table 1: The authors have selected training iterations of 4000-7000. Fig 4 shows that models stabilise at approximately 2000 iterations. Could the authors please include their reasoning for selecting model checkpoints at such high iterations? Is there any risk that these models are comparatively overtrained (meaningfully less generalised) than those around ~2000 iterations (e.g. how do they compare on the test dataset used for Table 3)?

Line 184-185: The meaning of “sample-dependent…resolutions…(194 by 194 pixels…)” was not clear to me. Does this refer to the “max_zircon_size” criterion in the “mosaic_info” data table of the “Data Matching and Preparation” Colab notebook (lines 262-266)? Consider rewording the text to clarify.

Lines 184-188: Could the authors please provide an indication of the nature of these zircon grains (e.g. sources, ages, sedimentary environment, histograms of shape parameter variation etc.) so that the reader has an understanding of how diverse the image test and validation datasets are? As the authors are using a small dataset to train deep convolutional neural networks, a reader may wonder how generalised the trained models are, and whether they will perform as well on zircon grains from different regions. Alternatively, if the authors feel that the small training dataset inhibits generalisation, please expand on this in the discussion.

Line 190: Kindly indicate which of the images were hand-selected (perhaps rename the image files and refer the reader to the supplementary data).

Lines 202-204: Please clarify what an iteration refers to, in this training regime. Additionally, consider specifying epochs and batch size in Table 1, for those readers who may wish to test the reported training strategy within their own Python implementation of MaskRCNN.

Line 216: Kindly provide definitions, using simple terminology or mathematical formula, for the training mask loss and average precisions shown in Fig 4. Is there a reference for the source of these definitions that could be provided?

Line 259, Section 4.3.1 Dataset Preparation tools: Colab_zirc_dims has a unique work flow with specific data and metadata requirements. I suggest including a flow diagram illustrating the segmentation and shape measurement procedure, inputs and outputs, including detail on image size and channels. This would help the reader understand the data requirements and process flow of colab_zirc_dims.

Lines 277-280: Colab_zirc_dims is designed for data output by facilities using the Chromium software. It therefore has specific data and metadata requirements. There appears some flexibility around metadata, which the authors touch on in lines 277-280. However, after reading this text I felt I lacked a clear understanding of (1) whether or not any reflected light image dataset could be adapted for use in colab_zirc_dims and (2) how this may be done (e.g. automatically generating the necessary metadata files from user inputs via script). Please could the authors expand on the flexibility and limitations surrounding the application of this tool to reflected light datasets in general. This would help readers quickly identify whether this tool can be used on their datasets. The flow diagram suggested in the previous comment may help in this regard.

Line 305, Table 3. Are the authors using the term “spot” as a synonym for “grain” in “Average segmentation time per spot”? Additionally, the footnote describes the metric as the time required for image segmentation. Does the reader interpret this metric as segmentation time per image, per grain in the image, or per analytical spot (potentially more than one spot per grain) in the image?

Line 305, Table 3: Please provide definitions, using simple terminology or mathematical formula, for each of the tabulated metrics. Is there an appropriate literature reference for the definitions, which could be provided?

Line 305, Table 3: Please could the authors add a description of the test dataset to help the reader understand how similar the test dataset is to the training and validation datasets. This adds additional context to the performance evaluation results.

Line 309: The authors refer the reader to Fig 4 and Table 3, in which models are differently named. Kindly standardise model names throughout the paper, thus facilitating quick comparison of models across tables and figures.

Line 310: Consider amending “training loss” to “training mask loss”, to be consistent with Fig 4.

Line 335: Please clarify the meaning of “skew slightly negative”.

Technical Corrections
Line 68: “with via” amend to “via”.

Line 269: “…allows to users to generate…” amend to “…allows users to generate”.

Line 292: “…are can…” amend to “can”.

Line 323: “Centermaks2” amend to “Centermask2”
Citation: https://doi.org/10.5194/gchron-2022-12-RC2
- AC2:
  'Reply on RC2', Michael Sitar, 22 Jul 2022
  We greatly appreciate T. Scharf’s constructive comments and corrections regarding our manuscript and are especially glad that she was able to evaluate our work given her unique expertise on this subject. The comments (italicized) highlight many places where we can improve or clarify our manuscript, and we have included a response below each one. We use “revised manuscript” below to refer to a revised copy of our original manuscript that we will submit if invited to do so by the editors.
  Line 45-46: Colab_zirc_dims works exclusively on reflected light (RL) images of zircons mounted in resin. The authors mention that zircon shape may be partially obscured by resin, resulting in minimum shape measurements instead of true dimensions. As colab_zirc_dims is presented as a tool for zircon shape measurement, could the authors kindly expand the discussion to cover whether the error introduced by reflected light images is significant, and whether or not it predisposes colab_zirc_dims to certain use cases? For example, do we know what proportion of a dataset is typically affected by this phenomenon? Is there any risk in comparative studies in which mounts have been differently handled (e.g. ground to different depths) or where shape measurements have been extracted from a variety of image types?
  
  This information is certainly important to convey to colab_zirc_dims users. We believe that a study involving preparing and imaging new mounts using different techniques in order to precisely quantify the potential degree to which it impacts grain shape estimation accuracy would very valuable, especially with growing interest in grain size/shape versus age relationships. Though such a study is obviously beyond the scope of our technical note, we think that the data that we have on hand in some cases approximates the errors that the reviewer comments on. We have consequently added the following text to section 5.2 of our draft revised manuscript:
  “Because these images are of sufficiently high quality that subsurface grain extents were interpretable by Leary et al. (in press), and because model M-R101-C generally only segments grain areas above resin surfaces, errors in these samples can also be used as a rough proxy for dimensional data loss from using reflected light versus transmitted light images to measure shapes of very poorly exposed grains in cases where reflected light images do not reveal any information about subsurface grain extents (Sect. 1; Leary et al., 2020a). In the worst-evaluated sample, 1WM-302 (n=180), M-R101-C produces axial measurements that underestimated manually measured grain axes by at least 20% 58.3% of the time, with average grain measurement errors of -16.1% and -21.0% along long and short axes, respectively. Treating these automatically generated axial measurements as ground truth data could result in significantly flawed analysis of relationships between grain size and age. Such shape parameter underestimates present only a minor (though potentially time-consuming) problem for colab_zirc_dims users with poorly exposed grains whose actual areas are still interpretable by humans (e.g., in the case of 1WM-302); erroneous segmentation masks can simply be corrected manually using the GUI. Users who observe that their mounted crystals are both very poorly exposed and invisible below the resin surface in their reflected light images, though, may consider re-imaging their samples using transmitted light and then measuring grains using a different program (e.g., AnalyZr; Scharf et al., 2022) to avoid collecting flawed data.”
  To address the question of comparison between facilities and/or datasets, we have added the following text:
  “Because most facilities aspire to polish their laser ablation zircon mounts to half the thickness of the zircons, we expect that there will be no systematic differences between measurement of grains analysed at different laser ablation labs. However, because there is some variability in quality of polish this was achieved at ALC in the test dataset (Leary et al., 2020a; see above discussion of samples 1WM-302, 5PS-58, 2QZ-9, and 2QZ-272), careful manual checking of polish quality will be required in any dataset as described above. Ultimately, a study in which pre- (e.g., Finzel, 2017) and post-mount (Leary et al., 2020a; Scharf et al., 2022; current study) grain dimension measurements can be collected on the same samples will be the best way to quantify the bias introduced by polishing and/or by different facilities. However, such a test is well beyond the scope of the current study.”
  Line 72-76: Please include an image that compares the segmentation achieved by traditional methods such as Otsu thresholding against those of colab_zirc_dims, when artefacts are present (e.g. anomalous bright spots, bubbles), to support this assertion. Alternatively, please include supporting literature references.
  
  We have amended revised versions of Figs. 1 and 2 (see attached) to include selections of images displaying artefacts and fractured grains (e.g., mosaic stitching boundaries, anomalous bright spots, and a bubble), along with basic Otsu thresholding masking results (Fig. 1d) and colab_zirc_dims segmentation and measurement results (Fig. 2d). We hope that these images adequately show differences in performance, which are significant in the cases of some atypical images (e.g., the top row in Figs. 1d and 2d), but minor in others (e.g., the bottom row in Figs. 1d and 2d).
  Line 88: I am perhaps confused by the term “zonal area” – does this refer to the banding seen in cathodoluminescence (CL) images of zircon? Have the authors used AnalyZr to extract bands from within grains? Unfortunately, as AnalyZr was not developed for CL images, full grain segmentation from CL images is expected to fail. Perhaps reword the text to clarify, as it might mistakenly be interpreted as a recommendation to use AnalyZr for CL image grain segmentation.
  
  Reply from the first author: Due to a combination of a) my misunderstanding of the intended use of the AnalyZr spot picking/CL texture recording UI in the Scharf et al. (2022) paper and b) a not unsuccessful test of AnalyZr with a CL zircon grain image (loaded as _RL), I was in fact under the mistaken impression that AnalyZr could be used for full grain segmentation in CL images. “Zonal area” is vague but refers here to the CL image zoning textures that users can localize to analytical spots using the UI; I erroneously believed that users could identify and mark these textures in-situ while segmenting grains in CL images. I apologize for mischaracterizing your work and should note that the Scharf et al. (2022) paper is quite unambiguous with regards to the intended use cases and files for AnalyZr.
  To eliminate any possibility of confusion, we have removed “…zonal area from cathodoluminescence images…” from our revised manuscript and added the following text:
  “Analytical spot identification and localization in AnalyZr is done manually through an interface which also allows input of spot-specific comments and qualitative internal grain zoning descriptors that persist into the program’s exports (Scharf et al., 2022).”
  Potentially inconsistent terminology. Do the authors intend these terms to have different meanings, or do they all refer to a MaskRCNN implementation with FPN, using a ResNet backbone? If the latter, I’d recommend that terminology be standardised throughout the paper.
  
  We agree that our terminology here is inconsistent and potentially confusing. These terms are used somewhat interchangeably in deep learning literature and documentation, but the technically correct terminology as per (He et al., 2018) should be “Mask RCNN ResNet-FPN” when referring to specific model designs. We have updated our manuscript for consistency as follows:
  Mask RCNN FPN (line 139 & 147)
  
  Changed both to “Mask RCNN” (reference to general model architecture)
  Mask RCNN (line 151, Fig 2 caption)
  
  Kept as is due to reference to specific model in Table 1.
  Mask RCNN ResNet-FPN (line 178)
  
  Kept as is (already correct).
  We have additionally revised all backbone names in Table 1 (see attached) to reflect the fact that all the backbone networks that we used incorporate FPNs, and revised the text of our manuscript to state this specifically where appropriate.
  Line 155: Table 1: The authors have selected training iterations of 4000-7000. Fig 4 shows that models stabilise at approximately 2000 iterations. Could the authors please include their reasoning for selecting model checkpoints at such high iterations? Is there any risk that these models are comparatively overtrained (meaningfully less generalised) than those around ~2000 iterations (e.g. how do they compare on the test dataset used for Table 3)?
  
  We agree that these questions should be answered for readers in the manuscript and have consequently revised figure 4 (attached) to include average absolute long axis error from evaluations of model checkpoints saved during training, and revised text at the end of section 5 to read:
  “We picked “best” model checkpoints (Table 1) at checkpoints beyond 3000 iterations where models achieved apparent local maxima in validation accuracies (i.e., Fig. 4) and local minima or plateaus in various measurement error metrics (e.g., failure rate and absolute long axis error; Table 3; Fig. 4) when evaluated on the full Leary et al. (in press) test dataset. We set our threshold (greater than 3000 iterations) for checkpoint picking based on qualitative observations that grain masks for all models appeared to be more “blobby” (i.e., more refined to actual grain areas) at lower training iterations, though it is worth noting that we fail to see conclusive evidence for this relationship in training, validation mask loss, or test accuracy curves (Fig. 4). Changes in most evaluation accuracy metrics (roughly represented by average absolute long axis error in Fig. 4) for the models trained with image augmentations were largely stochastic after ~2000 (for the pretrained models) to ~3000 iterations (for the randomly initialized models; Fig. 4). This suggests a lack of meaningful overfitting (possibly attributable to a combination of learning rate drawdown and training image augmentation) in relation to our test dataset and probable negligible negative effects on model generalization abilities from our selecting models at relatively high training iterations.”
  For the reviewer’s benefit, we would like to note that we do see significant evidence for overfitting against our test dataset for our single model trained without image augmentation (M-R50-S-NA) after ~4000 iterations. Due to much higher overall test dataset error values than for other models, however, we are unable to fit it on the new panel of our revised Fig. 4 without obscuring test results for our better-performing models. We have attached a version of Figure 4 that includes model M-R50-S-NA test results for the reviewer to evaluate – see “revised fig 4 with unaugmented model test results.png”.
  Line 184-185: The meaning of “sample-dependent…resolutions…(194 by 194 pixels…)” was not clear to me. Does this refer to the “max_zircon_size” criterion in the “mosaic_info” data table of the “Data Matching and Preparation” Colab notebook (lines 262-266)? Consider rewording the text to clarify.
  
  This does indeed refer to the “max_zircon_size” (“Max_grain_size” as of colab_zirc_dims v1.0.9). To clarify our meaning, we have revised the sentence to read:
  “ALC images (Table 2) were algorithmically extracted from mosaic images at scales and resolutions (194 by 194 pixels to 398 by 398 pixels) that respectively varied sample-to-sample based on imaging parameters during analysis and grain size (i.e., “Max_grain_size”; Sect. 4.3.1).”
  Lines 184-188: Could the authors please provide an indication of the nature of these zircon grains (e.g. sources, ages, sedimentary environment, histograms of shape parameter variation etc.) so that the reader has an understanding of how diverse the image test and validation datasets are? As the authors are using a small dataset to train deep convolutional neural networks, a reader may wonder how generalised the trained models are, and whether they will perform as well on zircon grains from different regions. Alternatively, if the authors feel that the small training dataset inhibits generalisation, please expand on this in the discussion.
  
  We have added the following texts to section 3.2.3 to provide additional information on the provenance of training dataset images:
  “These samples contain a wide range of zircon ages from Proterozoic to late Palaeozoic and represent a variety of terrestrial and marine depositional environments including fluvial, delta plain, nearshore, and continental shelf environments. See Leary et al. (2020a; 2020b) for detailed discussion of these samples.”
  and
  “These images are of grains derived from samples of Late Mesozoic-Early Cenozoic rocks interpreted to have been deposited by braided stream systems. Dated zircon grains from these samples indicate the presence of mixed populations of Proterozoic grains that likely record long-range tectonic-fluvial transport (e.g., from the Grenville Orogen to modern day Nevada, USA; Rainbird et al., 1997; Gehrels et al., 2000) and iterative recycling prior to their most recent deposition. These grains are combined in approximately equal proportions with minimally transported Early Cretaceous grains presumably sourced from the ancient Sierran Arc. Images from the UCSB training set consequently include variable mixtures of very well-rounded and relatively fresh, euhedral grains.”
  We have added the following text to the end of section 5.2 of our revised manuscript in order to better inform readers on potential uncertainties related to our small training dataset:
  “Though most of our models evidentially generalize well to our test set, and we believe that they will most likely generalize well to other datasets, they are still untested on data from facilities not represented in their training dataset (i.e., besides ALC and UCSB). And, though they have been exposed to some relatively euhedral detrital zircon grains in the UCSB training images, our models are notably also untested on crystals derived from primary igneous and volcanic rocks. Some uncertainty remains in how well our models will work when applied to more diverse data by colab_zirc_dims users. Since our training dataset is quite small and lacking in diversity of image sources, increasing the size and diversity of our training dataset before training updated models will likely yield some improvements in model generalization ability. We plan to expand our training dataset and release new models as we maintain colab_zirc_dims and will make this our priority should users inform us that our current models fail to generalize well.”
  Line 190: Kindly indicate which of the images were hand-selected (perhaps rename the image files and refer the reader to the supplementary data).
  
  We appreciate this suggestion and certainly plan to incorporate it. We are working on a Python script to persist filename changes into training annotation .json files and will get this done in advance of submitting supplementary data for a revised manuscript or possibly sooner (e.g., for colab_zirc_dims 1.0.10).
  Lines 202-204: Please clarify what an iteration refers to, in this training regime. Additionally, consider specifying epochs and batch size in Table 1, for those readers who may wish to test the reported training strategy within their own Python implementation of MaskRCNN.
  
  We are glad that the reviewer pointed this out – a Google search reveals several completely different definitions of “iteration” in the context of deep learning, which will surely be confusing to readers. We have added the following text to clarify what the term means for Detectron2 training:
  “Detectron2 exclusively uses the term “iteration” to define the extent of a model’s exposure to training data. To avoid semantic confusion, it is worth noting that a Detectron2 “iteration” is synonymous with a “batch” in other deep learning libraries (e.g., PyTorch and TensorFlow), and the number of iterations in an “epoch” is thus equivalent to training set size divided by iteration batch size (Paszke et al., 2019; TensorFlow Developers, 2022; Detectron2).”
  Line 216: Kindly provide definitions, using simple terminology or mathematical formula, for the training mask loss and average precisions shown in Fig 4. Is there a reference for the source of these definitions that could be provided?
  
  We have added the following text to our revised manuscript draft to clarify training mask loss and COCO AP metrics:
  “…Mask RCNN mask loss, which is defined by He et al. (2018) as average binary cross entropy loss calculated over each sigmoid-activated mask prediction for each class in each ROI of each image in a batch, is a component of the loss functions for all models and thus can be compared one-to-one between them (e.g., Fig. 4).”
  “Model performance on the validation set was evaluated using bounding box and mask MS COCO AP (mean average precision) values, which are themselves arithmetic means of mean average precisions calculated at 10 segmentation intersection over union thresholds between 0.5 and 0.95 (Lin et al., 2015; COCO - Common Objects in Context, 2022). These evaluations (Fig. 4) were run every 200 iterations…”
  Line 259, Section 4.3.1 Dataset Preparation tools: Colab_zirc_dims has a unique work flow with specific data and metadata requirements. I suggest including a flow diagram illustrating the segmentation and shape measurement procedure, inputs and outputs, including detail on image size and channels. This would help the reader understand the data requirements and process flow of colab_zirc_dims.
  
  We have significantly expanded Figure 5 (revised version attached) to provide a full overview of potential colab_zirc_dims workflows and potential dataset inputs. We hope that this diagram addresses the reviewer’s concerns raised here and in comment 11 below. We have also added the following text to section 4.3.2:
  “Researchers with datasets comprised of reflected light images that are not shot-centred and lack Chromium metadata can adapt (i.e., Fig 5a) their image datasets for use with colab_zirc_dims by either using Chromium Offline (Teledyne Photon Machines, 2020) to generate scaling and/or shot placement metadata or by manually cropping shot-centred images from mosaics (e.g., using ImageJ’s “multicrop” function; Schindelin et al., 2012). Such a workflow will, however, bypass most of the automation in the colab_zirc_dims data loading process, and potential users are advised that collecting grain measurements using existing software (i.e., AnalyZr; Scharf et al., 2022) will likely be less arduous.”
  Lines 277-280: Colab_zirc_dims is designed for data output by facilities using the Chromium software. It therefore has specific data and metadata requirements. There appears some flexibility around metadata, which the authors touch on in lines 277-280. However, after reading this text I felt I lacked a clear understanding of (1) whether or not any reflected light image dataset could be adapted for use in colab_zirc_dims and (2) how this may be done (e.g. automatically generating the necessary metadata files from user inputs via script). Please could the authors expand on the flexibility and limitations surrounding the application of this tool to reflected light datasets in general. This would help readers quickly identify whether this tool can be used on their datasets. The flow diagram suggested in the previous comment may help in this regard.
  
  We agree that more specific information on the input requirements for colab_zirc_dims processing notebooks is in order, and hope that our revisions to components A and B of Figure 5 (see response to comment 10) adequately cover this. We also hope that the reviewer’s concerns regarding a lack of discussion of limitations related to different reflected light datasets (e.g., ones varying in image quality and/or grain exposure) are addressed by our additions to section 5.2 of our revised manuscript, as described in response to comment 1.
  Line 305, Table 3. Are the authors using the term “spot” as a synonym for “grain” in “Average segmentation time per spot”? Additionally, the footnote describes the metric as the time required for image segmentation. Does the reader interpret this metric as segmentation time per image, per grain in the image, or per analytical spot (potentially more than one spot per grain) in the image?
  
  We were in this case using spot as a synonym for grain (i.e., an analytical spot placed on a grain, with one image corresponding to that grain). This is clearly confusing and inconsistent, and we have consequently revised the table column name to “Average segmentation time per image” and the footnote to “Average time for model/method to successfully segment an image and return a measurable mask. Actual per-image processing times will be higher due to additional automated mask measurement and verification image saving time. Measured in Colab notebook with NVIDIA T4 GPU.”
  Line 305, Table 3: Please provide definitions, using simple terminology or mathematical formula, for each of the tabulated metrics. Is there an appropriate literature reference for the definitions, which could be provided?
  
  We have revised this table to include in its footnotes written definitions for metrics “n” and “Average segmentation time per image” and formulas for each of the other calculated metrics. See the attached file ‘Table3_revised_as_word’ for optimal viewing of the formulas.
  Line 305, Table 3: Please could the authors add a description of the test dataset to help the reader understand how similar the test dataset is to the training and validation datasets. This adds additional context to the performance evaluation results.
  
  We have added the following text to the beginning of section 5 to address this comment:
  “We assessed the accuracy of our segmentation models by comparing a manually generated grain-dimension dataset (Leary et al., in press) to automatically generated grain dimensions from the same samples measured using colab_zirc_dims. The test dataset from Leary et al. (in press) consists of samples collected from late Palaeozoic strata exposed across Arizona, USA. These samples were deposited in the same orogenic system—the Ancestral Rocky Mountains—as the Leary et al. (2020a) training dataset images, and the grain ages and depositional environments are largely similar. The test dataset is unrelated to the training dataset images from UCSB (see above).”
  Line 309: The authors refer the reader to Fig 4 and Table 3, in which models are differently named. Kindly standardise model names throughout the paper, thus facilitating quick comparison of models across tables and figures.
  
  We have standardized the names in our revised figures (i.e., Fig. 4; attached) and in the text of our revised manuscript where appropriate.
  Line 310: Consider amending “training loss” to “training mask loss”, to be consistent with Fig 4.
  
  We have made this correction to our revised manuscript.
  Line 335: Please clarify the meaning of “skew slightly negative”.
  
  We are referring here to skewness in our error results that is quantifiable using Pearson’s skewness coefficient; this can be illustrated by the two histogram plots (for long and short axis error) that have added to a revised version of Figure 6 as Fig. 6b (see attached). To define Pearson’s skewness coefficient, we have added the following equation (which will be rendered by the MS Word equation formatter as Equation 2) to section 5.2 of our revised manuscript:
  Pearson's skewness coefficient = 3(mean-median)/(standard deviation)
  We have also revised the beginning of section 5.2 to read:
  “Per-grain automated (M-R101-C) measurements for the full Leary et al. (in press) dataset generally hew close to ground-truth measurements but with a significant number of datapoints plotting well below the 1:1 measured versus ground truth (i.e., Leary et al., in press) line (Fig. 6). The apparent dominant cause of this negative skew (i.e., Equation 2, Fig. 6B) is…”
  Technical Corrections
  Line 68: “with via” amend to “via”.
  
  Line 269: “…allows to users to generate…” amend to “…allows users to generate”.
  
  Line 292: “…are can…” amend to “can”.
  
  Line 323: “Centermaks2” amend to “Centermask2”
  
  The technical corrections above have been incorporated in our draft revised manuscript. Thank you to the reviewer for identifying them!
  
  Additional corrections:
  Line 110: We erroneously state that PyTorch is developed by Google:
  Corrected to “…also developed by Facebook…”
  References:
  COCO - Common Objects in Context: https://cocodataset.org/#detection-eval, last access: 14 July 2022.
  Finzel, E. S.: Detrital zircon microtextures and U-PB geochronology of Upper Jurassic to Paleocene strata in the distal North American Cordillera foreland basin, Tectonics, 36, 1295–1316, https://doi.org/10.1002/2017TC004549, 2017.
  Gehrels, G. E., Dickinson, W. R., Riley, B. C. D., Finney, S. C., and Smith, M. T.: Detrital zircon geochronology of the Roberts Mountains allochthon, Nevada, in: Special Paper 347: Paleozoic and Triassic paleogeography and tectonics of western Nevada and Northern California, vol. 347, Geological Society of America, 19–42, https://doi.org/10.1130/0-8137-2347-7.19, 2000.
  He, K., Gkioxari, G., Dollár, P., and Girshick, R.: Mask R-CNN, ArXiv170306870 Cs, 2018.
  Leary, R., Smith, M., and Umhoefer, P.: Grain‐Size Control on Detrital Zircon Cycloprovenance in the Late Paleozoic Paradox and Eagle Basins, USA, J. Geophys. Res. Solid Earth, 125, https://doi.org/10.1029/2019JB019226, 2020a.
  Leary, R., Smith, M. E., and Umhoefer, P.: Mixed eolian-longshore sediment transport in the Late Paleozoic Arizona Pedregosa basin, USA: a case study in grain-size analysis of detrital zircon datasets, J. Sediment. Res., in press.
  Leary, R. J., Umhoefer, P., Smith, M. E., Smith, T. M., Saylor, J. E., Riggs, N., Burr, G., Lodes, E., Foley, D., Licht, A., Mueller, M. A., and Baird, C.: Provenance of Pennsylvanian–Permian sedimentary rocks associated with the Ancestral Rocky Mountains orogeny in southwestern Laurentia: Implications for continental-scale Laurentian sediment transport systems, Lithosphere, 12, 88–121, https://doi.org/10.1130/L1115.1, 2020b.
  Lin, T.-Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., and Dollár, P.: Microsoft COCO: Common Objects in Context, ArXiv14050312 Cs, 2015.
  Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Köpf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S.: PyTorch: An Imperative Style, High-Performance Deep Learning Library, ArXiv191201703 Cs Stat, 2019.
  Rainbird, R. H., McNicoll, V. J., Thériault, R. J., Heaman, L. M., Abbott, J. G., Long, D. G. F., and Thorkelson, D. J.: Pan-Continental River System Draining Grenville Orogen Recorded by U-Pb and Sm-Nd Geochronology of Neoproterozoic Quartzarenites and Mudrocks, Northwestern Canada, J. Geol., 105, 1–17, https://doi.org/10.1086/606144, 1997.
  Scharf, T., Kirkland, C. L., Daggitt, M. L., Barham, M., and Puzyrev, V.: AnalyZr: A Python application for zircon grain image segmentation and shape analysis, Comput. Geosci., 162, 105057, https://doi.org/10.1016/j.cageo.2022.105057, 2022.
  Schindelin, J., Arganda-Carreras, I., Frise, E., Kaynig, V., Longair, M., Pietzsch, T., Preibisch, S., Rueden, C., Saalfeld, S., Schmid, B., Tinevez, J.-Y., White, D. J., Hartenstein, V., Eliceiri, K., Tomancak, P., and Cardona, A.: Fiji: an open-source platform for biological-image analysis, Nat. Methods, 9, 676–682, https://doi.org/10.1038/nmeth.2019, 2012.
  TensorFlow Developers: TensorFlow, Zenodo, https://doi.org/10.5281/zenodo.5949169, 2022.
  Detectron2: https://github.com/facebookresearch/detectron2.
  
  Citation: https://doi.org/10.5194/gchron-2022-12-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to revisions (further review by editor and referees) (08 Aug 2022) by Pieter Vermeesch

AR by Michael Sitar on behalf of the Authors (14 Dec 2022) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (16 Dec 2022) by Pieter Vermeesch

RR by Nikki Seymour (02 Feb 2023)

ED: Publish as is (03 Feb 2023) by Pieter Vermeesch

ED: Publish as is (06 Feb 2023) by Klaus Mezger (Editor)

AR by Michael Sitar on behalf of the Authors (16 Feb 2023) Author's response Manuscript

Short summary

We developed code to automatically and semi-automatically measure dimensions of detrital mineral grains in reflected-light images saved at laser ablation–inductively coupled plasma–mass spectrometry facilities that use Chromium targeting software. Our code uses trained deep learning models to segment grain images with greater accuracy than is achievable using other segmentation techniques. We implement our code in Jupyter notebooks which can also be run online via Google Colab.

Technical note: colab_zirc_dims: a Google Colab-compatible toolset for automated and semi-automated measurement of mineral grains in laser ablation–inductively coupled plasma–mass spectrometry images using deep learning models

Download

Interactive discussion

Peer review completion

Suggestions for revision or reasons for rejection