Technical note: A prototype transparent-middle-layer data management and analysis infrastructure for cosmogenic-nuclide exposure dating

Geologic dating methods for the most part do not directly measure ages. Instead, interpreting a geochemical observation as a geologically useful parameter – an age or a rate – requires an interpretive middle layer of calculations and supporting data sets. Both of these ::::: These are the subject of active research and evolve rapidly, so any synoptic analysis requires complete, 5 repeated recalculation of :::: large ::::::: numbers ::: of ages from a growing data set of raw observations, using a constantly improving calculation method. Many important applications of geochronology involve regional or global analyses of large and growing data sets, so this characteristic is an obstacle to progress in these applications. This paper describes the ICE-D database project, a prototype computational infrastructure for dealing with this obstacle in one geochronological application – cosmogenicnuclide exposure-dating – that aims to enable visualization or analysis of large, diverse data sets by making middle-layer 10 calculations dynamic and transparent to the user. An important aspect of this infrastructure :::::: concept : is that it is designed as a forward-looking research tool rather than a backward-looking archive: only observational data (which do not become obsolete) are stored, and derived data (which become obsolete as soon as the middle-layer calculations are improved) are not stored, but instead calculated dynamically at the time data are needed by an analysis application. This minimizes "lock-in" effects associated with archiving derived results subject to rapid obsolescence, and allows assimilation of both new observational data 15 and improvements to middle-layer calculations without creating additional overhead at the level of the analysis application. 1 Interpretive middle layer calculations in geochronology Geologic dating methods, saving a few exceptions like varve or tree ring counting, do not directly measure ages or timespans. Instead, the actual observation is typically a geochemical measurement, like a nuclide concentration or isotope ratio. Interpreting the measurement as a geologically useful parameter such as an age or rate then requires some sort of calculation and a 20 variety of independently measured or assumed data such as radioactive decay constants, initial compositions or ratios, nuclide production rates, or nuclear cross-sections (Figure 1). This presents a ::::: These :::::::: elements :::: form :: a :::::: "middle :::::: layer" ::::::: between ::: the ::::: direct :::::::::: observations :::: and ::: the ::::::::: geological :::::::::: information :::::: derived ::::: from ::: the ::::::::::: observations. ::::::::::: Middle-layer :::::::::: calculations ::::::: present : a : problem for management and analysis of geochemical data because the interpretive middle layer between observable data and geologically useful information constantly changes with improvements in ::: they ::::::::: constantly :::::: change :: as : the calculation methods :::::: improve : and

new measurements of the other parameters needed for the calculation :::::: become :::::::: available. Even though the geochemical measurements themselves in archived or previously published studies are valid indefinitely, the derived ages become obsolete. This is an obstacle for analysis of geochronological data collected over a long period of time or, sometimes, from multiple laboratories or research groups who have different opinions about the ::::::::: approaches ::: to middle-layer calculations, because any comparison 5 requires repeatedly recalculating all the derived ages from source data using a common method. This paper describes a prototype computational infrastructure for dealing with this obstacle in one geochronological application -cosmogenic-nuclide exposure-dating -that is intended to enable synoptic analysis of large, diverse data sets by making middle-layer calculations dynamic and transparent to the user.
2 Middle-layer calculations in cosmogenic-nuclide exposure dating 10 Cosmogenic-nuclide exposure dating is a geologic dating method that relies on the production of rare nuclides by cosmic-ray interactions with rocks and minerals at Earth's surface. As the cosmic-ray flux is nearly entirely stopped in the first few meters below the surface, the nuclide concentration in a surface sample is related to the length of time that :: the : sample has been exposed at the surface. This enables many applications in dating geologic events and measuring rates of geologic processes that move rock ::::::: transport ::::: rocks ::: or ::::::: minerals : from the subsurface to the surface, or from the surface into the subsurface (see review in 15 Dunai, 2010). The most common of these is "exposure dating" of landforms and surficial deposits to determine, for example, the timing of glacier and ice sheet advances and retreats (e.g., Balco, 2011;Jomelli et al., 2014;Johnson et al., 2014;Schaefer et al., 2016) or fault slip rates and earthquake recurrence intervals (e.g, Mohadjer et al., 2017;Cowie et al., 2017;Blisniuk et al., 2010).
The observable data for exposure-dating applications are (i) measurements of the concentrations in common minerals of trace 20 nuclides that are diagnostic of cosmic-ray exposure, for example beryllium-10, aluminum-26, or helium-3, and (ii) ancillary data describing the location, geometry, and physical and chemical properties of the sample. Interpreting these measurements as the exposure age of a rock surface is simple in principle: one measures the concentration of one of these nuclides, estimates the rate at which it is produced by cosmic-ray interactions, and divides the concentration (e.g., atoms/gatoms g −1 ) by the production rate (atoms/g/yratoms g −1 yr −1 ) to obtain the exposure age (yryr). It is much more complex in practice, because 25 the cosmic-ray flux, and therefore the production rate, varies with position in the atmosphere and the Earth's magnetic field, and the production rate also depends on the chemistry and physical properties of the mineral and the rock matrix. Production rate calculations are geographically specific, temporally implicit (because the Earth's magnetic field changes over time), and require not only a model of the cosmic-ray flux throughout the Earth's atmosphere, but an array of other :::::::: additional : data including atmospheric density models, paleomagnetic field reconstructions, nuclear interaction cross-sections, and others. In addition, 30 production rate models are empirically tuned using sets of "calibration data," which are nuclide concentration measurements from sites whose true exposure age is independently known.
The interpretive middle layer for exposure-dating, therefore, includes physical models for geographic and temporal variation in the production rate, numerical solution methods, geophysical and climatological data sets, physical constants measured in laboratory experiments, and calibration data. All these elements are the subject of active research: new production rate scaling models and magnetic field reconstructions are developed every 1-3 years, and several new calibration data sets are published each year. The result of this continuous development is that nearly all cosmogenic-nuclide exposure ages in published literature have been calculated with production rate models, physical parameters, or calibration data sets that are now obsolete.
5 Figure 1. Conceptual workflow for applications of cosmogenic-nuclide exposure dating (or, in principle, nearly any other field of geochronology). Any large-scale analysis of ages or process rates needs to continually assimilate a growing observational data set and improving middle-layer calculations...or else it will be immediately obsolete.
At present, middle-layer calculations for exposure dating most commonly utilize "online exposure age calculators" developed by, e.g., Balco et al. (2008), Ma et al. (2007), Marrero et al. (2016), or Martin et al. (2017), that consist of :: are : online web server executes a script that carries out production rate and exposure-age calculations, and returns results formatted so as to be easily pasted into a spreadsheet. The typical workflow for comparison or analysis of exposure-age data relies on manual, asynchronous use of one or more of these services, in which researchers: (i) maintain a spreadsheet of their own and previously published observational/analytical data; (ii) cut-and-paste from this spreadsheet into an online calculator; (iii) cut-and-paste 5 calculator results back into the spreadsheet; and (iv) proceed with analysis of the results. Although the ability to use the online calculators in this way to produce an internally consistent set of results has been valuable in making synthesis of large data sets drawn from multiple sources possible at all, this procedure creates redundancy and inconsistency among separate compilations by many researchers, relies on proprietary data compilations that are, in general, not available for public access and validation, interposes many manual data manipulation steps between data acquisition and downstream analysis, creates a "lock-in" effect 10 in which the effort required to recalculate hundreds or thousands of exposure ages using one scaling method is a disincentive to experimenting with others, and makes it difficult ::: and ::::::::::::: time-consuming : to assimilate new data into either the source data set or the middle layer calculations.

A transparent-middle-layer infrastructure
These disadvantages of the current best-practice approach of manual, asynchronous use of the online exposure age calculators 15 could be corrected, and synoptic visualization and analysis of exposure-age data better enabled, by a data management and computational infrastructure having the following elements.
1. A data layer: a single source of observational data that can be publicly viewed and evaluated, is up to date, is programmatically accessible to a wide variety of software using a standard application program interface (API), and is generally agreed upon to be a fairly complete and accurate record of past studies and publications, beneath: 20 2. A transparent :::::::::: "transparent" : middle layer that :::::::::: dynamically : calculates geologically useful results, in this case exposure ages, from observational data using an up-to-date calculation method or methods, and serves these results via a simple API to: 3. An analysis layer, which could be any Earth science application that needs the complete data set of exposure ages for analysis, visualization, or interpretation.

30
This eliminates unnecessary effort and the associated lock-in effect created by manual, asynchronous application of the middlelayer calculations to locally stored data by individual users, and allows continual assimilation of new data or methods into both the data layer and middle layer without creating additional overhead at the level of the analysis application. Potentially, this structure also removes the necessity for redundant data compilation by individual researchers by decoupling agreed-upon observational data (which are the same no matter the opinions or goals of the individual researcher and therefore can be incorporated into a single shared compilation) from calculations or analyses based on those data (which require judgements 5 and decisions on the part of researchers, and therefore would not typically be agreed upon by all users). The subsequent sections of this paper describe the ICE-D (Informal Cosmogenic-Nuclide Exposure-age Database) infrastructure, a prototype implementation of this concept.

The ICE-D implementation
The ICE-D transparent-middle-layer infrastructure prototype ::::::: includes ::::::: example :::::::::::::: implementations ::: of ::: all :::: three :::::: layers :: in :::: the 10 :::::::::::::::::::: transparent-middle-layer :::::::::: architecture. :: It : consists of (i) a networked database server storing observational data needed to compute exposure ages, (ii) a networked Linux server that performs middle-layer calculations with MATLAB/Octave code used in version 3 of the online exposure age calculator described by Balco et al. (2008) and subsequently updated, and (iii) a web server that responds to user requests by acquiring data from the database server, passing the data to the middle-layer server for calculation of exposure ages, and returning observations, derived exposure ages, and some related interpretive information 15 to the user (Figure 2). The effect is that a user interacting with the web server can browse and work with large data sets of exposure ages, originally collected and published by many researchers over several decades, without the necessity of managing the data set or ::::::::: repeatedly recalculating all the exposure ages using a common method. Data management and middle-layer calculations are transparent to the user, allowing focus on data visualization, discovery, and analysis.
The ICE-D prototype relies on cloud computing services available at low or zero cost from Google, Amazon Web Services, or 20 other vendors; the current implementation uses Google Cloud Services (https://cloud.google.com). The data layer is a MySQL database server provided by the Google Cloud SQL service. The middle-layer is a virtual machine , provided by :: on : the Google Compute Engine service , running CentOS 7 and the Octave code implementation of the online exposure age calculator, with a new API that facilitates programmatic use of the server. The web server that provides an example of a visualization/analysis layer is Python code running on the Google App Engine framework. 4.1 Aspects of the ::: The :::::::: example data layer The purpose of the data layer is to store and serve observational data needed to calculate exposure ages, mainly including nuclide concentrations and the location, physical properties, and chemical properties of samples. It also includes some information useful for downstream analysis: for example, in a database containing exposure ages from glacial landforms, multiple samples from the same landform are grouped so as to signal that multiple ages can be averaged or otherwise combined to 30 yield a better exposure age for the landform. The ::::::: example database has a standard relational database structure, with a series of tables containing information about landforms, samples collected from landforms, and geochemical measurements on samples.
Additional data tables relate samples to publications, sources of research funding, and any digital resource with a URL (e.g., Figure 2. Generalized topology of the prototype ICE-D infrastructure compared to conventional manual, asynchronous use of online exposure age calculators. Cloud computing services interact with each other to supply raw data, calculated exposure ages, and other derived products to users at various stages :: as ::::: needed ::: for ::::::: different :::: levels : of analysis. field and laboratory photos, detailed reports of laboratory analyses, etc.). It is similar to the database for cosmogenic-nuclide production rate calibration data already described by Martin et al. (2017).
In contrast to other services that aim to archive geochemical or geochronological data, the ICE-D database is not structured as a single entity designed to store any cosmogenic-nuclide exposure age data regardless of application, but instead consists of 5 several separate focus area databases designed to contain restricted collections of exposure-age data needed for specific synoptic analyses. For example, ICE-D:ANTARCTICA (http://antarctica.ice-d.org) contains nearly all known exposure-age data collected from the Antarctic continent, the complete data set of which is important in reconstructing past changes in the extent and thickness of the Antarctic ice sheets. ICE-D:GREENLAND (http://greenland.ice-d.org) has a similar collection of used to reconstruct :::: data :::::::: applicable :: to :::::::::::: reconstructing : past changes in the Greenland Ice Sheet. ICE-D:ALPINE (http://alpine.ice-d.org) 10 contains the majority of published exposure-age data from mountain glacier landforms worldwide, which in the aggregate are useful for paleoclimate reconstruction or diagnosis. The advantage of this focus-area approach is that developing relatively small (∼500 measurements for ICE-D:GREENLAND; ∼4000 for ICE-D:ANTARCTICA; ∼10,000 for ICE-D:ALPINE) data sets tailored to specific synoptic analysis applications enables a database project to become scientifically useful relatively quickly. The same number of measurements distributed among all possible global applications of exposure-dating research 15 would likely result in many incomplete and not-particularly-useful data sets.
4.3 Aspects of the prototype ::: The :::::::: example analysis and visualization layer The prototype ICE-D web server is a simple example of the type of tool that could occupy the analysis and visualization layer.
For the ICE-D:ANTARCTICA, ICE-D:GREENLAND, and ICE-D:ALPINE databases, the website provides a browse tree that allows one to view observational data and derived exposure ages for samples individually or grouped by, for example, geo-20 graphic region, landform, or publication. Views of samples or groups of samples include, in various combinations, a detailed report on :::::: detailed :::::: reports :: of observational data recorded in the database, exposure ages calculated using one or more production rate scaling methods, and some :::::::: examples :: of interpretive products such as analysis of the distribution of exposure ages on a particular landform (as is, for example, useful for glacial moraines in the ICE-D:ALPINE database) or age-elevation relationships for clusters of samples (as is useful for ice sheet thickness change reconstructions using the ICE-D:ANTARCTICA database). 25 Thus, the current web server implementation of the ::::::: prototype : transparent-middle-layer concept replaces most ::::::::::::: implementation ::::::: replaces ::::: many aspects of the conventional practice of manual, asynchronous use of the online exposure age calculators :::: with ::::: locally :::::: stored :::::::::: spreadsheets, while also enabling continuous data assimilation and removing the need for each user to maintain a separate copy of the data set of keep exposure-age calculations up to date.
The prototype infrastructure also allows use of the transparent-middle-layer concept :::::::::: architecture for many other analysis 30 applications. For example, any ::: Any : analysis of exposure-age data, that would conventionally operate on a static, locally stored spreadsheet or data file of previously calculated ages, can instead interact via standard APIs with the remote ::: with ::: the : database and middle-layer servers to dynamically generate a ::::: obtain ::: an :::::::: up-to-date : data set of exposure ages at the time of analysis.
5 Social engineering aspects of the transparent-middle-layer concept 15 An often noted obstacle to participation in community data management infrastructure (e.g, Fleischer and Jannaschk, 2011;Van Noorden, 2013;Fowler, 2016) is the conflict between the broad, generalized incentive for an overall research community to develop centralized infrastructure, and the immediate incentives of researchers who might, for example, view individually authored publications as more critical to career-development ::::: career ::::::::::: development objectives. The ICE-D prototype infrastructure :::::::::::::::::::: transparent-middle-layer ::::: model ::: for :::: data ::::::::::: management has several features that could contribute to resolving this conflict. First, 20 as discussed above, the separation of agreed-upon observational data from interpretive calculations or analysis makes the data compilation itself agnostic with respect to differences of approach or opinion among researchers, thereby reducing potential disincentives to participation in database development. ::::::::: Researchers :::: with :::::::: different ::::::::: approaches ::::: could :::::: simply ::::::: develop :::::::: different :::::::::: middle-layer :::: and ::::::::::: analysis-layer :::::: codes. Second, from the perspective of an individual researcher, the transparent-middle-layer infrastructure can make it substantially faster and easier to carry out time-consuming or difficult tasks (e.g., statistical analy-25 sis, generating statistical or graphical comparisons of new and existing data, comparing data with model predictions) that are required to achieve individual goals (e.g., writing successful proposals or publishing high-impact papers). In fact, more than 25% of sample descriptions ::::: records : in the ICE-D:ANTARCTICA database at this writing are unpublished data incorporated at the request of a number of researchers, and this may be evidence that the ability to use the analysis layer in tasks such as paper writing, proposal preparation, or sharing data with collaborators provides a positive incentive for user engagement with the 30 project. User engagement with centralized data management systems should represent a trade -users provide a service to the community by making data available, and in exchange are provided with services that help them to fulfill their own individual goals faster, better, and more easily. A transparent-middle-layer infrastructure can facilitate this exchange.
Perry Spector as well as the Polar Geospatial Center at the University of Minnesota contributed to developing geographic browsing interfaces for both databases.