From Slide Rule to Big Data: How Data Science is Changing Water Science and Engineering

Director, Swiss Federal Institute for Aquatic Science and Technology (Eawag), CH-8600 Dübendorf, Switzerland; Professor, Institute of Biogeochemistry and Pollutant Dynamics, Swiss Federal Institute of Technology Zürich, CH-8092 Zürich, Switzerland; Professor, School of Architecture, Civil and Environmental Engineering, Swiss Federal Institute of Technology Lausanne, CH-1015 Lausanne, Switzerland. Email: janet.hering@ eawag.ch


Introduction
My university cohort was one of the first to be allowed to use handheld calculators (replacing the slide rules that had been used previously) in our exams (B.A. 1979) and to create our figures and write our doctoral dissertations using graphics software and word processing programs (Ph.D. 1988). I distinctly remember going to the library to consult the printed version of Chemical Abstracts as well as the period when the online version of Chemical Abstracts went back only to the mid-1980s (resulting in a steep decline in referencing of older literature). For most of my career, it seemed that these developments were incremental, and my colleagues and I adjusted to them without major changes in our approaches and expectations. Over time, however, the developments in information technology (IT) and data science have reached the point where the field of water science and engineering (like many others) is confronted with a bewildering array of options and opportunities. This is challenging our fundamental approaches and assumptions about how to do our science and bringing about cultural changes in our expectations regarding the roles of individuals and institutions in the production and sharing of knowledge.
I started to pay serious attention to these issues a few years ago in my capacity as Director of the Swiss Federal Institute of Aquatic Science and Technology (Eawag). In addition to my own personal struggles to keep abreast of the exploding amount of information relevant to Eawag's mandate and positioning, I also have to make budgetary decisions regarding investments in IT infrastructure, research data management, and open-access publications and to respond to pleas from our researchers for scientific IT services. I used an invitation to write a book chapter to engage two of my colleagues (from our IT department and library) in addressing issues related to knowledge management. In that chapter, we were able to make some inroads in addressing issues relating to research data management and open access and to lay out the special challenges posed by experiential and practical knowledge, which are highly context-dependent . We stopped well short, however, of grappling with the complexities inherent in the volumes of heterogeneous data with which we are increasingly confronted and which I address in this article.
Here, I highlight the opportunities and challenges associated with • rapidly increasing availability of voluminous, high-resolution data on water systems, • web-based access to information and the consequent opportunities to contribute to online data sets and/or to develop models and software collaboratively, • applications of computational science (especially machinelearning) to environmental data, and • emerging challenges associated with open data and open science.
Although this is not a review, I have tried to reference the literature that addresses big data challenges in water science and engineering, including some of the broader literature on environmental applications. I follow the 4V concept of defining big data by volume, variety, veracity, and velocity (Farley et al. 2018). Data can be big with regard to one or more of these aspects (Fig. 1). Volume and heterogeneity (i.e., variety) of data are the most commonly considered aspects, but challenges also arise from the quality, reliability, and uncertainty of data (veracity) as well as the rates at which data are acquired or must be processed for particular applications (velocity). With this background, I illustrate some ways in which individual scientists and academic research institutions are taking advantage of new data-driven opportunities and accommodating the demands that accompany them. I also hope to be able to endorse some further steps we could take to promote the "move from data to VOLUME more data VERACITY more uncertainty VELOCITY more real-time VARIETY more heterogeneity Fig. 1. Four axes along which big data can be defined. For a given (big) data set, a spider graph can be used to illustrate which of the 4Vs contributes the most to the bigness of the data set, whether this is simply the amount of data (volume), their heterogeneity (variety), uncertainty, and related aspects of quality and reliability (veracity), and/or the rates at which data are acquired or must be processed for particular applications (velocity). (Adapted from Farley et al. 2018.) information to knowledge and, ultimately, to action for : : : sustainability and human well-being" (Ramaswami et al. 2016).

Data Fire Hose
For most of my career, the environmental sciences, particularly in the water domain, were rather data-poor. Aquatic scientists looked enviously at atmospheric scientists, who benefited from continuous online measurements of gases conducted from aircraft or balloons as well as ground-based (and later satellite) spectroscopic measurements that integrate over a column of air. Today, aquatic scientists and engineers are being flooded with data (pun intended). This flood has three main sources: omics, online and remotely deployed sensors, and remote sensing (Table 1). What these three sources have in common is the sheer volume of data; temporal and spatial resolution are additional challenges of the latter two sources. Omics data have expanded well beyond their origins in genomics to include high-throughput analyses of proteins (proteomics) and metabolites (metabolomics). Analysis of omics data (as well as other high-volume data) requires the development of data pipelines that automate the processes of extracting, transforming, combining, validating, and loading data for further analysis and visualization (Alley 2018). As the frequency of monitoring and/or the scale of experiments increases, data sets that have traditionally been analyzed manually also require automated pipelines for data handling and analysis (Durden et al. 2017;Farley et al. 2018;Pennekamp et al. 2017Pennekamp et al. , 2018Thomas et al. 2018a, b). Satellite observations, which at previous levels of spatial resolution were relevant mainly for marine systems, are, with improved resolution, increasingly relevant for lakes (Matthews and Odermatt 2015;Odermatt et al. 2018). Other spatially explicit data sources include remote sensing from drones and information collected by citizen scientists using mobile devices (McCabe et al. 2017). The spatial and temporal resolution of data from remotely deployed and online sensors and from remote sensing from aircraft and satellites pose additional challenges related to linking data to their time and location as well as to visualizing data, for example, in animated maps.
In engineering practice, water-treatment and wastewatertreatment plants are becoming more highly automated, and remote monitoring is increasingly used in distribution and/or conveyance systems, resulting in a substantial increase in the amount of data generated during system operation. These developments offer opportunities for performance optimization (Corominas et al. 2018;Ingildsen and Olsson 2016). They may also allow for novel management strategies, such as using excess sewer capacity to reduce overflows at wastewater-treatment plants (Zhang et al. 2018). Risks associated with vulnerability to cyber-attacks may, however, be increased (Taormina and Galelli 2018;Taormina et al. 2017).

Web-Based Collaboration
Web-based access to observational databases builds on a long historical tradition of monitoring data curated by (often governmental) institutions. The incorporation of well-defined data into online databases has been relatively straightforward, but even governmental agencies face challenges of curating and conserving legacy data. This challenge has been addressed by programs to preserve "data at risk" (Griffin 2015;USGS n.d.). Formal and/or informal scientific consortia have also formed to contribute to these efforts. The Force 11 consortium works to establish norms and standards, specifically the findable, accessible, interoperable, and reusable (FAIR) data principles (Wilkinson et al. 2016). The Research Data Alliance provides a neutral space where its more than 7,000 members Table 1. Illustrative, noncomprehensive list of relevant online resources, with entries ordered alphabetically by name Name Description URL BioTIME The BioTIME database contains raw data on species identities and abundances in ecological assemblages through time. ESDL provides access to a series of highly-curated data cubes containing preprocessed data that are ready for analysis. A framework is provided to map user-defined functions to a data cube. can "come together to develop and adopt infrastructure that promotes data-sharing and data-driven research" (RDA n.d.). A large consortium of researchers from almost 200 institutions acquired funding from a variety of sources to assemble the BioTIME database, which includes over 8 million species abundance records (Dornelas et al. 2018). Web-based collaboration can also facilitate citizen science initiatives ).
In the omics and remote-sensing domains, data have been produced in a context in which the need for online storage and access quickly became obvious. With support from the NIH, Genbank (Table 1) was established in 1982. Today, sequence data deposition is a routine aspect of publication in the molecular biology community, although questions have been raised recently about how this may be affected by the Nagoya Protocol (Deplazes-Zemp et al. 2018). With satellite data, national and international space agencies have a vested interest in improving the accessibility and usability of their data and downstream data products.
Through such online resources, individual scientists or scientific consortia have the opportunity both to contribute to and exploit the wealth of web-accessible data. Models and tools for modeling are also increasingly available through online platforms (Table 1). Online platforms provide support for collaborative and/or participatory modeling (Basco-Carrera et al. 2017;Langsdale et al. 2013) (Gaudard et al. 2019), although platforms and models may be less important than trustful interpersonal interactions and adequate governance structures (Parrott 2017).

Applications of Data Sciences
Increasingly, the analysis of big environmental data (in the sense of one or more of the 4Vs in Fig. 1) relies on data science methods, particularly machine learning. In some approaches, hypotheses are generated and then tested using big data (Peters et al. 2018(Peters et al. , 2014, which can also provide useful benchmarking for mechanistic models. Other approaches employ machine learning to extract trends or even elucidate hypotheses or model structures from data that are not biased by expectations (Ilie et al. 2017;Thomas et al. 2018a). Although this approach can be compromised by spurious correlations in the data (N. Schuwirth, "How to make ecological models useful for environmental management," submitted, Eawag, Dübendorf, Switzerland), this problem can be minimized if sampling is informed by knowledge about the system (Strobl et al. 2008) and/or if appropriate tests are applied (Broadhurst and Kell 2006). Potential problems have been illustrated by the prevalence of false positives in a study investigating the possible use of variance and/or autocorrelation as early warning indicators for the abundance of aquatic taxa (Burthe et al. 2016).
Application of data science methods is necessitated when multiple types of data inputs must be combined (e.g., data from remote sensing and high-throughput DNA analysis) and interpreted using multiple modeling frameworks, especially when there is a goal of producing near-real-time predictions as the basis for decision making (Bush et al. 2017;Dafforn et al. 2016). Real-time data analysis can also support adaptive operation of the data acquisition system, as illustrated by a recent study of turbidity currents (Paull et al. 2018). Even the sheer size of environmental data sets may preclude conventional statistical analysis and necessitate data analysis based on machine learning, which does not require assumptions regarding data distributions, shape, and covariance structure (Cox 2015). The assumptions of common statistical methods (e.g., linearity and independence of variables) are unlikely to be applicable to large, multidimensional environmental data sets ( One recognized limitation of machine-learning approaches is their lack of interpretability (Pearl 2018;, which raises important questions of accountability when decision making is based on such approaches (EPFL IRGC 2018). This issue is a topic of intensive research in the data science community, although it has only begun to be addressed in the environmental research application area . In this domain, integration of mechanistic models and/or inclusion of prior knowledge may offer insights into patterns derived from computational data analysis. Methods such as gene expression programming (GEP) generate explicit model structures from a specified set of operators applied to predictor variables and can be used in a reverse engineering approach (Ilie et al. 2017). Visualization of network activations can help to identify key forcing inputs triggering specific responses . The requirement of machine-learning approaches for sufficient data also constitutes a limitation that has been addressed by using generative adversarial networks (GANs) to generate training data sets (Li et al. 2018).
A few examples clearly demonstrate the value of the analysis and interpretation of big data on aquatic systems. At the level of process understanding, the combination of remote-sensing data on temperature and chlorophyll with three-dimensional lake modeling allows surface biomass variations to be interpreted in relation to wind-driven transient upwelling and basin-scale internal waves (Bouffard et al. 2018). Analysis of historical records has demonstrated the legacy effects of deforestation (with consequent increases in discharge and infiltration) on wetland development (Woodward et al. 2014). Improved estimates of global river runoff have indicated that rivers play a larger role in the exchange of carbon dioxide between the land surface and the atmosphere than had previously been realized (Allen and Pavelsky 2018). Concerted efforts to compile and harmonize data on dams and their impacts have provided important insights into the aggregate impacts of dams on surface freshwater storage, run-off, nutrient and sediment transport, and sea-level rise as well as the consequences for aquatic ecosystems (Chao et al. 2008;Doell et al. 2009;Grill et al. 2015;Kondolf et al. 2014;Lehner et al. 2011;Maavara et al. 2015). With the planned and anticipated increases in dam construction, such an evidence base is needed to inform decision making (Fan et al. 2015;Zarfl et al. 2015).

Open Data and Open Science
The preceding discussion was based on the presumption that there is a common understanding of what data should be deposited online. This makes sense in the context of historical monitoring data or supporting data for journal publications but becomes blurred in the emerging context of open science, which incorporates the entire research cycle (Bueno de la Fuente n.d.). The caching of intermediate results, such as outputs of simulation runs, has been explicitly recommended (Peters et al. 2014), although this is widely considered to be impracticable. Although the depositing of genomic data is well-established, the increasing trend toward resequencing (from which the DNA of a specific individual can be compared against a composite reference genome) raises the question of what data must be stored: the full resequenced genome or a compressed version based on the reference genome (Pinho et al. 2012). At the other extreme, data produced by detectors at the Large Hadron Collider (LHC) at CERN are subjected to real-time analysis to reduce data volumes by factors of 1,000-10,000 before data storage (Gligorov 2015). The demands for data storage and speed of data transmission are two of the most visible challenges for academic research institutions.

Institutional Challenges and Opportunities
There is no shortage of papers promising that big data will provide the basis for a profound improvement in our understanding of environmental systems and our capacity to manage them (Dafforn et al. 2016;Durden et al. 2017;Farley et al. 2018;Peters et al. 2018Peters et al. , 2014. Activities in synthesis centers such as The National Center for Ecological Analysis and Synthesis (NCEAS) and The National Socio-Environmental Synthesis Center (SESYNC) have demonstrated the power of data sharing in posing and answering previously intractable questions (Farley et al. 2018). The caveat is the level of investment that will be needed to capture these benefits. Needs for data storage and transmission will require upgrading of IT infrastructure. Support from informatics and data science experts will be needed for environmental scientists to apply computational methods to their data and models. But cultural changes in the attitudes and expectations of environmental scientists will also be needed to support the sharing of data as well as their collaborative use, interpretation, and presentation (Dafforn et al. 2016;Durden et al. 2017;Peters et al. 2018Peters et al. , 2014. Application of data sciences further imposes the need to share code and workflows, which requires proper annotation to support reproducibility (Hutton et al. 2016).
Research institutions must be aware of how their incentive systems (i.e., hiring, promotion, and tenure) may bias against data sharing and collaborative activities, issues that are particularly problematic for junior researchers (Gewin 2016). Even decisions about using proprietary or open-source software, which are often made at the level of an individual investigator or research group, can have important implications for further collaborative use of research products. At the same time, institutions have the capacity to support platforms for collaboration (such as the Swiss Data Science Center (SDSC n.d.) and to promote collaborative activities as exemplified by the July 2018 call for a biodiversity knowledge alliance (GBIF n.d.). Simply keeping abreast of all these developments poses its own challenges. Here, institutions can promote the FAIR data principles (Wilkinson et al. 2016) and encourage cross-referencing, harmonization, and (when appropriate) consolidation of platforms (Hering and Vairavamoorthy 2018). Funding agencies, in particular, should pay attention to the inherently transient nature of project-based platforms and take steps to ensure that successful platforms are embedded in an institutional structure. In general, successful platforms could be considered as small wins (Termeer and Dewulf 2018) whose aggregation could help to increase the visibility, accessibility, and reuse of environmental data.
I am convinced that the ability to access big data on water systems, combine these data with modeling, and update models (i.e., data assimilation) will dramatically expand our understanding of these systems and provide a robust basis for real-time prediction and systems control and/or management. The water sector is wellknown for its long time horizons (i.e., accompanying major infrastructure investments) and consequent inflexibility. The ability to monitor and model water systems more accurately and respond more quickly to observed changes could provide a basis for adaptive management. Allowing for more variance in water systems could help to improve their resilience (Carpenter et al. 2015). The effective use of big data could also provide the basis for balancing trade-offs in integrated land and water management (Davis et al. 2015) and for adaptive management in the restoration of aquatic ecosystems (Geist and Hawkins 2016). Big data offer an exciting opportunity to make our management of water systems more sustainable. As the capstone of my professional journey through the evolving landscape of data science, I hope to foster the cooperation and focus on outcomes and impacts that will be needed to realize this promise.