Big Data Analytics system for costing power transmission projects

1. Juan Manuel Davila Delgado, PhD., manuel.daviladelgado@uwe.ac.uk Associate Professor, University of the West of England, UK 2. Lukumon Oyedele, PhD., l.oyedele@uwe.ac.uk Professor, University of the West of England, UK 3. Muhammad Bilal, PhD., muhammad.bilal@uwe.ac.uk Associate Professor, University of the West of England, UK 4. Anuoluwapo Ajayi, PhD., anuoluwapo.ajayi@uwe.ac.uk Associate Professor, University of the West of England, UK 5. Lukman Akanbi, PhD., lukman.akanbi@uwe.ac.uk Associate Professor, University of the West of England, UK 6. Olugbenga Akinade, PhD., olugbenga.akinade@uwe.ac.uk Associate Professor, University of the West of England, UK *Corresponding author.


Introduction
Reliable cost estimation is a critical factor for the successful delivery of construction projects.
Poor cost estimates have been identified as significant factors contributing to cost overruns and profit erosion (Ahiaga-Dagbui and Smith, 2014;Sridarran et al., 2017). Accurate and systematic cost estimates are of utmost importance for the construction sector where low-profit margins are typical. The average profit margin of the top 100 UK construction companies was 1.5% in 2016 and 2.5% in 2017 (TCI, 2018). Cost estimation for construction projects has been a topic of intense study for many years. Estimating costs in construction is a challenging task due to many factors including the limited information at early phases of the project, the many different clients, the dissimilarity among projects, the diverse contexts and sites, and the large labour force required (Wilson, 2005). Cost estimation of power transmission projects has specific challenges such as large and complex construction sites and difficult accessibility (e.g. highways and river crossings), varying contexts (urban, rural, and protected natural areas), and unknown soil conditions. However, compared with traditional construction projects, power transmission projects also represent advantages for implementing data-driven cost estimation methods including a smaller number of potential clients, a considerable similarity among projects, and limited types of materials and plant equipment used; which facilitates data collection and management.
Many data-driven cost estimation methods have been developed and tested (e.g. Hwang, 2009;Lowe et al., 2006;Sonmez, 2008); and most of the research efforts have been placed on developing more accurate methods and to compare their performance among them, i.e. to find the best possible method (e.g. G.-H. Kim et al., 2004). However, there is no consensus on the best method to address cost estimation yet. More importantly, most construction companies have not adopted advanced cost estimation predictive models; and, in practice, cost estimation still relies heavily on human expertise rather than on systematic data-driven methods (Carr, 1989;Meredith et al., 2014).
A major indication that current cost estimation practices in construction are still unreliable is the number of projects experiencing cost overruns. The construction industry is well known for cost overruns. For example, a study found that large infrastructure projects across the world are on average 80% overbudget (Agarwal et al., 2016). Many factors contribute to cost overruns, but poor cost estimation practices are a major influencing factor (e.g. Adam et al., 2017;Larsen et al., 2016). Aljohani et al. (2017) point out that estimation methods used in practice are still affected by the estimator's bias and varying degrees of experience. Aljohani et al. (2017) also note that in practice data-driven methods are not fully used due to unavailable and unreliable data sources. The many different clients, project types, sites, materials and subcontractors complicate data collection. If data is available, it is usually not accessible through a single interface, and it is stored in different locations and formats, which limits its use.
Despite the fact that data unavailability and poor data management practices are major factors limiting the use of data-driven cost estimation methods (Aljohani et al., 2017), methods reported in literature usually do not present data management frameworks and approaches required to enable data-driven methods. This is becoming more relevant as the amount of data collected by construction companies is increasing substantially and traditional methods for data management cannot cope with the increasing amounts of generated data. For example, none of the traditional methods provide a way to integrate different types of data Gerrish et al., 2015), support dynamic visualisations (Davila Delgado et al., 2018;Mousa et al., 2016), or provide real-time links with Big Data repositories (Bilal et al., 2016).
Equally important is the lack of comparisons between the performances of predictive datadriven methods and methods usually used in practice. In most studies in literature, only an indication of how accurate the proposed methods are, is presented. But no comparison is presented with the actual methods used in practice. There is no clear indication of how much better the predictive data-driven methods are compared with the ones used in practice, and if the increase in performance will justify the investment required to implement predictive datadriven methods.
This study seeks to address: the lack of comparisons between data-driven cost estimation methods and manual methods, and the lack of demonstrations of data management approaches to cope with large amounts of diverse data. The objectives of this study are: (1) To get a quantitative indication of the difference in the performance between the most common predictive data-driven cost estimation methods and traditional manual methods used in practice.
(2) To demonstrate the implementation of a Big Data architecture that can manage large amounts of data from diverse sources and in different formats. This paper presents the Predictive Analytics and Modelling System (PAMS), which integrates uses historical financial and project data to predict costs. PAMS enables the extraction of valuable insights by integrating large and varied datasets and performing predictive analytics

Big Data in Architecture, Engineering and Construction (AEC)
Big Data is the term coined to define sets of data that are too large, complex, and heterogeneous so that traditional software applications cannot process them. The main defining attributes of Big Data are (i) volume, i.e. the amount of data; (ii) variety, different file formats and structures; and (iii) velocity, i.e. the speed at which the data is queried and processed (Erl et al., 2016).
Irrespective of the term "Big Data", the size of the datasets is only one challenging aspect of many, and it is, in most cases, not the most significant one (Boyd and Crawford, 2011).
Additional attributes have been identified such as value, vision, validation; but for the AEC industry veracity (i.e. the consistency, completeness, and reliability of data) and variety (i.e. different file formats, e.g. 2D drawings, 3D models, pictures, animations, spreadsheets) are the key Big Data attributes that constitute a significant challenge for Big Data adoption (Bilal et al., 2016).
Compared with other more modern and less established industries, the amount of data being generated in the AEC industry is smaller by some orders of magnitude. Nevertheless, it also must deal with increasing amounts of data that is generated during the entire life cycle of built assets. The increasing amounts data are driven by the push towards the adoption of the Smart Cities (Batty et al., 2012;Zanella et al., 2014) and Smart Infrastructure (Al-Hader and Rodzi, 2009;Hoult et al., 2009) paradigms. Both of which rely on the use of sensing and monitoring systems and on 3D digital representations of built assets that include performance and condition data (Khan and Hornbaek, 2011). Built assets have been instrumented with various types of sensors and embedded devices, which generate large and dynamic sets of data. For example, sensors are used in buildings to monitor temperature variations (Chen et al., 2014), indoor air quality (Kumar et al., 2016), and occupancy (Akkaya et al., 2015). They are also used to monitor power consumption (Suryadevara et al., 2015), structural condition (Davila Delgado et al., 2017;, and surrounding environmental conditions (Martín-Garín et al., 2018). However, current data management frameworks used in the AEC industry cannot handle the ever increasing and diverse data sets. Big Data management frameworks and programming models must be adopted to handle and process the data effectively. Otherwise, relevant insights and value could not be extracted from the generated data.
Big Data analytics is the broad term that refers to the various methods used to extract insights from data. Big Data analytics draws techniques from various existing fields such as statistics, data mining, business analytics, and applies them to large and diverse datasets. The types of analyses carried out in Big Data analytics can be classified into the following categories ( Figure   1 presents the categories and lists examples of methods used in each category): (i) Descriptive analytics, which uses statistical methods to describe and quantify basic features of the dataset.
(ii) Predictive analytics, which uses diverse techniques to analyse current data and make predictions about future events. Lastly (iii) prescriptive analytics, which uses the insights gained by descriptive and predictive analytics to devise actions that lead to a defined goal, e.g. cost reduction. All these types of analyses are beneficial to support the AEC industry in general and have the potential to unleash a new period of productivity improvement, which is a crucial interest for the sector as a whole (Abdel-Wahab and Vogl, 2011;Liberda et al., 2003).
Research efforts reported in literature have been focused mainly on identifying challenges and potential architectures. For example, the Big Data challenges for storing and visualising massive BIM models have been studied (Chen et al., 2016;Gao et al., 2017). Challenges for handling geospatial data (Yang et al., 2017), Earth observation data (Xia et al., 2018), challenges regarding building energy efficiency (Koseleva and Ropaite, 2017), and for managing big visual data and BIM (Han and Golparvar-Fard, 2017) have been reported as well.
Nonetheless, there is a significant interest in leveraging Big Data technologies to improve a wide variety of tasks in the AEC industry. These tasks range from support at the conceptual design stage, construction and planning, to tasks related to operations and facility management.
However, work has focused on operations and facility management due to easier access to data.
For example, Big Data analytics have been used to predict air passenger demands in airports (Kim and Shin, 2016), to model commuting patterns (Wan et al., 2018), and to infer transport mode using data from mobile devices (Semanjski et al., 2017). Also, Jeong et al. (2017;2019) presented a cloud-based Big Data management and analytics framework that handles the massive and diverse datasets used for bridge monitoring. Wang et al. (2018) presented a Big Data approach to identify potential quality issues in construction components.
However, Big Data has not been widely employed to support tasks in the preconstruction stage such as bidding, tendering, costing, scheduling. This stage represents a massive opportunity for cost reduction as flawed decision-making and planning leads to mounting cost and time delays during construction (Chan et al., 2004;Olawale and Sun, 2010). Poor cost estimation often defines whether a project is profitable or not. This is the focus of the study presented in this paper as is illustrated in Figure 1.

Predictive analytics
Numerous efforts have been carried out to apply intelligent systems to address the increasingly complex problems of the AEC area (Irani and Kamal, 2014). A significant application is predictive analytics, which is a process that uses data and statistical algorithms to predict future outcomes. It is mostly used in the financial, healthcare and marketing sectors. The most widelyused predictive technique is linear regression, which is a simple approach to model the relationship between two variables. The relationship is modelled using linear predictor functions whose unknown model parameters are estimated from available data. Other more sophisticated techniques exist as well such as Decision Trees, Support Vector Machines (SVM), and Artificial Neural Networks (ANNs). Predictive analytics has not been used widely in the AEC industry, where simulation approaches are preferred. For example, creating buildings in virtual environments and simulating its potential energy use is another approach to predict energy consumption (Elbeltagi et al., 2017). However, the idea to generate solutions to given problems based on data has been broadly investigated. For example, ANNs, a resurgent and prominent method for predictive analytics, has been used to aid in the design of water harvesting structures (Chandwani et al., 2016), predicting and controlling cooling loads in buildings (Venkatesan and Ramachandraiah, 2018), predicting risks for building maintenance (de Silva et al., 2013), and predicting the escalation of highway construction cost over time (Wilmot and Mei, 2005). Generative and Genetic Algorithms (GAs) have been used to predict structural designs solutions given limited spatial data Hofmeyer et al., 2013;. GAs have been used to analyse the static security of electric power systems (Canto dos Santos et al., 2015) and to generate optimal construction schedules (Faghihi et al., 2014). GAs have been combined with ANNs to optimise environmentally friendly buildings (Sun et al., 2015). SVM has been used to predict failures of construction companies (Horta and Camanho, 2013). Linear interpolation methods have been used to predict the annual electricity consumption of elevators (Tukia et al., 2016) and cooling loads (Geekiyanage and Ramachandra, 2018). Intelligent decision support systems have been developed to facilitate the management of construction processes (Hajdasz, 2014) and rule-based support systems have been used to check regulatory compliance (Beach et al., 2015). Simulation approaches have been combined with ANN models for managing Heating, Ventilation and Air Conditioning (HVAC) systems (Faizollahzadeh Ardabili et al., 2016).

Cost Estimation
Cost estimation for construction has been thoroughly investigated. Until now, there is no consensus regarding the supremacy of one single method, even though their performance has been compared and evaluated (e.g. Shane et al., 2009;Trost & Oberlender, 2003). This subsection presents a quick snapshot of the construction cost estimation landscape.
(1) Analytical and numerical approaches. These approaches define cost estimation as a regression problem and use different techniques of varying complexity to solve it (e.g. Hwang, 2009;Lowe et al., 2006;Sonmez, 2008). For example, ANNs have been used to predict cost of road tunnel construction based on geological attributes (Petroutsatou et al., 2012), to predict cost of retrofitting due to earthquakes (Jafarzadeh et al., 2014), and to predict costs of irrigation improvement projects (ElMousalami et al., 2018). Firouzi et al. (2016) present a method to predict total cost with dependent cost items, and Dursun & Stoy (2016) present a method that stacks predictive models to create a multistep estimation method.
Approaches that use existing BIM models or design documents to come up cost estimates exist as well. For example, methods to automate the manual process of generating a cost estimate of structural elements (Jadid and Idrees, 2007) and of steel frames (Barg et al., 2018) have been reported in the literature. An approach to cost estimation based on the level of detail of the BIM model (Cheung et al., 2012); and a method that uses Industry Foundation Classes (IFC) data to estimate costs (Ma et al., 2013) have been reported as well. Lastly, Asmar et al. (2011) present an approach to systematise the cost estimation of highway projects at planning stages. The disadvantage of these methods is that they require an existing BIM model, which in most cases is not available at early tendering stages.
(2) Knowledge-based approaches. These approaches use codified expert knowledge to support cost estimation. For example, Choi et al. (2014) present a method to estimate the cost of roads using case-based reasoning. Yildiz et al. (2014) present a knowledge-based tool that maps risks to support cost estimation. Ahn et al. (2014) present a case-based reasoning approach to identify attributes that impact cost estimation and in (Ahn et al., 2017) presents a method to determine the effect of covariance in case-based reasoning approaches for cost estimation. The main disadvantage of these methods is that their performance is restricted by the quality of the encoded expert knowledge and cases; and their limited transferability.
(3) Improvement of cost estimation factors and processes. These approaches seek to improve the factors affecting the accuracy of cost estimation methods. For example, Yu et al. (2006) present an approach to define cost indexes in real time; and Cheng et al. (2013) present a model that uses economic and financial data (e.g. Consumer Price Indexes, oil prices, stock market indexes) to predict variation in costs using a modified version of SVM. Hwang (2011) presents a method that uses time series indexes to consider the trends of costs in the market during construction.
More research efforts are required to address cost overruns in the construction industry. While various factors are at play, accurate cost estimates are essential to reduce cost overruns. More importantly, in the existing cost estimation literature for construction, few studies address the use of Big Data architectures and approaches to deal with large amounts of diverse data for cost estimation. Also, there is an insufficient number of comparisons between existing methods used in practice and data-driven methods. The motivation of this study is to address these gaps and contribute to the reduction of cost overruns in the construction industry.

Predictive Analytics and Modelling System (PAMS)
The  (RDF) to store the data in graph triplets (Klyne and Carroll, 2006). RDF enables to merge and model data from different sources without a defined schema, which facilitates the quick loading of data from varied sources and with different structures. A number of domain ontologies were developed for the graph database. These ontologies standardised the data extracted from the diverse data sources.
Volume. The Cloudera Distribution including Apache Hadoop (CDH), which is a collection of software based on the Hadoop Distributed File System (HDS), was used for Big Data management because it provides reliable and high-performance access to large datasets in distributed file systems and it enables parallel processing. The graph database was implemented using Apache HBase, which is an open-source, non-relational, distributed database that runs in combination with HDS providing a robust way of storing and querying large quantities of sparse data.

Velocity.
A vital requirement of the PAMS is the ability to analyse the stored historical data quickly. For Big Data processing, the Big Data programming model so-called Spark was used for processing the developed predictive models. Spark is at the core of the Berkeley Big Data Analytics Stack (BDAS) framework, which is regarded as the as a next-generation framework for large-scale data processing (Ryza et al., 2015). BDAS is an open source data analytics framework that provides speedy response times for complex computations on large datasets.
Compared with other traditional analytics frameworks, BDAS achieves very fast processing times by enabling large-scale in-memory data processing. The analytical pipelines for processing the predictive models were implemented using the R language and environment through SparkR, an R package that provides frontend use to Apache Spark; and MLLib, a Spark machine learning library with an R interface. R was selected because is widely used in academia, is open source, and supports multicore task distribution. The popularity of R for Big Data analytics is growing, and it is becoming a de facto standard, which facilitates further developments. The results of the predictive models are mapped to objects in the Data Services Layer, which exposes them to the client application using a REST (Representational State Transfer) API (Application Programming Interface). This enables the client application to query data via the web robustly and quickly. The data requests and responses are elicited using the JSON (JavaScript Object Notation) format.
The Big Data Analytics Environment also deals with veracity, another attribute of Big Data.
Methods to address missing and unreliable data that are common in the construction sector have been implemented and are presented in this paper. While these aspects are not the most commonly addressed in Big Data studies, they are -neverthelessessential to the construction industry, which lags in the adoption of Big Data technologies.

Big Data Analytics
Data from overhead power transmission projects constructed in the United Kingdom in the last ten years has been compiled and used as a basis to develop the predictive models in the Big Data Analytics Environment. It includes financial data, i.e. the total actual cost , the estimated cost ', profit , profit margin , distance , and region of each project. This section focuses on describing the analysis carried out to predict the total cost of projects using the financial data.

Data Pre-processing
The data was collected from various heterogeneous sources, i.e. Microsoft Excel files, CSV (comma separated value) files, and relational databases. Incorrect, incomplete, and inconsistent records were identified and removed. E.g. typos and values that were many degrees of magnitude larger than the mean of that variable were removed. Missing and removed profit ( ) values were resolved by employing the so-called mean imputation technique, i.e.
substituting the missing values with the mean of that variable for all other cases (Roderick and Rubin, 2002;Scheffer, 2002). This method was selected because it was assumed that the missing data was not a structural characteristic of the dataset and there were no indications of a strong correlation between profit and the other available variables, e.g. cost, distance, region, etc., for example, see distance ( ) variables while providing variation to the generated data. The same approach was used for missing cost values. Once the data was cleansed, it was condensed and loaded into the graph database.

Characteristics of the dataset
The dataset used for this study is a compilation of financial data from overhead power transmission projects constructed in the United Kingdom in the last ten years, resulting in over 2.75 million data points. This dataset is not particularly large when compared with other datasets used for Big Data analytics in other fields, such as marketing and customer analytics that use data from hundreds of millions of users. Nevertheless, this dataset is very large when compared with the typical datasets used for cost estimation in construction, which usually range from a few dozen to around hundred projects (e.g. ElMousalami et al., 2018;Petroutsatou et al., 2012). Moreover, the Big Data architecture proposed in this paper can handle significantly larger datasets and enables seamless scalability to constantly add more data as it becomes available. Figure 3 shows a sample of the frequency distribution of the cost of the projects. The projects' cost ranges from £10k to £50m. The data has a positively skewed distribution and 94% of all the projects have costs of less than £4.8m. Figure 4 and 5 present the relative distribution of the region and cost up to £100K and more than £100K, respectively. Based on the characteristics of the dataset, it was decided only to use a sample of projects that have a cost ranging from £50k to £4.8 million to develop the predictive models.

Predictive analytics
Three predictive models -linear regression (LR), support vector regression (SVR), and artificial neural network (ANN)-have been implemented to predict the total cost of the project given its distance. These models were selected because they are the most prevalent models used for cost estimation (Deng and Yeh, 2011). Five per cent of the projects have been used to test the predictive models, and the rest were used for developing the models. The traditional 80/20 per cent split between training and test data was not used due to the large size of the dataset, and to ensure a balanced variance between the parameter estimates and the performance statistic. The projects for testing were selected in a way that they were representative of the total cost distribution.
An LR model (̂= w + b ) has been implemented, in which ̂ is the predicted y value, given (1/2) • ( −1) ; = (1/2 ∑ ; −1 = ( ) • }. Note that is computed using an element wise product. Note that time-dependent factors that affect project cost such as inflation are not accounted in the predictive models thus the resulting cost predictions should be adjusted for inflation at the time that the prediction is carried out.

Comparison of predictive models
An indication of the coefficient of determination of the model is given by calculating the coefficient of determination 2 , which is computed as follows:   Additionally, the three models overestimate the cost as the distance increases. All the models perform better with projects costing less than £1.5million. A possible reason for this behaviour is that 78% of all the projects have costs of less than £2.5 million (Figure 3). This indicates that there is not enough data to develop accurate predictions because there are few data that delineates the model's behaviour for projects costing more than £2.5 million. Table 1 presents the coefficient of determination r 2 , mean absolute error (MAE), and root mean square error (RMSE) for the three predictive models. Table 1 presents results using only the distance as the predictor (rows 1-3) and using the region in which the line is located as an additional variable to predict the cost (rows 4-6). SVR achieved the best r 2 and the lowest MAE and RMSE for both sets of results, while LR and ANN achieved similar results. Adding the region, as an additional variable, to predict the cost did not improve the coefficient of determination of any of the models significantly. The MAE and RMSE only improved marginally. Therefore, in this case, the region is not a relevant variable to predict cost as the initial analysis suggested (Figure 4 and 5). Note that this study minimised the number of variables included in the analysis as it has been found that using more variables do not necessarily increase accuracy (Gardner et al., 2016).

Discussion
This paper presented an approach to developing a Big Data system to support cost estimation for power transmission projects. The proposed Big Data architecture proved to be fit for purpose and facilitated the integration of large and diverse data sources. The Big Data Analytics Environment enables the use of the most prevalent cost estimation models used in construction. The presented approach has the three principal attributes of Big Data (Erl et al., 2016): (1) Variety; the approach uses various types of structured and semi-structured data, i.e., financial data and project data in diverse file formats. It employs a standard model for merging data with different underlying schemas.
(2) Volume, the approach uses a large dataset (over 2.75 million data points) and employs a Big Data platform (Cloudera Distribution) that uses a distributed file system and non-relational databases for high-performance access to large datasets and parallel processing.
(3) Velocity, the approach uses a Big Data framework and programming model (BDAS and Spark) that facilitates in-memory processing to process complex models very quickly There are many challenges for estimating costs of power transmission projects in a traditional manner such as the extensive construction sites. Estimators usually have to survey many kilometres long routes in very far and inaccessible places. The construction sites are complex and located in different contexts (urban, rural, and protected natural areas). The power transmission lines usually cross through highways or rivers, which complicates construction and increases risks. Limited information and unknown soil conditions hinders reliable cost prediction. Many resources must be employed to achieve accurate cost prediction and thorough risk identification to support planning tasks effectively. In practice, this is prohibitively expensive, and usually, only rough estimates are carried out. However, power transmission projects have characteristics that facilitate the implementation of data-driven cost estimation methods such as a smaller number of potential clients, a considerable similarity among projects, and limited types of materials and plant equipment used. The main factor limiting the adoption of data-driven methods is the considerable efforts required to integrate diverse data.
Power transmission projects have substantially less diverse data than traditional construction projects, which facilitates the implementation of data-driven methods.
An indication of the human level performance for estimating costs of power transmission projects was obtained by calculating r 2 using estimations -calculated by planners-and the actual cost at the end of the projects. The calculated human level performance is r 2 = 63%. All three predictive models presented in this study perform better than the human level. SVR represents the highest increase in performance (17%). It is widely acknowledged that datadriven methods could outperform traditional manual methods. However, there was no empirical evidence of how much the difference in performance would be. This study presents an empirical comparison between traditional manual methods and data-driven methods. It provides a quantitative indication of the difference in performance. The results of this study indicate that the average difference in performance between predictive models is ̴ 7.5%, while the average difference in performance between the data-driven methods and the manual method is approximately 13.5%, almost twice as much. This insight represents a relevant implication for practice, as for example, stakeholders can start adopting simpler predictive models that are easier to implement and obtain a significant increase in performance.
This paper also presented a Big Data architecture to manage the vast and varied datasets required to process the predictive models that traditional methods cannot handle. Note that the main difference between Big Data Analytics and traditional data analytics is the manner in which the data is managed and processed. Algorithms used in Big Data Analytics are, in essence, the same algorithms used for traditional data analytics. The key difference is that the Big Data algorithms run on distributed file systems and non-relational databases using new computing frameworks. These new frameworks enable large-scale in-memory data processing.
Traditional data analytics frameworks are very slow in handling queries because they have to sift through large amounts of data stored on disk and are not suitable for complex computations such as advanced predictive models. These limitations prevent extracting value or new insights from data.

Conclusions
A Predictive Analytics and Modelling System (PAMS) has been presented that generates cost estimates using data from previously constructed designs. The presented Big Data Analytics Environment manages large and diverse datasets for predictive analytics. The proposed Big Data architecture is fit for purpose. The distributed file systems and Big Data frameworks and programming models employed can handle the large and diverse datasets used in the construction sector.
A 2.75 million-point dataset of power transmission projects has been used as a case study. The three most prevalent cost estimation models were implemented (linear regression, support vector regression, and artificial neural networks). The R 2 s were 73%, 80% and 74% respectively, in line with other results reported in literature. All the implemented models performed better than the estimated human level performance (63%). Data-driven methods for cost estimation have been studied extensively. However, due to the complexities of construction projects, there is no consensus regarding the best method. Incremental improvements to performance are being achieved constantly; however, adoption in practice remains very low. The intention of this paper is to provide an indication of the potential benefits of adopting predictive data-driven cost estimation methods and present an approach that can be useful in practice to support planning of power transmission projects.
The main contribution to the body of knowledge of this paper is an indication that data-driven cost estimation methods are on average 13.5% better than traditional manual methods for power transmission projects. This study has significant implications for practice because it enables to make data-driven decisions during preconstruction. In a highly competitive sector such as construction, making correct decisions could be the difference between the successful delivery of a project and the survivability of the company. For example, the presented approach will support stakeholders to make accurate cost estimates, to define accurate profit margins and to decide whether it is worthwhile to bid for a project or not. Future steps to improve the presented approach should focus on the following aspects: (i) to investigate whether the missing data in the dataset is a structural characteristic that reflects an underlying attribute of the dataset or are simply input and recording errors; (ii) to compile more granular data, for example to obtain the breakdown of the cost according to labour, materials and plant; and (iii) to develop an automatic adjustment of the predicted costs that takes into account the costs increases due to inflation at the time that the prediction is carried out.

Data Availability Statement
Data generated or analysed during the study are available from the corresponding author by request.