Workflows are a way to describe a series of computations on raw e-Science data. These data may be MRI brain scans, data from a high energy physics detector or metric data from an earth observation project. In order to derive meaningful knowledge from the data, it must be processed and analysed. Workflows have emerged as the principle mechanism for describing and enacting complex e-Science analyses on distributed infrastructures such as grids. Scientific users face a number of challenges when designing workflows. These challenges include selecting appropriate components for their tasks, spec- ifying dependencies between them and selecting appropriate parameter values. These tasks become especially challenging as workflows become increasingly large. For example, the CIVET workflow consists of up to 108 components. Building the workflow by hand and specifying all the links can become quite cumbersome for scientific users.
Traditionally, recommender systems have been employed to assist users in such time-consuming and tedious tasks. One of the techniques used by recommender systems has been to predict what the user is attempting to do using a variety of techniques. These techniques include using workflow se- mantics on the one hand and historical usage patterns on the other. Semantics-based systems attempt to infer a user’s intentions based on the available semantics. Pattern-based systems attempt to extract usage patterns from previously-constructed workflows and match those patterns to the workflow un- der construction. The use of historical patterns adds dynamism to the suggestions as the system can learn and adapt with “experience”. However, in cases where there are no previous patterns to draw upon, pattern-based systems fail to perform. Semantics-based systems, on the other hand infer from static information, so they always have something to draw upon. However, that information first has to be encoded into the semantic repository for the system to draw upon it, which is a time-consuming and tedious task in it self. Moreover, semantics-based systems do not learn and adapt with experience. Both approaches have distinct, but complementary features and drawbacks. By combining the two approaches, the drawbacks of each approach can be addressed.
This thesis presents HyDRA, a novel hybrid framework that combines frequent usage patterns and workflow semantics to generate suggestions. The functions performed by the framework include; a) extracting frequent functional usage patterns; b) identifying the semantics of unknown components; and c) generating accurate and meaningful suggestions. Challenges to mining frequent patterns in- clude ensuring that meaningful and useful patterns are extracted. For this purpose only patterns that occur above a minimum frequency threshold are mined. Moreover, instead of just groups of specific components, the pattern mining algorithm takes into account workflow component semantics. This allows the system to identify different types of components that perform a single composite function. One of the challenges in maintaining a semantic repository is to keep the repository up-to-date. This involves identifying new items and inferring their semantics. In this regard, a minor contribution of this research is a semantic inference engine that is responsible for function b). This engine also uses pre-defined workflow component semantics to infer new semantic properties and generate more accurate suggestions. The overall suggestion generation algorithm is also presented.
HyDRA has been evaluated using workflows from the Laboratory of Neuro Imaging (LONI) repos- itory. These workflows have been chosen for their structural and functional characteristics that help� to evaluate the framework in different scenarios. The system is also compared with another existing pattern-based system to show a clear improvement in the accuracy of the suggestions generated.
Soomro, K. HyDRA Hybrid workflow Design Recommender Architecture. (Thesis). University of the West of England