Semantic Web & Big Data

4 downloads 35514 Views 149KB Size Report
Aug 14, 2016 - 3. Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email,.
Georgia State University Topics in Semantic Web Paper Review Report

Semantic Web & Big Data Author: Ankush Chauhan

August 14, 2016

Supervisor: Dr. Liyang Yu

Abstract Big Data and Semantic Web are the epitomai of recent trending research topics in Computer Science. Although both fields are more than a decade old, recent works on the integration of both these technologies have provided a scalable approach in Data Analytics. Semantic Web provides Big Data the required agility and scalability in Business Intelligence as well. This report consists of a brief introduction to Big Data concepts, followed by 5 paper reviews with a summary and paper evaluation in a concise tabular form under 4 broad categories. The last section concludes this report with personal ideas around merging these two fields and their relations are presented.

1

Introduction to Big Data

Although the term “Big Data” is relatively new however, the act of collecting and storing large amounts of information for eventual analysis is ages old. The concept gained its momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs. 1. Volume. It discusses the scale of data. Organizations collect data from a variety of sources, including business transactions, social media and information from the sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden. It is estimated that by the year 2020, 40 Zettabytes(43 Trillion Gigabytes) of data will be created. 2. Velocity. It deals with analysis of streaming data. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Like the New York Stock Exchange generates 1 TB of trade information in each trading session. Moreover modern cars have close to 100 sensors that monitor items such as fuel level and tire pressure. 3. Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions. By 2014 1

there were 420 million wearable and wireless health monitor devices. 4 Billion+ hours of video is watched on YouTube every month, on Facebook 30 billion pieces of content are shared every month. In addition to the above, two more V’s are associated these days with ”Big Data” 4. Veracity. It encompasses the uncertainty related with the data. According to research 1 in 3 business leaders doesn’t trust the information they use to make decisions. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data. Poor data quality costs 1.3 Trillion USD a year to the US economy. 5. Value. It is all well and good having access to big data but unless we can turn it into value it is useless. So we can clearly argue that ’Value’ is the most important V of ”Big Data”. It is important that businesses make a business case for any attempt to collect and leverage big data. It is so easy to fall into the buzz trap and embark on big data initiatives without a clear understanding of costs and benefits.

2 2.1

Paper Survey Integration of Big Data Using Semantic Web Technologies[1]

In their paper published at IEEE ICSC 2016, Ostrowski et al. introduced a framework to recognize the assimilation of data for ”Big Data” from heterogeneous data sources which overcome the scalability shortenings of traditional ETL(Extraction-Transformation-Loading) operations. Additionally, they presented their technical report using a case study for Risk Detection in Supply Chain Management for the Automotive Sector. Moreover they presented several challenges around in their application space to reach the full potential of a ”Big Data” applications asking for required support for Parallelization, Real-Time Streaming, need for more domain specific Ontologies and data incompatibility.

2

For this paper following tables consists of research evaluations on various parameters. Summary of Evaluation Excellent Recommendation Accept if certain minor changes are made Theoretical Contribution Significant Application Contribution Tutorial Overall Contribution Excellent Table 2.1.1: Summary of Evaluation.

Current Interest Reasonable Segment Likeliness of topic interest to change in 5 years Growing interest Importance in Semantic Web Yes Overall Contribution Good Table 2.1.2: Paper Overview.

High Technical soundness Technical Depth Field Expertise required Tangible Contribution to the State-of-the-Art Yes Likeliness extent for use by other researchers High Table 2.1.3: Content Review. Abstract an appropriate digest Understandable clarity for the non-specialist in Introduction Overall Organization Appropriate Paper Length Satisfactory English Readability for non-specialist Quality disregarding technical content Table 2.1.4: Presentation Review.

3

Yes Yes Good Should be lengthened Yes Readable with ordinary effort Excellent

2.2

Semantics for Big Data Integration and Analysis[2]

This technical report published at AAAI Fall Symposium Series 2013 by Knoblock & Szekely, presents another approach towards ETL operations for generating build plans for integration and restructuring of data for scalable ”Big Data” analytics and provided support for data visualization with the motive of reduction of overhead time in model development for analytics. They used KARMA, an interactive ETL tool for restructuring input and data cleaning related jobs. They displayed their experiments several bioinformatics data sources, KEGG and PharmGKB, and downloaded the data into KARMA while looking at only partial datasets and then generating restructuring plans. For this paper following tables consists of research evaluations on various parameters. Excellent Summary of Evaluation Recommendation Accept with minor changes Theoretical Contribution Tutorial Application Contribution Tutorial Overall Contribution Good Table 2.2.1: Summary of Evaluation.

Current Interest Importance in Semantic Web Overall Contribution

Reasonable Segment Yes Good

Table 2.2.2: Paper Overview.

Technical soundness High Likeliness of topic interest to change in 5 years Growing interest Technical Depth Field Expertise required Tangible Contribution to the State-of-the-Art To a limited extent Likeliness extent for use by other researchers High Table 2.2.3: Content Review. 4

Abstract an appropriate digest Understandable clarity for the non-specialist in Introduction Overall Organization Appropriate Paper Length Satisfactory English Readability for non-specialist Quality disregarding technical content

Yes Yes Good Yes Yes Ordinary effort required Excellent

Table 2.2.4: Presentation Review.

2.3

State of the Art in Ontology Design[3]

In their AI Magazine 1997 article, Noy & Hafner developed a comparative framework for various ontologies, encompassing general, domain-specific and knowledge representation system. They diversified the various ontologies by finding similarities and differences among these existing ontologies. They used 8 novel projects and did the ontology comparison with a thorough self-contained description is presented in this article. A contrasting report presented these ontologies on various parameters namely Size, Formalism, Implementation, and Published. They conclude by summarizing strengths and contributions of these projects. For this paper following tables consists of research evaluations on various parameters. Excellent Summary of Evaluation Recommendation Accept without changes Theoretical Contribution Significant Application Contribution Possible Overall Contribution Good Table 2.3.1: Summary of Evaluation.

5

Current Interest Reasonable Segment Likeliness of topic interest to change in 5 years Growing interest Importance in Semantic Web Yes Overall Contribution Excellent Table 2.3.2: Paper Overview.

High Technical soundness Technical Depth Suitable for the non-specialist Tangible Contribution to the State-of-the-Art Yes Likeliness extent for use by other researchers High Table 2.3.3: Content Review. Abstract an appropriate digest Understandable clarity for the non-specialist in Introduction Overall Organization Appropriate Paper Length Satisfactory English Readability for non-specialist Quality disregarding technical content

Yes Yes Good Should be lengthened Yes Readable with ordinary effort Excellent

Table 2.3.4: Presentation Review.

2.4

MAD Skills: New Analysis Practices for Big Data[4]

In their August 2009 paper published in VLDB Journal, Jeffrey Cohen et al. presented a design technique for EDW(Enterprise Data Warehousing) and BI(Business Intelligence) for massive data acquisition and storage using parallel algorithms with agility support. They used Fox Interactive Media and Greenplum parallel database system using both structured and unstructured databases i.e. SQL and MapReduce. Detailed background knowledge for OLAP(Online Analytical Processing and Data Cubes for multidimensional data analytics, they also described various Databases and Statistical Packages namely SAS, R, Matlab and ScaLAPACK for data intensive stastical analysis. They presented various algorithms in Matrix Based Analytical Methods and presented Resampling Techniques for parameterized modeling. 6

For this paper following tables consists of research evaluations on various parameters. Excellent Summary of Evaluation Recommendation Accept without changes Theoretical Contribution Tutorial Application Contribution Significant Overall Contribution Good Table 2.4.1: Summary of Evaluation.

Reasonable Segment Current Interest Likeliness of topic interest to change in 5 years Growing interest Importance in Semantic Web Yes Overall Contribution Good Table 2.4.2: Paper Overview.

Technical soundness High Technical Depth Field Expertise required Tangible Contribution to the State-of-the-Art To a limited extent Likeliness extent for use by other researchers Average Table 2.4.3: Content Review. Abstract an appropriate digest Understandable clarity for the non-specialist in Introduction Overall Organization Appropriate Paper Length Satisfactory English Readability for non-specialist Quality disregarding technical content Table 2.4.4: Presentation Review.

7

Yes Yes Good Should be lengthened Yes Readable with considerable effort Excellent

2.5

Biological data integration using Semantic Web technologies[5]

C. Pasquier in April 2008 published a paper in the Biochimie Journal presenting a novel approach in Bioinformatics for using Semantic Web for representing Biological data. Due to its sheer volume and interchangeability issues in Biology, using Semantic Web concepts in Bioinformatics can help in new discoveries by finding relationships among the diverse amount of biological data. He also raised arguments for the maturity of semantic web 8 years ago and marked the social hindrances for acceptability of Semantic Web in Life Sciences. He clearly demonstrated his application of RDF/OWL statements for ontology creation along with SPARQL queries and unification of different concepts using class hierarchy, with topics ranging from Data Gathering, Data Conversion, and usage of metadata with an organized Data Repository. All of which clearly represents the ETL step required for data cleaning discussed in previous paper sections shows a correlation to ”Big Data” and the field of Bioinformatics. For this paper following tables consists of research evaluations on various parameters. Excellent Summary of Evaluation Recommendation Accept without changes Theoretical Contribution Significant Application Contribution Tutorial Overall Contribution Excellent Table 2.5.1: Summary of Evaluation.

Current Interest Domain Specific Likeliness of topic interest to change in 5 years Growing interest Importance in Semantic Web Yes Overall Contribution Excellent Table 2.5.2: Paper Overview.

8

Technical soundness High Likeliness of topic interest to change in 5 years Growing interest Technical Depth Self-Contained Tangible Contribution to the State-of-the-Art High Likeliness extent for use by other researchers High Table 2.5.3: Content Review. Abstract an appropriate digest Understandable clarity for the non-specialist in Introduction Overall Organization Appropriate Paper Length Satisfactory English Readability for non-specialist Quality disregarding technical content

Yes Somewhat Excellent Yes Yes Readable with considerable effort Excellent

Table 2.5.4: Presentation Review.

3

Ideas on usage of Semantic Web technology usage in ”Big Data” and their relation.

Since the advent of big data analaytics, developers have spent crucial amount of time in data preparation i.e. ETL operations in ”Big Data”. This workhour could have been channelized on the development of actual statistical models for analytics. Thus technologies such as the Semantic Web provides the necessary data vision templates using scalable concepts of standardized ontologies for data representation. Today the sheer Volume of ”Big Data” makes analytics way more complicated to be achieved by traditional OLAP tools which are excellent in analytics using data cubes for multidimensional datasets. However when it comes to Knowledge Discovery and Feature Engineering usage of Semantic Web for URI based resource identification can help in schema re-usability among domain-specific datasets and provide a scalable framework for such operations. Moreover, the latest industry culture of Agile Development and fast changing business requirements requires a robust yet flexible solution for Business Intelligence and Data Warehousing can be incorporated easily using shared 9

enterprise level ontologies. Also, Parallelization techniques such as Message-passing and Multi-threading for using a shared memory can clearly reap the benefits of the prevalent multi-core architecture nowadays. This node architecture can be used to apply MapReduce Algorithms on VLDB datasets using large scale semantic reasoners such as the LarKC (the Large Knowledge Collider)[6]. Effectively, linked data can be used as a broker, mapping and interconnecting, indexing and feeding real-time information from a variety of sources. We can infer relationships from big data analysis that might otherwise have been discarded and then, potentially we end up running further analysis on the linked data to derive even more insight.

References [1] D. Ostrowski, N. Rychtyckyj, P. MacNeille, and M. Kim, “Integration of big data using semantic web technologies,” in 2016 IEEE Tenth International Conference on Semantic Computing (ICSC), pp. 382–385, Feb 2016. [2] C. A. Knoblock and P. Szekely, “Semantics for big data integration and analysis,” in Proceedings of the AAAI Fall Symposium on Semantics for Big Data, 2013. [3] N. F. Noy and C. D. Hafner, “The state of the art in ontology design: A survey and comparative review,” AI magazine, vol. 18, no. 3, p. 53, 1997. [4] J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton, “Mad skills: New analysis practices for big data,” Proc. VLDB Endow., vol. 2, pp. 1481–1492, Aug. 2009. [5] C. Pasquier, “Biological data integration using semantic web technologies,” Biochimie, vol. 90, no. 4, pp. 584–594, 2008. [6] D. Fensel, F. van Harmelen, B. Andersson, P. Brennan, H. Cunningham, E. D. Valle, F. Fischer, Z. Huang, A. Kiryakov, T. K. i. Lee, L. Schooler, V. Tresp, S. Wesner, M. Witbrock, and N. Zhong, “Towards larkc: A platform for web-scale reasoning,” in Semantic Computing, 2008 IEEE International Conference on, pp. 524–529, Aug 2008.

10