Deep Product Comparison on the Semantic Web

July 22, 2016 | Author: Anonymous | Category: CSS
Share Embed


Short Description

Oct 12, 2015 - search within the often sparse graph of product information on the Web of Linked Data. Furthermore, it ex...

Description

Universität der Bundeswehr München Fakultät für Wirtschafts- und Organisationswissenschaften Institut für Management marktorientierter Wertschöpfungsketten

Dissertation

Deep Product Comparison on the Semantic Web Alex Stolz

Neubiberg, März 2016

Vollständiger Abdruck der von der Fakultät für Wirtschafts- und Organisationswissenschaften der Universität der Bundeswehr München zur Erlangung des akademischen Grades eines Doktors der Wirtschafts- und Sozialwissenschaften (Dr. rer. pol.) genehmigten Dissertation. Gutachter: 1. Univ.-Prof. Dr. Martin Hepp 2. Univ.-Prof. Dr. Michael Koch Die Dissertation wurde am . . . . . . . . . . . . . . . bei der Universität der Bundeswehr München eingereicht und durch die Fakultät für Wirtschafts- und Organisationswissenschaften am . . . . . . . . . . . . . . . angenommen. Die mündliche Prüfung fand am . . . . . . . . . . . . . . . statt.

Abstract Search plays a major role in information systems of today. It facilitates the finding of information on our desktop computers and mobile devices, in enterprise intranets, or on the Web. Yet, as the volume of data grows, it becomes increasingly difficult to get the required information. Problems in particular arise with regard to search efficiency (“Can the information be procured at low cost?”) and search effectiveness (“Are the returned results satisfying?”). An important use case in this context is the discovery of products on the Web. Product search is challenging for several reasons: (1) The amount of product-related documents has increased over time; (2) the data contained in those documents is mostly unstructured and heterogeneous; (3) products are multi-dimensional objects; and, (4) users have often complex information needs. On that account, the quality and granularity of data are critical requirements for product search algorithms on the Web. This thesis contributes a search framework for product offers on the Semantic Web, also known as the Web of Data. Structured data on the Web has grown rapidly over the last five years. The key drivers have been Linked Data sources and Web pages with embedded Microdata or RDFa markup. Structured data can mitigate many of the limitations of traditional Web searches for products. For instance, global resource identifiers on the Web ease product data integration. This way, it is possible to augment product data by fine-grained and high-quality product descriptions. In our work, this authoritative data is supplied via manufacturer datasheets and product classification systems. These granular product descriptions enable deep product comparison over several product dimensions. A crucial component of our solution is the implementation of a faceted search interface. Faceted search is a proper way to deal with the iterative and incremental nature of search. It engages the user in the search process, letting him continually learn about the option space in an exploratory fashion. As an important innovation, our approach is data- or instance-driven, i.e. the availability of data determines the options presented to the user. This is in stark contrast to traditional search interfaces that typically rely on a system-wide, domain-specific, rigid conceptual structure. Our design choice eases to search within the often sparse graph of product information on the Web of Linked Data. Furthermore, it extends the feasibility of our approach to other application areas outside the narrow scope of e-commerce.

i

ii

Zusammenfassung Suche ist eine der häufigsten Funktionen von Informationssystemen. Suchsysteme werden auf Arbeitsplatzrechnern und mobilen Endgeräten eingesetzt, insbesondere aber auch in Firmennetzwerken und im World Wide Web. Vor allem für die letztgenannten Systeme wird die steigende Menge an Daten aber zu einem immer größeren Problem. Schwierigkeiten gibt es sowohl hinsichtlich der Sucheffizienz (“Können die Informationen mit geringem Kostenaufwand beschafft werden?”) als auch der Sucheffektivität (“Sind die gelieferten Resultate zufriedenstellend?”). Die Bedeutung und die wachsenden Probleme bei der Informationssuche werden insbesondere bei der Produktsuche im Web deutlich. Hierbei sind mehrere Faktoren zu nennen: (1) Einerseits stieg die Anzahl an Webseiten mit Produktbeschreibungen über die letzten Jahre erheblich. (2) Andererseits sind die Inhalte solcher Webdokumente meist unstrukturiert und heterogen. Zusätzlich erschweren (3) der mehrdimensionale Charakter von Produkten und (4) die komplexen Informationsbedürfnisse der Nutzer die herkömmliche Produktsuche. Aufgrund dessen stellen gängige Suchalgorithmen hohe Anforderungen an die Qualität und Granularität von Daten. In der vorliegenden Dissertation wird ein Rahmenwerk für die Produktsuche im Semantic Web bzw. Web of Data vorgestellt. Das Semantic Web stellt bereits eine umfassende Menge an strukturierten Angebotsdaten bereit. Deren rasantes Wachstum wurde in den letzten fünf Jahren besonders von der Idee eines Linked Open Data angetrieben und war gekennzeichnet durch eine massive Verbreitung von Daten in den syntaktischen Varianten Microdata und RDFa. Mit einer Produktsuche über das Semantic Web können zahlreiche Einschränkungen traditioneller Websuche verbessert werden. Zum einen vereinfachen strukturierte Daten und global gültige Bezeichner im Web die Datenintegration. Dadurch lassen sich Produktdaten mit granularen und qualitativ hochwertigen Produktbeschreibungen ergänzen. Solche zusätzlichen Daten werden in dieser Arbeit über Herstellerkataloge und Produktklassifikationsstandards bereitgestellt. Sie erlauben einen detaillierteren Produktvergleich, der mehrere Produktdimensionen gleichzeitig berücksichtigt. Zum anderen wird in dieser Arbeit eine Benutzeroberfläche für eine filterbasierte Suche von Produkten entwickelt. Die hierbei verwendete Facettensuche (bzw. Faceted Search) beschreibt ein iteratives und inkrementelles Suchparadigma unter aktiver Miteinbeziehung von Nutzern. Auf diese Weise können Nutzer den Möglichkeitsraum kontinuierlich auf explorative Weise erkunden. Eine wichtige Innovation unseres Beitrags ist, dass die Suchoberfläche daten-

iii

iv

bzw. instanzgetrieben ist, d.h. dass die Auswahlmöglichkeiten für die weitere Navigation auf Basis der im Moment vorhandenen Daten erzeugt werden. Dies steht im Gegensatz zu den sonst üblichen Benutzeroberflächen für die Produktsuche, die für gewöhnlich einer systemweiten, domänenspezifischen, und starr vorgegebenen Struktur folgen. Unsere Entwurfsentscheidung erleichtert die Suche in den bisher häufig recht dünn besetzten Graphen an Produktdaten im Web of Linked Data. Damit kann unser Ansatz auch ohne Weiteres auf andere Anwendungsgebiete außerhalb des E-Commerce ausgedehnt werden.

Contents

List of Tables

viii

List of Figures

x

List of Listings

xiii

List of Abbreviations

xvii

1 Introduction

1

1.1

State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.3

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.4

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5

Research Method and Contributions . . . . . . . . . . . . . . . . . . . . . 13

1.6

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Background and Related Work

19

2.1

Relevant Economic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2

E-Business and E-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3

Semantic Web and Linked Data . . . . . . . . . . . . . . . . . . . . . . . . 52

2.4

Semantic Data Interoperability . . . . . . . . . . . . . . . . . . . . . . . . 88

2.5

Product Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3 Data Collection

125

3.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.2

State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 129

3.3

Sweet-Spot Deep Crawling Approach . . . . . . . . . . . . . . . . . . . . . 133

3.4

Evaluation and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

v

vi

Contents

4 Product Model Master Data

151

4.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

4.2

State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 156

4.3

Product Model Master Data for the Semantic Web . . . . . . . . . . . . . 158

4.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

5 Product Type Information

173

5.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5.2

State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 177

5.3

Deriving Product Ontologies from Knowledge Organization Systems . . . 180

5.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

5.5

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

5.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

6 Cleansing and Enrichment

195

6.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

6.2

Typology of Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6.3

Techniques

6.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

6.5

Implementation of a Data Management Web User Interface . . . . . . . . 238

6.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7 Faceted Product Search on the Semantic Web

245

7.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

7.2

State of the Art and Related Work . . . . . . . . . . . . . . . . . . . . . . 249

7.3

Adaptive Faceted Search Interface for Product Offers . . . . . . . . . . . . 252

7.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

7.5

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

8 Discussion and Conclusion

269

8.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

8.2

Contributions and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . 271

8.3

Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

8.4

Critical Review and Limitations . . . . . . . . . . . . . . . . . . . . . . . . 274

8.5

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

8.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

A User Survey

281

Contents

vii

A.1 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 A.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 B Index of DVD Contents

293

C Online Tools and Web Resources

295

Bibliography

297

List of Tables

2.1

Transaction activities [cf. PRW08, p. 42] . . . . . . . . . . . . . . . . . . . 23

2.2

Categories of e-commerce transactions by entities involved [cited from Gri03] 29

2.3

Characterization of MDM, PDM, PLM, PIM, and ERP . . . . . . . . . . 34

2.4

Date and time formats as defined by ISO 8601 . . . . . . . . . . . . . . . . 39

2.5

Selected snippet related to “kilogram” from the UN/CEFACT Common Code [Uni09a] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6

Characteristics of product identifier types . . . . . . . . . . . . . . . . . . 44

2.7

High-level comparison of product categorization standards . . . . . . . . . 48

2.8

(X)HTML attributes defined for RDFa [based on Adi+ 13, Section 5] . . . 66

2.9

Most important contributors to the definition of the term ontology . . . . 73

3.1

Structured data in the Web Data Commons [based on WebND a; WebND d; WebND c] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

3.2

Instance count and average number of properties in crawl dataset . . . . . 142

3.3

Comparison of entity frequency in WDC and in GRC . . . . . . . . . . . . 146

3.4

Comparison of the amount of RDF triples in shops for WDC and GRC, sorted in descending order by the number of triples in GRC . . . . . . . . 149

3.5

Comparison of the amount of RDF triples in shops for WDC and GRC, filtered by domains for which WDC contains more triples than GRC . . . 149

4.1

Comparison of product features between manufacturers and retailers . . . 153

4.2

Mapping of product details from BMEcat to GoodRelations . . . . . . . . 160

4.3

Mapping of product features from BMEcat to GoodRelations . . . . . . . 161

4.4

Mapping of a catalog group system in BMEcat to a rdfs:subClassOf hierarchy162

4.5

Validation of BMEcat conversions . . . . . . . . . . . . . . . . . . . . . . . 167

4.6

Product features in BSH BMEcat versus data from retailers publishing GoodRelations markup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

4.7

Product searches for a digital camera model on popular e-marketplaces . . 170

5.1

Statistics of product classification standards and category systems

viii

. . . . 190

List of Tables

ix

6.1

Obstacles with respective solutions . . . . . . . . . . . . . . . . . . . . . . 208

6.2

Statistics of entities in the crawl corpus . . . . . . . . . . . . . . . . . . . 236

6.3

Data quality problems in the crawl corpus . . . . . . . . . . . . . . . . . . 237

7.1

Variety of properties and values in an automotive dataset . . . . . . . . . 261

7.2

Results of SUS experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 264

7.3

Web shops with number of matching items . . . . . . . . . . . . . . . . . . 266

List of Figures

1.1

Google rich snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2

Multi-parametric view of a powered hedge trimmer . . . . . . . . . . . . .

8

1.3

Deep product comparison on the Web . . . . . . . . . . . . . . . . . . . . 10

1.4

Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1

Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2

B2B and B2C e-commerce [adapted from Cha09a, p. 65] . . . . . . . . . . 29

2.3

Models of master data exchange: (a) Bilateral versus (b) multilateral [from SLÖ08]

2.4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Structural complexity increase of knowledge organization systems [adapted from Nat05, p. 17] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5

Semantic Web layer cake [adapted from DFH11, p. 20] . . . . . . . . . . . 54

2.6

Evolution of the LOD cloud diagram . . . . . . . . . . . . . . . . . . . . . 57

2.7

Relationship between URI, URL, and URN [based on BFM05] . . . . . . . 58

2.8

RDF triple represented as a graph . . . . . . . . . . . . . . . . . . . . . . 60

2.9

Example as an RDF graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.10 RDF graph that corresponds to the Turtle example . . . . . . . . . . . . . 64 2.11 RDFS language additions . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 2.12 Google rich snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.13 Agent-Promise-Object principle [based on Hep15b] . . . . . . . . . . . . . 77 2.14 Side-by-side comparison of RDF graph and SPARQL graph . . . . . . . . 84 2.15 Taxonomy of schema matching approaches [from RB01] . . . . . . . . . . 90 2.16 Ontology matching process [from ES07, p. 44] . . . . . . . . . . . . . . . . 92 2.17 Classification of matching techniques for ontology matching [from ES07, p. 65] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 2.18 Side-by-side comparison of precision and recall . . . . . . . . . . . . . . . 104 2.19 Faceted search interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 2.20 Processes of an NLP system [from Bat95] . . . . . . . . . . . . . . . . . . 117 2.21 Match classes [based on Col+ 06]

. . . . . . . . . . . . . . . . . . . . . . . 123

x

List of Figures

xi

3.1

Distribution of syntaxes in the Web Data Commons . . . . . . . . . . . . 131

3.2

Average number of entities per domain in Common Crawl corpora (logscaled y-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.3

Share of structured product offer data with respect to domains with RDFa (for gr:Offering) and Microdata (for s:Offer ) in the Web Data Commons from 2012–2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

3.4

Share of structured product offer data with respect to all structured data markup in the Web Data Commons from 2012–2014 . . . . . . . . . . . . 133

3.5

Flowchart of the crawling algorithm

. . . . . . . . . . . . . . . . . . . . . 137

3.6

Distribution of items per shop (log-scaled y-axis) . . . . . . . . . . . . . . 142

3.7

Boxplot of the distribution of items per shop

3.8

Ten most represented shops by offer count . . . . . . . . . . . . . . . . . . 143

3.9

Frequency of offer properties in crawl (upper 90% – 20 out of 55) . . . . . 144

. . . . . . . . . . . . . . . . 142

3.10 Frequency of flat offer properties in crawl (upper 90% – 24 out of 46) . . . 144 3.11 Frequency of product properties in crawl (upper 90% – 11 out of 43) . . . 145 3.12 Frequency of product model properties in crawl (upper 90% – 17 out of 24)146 3.13 Comparison of the E/D-ratios for WDC and GRC . . . . . . . . . . . . . 147 4.1

Enriching shop pages with product master data from manufacturers based on “strong identifiers” [from Hep12a] . . . . . . . . . . . . . . . . . . . . . 155

4.2

Retailer and manufacturer data in GoodRelations . . . . . . . . . . . . . . 156

4.3

BMEcat 2005 skeleton [based on SLK05a] . . . . . . . . . . . . . . . . . . 157

4.4

Boxplot of the product offer count (with EANs) across Web shops in the crawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

4.5

Boxplots of the distribution of shop offers per EAN . . . . . . . . . . . . . 169

4.6

Frequency distribution of EANs with respect to the number of product offers for a particular EAN in the dataset . . . . . . . . . . . . . . . . . . 171

5.1

Conceptual dynamics of the eCl@ss product categorization standard [based on ECl14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

5.2

Conceptual architecture of PCS2OWL . . . . . . . . . . . . . . . . . . . . 181

5.3

GenTax applied to a subset of the Google product taxonomy [cf. HdB07] . 184

5.4

Reverse-engineering of the Google product taxonomy . . . . . . . . . . . . 188

5.5

Examples of valid and invalid subsumption relations from the GPC hierarchy when interpreted as product classes . . . . . . . . . . . . . . . . . . . 189

6.1

Property hierarchy of quantitative values in GoodRelations [based on Hep11]227

xii

List of Figures

6.2

Base and derived units in QUDT linked via a common type qudt:LengthUnit [based on NAS10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

6.3

Data management Web user interface . . . . . . . . . . . . . . . . . . . . . 239

6.4

Axiom replacement mechanism . . . . . . . . . . . . . . . . . . . . . . . . 241

7.1

Mock-up of a faceted search interface for e-commerce . . . . . . . . . . . . 253

7.2

Screenshot of our faceted product search prototype . . . . . . . . . . . . . 254

7.3

Incremental search cycle among multiple search paradigms . . . . . . . . . 257

7.4

Screenshot of a product details modal window with instance-based search filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

7.5

Change of option space with 100 random walk iterations over a decision tree for 875 automobile offers . . . . . . . . . . . . . . . . . . . . . . . . . 262

7.6

Screenshot of the search interface with real data from a household crawl . 267

List of Listings

2.1

Example in RDF/XML

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.2

Example in N-Triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3

Example in Turtle/N3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.4

Example in RDFa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.5

Example in JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.6

Example in Microformats . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.7

Example in Microdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

2.8

Schema.org in Microdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

2.9

Example query in SPARQL . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1

Example of product details in Turtle/N3 . . . . . . . . . . . . . . . . . . . 161

4.2

Example of product features in Turtle/N3 . . . . . . . . . . . . . . . . . . 162

4.3

Example of catalog group information in Turtle/N3 . . . . . . . . . . . . . 163

5.1

Calculating the number of hierarchy levels of product classification systems190

5.2

Annotation example in Microdata syntax . . . . . . . . . . . . . . . . . . 192

6.1

Categorizing products with textual properties . . . . . . . . . . . . . . . . 202

6.2

SPARQL SELECT query to retrieve products with prices in “Euros” . . . 203

6.3

Linking two entities with the gr:includesObject modeling pattern . . . . . 205

6.4

Linking two entities with the gr:includes modeling shortcut . . . . . . . . 205

6.5

Modeling of intervals in GoodRelations . . . . . . . . . . . . . . . . . . . . 206

6.6

Price modeling patterns in schema.org . . . . . . . . . . . . . . . . . . . . 206

6.7

Namespace declarations used for Turtle/N3 and SPARQL examples . . . . 209

6.8

Adding RDF datatypes to plain literals . . . . . . . . . . . . . . . . . . . . 210

6.9

OWL definition of the gr:hasValueInteger property . . . . . . . . . . . . . 210

6.10 SPARQL CONSTRUCT query to recover the correct datatype from schema information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 6.11 Assigning correct RDF datatypes to literals with incorrect datatypes . . . 211 6.12 Converting the invalid data value “5,0” to “5.0” . . . . . . . . . . . . . . . 211 6.13 SPARQL CONSTRUCT query to convert invalid numerical values . . . . 212 6.14 Redundant product models with the same EAN . . . . . . . . . . . . . . . 213

xiii

xiv

List of Listings

6.15 SPARQL CONSTRUCT query for product models based on arbitrary product identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 6.16 owl:sameAs links between redundant product model entities . . . . . . . . 213 6.17 Redundant product models based on the combination of manufacturer name and MPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.18 SPARQL CONSTRUCT query for consolidating redundant product models based on identical pairs of brand names and MPNs . . . . . . . . . . . . . 215 6.19 Product offering definition in schema.org and GoodRelations . . . . . . . . 216 6.20 SPARQL CONSTRUCT query to convert a product offer in schema.org to the respective offer in GoodRelations . . . . . . . . . . . . . . . . . . . . . 216 6.21 Product model definition in schema.org (actually schema.rdfs.org) and GoodRelations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.22 Axiom to translate among two product model classes and product model instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.23 SPARQL SELECT query and triples returned by selecting all GoodRelations product models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.24 Two equivalent modeling patterns for prices in schema.org . . . . . . . . . 218 6.25 SPARQL CONSTRUCT query to translate between two equivalent modeling patterns within the same schema . . . . . . . . . . . . . . . . . . . . . 218 6.26 Modeling shortcut and expanded version for attaching a product to an offer219 6.27 SPARQL CONSTRUCT query to expand a shortcut pattern for products to its canonical long variant . . . . . . . . . . . . . . . . . . . . . . . . . . 219 6.28 Quantitative values as point values and intervals . . . . . . . . . . . . . . 220 6.29 SPARQL CONSTRUCT query to convert point values to intervals . . . . 220 6.30 Product model information based on matching EANs . . . . . . . . . . . . 221 6.31 SPARQL CONSTRUCT query to establish a link between products and product models with matching EANs . . . . . . . . . . . . . . . . . . . . . 221 6.32 Product with and without product features from the product model

. . . 222

6.33 SPARQL CONSTRUCT query for the inheritance of product features from the product model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 6.34 Product variant with and without product features from a related product model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 6.35 SPARQL CONSTRUCT query for the inheritance of product features from product variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6.36 Products where one (a toner cartridge) is a consumable for another (a printer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

List of Listings

xv

6.37 SPARQL CONSTRUCT query to add gr:isConsumableFor link based on the MPN of one product contained in the product name of the other product225 6.38 Two examples of where some unit codes (code of the unit of measurement and the currency code) are missing for quantitative values . . . . . . . . . 226 6.39 SPARQL CONSTRUCT query to assign the default currency value wherever missing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 6.40 SPARQL CONSTRUCT query to recover missing unit codes in quantitative values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 6.41 Comparison of non-granular and granular quantitative value descriptions . 228 6.42 SPARQL CONSTRUCT query that applies a heuristic to extract a value and a unit code from a free-text field . . . . . . . . . . . . . . . . . . . . . 228 6.43 Intervals modeled in text as compared to individual intervals . . . . . . . 229 6.44 SPARQL CONSTRUCT query to convert intervals in text (decimals or integers) to intervals modeled using appropriate properties . . . . . . . . . 229 6.45 Base and derived units in QUDT represented in N3 syntax [adapted from Mas+ 11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 6.46 Unit conversion of quantitative values in GoodRelations . . . . . . . . . . 232 6.47 Provision of additional axioms that are not covered by QUDT . . . . . . . 233 6.48 Example of a populated currency exchange rate instance . . . . . . . . . . 234 6.49 SPARQL CONSTRUCT rule for currency conversion with SPARQL . . . 235 6.50 SPARQL SELECT with owl:deprecated . . . . . . . . . . . . . . . . . . . . 241 6.51 Interchangeable execution of two cleansing rules . . . . . . . . . . . . . . . 242

xvi

List of Listings

List of Abbreviations AI

Artificial Intelligence

AP

Application Protocol

API

Application Programming Interface

ASCII

American Standard Code for Information Interchange

ASIN

Amazon Standard Identification Number

Atom

Atom Syndication Format

B2B

Business-to-Business

B2C

Business-to-Consumer

B2G

Business-to-Government

BIM

Binary Independence Model

BME

Bundesverband Materialwirtschaft, Einkauf und Logistik (Engl.: Federal Association of Materials Management, Purchasing and Logistics)

BPREF

Binary Preference

C2B

Consumer-to-Business

C2C

Consumer-to-Consumer

C2G

Consumer-to-Government

CAD

Computer-aided Design

CAE

Computer-aided Engineering

CAM

Computer-aided Manufacturing

CET

Central European Time

CMS

Content Management System

COINS

COmmon INterest Seeker

CPA

Classification of Products by Activity

CPC

Central Product Classification

CPU

Central Processing Unit

CPV

Common Procurement Vocabulary

CSV

Comma-separated Values

CTR

Click-through Rate

CURIE

Compact URI

CWA

Closed-World Assumption

cXML

Commerce XML

xvii

xviii

List of Abbreviations

DAML

DARPA Agent Markup Language

DAML-ONT

DAML Ontology Language

DBMS

Database Management System

EAN

European Article Number

ebXML

Electronic Business using XML

EDI

Electronic Data Interchange

EDM

Engineering Data Management

ELMAR

Electronic Market Data Feed

eOTD

ECCMA Open Technical Dictionary

EPC

Electronic Product Code

ERP

Enterprise Resource Planning

ETIM

ElektroTechnisches InformationsModell (Engl.: Electro-Technical Information Model)

F-Logic

Frame Logic

FOAF

Friend of a Friend

FTP

File Transfer Protocol

G2B

Government-to-Business

G2C

Government-to-Consumer

G2G

Government-to-Government

GATE

General Architecture for Text Engineering

GDSN

Global Data Synchronization Network

GPC

Global Product Classification

GRAPPA

Generic Request Architecture for Passive Provider Agents

GTIN

Global Trade Item Number

HCI

Human-Computer Interaction

HTML

Hypertext Markup Language

HTTP

Hypertext Transfer Protocol

ID3

Iterative Dichotomizer 3

IMEI

International Mobile Station Equipment Identity

IoT

Internet of Things

IP

Internet Protocol

IR

Information Retrieval

IRI

Internationalized Resource Identifier

ISBN

International Standard Book Number

ISO

International Organization for Standardization

JSON

JavaScript Object Notation

JSON-LD

JSON for Linked Data

List of Abbreviations

KOS

Knowledge Organization System

LARKS

Language for Advertisement and Request for Knowledge Sharing

LDF

Linked Data Fragment

LOD

Linked Open Data

LSI

Latent Semantic Indexing

MAP

Mean Average Precision

MDM

Master Data Management

MPN

Manufacturer Part Number

N3

Notation 3

NER

Named Entity Recognition

NLP

Natural Language Processing

NLTK

Natural Language Toolkit

OAGIS

Open Applications Group Integration Specification

OCML

Operational Conceptual Modeling Language

OEM

Original Equipment Manufacturer

OGP

Open Graph Protocol

OIL

Ontology Inference Layer

OM

Ontology for Units of Measure and Related Concepts

OPDM

Ontology-based Product Data Management

OSI

Open Systems Interconnection

OWA

Open-World Assumption

OWL

Web Ontology Language

PDM

Product Data Management

PHP

Hypertext Preprocessor

PIM

Product Information Management

PLM

Product Lifecycle Management

POS

Part-of-Speech

PRICAT

Price Catalog Message

PRODAT

Product Data Message

PTO

Product Types Ontology

PZN

Pharmazentralnummer (Engl.: Central Pharmaceutical Number)

QA

Question Answering

QUDT

Quantities, Units, Dimensions and Types

RDF

Resource Description Framework

RDFa

Resource Description Framework in Attributes

RDFS

RDF Schema

RDQL

RDF Data Query Language

xix

xx

List of Abbreviations

REST

Representational State Transfer

RFC

Request for Comments

RFID

Radio-Frequency Identification

RIF

Rule Interchange Format

RNTD

RosettaNet Technical Dictionary

RQL

RDF Query Language

RSS

Really Simple Syndication, sometimes Rich Site Summary or RDF Site Summary

RuleML

Rule Markup Language

SaaS

Software as a Service

SCM

Supply Chain Management

SEO

Search Engine Optimization

SERP

Search Engine Results Page

SeRQL

Sesame RDF Query Language

SHADE

SHared DEpendency Engineering

SHOE

Simple HTML Ontology Extension

SI

International System of Units

SKOS

Simple Knowledge Organization System

SKU

Stock Keeping Unit

SPARQL

SPARQL Protocol and RDF Query Language

SPIN

SPARQL Inferencing Notation

SQL

Structured Query Language

STEP

Standard for the Exchange of Product Model Data

SUS

System Usability Scale

SVG

Scalable Vector Graphics

SWRL

Semantic Web Rule Language

TCP

Transmission Control Protocol

TED

Tenders Electronic Daily

TF-IDF

Term Frequency--Inverse Document Frequency

TREC

Text REtrieval Conference

Turtle

Terse RDF Triple Language

UBL

Universal Business Language

UCUM

Unified Code for Units of Measure

UDF

User-defined Function

UNA

Unique-Names Assumption

UNSPSC

United Nations Standard Products and Services Code

UPC

Universal Product Code

List of Abbreviations

xxi

URI

Uniform Resource Identifier

URL

Uniform Resource Locator

URN

Uniform Resource Name

VIN

Vehicle Identification Number

VSO

Vehicle Sales Ontology

W3C

World Wide Web Consortium

WDC

Web Data Commons

WSD

Word Sense Disambiguation

WWW

World Wide Web

WZ

Klassifikation der Wirtschaftszweige (Engl.: German Classification of Economic Activities)

xCBL

XML Common Business Library

XHTML

Extensible Hypertext Markup Language

XML

Extensible Markup Language

XOL

XOL Ontology Exchange Language

XRO

Exchange Rate Ontology

1 Introduction

1.1

State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Large Quantity of Unstructured and Heterogeneous Product Data on the Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.2.2

Complex Information Needs . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.2.3

Insufficient Support for User Interaction . . . . . . . . . . . . . . . . . .

8

1.2.4

Research Problem and Research Hypothesis . . . . . . . . . . . . . . . .

9

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2.1

1.3

1.3.1

Implications of Constrained Web Searches . . . . . . . . . . . . . . . . .

10

1.3.2

Economic Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.4

Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.5

Research Method and Contributions . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.6

Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

1.7

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

The World Wide Web (WWW) has undergone a remarkable evolution over the last two decades, where the second half was characterized by new ways of humans interacting with Web applications, commonly known as the Web 2.0 [ORe05; ORe07]. Its novel practices and principles also accelerated the dynamics of content publishing. Static Web pages have gradually been replaced by dynamic Web content based on information from databases and other Web services (Web application programming interfaces (APIs)). In concrete terms, Web users set up blogs (also “Weblogs”) [e.g. Nar+ 04; Her+ 04; Sch+ 04] and Wikis [e.g. LC01] almost effortlessly in order to discuss topics of interest and to foster collaboration. As a result, a lot of useful content has since been generated, which culminated in a considerable amount of Web documents that we are facing today – and the number of documents has quickly grown beyond what humans can process, especially if there is no mature search or recommendation engine in place that is able to cope with the large quantity, diversity, and granularity of data. This problem can be well illustrated for e-commerce. There is an ongoing trend towards purchasing goods online, which caused numerous e-marketplaces to emerge over time.

1

2

1 Introduction

Consequently, more and more sellers have been pushing their products for sale into electronic marketplaces, and ultimately, onto the Web. From 2005 to 2014, for example, retail e-commerce sales in the USA has grown almost three-fold relative to total retail sales, totaling 7.7 percent (96 billion U.S. dollars) of the entire U.S. retail sales market in the fourth quarter of 2014 [Uni14]. Between 2013 to 2014 alone, it grew by 14.7% [Uni14]. By comparison, total retail sales increased by only 3.8% over the same time period [Uni14]. In Germany, the overall e-commerce turnover for 2015 is estimated at 43 billion Euros [Han14] (roughly 49 billion U.S. dollars at an exchange rate of 1.13461 (EUR/USD)), which is after all three times as much as it was only ten years ago [Han14]. This increase of the value of the product volume traded online has an immediate impact on the efficiency and effectiveness of product searches and product recommendations on the Web; because the higher the degree of choice, the more prevalent is the problem of product search and discovery. In other words, a growing share of e-commerce increases the problem of consumer choice and all associated information processing on the Web. In fact, it is currently very difficult and time-consuming to seek for products and services on the Web. For example, consider the following information needs: • Who can I employ to do the brickwork for my house? • What is the best hotel for my weekend trip to Paris? • What restaurant shall I choose for having dinner tonight? • Which book should I read to learn Python programming? • Which digital camera shall I buy? Although traditional Web searches would return results to some of these questions, it often requires substantial human effort in order to satisfy the user’s actual information needs. Product search on the Web is complicated, because the necessary information is often contained in many different sites and pages and because the final choice is based on multidimensional trade-off decisions [cf. BLP98], e.g. between sometimes conflicting features. There are also learning effects in the search process, and there is an additional trade-off decision between the costs of investing more search effort and the expectable gain in the form of a better final choice [cf. Sti61; BSV12]. 1

Exchange rate as of February 25, 2015. Available online at http://www.currency2currency.org/EUR/USD/20150225 (accessed on February 25, 2015)

1.1 State of the Art

3

1.1 State of the Art Search systems nowadays play a major role in helping to find information on our desktop computers and mobile devices, in enterprise intranets, or on the Web; for an overview of the history of information retrieval, see e.g. [SC12]. While it is relatively straightforward to develop custom search systems for well-controlled collections [BP98], e.g. document searches on personal computers with rigid data and file structures and moderate amounts of data, the situation is quite different for more heterogeneous systems like the open Web. On the Web, virtually everyone is free to publish contents from anywhere and in whatever language or data format is preferred [cf. BP98]. Classical search can only provide limited capabilities for fulfilling the complex information needs of today’s Internet users. Traditional, document-based Web search has its origins in information retrieval (IR) [MRS09] research and is thus operating over a full-text index of documents found on the Web. Its core task is to find a number of relevant resources among a collection of documents, regardless of whether their contents are structured or not; it is also an open research challenge to combine information from multiple documents. Traditional research determines the relevance of documents relative to a user’s keyword query based on the frequency of the search terms appearing in the document and in the document collection, e.g. using a cosine similarity2 score based on term frequency–inverse document frequency (TF-IDF)3 . In addition to that, other algorithms may be applied to complement IR metrics, but they often vary from search engine to search engine. PageRank [Pag+ 98] e.g., a prominent ranking measure, considers the link structure of documents. The algorithm computes the importance of a document relying on a statistical analysis about incoming and outgoing links, which value is then further propagated to the adjacent nodes in the link graph. According to this, a Web page is generally the more relevant the more popular Web pages are linking to it. Modern search solutions like Google use a combination of very sophisticated approaches for keyword-based search; for an overview, see e.g. [BP98; Bif+ 05; Eva07; Su+ 14; Goo15c]. Since the early 2000s, structured data4 has found its way into the document-based Web in the form of the Semantic Web. The Semantic Web is the name of an extension of the current Web with the addition of assigning well-defined meaning to information, allowing computers and individuals to better cooperate, and in particular for computers The angle between two term vectors q and d determines their similarity. q is the query term vector, whereas d denotes the document term vector. 3 Product of term frequency and inverse document frequency, thus adding relevance weights to the term vectors. 4 From here on we refer to the term structured data as data that underlies an explicit, formally defined data model. 2

4

1 Introduction

to support humans in the task of combining and interpreting information on the Web [BHL01]. The Semantic Web relies on the Resource Description Framework (RDF) data model that was proposed as a W3C Recommendation in 2004 [MM04], and on Uniform Resource Identifiers (URIs) for uniquely identifying resources [BFM05]. First applications that were built on top of RDF encoded meaning in an XML-based RDF syntax called RDF/XML [GS14]. Later on, the dissemination of Semantic Web content on the Web has profited from the emergence of data formats like the Resource Description Framework in Attributes (RDFa) and Microdata. These data formats enable to embed structured data in traditional Web content in HTML, which caused the number of Web pages exposing structured data to increase significantly, as surveyed in [MP12; Biz+ 13; MPB14]. In the domain of e-commerce, a large body of online product data is already expressed this way using the GoodRelations [Hep08a; Hep12b] (mainly in RDFa syntax) and schema.org [SchND ] (mainly in Microdata syntax) vocabularies [cf. MPB14]. Such structured information can be of help for improving the search experience and the accuracy of product searches. Market-leading search engines have already started to make sense out of this structured data. One effect that becomes immediately apparent is that they reward product pages that feature structured data markup by nicely decorating search results [cf. Mik08; Haa+ 11] and, thus, highlighting them prominently on the search engine results pages (SERPs), commonly referred to as rich snippets [GGH09] (Google) or rich captions [MicND ] (Bing). A search result as displayed on Google, annotated with product ratings, price details and stock availability, is illustrated in Figure 1.1. In addition, search engines benefit from structured product descriptions by obtaining relevance signals from the shop pages, which they might use to better assess the relevance of a page for a particular query. Moreover, the structured markup can be combined with knowledge that the search engines gathered over time, to accomplish novel and more useful forms of SERPs. Google Inc., for example, announced the Knowledge Graph 5 in 2012, which represents a knowledge base that is currently used to augment traditional search results with supplementary info boxes presenting summaries for certain entities in response to user queries6 . While this knowledge base is up to now largely backed by third-party data sources (e.g. Freebase7 , Wikipedia8 , or the CIA World Factbook9 ) [Sin12], search engines obtain a considerable part of their valuable information from their large body of crawled 5

6

https://www.google.com/intl/es419/insidesearch/features/search/knowledge.html (accessed

on February 22, 2016)

http://www.business2community.com/online-marketing/strings-things-quick-primersemantic-search-0621611 (accessed on July 22, 2014) 7 http://www.freebase.com/ (accessed on May 12, 2014) 8 http://www.wikipedia.org/ (accessed on July 22, 2014) 9 https://www.cia.gov/library/publications/the-world-factbook/ (accessed on July 22, 2014)

1.2 Problem Statement

5

Figure 1.1: Google rich snippet

documents from the Web, where they infer knowledge by utilizing advanced techniques such as natural language processing (NLP) or machine learning [cf. Don+ 14]. With the increased availability of structured data, building up such knowledge bases is simplified a lot, which promises to be very useful for the dynamic domain of products and services.

1.2 Problem Statement With respect to state-of-the-art solutions, there are three main aspects that complicate product searches on the traditional, document-based Web, namely 1. the growing amount of product data published online which is distributed, weakly structured, and heterogeneous, 2. complex information needs of Web users that are constrained by keyword-based search user interfaces, and 3. insufficient support for user interaction and limited opportunities for the user to learn about the option space. In the following, we discuss the characteristics of these three problems in more detail. Thereupon, we identify the research problem and define the research hypothesis for this thesis.

1.2.1 Large Quantity of Unstructured and Heterogeneous Product Data on the Web The essential characteristics of content published on the Web that prevent realizing deep product comparison, search, and discovery are • the vast amount of product data, • the data being mostly raw and unstructured, • distributed content, and

6

1 Introduction

• heterogeneous descriptions and representations. The publication of more and more product data on the Web leads to a greater variety of searchable products. At the same time, it makes finding and comparing products on the Web increasingly challenging. For instance, introducing a new cell phone model by a manufacturer implies the publication of several respective offers by numerous online vendors with varying descriptions. This seems desirable at a first glance, because goods supplied by many vendors gives customers more choice and increases the chance of a better match; besides that, prices are lowered as a result of competition. However, the addition of product data increases the search space (or option space, information space), which makes it more difficult and expensive to find product offers relevant to a particular query. The increasing specificity [cf. PRW08, p. 43] and variety of these products further causes the option space to grow. While years ago there only existed a small number of cell phones with very similar specifications, the present market for electronics is offering a wealth of mobile devices with ample product configurations ranging from classical mobile phones (i.e. feature phones) over intelligent smartphones and “phablets” 10 to tablet computers. Raw and unstructured data further inhibits product search and comparability on the Web, especially for very specific products, where the quality is not easy to grasp for potential buyers. A considerable part of the product data on the Web is maintained in relational databases, where it is typically stored in a well-structured form. However, once put online much of the original information gets lost, because Web markup languages are not able to preserve the data structure [e.g. Haa+ 11]. Markup languages like the Hypertext Markup Language (HTML) [Hic+ 14] are mainly designed for presenting information to humans via their Web browsers, and machines are not able to extract information easily out of this semi-structured Web content. Product data on the Web is residing in distributed data silos that are largely disconnected [cf. BHB09]. Since the Web is a distributed system, content can be created from anywhere by various people or systems, which to consolidate is very difficult. Moreover, the linkage among these disparate Web sites is generally weak, because Web links between documents carry no precise meaning about the types of relationships that hold between resources [BHB09]. Another problem of the Web is the high degree of heterogeneity, which is caused by the distributed nature of the WWW and the lack of consensus between different publishers of product data. An important example of heterogeneity in natural languages is the existence 10

A class of mobile devices positioned in between mobile phones and tablet computers.

1.2 Problem Statement

7

of homonyms, i.e. terms with different meaning that are spelled the very same way, and synonyms, i.e. distinct terms with identical or similar meaning [cf. NO95]. Homonyms and synonyms are very frequent in natural languages [cf. RB01], which complicates the reliable distinction of entities for IR-based search algorithms. Entity recognition becomes particularly challenging if search engines are unaware of the context (environmental variables), i.e. they are missing important contextual information either about the data item itself (the topic the data is about), or about the user intention (the topic the information need is about). Furthermore, heterogeneity is often attributable to different languages and standards for products. A few examples in the field of e-commerce are product descriptions in English versus in German, units of measurement in inch versus in centimeter, price specifications given in U.S. dollars versus Euros, or the use of different classification systems for the organization of products. In business-to-business (B2B) scenarios, it is very common to categorize products according to product categorization standards (e.g. eCl@ss11 or United Nations Standard Products and Services Code (UNSPSC)12 ) and proprietary catalog group systems, whereas in business-to-consumer (B2C) situations custom product category systems and product taxonomies are prevalent. On the Web, for example, there exist category systems like the Google product taxonomy13 and custom taxonomies to better organize Web shop items. The problem with classifications is that they are often designed for specific purposes and thus apply to different contexts [Hep06], and categories may be arranged in distinct structures and expressed using different terminology [cf. RB01]. The finding of automatic ways to harmonize such schemas is a difficult endeavor tackled by the schema matching [RB01] and ontology matching [SE13] research communities.

1.2.2 Complex Information Needs People who consider to purchase products online have typically varying information needs, often of complex nature. Sometimes, they are already familiar with the characteristics of the products they are looking for (known-item seek [MR06, pp. 33–35]). More often, however, they only have vague knowledge about the option space, for which the need arises to do exploratory or research searches [MR06, pp. 33–35]. Since products are multi-dimensional objects (see Figure 1.2), current approaches that operate on unstructured, uni-dimensional data fail to support multi-parametric searches. Keyword searches e.g. are often inappropriate with regard to users’ complex and varying http://www.eclass.de/ (accessed on May 16, 2014) http://www.unspsc.org/ (accessed on May 16, 2014) 13 http://www.google.com/basepages/producttype/taxonomy.en-US.txt (accessed on July 22, 2014) 11

12

8

1 Introduction

information needs. To illustrate their main shortcoming, imagine the following multiparametric keyword query for a specific product (in this case, a powered hedge trimmer for gardening): “Hedge trimmer, powered, light-weight, blade length of about 20 inches, at a price lower than 200 U.S. dollars, sorted by cheapest first.” Blade Length Warranty Duration

Length

Price

Weight

Displacement

No Load Speed Engine Power

Figure 1.2: Multi-parametric view of a powered hedge trimmer

The limited capabilities of keyword-based search engines only allow for searching documents that contain exactly the same terms as specified in the query (i.e. compute the cosine similarity of term vectors), potentially expanding the query by synonyms or past search results that turned out relevant based on user clicks. Implementing highly preferable features like currency (U.S. dollars to Euros) and unit conversions (inches to centimeters), or correctly interpreting fuzzy and ambiguous terms like light-weight, at about, or sort by cheapest first, are non-trivial tasks. Notwithstanding the fact that search engines are continually making progress at better understanding user inputs in the form of keyword queries, there is still room for improvement, especially with respect to product search and discovery over structured data.

1.2.3 Insufficient Support for User Interaction Product search is not a static, one-turn search task that can be translated into a query, but instead includes a learning effect about the option space, and a relaxation or refinement of constraints and preferences. Current approaches do not provide sufficient support for this kind of user interaction. In general, users are too much involved in the process of information integration from multiple sources (e.g. combining product feature data from one site, with reviews from a second, and offers from multiple others), and too little

1.3 Motivation

9

able to contribute human intelligence and judgment into the process, nor to adjust their search based on the outcome of the last take.

1.2.4 Research Problem and Research Hypothesis Research problem. The traditional Web has limitations regarding deep product comparison, primarily due to the vast amount of unstructured and heterogeneous data, limited capabilities for data integration, and the missing support for advanced Web searches. Our research strives to overcome these shortcomings by taking advantage of the Semantic Web, Linked Data, and related technologies. The Semantic Web is characterized by structured data with well-defined meaning formalized using ontologies [Gru93; Bor97; SBF98; GOS09]. By means of the GoodRelations ontology and schema.org, a lot of structured e-commerce data has already been made available on the Web. Accordingly, the research hypothesis is defined as follows: Research hypothesis. The Semantic Web with its underlying data model (RDF), the notion of unique global identifiers for describing entities (URIs), and widely accepted vocabularies for e-commerce (GoodRelations and schema.org), allows for the fine-grained description of products and product offers and facilitates the integration of different information sources. It thus constitutes a suitable infrastructure for product search and discovery on the Web. This thesis proposes a solution to help overcome the main drawbacks of shallow product searches over the traditional, document-based Web. Our approach enables deep product comparison based on granular product descriptions published as Linked Data on the Web. In other words, we develop a search framework over real, structured product data available online as GoodRelations in RDFa and/or Microdata. Moreover, with the inherent ability to integrate Web resources on the Semantic Web, we are able to leverage the visibility of product offers with sparse or low-quality product details, as often published by Web shop owners.

1.3 Motivation Our research was motivated by the unsatisfying situation of current product searches on the Web, and by the economic relevance of product search and discovery on markets.

10

1 Introduction

1.3.1 Implications of Constrained Web Searches As a result of the limited capabilities of current search approaches, users frequently base their buying decisions on sparse information rather than taking into account comprehensive product details for product comparison. More specifically, very early in the search process people tend to 1. reduce searches to one or two dimensions of the product, i.e. they merely rely on the price tag and on product names or descriptions instead of more detailed qualitative and quantitative product characteristics, 2. make a preliminary selection in favor of a small number of products, i.e. they narrow down the option space very quickly in order to save time-consuming manual comparisons of product items over multiple product dimensions, and 3. further investigate and compare among the selected products, e.g. by visiting the manufacturer Web sites and doing manual side-by-side comparison of product datasheets. This approach, unfortunately, makes it hard to find close-to-optimal product offers on the Web, because it prematurely reduces the option space on the basis of incomplete information. It also risks to unfairly favor low-priced, but potentially suboptimal goods, over the best product for a given need. Furthermore, the effort needed to find and visit the pages with the relevant information, usually deep links of Web sites, is substantial, as illustrated in Figure 1.3.

Site 1

Search Engine Result Page

Page 1

Page 2

Page 3

Site 2

Page 4

Page 5

Figure 1.3: Deep product comparison on the Web

Until the user is – if ever – satisfied with the collected results, he first needs to visit a couple of Web sites listed on the SERPs and navigate through deep Web links in order

1.3 Motivation

11

to gather the necessary information. This task possibly spans a series of Web searches. Even if a decision is made at some point, chances are that somewhere else, in one of the many data silos, there would have been relevant information leading to superior results.

1.3.2 Economic Relevance Suboptimal product searches may create a series of problems, among others considerable time wasted on finding answers to an information need, wrong buying decisions because of incorrect, incomplete, or outdated information, or lost revenue due to unsatisfied customers. From an economic point of view, many of the possible shortcomings of poor product searches are connected with the transaction cost theory [PRW08, p. 42]. In particular, excessively high search costs14 constitute a remarkable amount of the overall expenses in a market economy of today’s information age. According to an estimate from 2005, knowledge workers spend about 38% of their working time searching for information [McD05]. Similarly, people looking for products and services on the Web often dedicate precious time and money to their searches. The extent of search costs is marked by the difficulty of finding relevant information. The more specific the products, the more effort is usually needed to find the right product offer. Another important driver of search costs are information asymmetries between market participants. In the worst case, lack of information might lead to adverse selection scenarios. A classical example is the “market for lemons” [Ake70], where in a used cars market good cars are not sold because their quality is uncertain and not visible to the customers. A similar situation may accrue for products offered on the Web. If searches on the Web are very shallow, additional value propositions are not rewarded properly and hence there is no incentive for vendors of high-quality products to participate in the market. Because prices of qualitatively superior products are too high to attract potential customers, they are deselected very early in the search process in favor of low-quality products. The products remaining on the market are those with older technology, inferior specification, or less product features. In other words, the technical limitations of the Web that impede a vendor’s ability to articulate the value proposition of a superior product properly can lead to a market in which such a product will no longer be offered. The field of search theory has dedicated extensive research to quantifying the effect of uncertainty on searches in markets. In his seminal work “The Economics of Information”, 14

Search costs are the costs that accrue during the information gathering phase of a transaction, i.e. the initial phase of a transfer of property rights (see Chapter 2).

12

1 Introduction

Stigler [Sti61] developed a formal model to derive the optimal number of searches for products on a market for goods. According to him, because there is price dispersion (an effect of information asymmetries) in the market, the optimum number of searches is a function of the search costs in relation to the expected gain of lowering the price in every additional search iteration [Sti61]. In view of the economic implications, an important quality criterion of search is to minimize search costs by fostering transparency and user engagement. Instead of restricting users to shallow keyword searches over text corpora, allowing them to compare products over multiple product dimensions would better account for their individual preferences (see Figure 1.2). By means of the multi-dimensionality of products (see Section 1.2.2), users could more easily identify and sort out “lemons” offered on the Web, which would ultimately lower search frictions by preventing unfair discrimination of high-quality products. More precisely, it may empower users to make rational choices by sacrificing the marginally lower prices of goods for significantly higher qualities of substitute products. Up until now, however, buying decisions on the Web are mostly based on the price tag of products and to a lesser degree on product features and quality [cf. Kar+ 05; Cha09b; Nel70]. This makes it extremely difficult for manufacturers and retailers to express the value proposition of their products, especially if there are multiple competing products on the market. In other words, they cannot easily communicate the comparative advantages of their items to potential customers. Hence, if searches would operate on richer product descriptions, then even very specific products could benefit, because additional relevance signals could be conveyed to search engines and other consuming applications, which would render them more visible to potential customers.

1.4 Research Questions In order to verify our research hypothesis stated in Section 1.2.4, we identified five research questions (RQ) with each of them being dedicated a chapter of this thesis. RQ 1. How can we obtain structured product offer data from the Web and what are its main characteristics? (Chapter 3) RQ 2. How can we enrich product offers from the Web with granular, high-quality, and comprehensive product details from product model master data? (Chapter 4) RQ 3. How can we supply product category information in order to support the organization and aggregation of, and the navigation over product data? (Chapter 5)

1.5 Research Method and Contributions

13

RQ 4. What are the major gaps in the product data obtained in RQ 1, RQ 2, and RQ 3? How can we cleanse and integrate these data sources into a consolidated, enriched, and augmented view on product offers? (Chapter 6) RQ 5. How can we realize product search with support for deep product comparison and incremental learning that is based on SPARQL queries over RDF data? (Chapter 7)

1.5 Research Method and Contributions In this thesis, we develop a framework for incremental product search based on product and offer data from the Semantic Web. In our thesis, we predominantly use quantitative research methods, namely in Chapter 3, Chapter 4, and Chapter 6, where we analyze different aspects of the data collected from a Web crawl and BMEcat catalogs, or in Chapter 5, where we report on structured data derived from product classification systems. In Chapter 7, we rely on experimental and qualitative research methods when we empirically evaluate our research prototype with a usability survey. Furthermore, as most of our work is concerned with developing software artifacts to demonstrate the viability of our approach, we touch on principles from design science research for information systems, as researched by Hevner et al. [Hev+ 04]: “[The design science paradigm] seeks to create innovations that define the ideas, practices, technical capabilities, and products through which the analysis, design, implementation, management, and use of information systems can be effectively and efficiently accomplished (Denning 1997; Tsichritzis 1998).” [Hev+ 04]

As a proof of concept, we show that our proposal is able to cope with real structured e-commerce data from the Web. Subsequently, we present the main contributions of this thesis (see Figure 1.4). They are in alignment with the aforementioned research questions (see Section 1.4) and, accordingly, they correspond to Chapters 3-7 of the thesis. Contribution 1. Collect real structured product and offer data from the Web. We collect a sample of product offers from real Web shops that provide valuable insights into the nature of the structured data published on the Web, e.g. How many of the offered items feature product identifiers like Global Trade Item Numbers (GTINs) or manufacturer part numbers (MPNs)?, How much granularity is supplied by structured

1 Introduction

Product Category Data Product and Offer Data Contribution 1

+

Integration, Cleansing and Enrichment

Contribution 3

Product Model Master Data

Product Category Data Product and Offer Data

Contribution 4

Contribution 2

Product Model Master Data

Deep Product Search and Discovery

14

Contribution 4 Contribution 5

Figure 1.4: Contributions of this thesis

product descriptions of product offers on the Web?, etc. The contribution is more precisely composed of the following tasks: 1. Identify and develop a catalog of data sources on the Web that contain structured product data. 2. Extract structured content from the Web pages found on the Web. 3. Analyze the nature of the data. In this context, a parallelized Web crawler for semantic e-commerce data is presented. Contribution 2. Develop a methodology to generate high-quality product model master data for the Semantic Web. The granularity of structured product data published by shop owners as RDFa or Microdata on the Web is limited. This is considered a serious bottleneck for deep product comparison. We thus describe a novel integration approach to support online product offers with the addition of authoritative product model master data from manufacturers and large retailers. In this regard, we suggest a command-line tool to convert product catalogs in the established BMEcat catalog exchange format [SLK05a] to the GoodRelations e-commerce vocabulary [Hep08a] for the Semantic Web. Contribution 3. Derive product type information from existing product categorization standards and proprietary product category systems. Products can be further described semantically by product categories. This allows for more intelligent processing of product data, e.g. related products can be grouped together for accounting purposes. Product categories from product classification systems also represent important distinguishing characteristics for product search. In this contribution,

1.6 Publications

15

we apply an enhanced version of the GenTax algorithm [HdB07] to convert classification systems and taxonomies into respective Web Ontology Language (OWL) hierarchies that are compatible with the GoodRelations vocabulary for e-commerce. Contribution 4. Analyze conceptual gaps between product offer data on the Web and product-related information of derived data sources. Collate and combine the product data into a consolidated data space. The data sources obtained in Contributions 1-3 need to be integrated and cleansed in order to obtain a consolidated data space of product offers as necessary for product search. More precisely, the data is collated in an RDF store with a SPARQL endpoint. After the conceptual gaps and the gaps in the product data have been identified, the data sources are consolidated, enriched, and augmented using respective SPARQL CONSTRUCT queries. Contribution 5. Build a faceted search interface which supports deep product comparison and incremental learning via user interaction. We develop a prototype as a proof-of-concept that exemplifies product search over Semantic Web data. The search system is realized using a faceted search interface over product offer data, which allows to compare product offers based on product details rather than the price tag only. The incremental search strategy lets the user gradually refine and relax the search scope, thereby providing a means to learn about the option space. To our knowledge, this is the first comprehensive attempt to support deep product comparison over Semantic Web data sources.

1.6 Publications With permission by the PhD committee and in accordance with the regulations at the Universität der Bundeswehr München, parts of the work presented in this thesis have already been published at peer-reviewed conferences. In the following, we disclose the relevant publications along with the topics they deal with: • Product model master data for the Semantic Web: A. Stolz, B. Rodriguez-Castro, and M. Hepp: “Using BMEcat Catalogs as a Lever for Product Master Data on the Semantic Web”. In: Proceedings of the 10th Extended Semantic Web Conference (ESWC 2013). Montpellier, France: Springer Berlin Heidelberg, 2013, pp. 623–638.

16

1 Introduction

• Product category information for the Semantic Web: A. Stolz, B. Rodriguez-Castro, A. Radinger, and M. Hepp: “PCS2OWL: A Generic Approach for Deriving Web Ontologies from Product Classification Systems”. In: Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014). Anissaras/Hersonissou, Crete, Greece: Springer Berlin Heidelberg, 2014, pp. 644–658. • Currency conversion in SPARQL as a contribution to data cleansing: A. Stolz and M. Hepp: “Currency Conversion the Linked Data Way”. In: Proceedings of the First Workshop on Services and Applications over Linked APIs and Data (SALAD2013). Montpellier, France: CEUR Workshop Proceedings, 2013, pp. 44– 55. • Faceted search for deep product comparison on the Semantic Web: – A. Stolz and M. Hepp: “Adaptive Faceted Search for Product Comparison on the Web of Data”. In: Proceedings of the 15th International Conference on Web Engineering (ICWE 2015). Rotterdam, The Netherlands: Springer Berlin Heidelberg, 2015, pp. 420–429. – A. Stolz and M. Hepp: “An Adaptive Faceted Search Interface for Structured Product Offers on the Web”. In: Proceedings of the 4th International Workshop on Intelligent Exploration of Semantic Data (IESD 2015). Bethlehem, PA, USA, 2015, no pages. The author of this thesis further authored and co-authored publications and technical reports during the work on this thesis that were not directly included in this work. Where relevant, they are cited as external references: • A. Stolz, M. Ge, and M. Hepp: “GR4PHP: A Programming API for Consuming E-Commerce Data from the Semantic Web”. In: Proceedings of the First Workshop on Programming the Semantic Web (PSW 2012). Boston, MA, USA, 2012, no pages. • A. Stolz and M. Hepp: “From RDF to RSS and Atom: Content Syndication with Linked Data”. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media (Hypertext 2013). Paris, France: ACM, 2013, pp. 236–241. • A. Radinger, B. Rodriguez-Castro, A. Stolz, and M. Hepp: “BauDataWeb: The Austrian Building and Construction Materials Market as Linked Data”. In: Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS 2013). Graz, Austria: ACM, 2013, pp. 25–32.

1.7 Thesis Outline

17

• A. Stolz, B. Rodriguez-Castro, and M. Hepp: RDF Translator: A RESTful MultiFormat Syntax Converter for the Semantic Web. Technical Report TR–2013–1. E-Business and Web Science Research Group, Universität der Bundeswehr München, 2013. • A. Stolz and M. Hepp: GR2RSS: Publishing Linked Open Commerce Data as RSS and Atom Feeds. Technical Report TR–2014–1. E-Business and Web Science Research Group, Universität der Bundeswehr München, 2014. • A. Stolz and M. Hepp: “Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce”. In: Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015). Bethlehem, PA, USA, 2015, no pages.

1.7 Thesis Outline This thesis is organized as follows: • Chapter 2 introduces background and related work relevant for the topic of this thesis. • Chapter 3 describes the implementation of a Web crawler and the data collection process, analyzes the nature of the instance data, and summarizes relevant statistics. • Chapter 4 presents a converter from product catalogs encoded in the XML-based BMEcat format to GoodRelations in RDF. • Chapter 5 details an approach on how to derive Web ontologies for products and services with product classes and features from product classification systems. • In Chapter 6, we develop a typology of common data quality problems and gaps in product data, report statistics on the prevalence of obstacles in the Web crawl from Chapter 3, and propose a data management interface for SPARQL-compliant RDF stores. • Grounding on the preceding works, we suggest a faceted product search interface over RDF data in Chapter 7. • Finally, Chapter 8 concludes our work by summarizing the contributions, discussing the results and limitations, and pointing to future work.

18

1 Introduction

2 Background and Related Work

2.1

2.2

Relevant Economic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.1.1

Transaction Cost Theory . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.1.1.1

Asset Specificity . . . . . . . . . . . . . . . . . . . . . . . . . .

23

2.1.1.2

Bounded Rationality . . . . . . . . . . . . . . . . . . . . . . .

24

2.1.2

Information Economics and Search Theory . . . . . . . . . . . . . . . . .

25

2.1.3

Utility and Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

E-Business and E-Commerce . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.2.1

Types of E-Commerce Transactions . . . . . . . . . . . . . . . . . . . . .

28

2.2.2

Types of Goods for E-Commerce . . . . . . . . . . . . . . . . . . . . . .

30

2.2.3

Product Master Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.2.4

Product-related Information Systems . . . . . . . . . . . . . . . . . . . .

32

2.2.4.1

Product Data Management . . . . . . . . . . . . . . . . . . . .

32

2.2.4.2

Product Lifecycle Management . . . . . . . . . . . . . . . . .

33

2.2.4.3

Product Information Management . . . . . . . . . . . . . . . .

33

2.2.4.4

Enterprise Resource Planning . . . . . . . . . . . . . . . . . .

34

2.2.5

Content Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.2.6

Standards for B2B Data Interchange . . . . . . . . . . . . . . . . . . . .

35

2.2.6.1

Transaction and Catalog Exchange Standards . . . . . . . . .

36

2.2.6.2

Code Standards . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.2.6.3

Product Identifiers for Electronic Business . . . . . . . . . . .

41

Product and Services Classification Systems . . . . . . . . . . . . . . . .

44

2.2.7.1

Knowledge Organization . . . . . . . . . . . . . . . . . . . . .

44

2.2.7.2

Product Categorization Standards . . . . . . . . . . . . . . . .

47

2.2.7.3

Proprietary Product Classification Systems and Taxonomies .

49

2.2.8

Electronic Marketplaces . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

2.2.9

Electronic Tendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

Semantic Web and Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

2.3.1

Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

2.3.1.1

52

2.2.7

2.3

World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . .

19

20

2 Background and Related Work

2.3.1.2

Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

2.3.1.3

Linked Data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

2.3.2

Unique Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

2.3.3

Resource Description Framework . . . . . . . . . . . . . . . . . . . . . .

59

2.3.4

RDF Serialization Formats . . . . . . . . . . . . . . . . . . . . . . . . . .

61

2.3.4.1

RDF/XML . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

2.3.4.2

Turtle

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

2.3.4.3

RDFa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

2.3.4.4

JSON-LD . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

2.3.4.5

Non-RDF Syntaxes for the Semantic Description of Data . . .

68

Ontology Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

2.3.5.1

RDF Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

2.3.5.2

OWL Web Ontology Language . . . . . . . . . . . . . . . . . .

71

Ontologies and Global Schemas . . . . . . . . . . . . . . . . . . . . . . .

73

2.3.6.1

Schema.org . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

2.3.6.2

GoodRelations . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

2.3.6.3

Simple Knowledge Organization System . . . . . . . . . . . .

81

2.3.7

Query and Rule Languages . . . . . . . . . . . . . . . . . . . . . . . . .

82

2.3.8

Storage and Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

2.3.8.1

RDF Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

2.3.8.2

SPARQL Endpoints . . . . . . . . . . . . . . . . . . . . . . . .

86

2.3.8.3

Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

2.3.8.4

Open-World and Closed-World Assumptions . . . . . . . . . .

87

2.3.8.5

Non-Unique-Names Assumption . . . . . . . . . . . . . . . . .

88

Semantic Data Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

2.4.1

Data Integration and Heterogeneity . . . . . . . . . . . . . . . . . . . . .

89

2.4.2

Schema and Ontology Matching . . . . . . . . . . . . . . . . . . . . . . .

90

2.4.2.1

Schema Matching . . . . . . . . . . . . . . . . . . . . . . . . .

90

2.4.2.2

Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . .

91

2.4.3

Data and Instance Matching . . . . . . . . . . . . . . . . . . . . . . . . .

94

2.4.4

String Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

2.4.5

Data Cleansing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

Product Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

2.5.1

Information Need . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

2.5.2

Search Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

2.3.5

2.3.6

2.4

2.5

2 Background and Related Work

2.5.3

2.5.4

2.5.5

2.5.6

2.5.7

21

Information Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

2.5.3.1

Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

2.5.3.2

Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

2.5.3.3

Evaluation Criteria for Information Retrieval . . . . . . . . . . 103

2.5.3.4

Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

Human-Computer Interaction . . . . . . . . . . . . . . . . . . . . . . . . 107 2.5.4.1

Static versus Dynamic Search . . . . . . . . . . . . . . . . . . 107

2.5.4.2

Lookup versus Learning

2.5.4.3

Searching versus Browsing . . . . . . . . . . . . . . . . . . . . 109

2.5.4.4

Interaction Paradigms for Search . . . . . . . . . . . . . . . . 109

2.5.4.5

Faceted Search Interfaces . . . . . . . . . . . . . . . . . . . . . 110

2.5.4.6

Design Guidelines for Search Interfaces . . . . . . . . . . . . . 112

. . . . . . . . . . . . . . . . . . . . . 108

Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2.5.5.1

Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

2.5.5.2

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

2.5.5.3

Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 116 2.5.6.1

Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

2.5.6.2

Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Matchmaking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 2.5.7.1

Characteristics of Matchmaking . . . . . . . . . . . . . . . . . 119

2.5.7.2

Matchmaking and Information Retrieval . . . . . . . . . . . . 121

2.5.7.3

Ranking with Match Degrees . . . . . . . . . . . . . . . . . . . 122

2.5.7.4

Related Research Fields and Applications Areas . . . . . . . . 123

This thesis focuses on product search and discovery on the Semantic Web. Hence, relevant theoretical background and related work include the economic significance of search, related aspects from e-business and e-commerce, the concepts of the Semantic Web and Linked Data, aspects of semantic data integration, as well as the main principles of disciplines at the intersection of product search, namely information retrieval (IR), human-computer interaction (HCI), recommender systems, natural language processing (NLP), and matchmaking. See Figure 2.1 for a chapter outline.

22

2 Background and Related Work

2.5 Product Search

2.4 Semantic Data Interoperability

2.2 E-Business and E-Commerce

2.3 Semantic Web and Linked Data

2.1 Relevant Economic Concepts

Figure 2.1: Chapter outline

2.1 Relevant Economic Concepts In the following, we introduce relevant economic concepts related to product search and discovery, including transaction cost theory, information economics, search theory, and utility.

2.1.1 Transaction Cost Theory The trading of products and services over a market not only involves the costs of the goods per se, but also additional costs that accrue due to the exchange of property rights on a good that is the subject of a transaction [PRW08, p. 42]. These costs are referred to as transaction costs, sometimes also coordination costs [cf. Coa60; Wil81; PRW08, p. 42]. Transaction costs were first addressed in 1937 by Ronald Coase in his seminal article “The Nature of the Firm”, albeit not yet termed as such. In seeking to explain the existence of the institution of a firm as an alternative to purchasing everything on the market, Coase argued that using the market mechanism is not for free [Coa37]. According to him, a market transaction encompasses the costs of the good, but also the costs of using the market [Coa37]. The amount of transaction costs that accrue by using the market mechanism can thus be used as a theoretical model to explain the existence of firms. Oliver E. Williamson raised the public awareness of transaction costs. While Williamson principally shared Coase’s conceptions, he added that internal costs such as for governance

2.1 Relevant Economic Concepts

23

and coordination are present within firms [Wil81]. In other words, organizations have to spend on resources that go beyond the costs of the transformation process itself, e.g. time and money for instructing unexperienced workforce or overhead to control opportunistic behavior [cf. Wil81]. Transaction costs accumulate as part of the activities of a transaction. In Table 2.1, we illustrate typical transaction activities on the example of purchasing a laptop. These activities are in chronological order: Search, selection, negotiation, contract set-up, exchange, supervision, and enforcement [cf. PRW08, p. 42]. Table 2.1: Transaction activities [cf. PRW08, p. 42] 1. 2. 3. 4. 5. 6. 7.

Activity

Example

search select negotiate set-up contract exchange supervise enforce

Seek and compare relevant laptop models Choose suppliers that offer potentially interesting laptops Negotiate terms and conditions with one of the suppliers Set up a contract with the supplier Get laptop from supplier and transfer money in return for the laptop Check product quality and stick to the contract Exercise consumer rights (e.g. require a refund)

In his organizational failure framework, Oliver E. Williamson identifies four factors that influence the level of these transaction costs [cited from PRW08, p. 43]: 1. Bounded rationality: Actors are not perfectly rational because they are missing important information; either, because they do not have access to the relevant information, or because they are incapable of processing the information. 2. Opportunism: Opportunistic behavior such as individuals’ profit maximization causes markets to set up control mechanisms. 3. Specificity: Specific investments create dependencies between business partners, which can lead to opportunistic exploitations (opportunism). 4. Uncertainty/complexity: Environmental variables like prices, conditions, quantities are not easily predictable (bounded rationality). In the following, we discuss specificity and bounded rationality as relevant factors for search costs. 2.1.1.1 Asset Specificity One of the key drivers of transaction costs is the specificity of assets [PRW08, p. 43, p. 46]. Specificity is the value loss of a good if it is used for the next best utilization [PRW08,

24

2 Background and Related Work

p. 43]. For instance, a wedding cake is a very specific good, because if not served at the particular wedding event it becomes immediately worthless. Similarly, investments in highly specialized machines are considered very specific. On the contrary, a kilogram of potatoes is not very specific, because its use is not confined by the purpose for which it was initially acquired [cf. PRW08, p. 43]. Williamson [Wil83] distinguishes four dimensions of asset specificity: 1. Site specificity, e.g. location-sensitive investments, such as a manufacturer processing copper and located next to a copper mine. 2. Physical asset specificity, e.g. the investment in a fully customized ERP system. 3. Human asset specificity, e.g. the investment in specialized employees. 4. Dedicated asset specificity, e.g. the investment in plants serving a single purpose. Malone et al. [MYB87] identified yet another dimension, namely 5. Time specificity, e.g. time-critical assets such as goods that are perishable or a newspaper that becomes outdated. The specificity of goods has been increasing since decades, favored by the globalization and mass customization of products that fostered the diversity of products and services available on markets [cf. PRW08, p. 9, p. 306]. Entering a bakery today in contrast to a hundred years ago, gives us a large variety of different sorts of bread. Rather than being restricted to only one type of bread, today we can select between dozens of them. As a general rule of thumb, the higher the specificity of a product, the harder it is to procure it on a market. The rationale can be found in higher transaction costs, caused by specific investments and the associated risk of opportunistic behavior [PRW08, p. 44]. Nonetheless, the Web is often considered to facilitate transactions over the market [MYB87]. While in the past significant effort was necessary to find and select suppliers (e.g. using yellow pages), it has meanwhile changed, because vendors increasingly use the Web as their primary sales channel.

2.1.1.2 Bounded Rationality Many economic models make facilitating assumptions about rational actors with perfect information (known as homo economicus [Per95]), i.e. utility-maximizing individuals and profit-maximizing firms [cf. Sim59]. In reality, though, the information of individuals is incomplete, i.e. actors have to deal with environmental factors like uncertainty and

2.1 Relevant Economic Concepts

25

complexity as well as cognitive limitations in accessing and processing information [cf. PRW08, p. 43]. Bounded rationality [Sim97, p. 118], as how Herbert A. Simon refers to this problem, causes people to make decisions under uncertainty (e.g. by applying heuristics), which at some point might lead to satisfying but not necessarily optimal solutions [cf. Sim97, p. 119]. Bounded rationality is thus considered a trigger for higher transaction costs and, more specifically, for search costs.

2.1.2 Information Economics and Search Theory As Stigler [Sti61] pointed out, the value of information was for a long time insufficiently considered by economists. “One should hardly have to tell academicians that information is a valuable resource: knowledge is power. And yet it occupies a slum dwelling in the town of economics.” [Sti61]

The field of information economics studies the relevance, value, and characteristics of information in economies and economic decisions. As such, it deals with topics like information as a good, information asymmetry, and the price mechanism [cf. Sti61; SV99]. Because information has an influential character on decision-making of economic subjects, it plays an important role throughout the transformation process of a company. Information is considered as a value-adding good just like physical goods that can be purchased and sold [cf. SV99, pp. 3f.]. Indeed, “people are willing to pay for information” [SV99, p. 3]. For a company, an information advantage often translates into a competitive advantage. An information advantage is determined by information asymmetries [cf. Ake70], a situation where one party has more or better information than the other party. Information asymmetries are prominently used by the theoretical model of the principal-agent problem [PRW08, pp. 47–51; Ros73; JM76]. In a principal-agent problem, one party (agent) acts on behalf of a second party (principal ) [e.g. Ros73; JM76]. The agent has an information advantage over the principal, e.g. being aware of its own strengths and weaknesses (hidden characteristics), actions (hidden action), and goals (hidden intention) [cf. Spr90]. The agent could thus behave opportunistically by concealing certain information. Typical principal-agent situations occur in employer-employee-relationships, doctor-patient-relationships, or insurance contracts [e.g. Ros73]. But they could as well appear in electronic business transactions, for example when a customer depends on the goodwill of vendors and the accuracy and transparency of their product descriptions. In

26

2 Background and Related Work

general, the roles of the principal and the agent can be bidirectional, either depend on the point of view, or change during a transaction (e.g. the agent makes transaction-specific investments). The risk of opportunistic behavior inherent to principal-agent relationships often leads to the problems of adverse selection [Ake70], moral hazard [e.g. Arr63; Hol79], and hold-up [Gol76]. An often-discussed problem occurring in (online) marketplaces is price dispersion [e.g. PRS04; Hop08; BS00; Sti61]. Price dispersion is the result of consumers being underinformed about the prices of the goods [Hop08], which can be attributed to information asymmetries. The different price setting of the same goods by different vendors leads not only to market inefficiencies but also to considerable search effort. Thus, lowering the search costs plays a crucial role for reducing price dispersion in online markets [cf. BS00]. For a few categories of otherwise homogeneous goods, this might be a simple challenge and price comparison services are trying to fulfill that need. In a market with an increasing option space of product choices with a large number of relevant product dimensions this turns into a very hard problem. Search theory studies the economic behavior in markets with search frictions, i.e. where individuals have imperfect information and invest time and effort in searching [Pis01]. While Pissarides [Pis01] focused in his research particularly on the labor market, using search theory to explain frictional unemployment [cf. Pis01], Stigler studied price dispersion. Stigler [Sti61] developed a function for the optimal number of searches based on search frictions due to price dispersion in markets. According to this, the optimal amount of search is reached when the marginal cost of an additional search (i.e., requesting a price quotation from an additional seller) matches or exceeds the expected returns [Sti61]. For lower-priced products, this can mean that even a low-cost search does not necessarily pay off [Nel70].

2.1.3 Utility and Preferences Jeremy Bentham determined utility as the “property in any object, whereby it tends to produce benefit, advantage, pleasure, good, or happiness” [Ben23]. The school of hedonism, which dates back to ancient Greece, was first concerned with utility maximization, more precisely with individuals that seek happiness (or pleasure) while minimizing pain [Wei12]. Utilitarianism is a prominent school of thought grounded on the principles of hedonism. It is represented by Jeremy Bentham and John Stuart Mill [Wei12]. Bentham was the founder of the greatest-happiness principle, denoting that an action is considered moral if it creates happiness to society as a whole, to the

2.2 E-Business and E-Commerce

27

greatest number of people [Ben23]. Unlike Bentham, who considered all differences among pleasures to be quantifiable, Mill argued that pleasures exhibit different qualities, by treating physical forms of pleasure inferior to higher, intellectual pleasures [Mil06]. In microeconomics, utility is typically modeled as a utility function [NS08, p. 87]. The utility function for a bundle consisting of two goods can be written as utility = U (x, y)

(2.1)

The utility function U describes the preference structure of an individual over a combination of goods, i.e. x and y [cf. NS08, p. 89]. Preferences are assumed to be complete, transitive, and continuous [NS08, pp. 87f.]. Completeness means that for each pair of goods there must exist a preference relation, i.e. either x is preferred to y or vice versa, or both alternatives are equally preferred [NS08, p. 87]. Transitivity requires that preferences are consistent, i.e. that if x is preferred to y and y to z, then x is preferred to z [NS08, p. 87]. And finally, continuity describes the case that similar goods must exhibit a similar utility, i.e. if x and y are similar, and x is preferred to z, then also y is preferred to z [NS08, pp. 87f.]. More contemporary research efforts in psychology indicate that increasing numbers of options make people feel less confident and unhappy. In The Paradox of Choice: Why More Is Less, Barry Schwartz addresses the problem of too much choice in today’s society, which raises the individual’s expectations that are not easily satisfiable, and people are unhappy because they have the impression of not having made the best decision [Sch04]. Oulasvirta et al. [OHS09] argue that this effect is also present in search engines of today, evaluating the effect of the number of search results displayed to the user. As already noted for bounded rationality [Sim59], people frequently have imperfect information, thus they tend to choose options that to some degree match their expectations rather than trying to find the optimum. In other words, because the price of finding the optimum is often considerable, in some cases it may be superior to satisfice (e.g. to settle for a “fair price”) than to maximize or optimize (e.g. to aim for the “best price”) [cf. Sim59; cf. Sim97, p. 119].

2.2 E-Business and E-Commerce When referring to business activities over the Internet, the two terms e-business and e-commerce are frequently used. Their relationship is not always clear-cut [cf. PRW08, p. 274; cf. Cha09a, p. 13]. Quite often, e-commerce is considered a subdiscipline of

28

2 Background and Related Work

e-business [Cha09a, pp. 13f.]. The two other prevalent perceptions are that e-business and e-commerce are treated as equivalent and used synonymously, or that they are considered as partially overlapping [Cha09a, pp. 13f.]. Similarly, the term e-government is often used to refer to e-commerce applied to the public sector [Cha09a, pp. 28f.]. The term e-business was coined by the International Business Machines Corporation (IBM) and made public in 1997 as part of an advertisement campaign for their Internet-based transaction services. Consequently, IBM was the first to give a definition for e-business, stating that “e-business is about transforming key business processes by using Internet technologies” [IBM11]. According to this, e-business entails all business-related activities that are conducted over the Internet. This may in particular include activities such as the collaboration between business partners, customer service, or intra-organizational transactions. Under this premise, concepts like e-procurement, e-commerce, e-sales, or e-marketplaces fall into the broad definition of e-business. We assent to this broad meaning of e-business in our subsequent discussion about e-commerce. In contrast to e-business, e-commerce is about trading goods over electronic systems. It generally puts more emphasis on the transactional aspect, i.e. the processes and activities related to purchasing and selling products and services online: “Electronic commerce is the exchange, distribution, or marketing of goods or services over the Internet” [Gol08]. The affected activities may encompass procurement, sales, advertising, service provisioning, and payment. E-commerce helps augmenting the quality of the decision-making process, lowering the costs, and increasing the speed of transactions [KR03].

2.2.1 Types of E-Commerce Transactions Electronic transactions involve entities that act as either selling or buying parties [KW97, pp. 4f.; Cha09a, p. 11]. Sellers are often referred to as provider, producer, supplier, or server, whereas buyers are also known as consumer, customer, or client. This distinction of sell-side and buy-side e-commerce serves to define the types of e-commerce transactions with respect to the business entities involved. The three most established and widely known transaction categories are business-tobusiness (B2B), business-to-consumer (B2C), and consumer-to-consumer (C2C) [Cha09a, p. 26; BGG01]. Table 2.2 shows all possible combinations of transaction categories. B2B e-commerce describes exchanges of products and services between supplier and manufacturer, manufacturer and wholesaler, or wholesaler and retailer. On the other hand, B2C denotes the transactions between retailer and consumer, less often between wholesaler

2.2 E-Business and E-Commerce

29

Table 2.2: Categories of e-commerce transactions by entities involved [cited from Gri03] Buyer Business Government Consumer B2B B2G B2C Business Government Seller G2B G2G G2C Consumer C2B C2G C2C

and consumer (see Figure 2.2). By virtue of the Internet, traditional distribution channels like those where intermediaries are chained up as in Figure 2.2 become gradually less rigid [Cha09a, p. 65]. E.g., a manufacturer could open an online store where consumers can purchase products directly. Supply Chain B2B Supplier

Manufacturer

Wholesaler

Retailer

Consumer

B2C

Figure 2.2: B2B and B2C e-commerce [adapted from Cha09a, p. 65]

The most prominent example of a B2C marketplace is Amazon, a large online retailer whose business model relies on selling products to end customers. C2C, having been less common in the past, gained a lot of interest through eBay, an electronic auction and classifieds platform that, as an intermediary, facilitates transactions between individuals [cf. BGG01]. On eBay, people and organizations can both advertise and purchase goods, whereas the transaction fulfillment is controlled and handled by the platform provider. Besides those addressed so far, there exist additional, less evident forms of transaction relationships (see Table 2.2), namely between business and government (B2G), consumer and government (C2G), consumer and business (C2B), government and government (G2G), government and consumer (G2C), and government and business (G2B) [e.g. Gri03]. New application areas for e-business are recently gaining traction. Smartphones and tablet computers not only make up for the largest number of computing devices sold in the last years [cf. Int15], but they also provide novel ways of conducting e-commerce, termed as mobile commerce, mobile e-commerce, or m-commerce [e.g. VVK00; Sen00; BPJ02; TBH06]. Similarly, the new distribution channels opened up by social media are commonly referred to as social commerce [e.g. TBL10; WZ12], or more specifically, in the realm of Facebook, f-commerce [e.g. FH13].

30

2 Background and Related Work

2.2.2 Types of Goods for E-Commerce A good in the context of e-commerce is known as a commodity or product that can be traded [cf. Mar04; Hil99]. The term good is often used synonymously with the terms commodity and product. The term commodity was used in particular by early economists to refer to goods [cf. Hil99]. A commodity, according to Marx, is “ ‘any thing necessary, useful or pleasant in life,’ an object of human wants, a means of existence in the broadest sense of the word” [Mar04, p. 20]. Commodities are determined by their use-value and exchange-value, meaning that they can satisfy human needs and can be exchanged for another commodity in a market [Mar04, pp. 19–21]. Hill [Hil99] characterizes goods as follows: “Goods are entities of economic value over which ownership rights can be established. If ownership rights can be established they can also be exchanged, so that goods must be tradable.” [Hil99]

There are several possible ways of classifying products or goods, whereof we subsequently present some popular distinctions relevant for electronic search and discovery. One possibility is to group products by their physical properties [Hil99], i.e. • tangible goods, and • intangible goods. In his article about the clarification on the difference of intangible goods and services (both of intangible nature), Hill argues that intangible goods, unlike services, have all essential economic characteristics of goods [Hil99]. Hence, “the traditional dichotomy between goods and services should be replaced by a breakdown between tangible goods, intangible goods and services” [Hil99]. While tangible goods are physical goods, products that can be touched like food, clothing, electronic devices, or sports equipment, intangible goods are non-physical products [cf. Hil99]. Information goods, playing an important role in information economics, are a special subset of intangible goods, which have a particular economic value [cf. Hil99]. The value of information goods is determined by the information they provide. Classical examples of this kind of goods are digital photographs, music, movies, spreadsheets, or software. Unlike physical goods, the reproduction of information goods does not cost significant additional amounts of money1 [SV99, p. 3]. Nelson [Nel70] makes a distinction between • search goods, and 1

The cost structure of information goods is usually made of high fix costs, namely the costs for design and production of the first copy, and rather low variable costs, namely the costs for producing additional copies [SV99, p. 3].

2.2 E-Business and E-Commerce

31

• experience goods. Search goods are those goods that are easy to evaluate without having seen or experienced them. By contrast, it is harder to assess the quality and characteristics of experience goods unless they have been used [Nel70]. This is particularly relevant for information goods, because for them it is generally difficult to assess their quality before having consumed them. From a microeconomic point of view, products can be classified according to their relationships to each other. Economists dealing with microeconomics [e.g. NS08, p. 185] distinguish two important groups of products, namely • complements, and • substitutes. Complements, on one side, are products that are complementing each other, e.g. product accessories like cream for coffee or sugar for tea. Substitutes, on the other side, are products that can be replaced for one another, namely they are of similar use (i.e. they create similar utility, e.g. coffee and tea) or by different manufacturers with the same or a similar functionality (e.g. Pepsi Cola and Coca Cola) [cf. NS08, p. 185]. Further, it is possible that products exhibit no obvious relationship, e.g. milk and automobile. It does not mean, however, that they are necessarily independent. In the given example, there is at least no evident relationship, unless milk is primarily consumed during driving a car or was invented as fuel for automobiles.

2.2.3 Product Master Data Master data refers to information artifacts about business entities, such as parties, places, and things [Whi+ 06a]. Accordingly, product master data is master data related to products, such as manufacturer information, product features, and product images. As its main characteristics, master data is (1) shared between applications and business processes among one or more departments or organizations; (2) relatively static and infrequently changed during business processes; and, (3) potentially arranged hierarchically (e.g. an employee is member of a department within a firm) [Los09, p. 8]. The information stored about such objects comprises attributes, definitions, roles, connections, categories, and metadata [Los09, p. 6]. Because master data represents the core business entities of an organizational unit that persist for extended periods of time, master data is often stored centrally, referenced and accessed by various departments when needed.

32

2 Background and Related Work

Transactional and analytical systems use master data for executing business operations and reporting business information. The data generated by these systems is referred to as transactional (or operational) and analytical data [Ora11]. Transactional systems create, modify, archive, or delete operational data around existing master data. This comprises invoices and orders, that refer to product information and customer details, respectively. Meanwhile, analytical systems compile decision-supporting reports relying on metrics about suppliers, customers, products, or employees like revenues, costs, and performance indicators [MV05]. The efforts to set up policies, methods, and infrastructure to capture, integrate, and share relevant master data between organizations’ stakeholders and information systems are called master data management (MDM) [Los09, pp. 8f.].

2.2.4 Product-related Information Systems For quite some time, data has been stored in a decentralized fashion within companies. The increase in functionality and applications for desktop computers made it common for departments and employees to autonomously manage information, e.g. as text files, spreadsheets, or even printed on paper [cf. Los09, p. 3; Cha09a, pp. 165f.]. This led to many disparate data silos with heterogeneous information, which made it almost impossible to share information in an efficient way [Cha09a, pp. 165f.]. Nowadays, there exist various approaches and systems for the centralized organization and synchronization of product data within and across company borders. The most prominent concepts are product data management (PDM), product lifecycle management (PLM), product information management (PIM), and enterprise resource planning (ERP).

2.2.4.1 Product Data Management Product data management (PDM) first arose in the 80s and 90s of the twentieth century as a concept derived from engineering data management (EDM) [SI05, p. 1]. PDM describes the technology and software systems to integrate all information artifacts related to products. This includes capturing information from other systems like computer-aided design (CAD), computer-aided engineering (CAE), and computer-aided manufacturing (CAM), as well as providing interfaces to enterprise resource planning (ERP) and supply chain management (SCM) systems [cf. LMS07]. As such, PDM was a good fit for a small to medium-sized company in order to maintain and share product information within an organization’s borders.

2.2 E-Business and E-Commerce

33

2.2.4.2 Product Lifecycle Management Product lifecycle management (PLM) describes an integrated management approach that spans the entire product lifecycle [LMS07]. The idea mainly emerged at the end of the twentieth century, focusing on product and support innovation in an Internet-enabled globalization of markets and mass customization of products [Liu+ 09]. PLM aims at raising companies’ global competitiveness by facilitating the information flow within and across companies at every stage of the product lifecycle [e.g. SI05, p. 1; Liu+ 09; LMS07]. In other words, the challenge of PLM is to manage product information along the full product lifecycle [Sta11, p. 3; LMS07]. This specifically implies to support the five phases of product development, namely imagination, definition, realization, use/support, and retirement/disposal [Sta11, p. 2] (i.e. the full path “from cradle to grave” [Sta11, p. 1]). PLM is sometimes considered as an advancement over PDM [SI05, p. 244]. In addition to PDM, it can contribute useful product-related information that emerge in later stages of the product development. Such additional information complements the product specifications and product metadata as already present in PDM systems [SI05, pp. 7f.].

2.2.4.3 Product Information Management The goal of product information management (PIM) is to create “one shared source of product information” from where it can be distributed to different sales channels [Abr14, p. 3]. Online vendors face several challenges where spreadsheets are inappropriate [Abr14, p. 1]. The global market poses special requirements, forcing companies to sell products to stakeholders from different countries, at varying prices, and under differing brands [Abr14, p. 2]. In particular, this involves the need to offer a multitude and variety of products online, increasing customer needs of detailed product information, or cross-media publishing [Abr14, p. 2]. PIM systems store product information in a central repository where heterogeneous sources of product master data are reconciled, and create a single product view for all stakeholders [cf. Whi07]. This single product view aims at serving various distribution channels such as printed product catalogs, flyers, Web pages, online stores, mobile phones, tablets, etc. [Abr14, p. 1]. For instance, a PIM system can be used simultaneously to empower the content management system (CMS) of an online shop, to create electronic

34

2 Background and Related Work

product data feeds for business partners, and to compile a print catalog for customer home delivery.

2.2.4.4 Enterprise Resource Planning Enterprise resource planning (ERP) describes a system integration approach to solve the fragmented applications infrastructure, as it has governed companies in the 1990s [Cha09a, p. 166]. ERP typically refers to a monolithic system that functionally integrates the departments within a company, e.g. procurement, production, marketing, logistics, finance, and human resources [Cha09a, p. 167]. Unlike the previously presented approaches, ERP is not product-centered, but also encloses master data of other business entities, alongside with operational and analytical data, e.g. material requirements, orders, and invoices. Nevertheless, ERP is often used together with other systems like PDM or PIM. Among the most popular ERP system vendors are SAP, Baan, PeopleSoft, and Oracle [BSG99]. In Table 2.3, we contrast MDM, PDM, PLM, PIM, and ERP among several dimensions that we distilled from our previous descriptions. Table 2.3: Characterization of MDM, PDM, PLM, PIM, and ERP Concept Focus Systems approach Product-data-oriented Master-data-oriented Cross-corporate integration Full product lifecycle support a

MDM

PDM

PLM

PIM

ERP

− − + − naa

+ + + − − (engineering)

− + − + +

+ + + − − (sales)

+ − − − na

For some concepts a specific dimension is not applicable, which we have marked by “na” for “not available” or “not applicable”.

2.2.5 Content Integration With corporate environments growing more globally, exchange of product information has to take place between organizations. Hence, the integration problem aggravates because of several heterogeneous applications and data sources with missing standardization being forced to interoperate [cf. Fen+ 01]. Within a single company’s boundaries, the problem can be solved using data warehousing systems [Los09, pp. 3f.; cf. Inm02, pp. 31–33]. The employment of data warehouses reflects the traditional view of data integration [DHI12, p. 272], which is about integrating data from a variety of databases into a unified one [DHI12, p. 272]. Even if there exist other

2.2 E-Business and E-Commerce

35

approaches (portals, operational data stores, federated database systems, peer-to-peer integration, among others [ZD04]), data warehouses are still very popular in business contexts [cf. DHI12, pp. 272f.]. Traditional data warehouses, however, have an important limitation, i.e. they collect historical data and generate reports from various applications, which might not properly reflect the live status of the data at the original systems [Inm02, p. 35]. In business settings with many organizations, integration becomes even more challenging because of the variety of formats, schemas, etc. This problem is addressed by content integration, i.e. the “integration of operational information across enterprises” [SH01]. Content integration strives to consolidate potentially volatile content available from disparate organizations, in different formats, and at different levels of granularity. In particular, this includes besides rich data sets also weakly structured information objects, like for example images, scans, videos, PDF files, presentations, or informal notes and reports. Data warehousing techniques are too limited when integrating content from different data providers [SH01]. While Stonebraker and Hellerstein [SH01] propose a data federation approach with materialized views as a possible solution for content integration, the problem is still prevalent in many enterprises of today.

2.2.6 Standards for B2B Data Interchange In e-business and e-commerce, there is a need for establishing consensus between business parties regarding the structure and semantics of the exchanged business documents and the identification of business objects. Otherwise, misunderstandings or transaction failures may arise that ultimately create unnecessary costs and hamper the willingness of organizations to continue doing business with each other. Standardization is thus a frequently mentioned requirement for the automated exchange of product information [e.g. HT05, p. 191]. By means of standardization, positive network effects can accrue [FS85; KS85] that raise the compatibility and interoperability within and across enterprises by reducing internal process costs as well as external transaction costs. There is a series of important standards relevant for e-business that can roughly be grouped into process standards, transaction standards, catalog standards, classification standards, and identification standards [GI12, p. 6]. In the following, we take a closer look at these standards. In addition, we present code standards that ensure frictionless electronic communication between trading organizations from different domains and countries.

36

2 Background and Related Work

2.2.6.1 Transaction and Catalog Exchange Standards In general, there are two established models of exchanging product information between corporate entities, as depicted in Figure 2.3. The first is bilateral exchange [SLÖ08], that is when vendors are setting up individual information channels with all their suppliers. This type of exchange becomes difficult to manage with a growing number of transaction partners. In particular, the number of required mappings to harmonize different product descriptions increases exponentially [Fen+ 01]. The alternative is a master data pool, provided by a data intermediary that procures master data from various manufacturers, potentially curates it, and furnishes it to different vendors. This is known as multilateral exchange [SLÖ08] and the key idea underlying business-to-business marketplaces [Fen+ 01]. Instead of up to m × n, as few as m + n individual communications or mappings are necessary to set up [Fen+ 01].

Vendor 1

Vendor 2

...

Vendor n

(a)

Manufacturer 1

Vendor 1

(b)

Manufacturer 2

...

Manufacturer m

Manufacturer 1

Vendor 2

...

Vendor n

...

Manufacturer m

Master Data Pool

Manufacturer 2

Figure 2.3: Models of master data exchange: (a) Bilateral versus (b) multilateral [from SLÖ08]

The Global Data Synchronization Network (GDSN) [SLÖ08; GS1ND ] goes beyond these traditional exchange models, aiming at setting up a network of data pools that are synchronized in a way that supplier and customer sides obtain real-time and high-quality product data. The data pools are controlled and consistently updated by means of a central global registry. The registry ensures that whatever data pool has been selected by the trading partners, they always obtain the most recent product information via publish-subscribe [cf. SLÖ08]. Every change to product data is automatically propagated to all participating data pools in the network. Transaction Formats For the bilateral exchange of data, standard data exchange formats have been established in the past. Long before the existence of the World Wide Web (WWW), in the 1960s, companies had already started to replace the paper-based

2.2 E-Business and E-Commerce

37

communication by electronic exchanges of business documents relying on electronic data interchange (EDI) [Cha09a, pp. 176f.]. EDI uses a set of standard messages for the automated processing by computers, which, however, is difficult to grasp by humans in terms of syntax, structure, and semantics. More recently, the most popular parts of the EDI standard have been mapped to Extensible Markup Language (XML) syntax in order to comfort modern software systems and to reduce communication costs [cf. Hue00; PW97]. With XML, EDI-based data exchange can further benefit from the broad tooling support for XML, i.e. the syntactical correctness and completeness of a transmitted document can be validated much more easily than for a message stream [cf. PW97; cf. PRW08, p. 150]. Over time, the EDI communities have developed a number of industry-specific subsets, such as ODETTE focusing on the automotive sector in Europe, or UN/EDIFACT covering the fields of administration, commerce and transport [cf. PRW08, p. 150]. Besides EDI-based standards, a series of other data exchange formats obtained wide acceptance, such as Electronic Business using XML (ebXML), Commerce XML (cXML), XML Common Business Library (xCBL), Universal Business Language (UBL), Open Applications Group Integration Specification (OAGIS), RosettaNet, or OpenTRANS [cf. HT05, pp. 205–208; SLK04]. These exchange formats support typical processes of a business transaction, namely the automated exchange of product catalogs, quotations, customer orders, delivery notes, invoices, and payments.

Standards for Representing and Exchanging Product Details The exchange standards presented so far are rather generic, and often provide insufficient support for describing products and services in much detail. For this reason, there exist complementing extensions and subsets of these standards. Product Data Message (PRODAT) and Price Catalog Message (PRICAT) are two message types in the EANCOM subset of UN/EDIFACT that can be used to exchange product information [DLS01, p. 1532; HT05, p. 205]. PRODAT describes a message type for EDI that supports the exchange of product master data. Among others, it contains messages to indicate product-related details like product group information (message code “PGI”), currencies (“CUX”), and physical measurement values (“MEA”) [UN 14]. PRICAT defines a set of messages to exchange product catalogs transferred between trading partners by permitting them to indicate commercial details, such as price information, terms for payment and transport, and packaging information, but also product characteristics and categories [UN 12].

38

2 Background and Related Work

The Standard for the Exchange of Product Model Data (STEP), also known as ISO 10303 [Pra05], is a family of standards for the exchange of product master data [SLK04]. The STEP standard is intended for representing relevant product information throughout the full product life-cycle, including for example the specification of CAD/CAE/CAM objects as generated and used by enterprises involved in the design, engineering, and manufacturing of a product [cf. Pra05]. The standard has evolved into a modular architecture facilitating the development of application protocols (APs) to serve product representation among several applications [cf. Pra05]. STEP is formalized using a systemneutral and machine-understandable data modeling language, EXPRESS [Int04], and usually exchanged via a STEP file [Int02a]. EXPRESS schemas can also be illustrated using a human-friendly graphical notation (EXPRESS-G) [Int04], or, serialized as XML (e.g. STEP-XML), as proposed in ISO 10303-28 [Int07a]. This neutrality of its specification language makes the STEP standard appropriate for both single- and cross-domain usage, e.g. integrating multiple ERP systems as well as connecting CAD systems with ERP systems. Nevertheless, STEP is not appropriate for describing commercial product properties, such as used for e-sales and e-procurement [SLK04]. Product catalog exchange formats can be considered standards for the exchange of product data via electronic product catalogs, mostly between PIM system operators (typically manufacturers and suppliers) and clients (retailers and customers). As opposed to transaction formats, pure product catalog exchange formats concern product-related information rather than transaction-related information in general. Numerous transaction formats have built-in support for exchanging product specifications; popular examples in XML are RosettaNet, cXML, and xCBL [cf. SLK04]. BMEcat [SLK05a; HS00] is an XML-based exchange format for product catalog data developed by the eBusiness Standardization Committee consortium in Germany, consisting of members like Fraunhofer IAO, Universität Duisburg-Essen, and industrial organizations. The standard has also proved successful in other European countries [SLK04]. It can be used together with B2B XML standards like e.g. OpenTRANS, which defines business documents for the transactional part, whereas BMEcat focuses on the product catalog to be transmitted [SLK05a, p. 7].

2.2.6.2 Code Standards Trading partners often speak different languages that go beyond the use of compatible data formats. Consensus needs to be established about date formats, locations, languages,

2.2 E-Business and E-Commerce

39

currencies, and units of measurement. In the following, we summarize code standards for e-commerce that many transaction standards adhere to.

Date and Time Formats The use of compatible formats for date and time is critical in many applications. For example, a date “12/10/2015” can mean two different things, namely a) October 12, 2015 (DD/MM/YYYY 2 ), or b) December 10, 2015 (MM/DD/YYYY ). The ability to reliably distinguish these two variants is critical in business contexts. The potential for misinterpretation further aggravates if the date format is of the following form: “12/10/15”. To overcome this problem, the International Organization for Standardization (ISO) suggests a standard, ISO 8601 [Int88], for representing dates and times in a uniform way. It is a powerful standard that covers a variety of possible formats (localized time and date, calendar week numbers, time intervals, recurring time intervals, durations). The basic date and time formats are outlined in Table 2.43 . ISO 8601 also allows more granular descriptions of dates and times. For instance, fractions of a second can be attached to the basic formats for date and time, e.g. “14:30:00.05”. Alternatively, time zone information can be supplied, e.g. “14:30:00+01:00” for Central European Time (CET). Table 2.4: Date and time formats as defined by ISO 8601 Concept Date Time Date and time

Format

Example

YYYY-MM-DD hh:mm:ss YYYY-MM-DDThh:mm:ss

2015-10-12 14:30:00 2015-10-12T14:30:00

Country Codes ISO 3166 is a standard code for indicating geographical or geopolitical areas (i.e. countries, states, provinces). The standard is divided into three parts, namely ISO 3166-1 [Int13b], ISO 3166-2 [Int13c], and ISO 3166-3 [Int13d]: ISO 3166-1 : Regions, that are further divided into three different types of codings: a) alpha-2 : Two-letter country codes, e.g. “DE” for Germany YYYY denotes the year in the Gregorian calendar (referred to CCYY with CC for century digits in [Int88]), MM the month, and DD the day of the month, respectively. Similarly, we refer to hours with hh, minutes with mm, and seconds with ss. 3 Separating hyphens and colons denote the more readable extended format [Int88]. They could also be omitted.

2

40

2 Background and Related Work

b) alpha-3 : Three-letter country codes, e.g. “DEU” for Germany c) numeric: Three-digit country codes, e.g. “276” for Germany ISO 3166-2 : Countries and administrative subdivisions (regions, districts, provinces, federated states), e.g. “DE-BY” for Bavaria, Germany ISO 3166-3 : Deprecated country names, e.g. “YUCS” for Yugoslavia The two-letter country codes, defined in ISO 3166-1 alpha-2, are well-established and serve as the basis for the subdivisions specified in ISO 3166-2 [Hep08b, p. 24]. Two-letter geographical codes are used for example for top-level domains of Web addresses, e.g. http://www.example.de for a German Web site address.

Language Codes Language codes are defined by ISO 639 [IntND ]. ISO 639 entails six standards (part 1 to part 6). Their specifications differ by the length of the letter code, ranging from two to four letters (alpha-2, alpha-3, and alpha-4). The number of languages encoded varies accordingly, i.e. ISO 639-1 [Int02b] covers most international languages, whereas ISO 639-2 [Int98], consisting of only one additional letter, covers more languages [IntND ]. Examples for ISO 639-1 are “de” for German and “en” for English. In ISO 639-2, for legacy reasons, some languages have two three-letter codes. E.g., “deu” and “ger” are both valid codes for German. ISO 639-3 [Int07b] preserves only the native code of each major language, i.e. “deu”. Furthermore, it covers not only living languages but also dead and ancient languages [IntND ], e.g. old-high and middle-high German languages are represented with “goh” and “gmh”.

Codes for Units of Measure The UN/CEFACT Common Code standard defines codes consisting of up to three alphanumeric characters for the description of units of measurement [Uni09b]. The codes defined in the standard allow for the automated exchange of physical measures in international trading. The standard provides codes for both base and derived International System of Units (SI) units [Nat08, p. 9]. For mass, the base unit is kilogram4 . Accordingly, gram is a derived unit for mass. The standard includes conversion factors that relate derived units to their compatible base units. The conversion factor for gram with respect to kilogram is e.g., as expected, “0.001” (see Table 2.5). 4

The Bureau International des Poids et Mesures 5 (BIPM) keeps a reference mass prototype of exactly one kilogram made of platinum-iridium in Sèvres, near Paris, France [Nat08, p. 18].

2.2 E-Business and E-Commerce

41

Table 2.5: Selected snippet related to “kilogram” from the UN/CEFACT Common Code [Uni09a] Common Code

Name

MC DJ DG KGM GRM CGM TNE

microgram decagram decigram kilogram gram centigram tonne (metric ton)

Conversion Factor

Symbol

10 kg 10−2 kg 10−4 kg kg 10−3 kg 10−5 kg 103 kg

µg dag dg kg g cg t

−9

Currency Codes Currency codes are the “units of measurement” for monetary amounts. They greatly facilitate international electronic transactions by eliminating potential ambiguities about prices. ISO 4217 [Int08] is an international three-letter code standard for currencies. Currency codes are composed of the two-letter ISO 3166-1 [Int13b] country codes and, where possible, the initial letter of the currency name (e.g. “USD” for US dollars, “CHF” for Swiss francs) [Int08]. There are exceptions, though, as for example with Euros (“EUR”). Furthermore, ISO 4217 defines corresponding three-digit numeric codes [Int08]. E-business heavily relies on currency codes, e.g. e-commerce platforms and Web shops need to show price details in different currencies and calculate price values based on currency exchange rates. But even in our everyday lives, we are often confronted with currency codes, e.g. on passenger transport tickets of trains and airlines.

2.2.6.3 Product Identifiers for Electronic Business Electronic systems use product identifiers for the seamless integration and reliable electronic exchange of product information. The most relevant types of product identifiers for e-business are product item identifiers, product model identifiers, custom product identifiers, and product class identifiers. For the subsequent discussion of the product identifier types, let us consider an example featuring the following product information about (a particular instance of) a car radio: JVC KD-R741BTE Auto CD-Receiver GPC: 10001527 Brand: JVC Serial no.: 01-12-14-R741BTE-171 EAN-13: 4975769403750 MPN: KD-R741BTE SKU: A2318110

42

2 Background and Related Work

Product Item Identifiers A product item identifier globally and uniquely identifies an individual instance of a particular product type. Examples of product item identifiers are serial numbers, either for tangible or intangible products. This includes software product keys, Vehicle Identification Numbers (VINs), or International Mobile Station Equipment Identity (IMEI) numbers to identify mobile phones. The Electronic Product Code (EPC) is a universal product identifier to identify virtually every physical object [GS1ND , p. 22]. EPC codes are typically encoded into radio-frequency identification (RFID) chips [GS1ND , p. 22]. The specified serial number of our example globally identifies a unique instance of a car radio, e.g. the one that we just purchased from a retail store. The serial number is typically assigned by the manufacturer, often obeying to a generally accepted standard (e.g. VIN). Product Model Identifiers A product model identifier globally and uniquely identifies the make and model of a product, i.e. a bundle of identical trade items, but not an individual product item. It can be considered as an identifier for the prototype or blueprint of a product [cf. GS116, p. 25]. An important product model identifier developed by GS1 is the Global Trade Item Number (GTIN) [GS115]. GTIN distinguishes four important codes with different lengths, namely GTIN-8, GTIN-12, GTIN-13, and GTIN-14 [GS115]. The trailing number indicates the amount of digits that the identifier contains. On trade items, GTIN is encoded into EAN/UPC barcodes [GS1ND , p. 17] or, alternatively, into EPC codes used with RFID tags [GS1ND , pp. 21f.]. While the European Article Number (EAN) is mainly used in Europe, the Universal Product Code (UPC) is more popular in North America and other English-speaking countries. Both EAN-13 (13 digits) and UPC (12 digits) product codes can be mapped into the newer GTIN-14 number, namely by padding them with zeros to the left until 14 digits have been reached [GS115; GS116, p. 26]. Amazon employs a proprietary product model identifier, the Amazon Standard Identification Number (ASIN) [Ama16]. Some industries even developed own identification standards: Books, for instance, can be identified using the well-established International Standard Book Number (ISBN) [Int05], and the pharmaceutical industry standardized the PharmaCode (or Pharmaceutical Binary Code) for medicine and health-care products. The German equivalent is the Pharmazentralnummer (Engl.: Central Pharmaceutical Number) (PZN) [Inf15]. In our example, a 13-digit EAN code of the product model is provided (“4975769403750”), which is equivalent to the GTIN-13 code. The corresponding ASIN code that Amazon

2.2 E-Business and E-Commerce

43

would assign is “B00B59P4YW”.

Custom Product Identifiers Manufacturers and vendors frequently assign custom codes to their products for internal reference. In practical scenarios, and to a lesser degree in academia, the terms manufacturer part number (MPN) and stock keeping unit (SKU) have been widely deployed to define product part numbers within the scope of manufacturer and vendor catalogs, respectively [e.g. EBa15; GooND ]. More precisely, when manufacturers design and produce product parts, they often assign a MPN in order to refer to them internally. Similarly, vendors use SKUs to identify products of their inventories. Nonetheless, unlike GTIN numbers for example, these identifiers are generally not designed for use as globally unique identifiers. Multiple MPNs may be assigned to one and the same product make and model. This holds especially true if more than one manufacturer is producing parts with respect to the same product specification. These manufacturers are considered as producing non-OEM parts, as opposed to original equipment manufacturer (OEM) parts only produced by the original manufacturer. We can see this for example in the automotive sector, where spare parts are often manufactured by various companies. For the product item in our example, both the MPN (“KD-R741BTE”) and SKU (“A2318110”) are given.

Product Class Identifiers In e-business contexts, similar products (or product models) are often grouped into categories (see Section 2.2.7), which can be assigned global identifiers themselves. The GPC code [GS105], as used in our example, is part of a multi-level hierarchical classification standard that, if read from top to bottom, is divided into the four tiers Segment > Family > Class > Brick. The lowest level of the Global Product Classi-

fication (GPC) standard denotes the most specific category. Each category is codified using an eight-digit number. Our car radio is characterized by a very specific category (“10001527” – “Car Audio CD Players/Changers”), i.e. it belongs to the lowest possible level of the GPC hierarchy: Segment Family Class Brick

68000000 (Audio Visual/Photography) 68030000 (In-car Electronics) 68030200 (Car Audio) 10001527 (Car Audio CD Players/Changers)

44

2 Background and Related Work

Comparison In Table 2.6, we summarize the main characteristics of the product-related identifier types by comparing them by their usual scope (locally or globally valid), the range (single item, or a group of the same or similar items), and the issuer of the identifier. Table 2.6: Characteristics of product identifier types Identifier

Scope

Range

Issuer

Item Model Custom Class

global global local local or global

single item same items same items similar items

manufacturer manufacturer manufacturer, vendor maintainer of the classification

2.2.7 Product and Services Classification Systems This section presents some classification systems for organizing products and services. In this context, let us first clarify the term classification and introduce related terms that, even though often used interchangeably, generally mean different things.

2.2.7.1 Knowledge Organization Organization and classification is inherent to humans [Hod00, p. 4]. From the beginning of our lives, we consciously and unconsciously categorize observations that we have made and compare them among each other. Just think of toy building blocks of different shapes and colors that a young child tries to match into given forms. In literature, there does not exist a single, common understanding of classification or knowledge organization [Hep03, pp. 50–52; Hjø08]. Hjørland e.g. summarizes the organization of knowledge into a narrower and a broader perspective [cf. Hjø08]. While the broader meaning focuses on the social aspects of knowledge organization, such as the organization of reality (e.g. biological systematics, geographical classification, etc.), the narrower meaning refers to the technical challenges to ease the storage and retrieval of information (e.g. the building of taxonomies and thesauri) [Hjø08]. Many of the techniques for knowledge organization arose from the field of library science [Gar04; cf. Hod00, p. 3; cf. Hed10, pp. 1f.]. In library and information science, a classification represents a type of Knowledge Organization System (KOS) [Hod00, p. 3]. KOSs, according to Hodge, “encompass all types of schemes for organizing information and promoting knowledge management” [Hod00, p. 3]. Their purpose is to group information

2.2 E-Business and E-Commerce

45

assets of digital libraries in a way to facilitate the information discovery process. The three main groups of KOSs are [Hod00, pp. 5–7; Hed10, p. 2]: 1. Term lists: Authority files, glossaries, dictionaries, and gazetteers. 2. Classifications and categories: Subject headings, categorization schemes, and taxonomies. 3. Relationship lists: Thesauri, semantic networks, and ontologies. Hedden summarizes these kinds of KOSs to controlled vocabularies, taxonomies, thesauri, and ontologies [Hed10, pp. 3–15]. The expressivity and complexity of these KOSs increases in the provided order, as illustrated in Figure 2.4. Term List

Taxonomy

Simple

Thesaurus

Ontology

Expressive Complex

Figure 2.4: Structural complexity increase of knowledge organization systems [adapted from Nat05, p. 17]

Controlled Vocabulary A controlled vocabulary, as understood by Hedden, “may cover any kind of knowledge organization system, with the possible exclusion of highly structured semantic networks or ontologies” [Hed10, p. 3]. The understanding of a controlled vocabulary would thus typically range from term lists to simple relationship lists. In its simplest and most common form, a controlled vocabulary describes a closed list of terms [Hed10, p. 3; Gar04], e.g. a predefined list of category or concept names. With the limited range of available terms, the freedom to choose between arbitrary terms is constrained, which makes the use of a controlled vocabulary more deterministic and manageable, also in collaborative environments. Controlled vocabularies help ensure consistency, e.g. by indexing or tagging documents of a digital library more homogeneously and unambiguously [Hed10, p. 3]. More sophisticated controlled vocabularies also provide cross-referencing across terms and offer synonym sets or synonym rings [Hed10, p. 4; Nat05, p. 18], e.g. to help disambiguate terms and categories or to improve recall during information retrieval [cf. Nat05, p. 18]. Taxonomy The National Information Standards Organization (NISO) defines a taxonomy as a “collection of controlled vocabulary terms organized into a hierarchical structure” [Nat05, p. 9]. For Hedden, the term taxonomy carries a broader meaning [Hed10, p. 1, p. 6], namely representing “any means of organizing concepts of knowledge” [Hed10, p. 1].

46

2 Background and Related Work

Instead, Hedden uses the more accurate term hierarchical taxonomies to refer to the narrower understanding of taxonomies [Hed10, p. 6]. By considering a taxonomy as a hierarchical arrangement of terms for concepts or categories, the hierarchy is built up of parent-child/broader-narrower relationships [Nat05, p. 9; Gar04], or is-a relations. In a biological hierarchy, for example, a cow is a mammal, whereas a mammal is an animal (with having potentially skipped some hierarchy levels). However, different interpretations of hierarchical relationships are possible, i.e. is-a relations do not necessarily mean the same in different contexts and in different classification schemes. Brachman [Bra83] addressed this problem, summarizing and analyzing a number of possible interpretations of the is-a relation. Thesaurus By the definition of NISO, a thesaurus is a “controlled vocabulary arranged in a known order and structured so that the various relationships among terms are displayed clearly and identified by standardized relationship indicators” [Nat05, p. 9]. In a thesaurus, a collection of terms can be organized using different standard types of relationships [e.g. Gar04; Hed10, p. 10; Hod00, p. 6]: • Hierarchical (e.g. is-a relations), • associative (e.g. related terms), and • equivalence (e.g. synonyms) [Hed10, p. 10; Hod00, p. 6]. These characteristics imply that a thesaurus merely extends a taxonomy by additional relationships with richer semantics [Gar04; Hed10, p. 10]. However, unlike a taxonomy, a thesaurus is less strict about a hierarchical order between its terms [Hed10, pp. 10f.]. In a thesaurus, if a broader/narrower relationship between terms does not hold, the relationship is simply omitted [Hed10, pp. 10f.]. In the ISO 25964 standard, the ISO publishes guidelines for the development and maintenance of thesauri (part 1 [Int11]) and for establishing interoperability with other vocabularies (part 2 [Int13a]). Ontology Semantic networks6 [e.g. Sow13; Leh92] and ontologies [e.g. Gru93] are more powerful than thesauri, because they add further semantic richness to facilitate information sharing about a domain of discourse [Hed10, pp. 12f.; Gru93]. An ontology possibly entails all relationships of taxonomies and thesauri, plus it defines additional elements 6

“A semantic network or net is a graph structure for representing knowledge in patterns of interconnected nodes and arcs.” [Sow13]

2.2 E-Business and E-Commerce

47

in order to express facts of a specific knowledge domain in greater detail [Hed10, p. 12]. These additions can be domain-specific (because terms and relationships can be defined at will), which permits for example to express the following sentences: A human is a mammal. All mammals are mortals.

A proper definition and a more in-depth discussion about ontologies follow in Section 2.3.6.

2.2.7.2 Product Categorization Standards The aforementioned classification schemes (especially taxonomies and thesauri) provide the basis for product (or services) categorization. Product categorization standards are widely accepted knowledge structures that often consist of thousands of hierarchically arranged categories. As already noted in Section 2.2.6.3, categories are typically assigned product class identifiers in order to be able to globally and unambiguously reference to them. Product categories basically permit business partners to trade and negotiate based on common definitions for product types [GS105, p. 4]. More specifically, a product classification standard can have several benefits, among others [cf. GS105, p. 6] • more accurate product descriptions through shared product features, attributes, and multi-lingual support, • lesser likelihood of redundant redefinitions, • seamless integration of product data between business partners, and • improved search and discovery of product items. In summary, it strengthens the competitiveness of a company, because betting on a widely used product classification standard facilitates collaboration on a global scale, and increases the likelihood for the products and services to be shown to target audiences [cf. HLS07]. The structure of product categorization standards is usually hierarchical [HLS07], namely categories (or classes) are arranged as in a taxonomy. Hence, product categorization standards are often referred to as product taxonomies. The branches of categorization standards can have variable or fixed depth, sometimes with varying conceptual coverage and level of detail [HLS07]. Furthermore, many standards provide means to describe

48

2 Background and Related Work

categories in greater detail, namely by specifying product features, feature values, and translations of categories into various languages [HLS07]. Among the most relevant product categorization standards in industry are, in alphabetical order, Common Procurement Vocabulary (CPV) [Eur08a], eCl@ss7 , ECCMA Open Technical Dictionary (eOTD) [EleND ], ElektroTechnisches InformationsModell (Engl.: Electro-Technical Information Model) (ETIM)8 , GPC [GS105], RosettaNet Technical Dictionary (RNTD), and United Nations Standard Products and Services Code (UNSPSC)9 . Their main properties are outlined in Table 2.7. As all these standards are available in multiple languages, we decided against including this information in the table. Table 2.7: High-level comparison of product categorization standards Standard

Industrial Scope

CPV eCl@ss EOTD ETIM GPC RNTD UNSPSC

Public procurement (EU) Cross-industrial Cross-industrial Electronics Cross-industrial Electronics Cross-industrial

Number of Hierarchy Levels

Includes Features

4 4 1 2 4 1 4–5

− + + + + + −

Since requirements are different, it is quite natural that, over time, there evolved multiple product categorization standards for different domains and purposes [cf. HLS07]. The level of detail and the scope determine whether we refer to a standard as being vertical or horizontal [cf. HLS07]. A vertical standard aims to cover a specific domain or industry accurately and in great depth (e.g. CPV, ETIM, and RNTD), whereas a horizontal standard captures a broad spectrum of domains or industries (e.g. eCl@ss, eOTD, GPC, and UNSPSC) but, of course, at the sacrifice of less detail (see the second column in Table 2.7). Despite the sheer number of available product classification standards, they also reveal weaknesses. Fensel et al. [Fen+ 01] noted that UNSPSC as a horizontal standard is not very handy, i.e. it is not very descriptive, unintuitive, and shallow. Hepp et al. [HLS07] provided metrics that confirm that UNSPSC is not only shallow but also the level of detail varies between different branches within the standard. Hepp et al. [HLS07] defined a set of metrics whereupon they quantitatively analyzed four product and services classification standards, namely the horizontal standards eCl@ss, UNSPSC, eOTD, and one vertical standard, RNTD. Their major finding was that each standard is associated with trade-offs http://www.eclass.de/ (accessed on May 16, 2014) http://www.etim.de/ (accessed on May 16, 2014) 9 http://www.unspsc.org/ (accessed on May 16, 2014) 7

8

2.2 E-Business and E-Commerce

49

(the results indicate that only very few branches show comprehensive coverage [HLS07]), which is potentially attributable to the difficulty of catching up with real-world changes in huge knowledge bases, but also due to differing requirements and scopes [cf. HLS07].

2.2.7.3 Proprietary Product Classification Systems and Taxonomies The use of product categorization standards might not be feasible in some cases, in particular when • there is already a legacy (e.g. a proprietary) classification system in place, • adopting a product classification standard would be beyond the immediate business needs for a given use case, or • a product classification standard that would accommodate the requirements imposed by the particular domain is missing. The last item addresses the problem outlined at the end of Section 2.2.7.2, i.e. that the coverage of an existing product classification standard is not comprehensive enough for a given use case [cf. HLS07]. If any of the aforementioned conditions are met, then a supplier, manufacturer, retailer, or vendor could opt for a proprietary classification system to organize their products. This has the advantage that it requires less commitment and is more flexible with respect to custom requirements. The downside is the lack a consensus among users, which makes proprietary classification systems fairly hard to use in collaborative settings. People and organizations frequently employ proprietary classification systems and taxonomies for empowering custom-tailored solutions. Examples are application-specific category hierarchies for indexing and retrieval, as for example navigation structures in Web shops or hierarchical directories of search engines. Among the prominent examples for a proprietary classification system on the Web is the Google product taxonomy [Goo13]. It is a publicly available category structure released by Google Inc. that, among others, has the goal to support shop owners in specifying the correct Google product category for their products submitted to Google Shopping10 . Many Web shops interested in having their products properly indexed by Google shopping are thus incentivized to adhere to this proprietary category system. 10

http://www.google.com/shopping (accessed on May 8, 2014)

50

2 Background and Related Work

2.2.8 Electronic Marketplaces Literature about electronic markets reveals at least two distinct conceptions of markets: 1. A general view that concerns the market as a governance mechanism in contrast to the hierarchical organization [MYB87]11 . 2. A narrower, system-oriented view with the market as an intermediary and facilitator that matches sellers and buyers, facilitates transactions, and provides the legal and regulatory infrastructure [Bak98]. In the context of this thesis, the system-oriented view of a market is the more interesting one, because it matches the idea of electronic marketplaces as places for trading. “Electronic markets [...] form a single selected institutional and technical platform for electronic commerce. Their common feature is the market coordination mechanism” [PRW08, p. 274]. An electronic marketplace, e-marketplace, or online market, brings together multiple buyers and sellers in a virtual market [Gri03]. Grieger [Gri03] conducted a rich survey about electronic marketplaces with summarizing selected definitions from literature. On the basis of his work, Grieger describes a marketplace as follows: “A marketplace as a historically evolved institution allows customers and suppliers to meet at a certain place and at a certain time in order to communicate and to announce buying or selling intentions, which eventually match and may be settled. Today the institution market still does the same, but has occasionally been remodelled due to the evolution of media.” [Gri03]

According to Grieger [Gri03], the understanding of a market as of today largely correlates with the market definition from centuries ago, when people converged at market squares for trading their goods. In comparison with traditional markets, however, the geographical conditions have essentially changed for electronic marketplaces. Nowadays, trading is ubiquitous, i.e. virtually conducted everywhere. Schmid and Lindemann [SL98] suggest a division of marketplace transactions that is principally compatible with the transaction activities we have introduced in Section 2.1.1, even if more coarse-grained. They group transactions into three consecutive market transaction phases, namely information, agreement, and settlement [SL98]. Product search naturally belongs to the first of these phases, i.e. the information phase. 11

Malone et al. [MYB87] argue that information technology reduces coordination costs on markets that favors markets more than hierarchies.

2.2 E-Business and E-Commerce

51

2.2.9 Electronic Tendering Tendering is a process that has traditionally been known for being lengthy and complex [Du+ 04]. In general, it implies that the party demanding a good or service advertises tendering documents (formal request for tenders, comprising an early specification of the expected deliverables or terms and conditions), and companies that are able to deliver the good or service can bid by sending respective offers (known as tenders) [TB96, p. 66f.]. Usually, the cheapest offer with sufficiently good quality is accepted from the list of bidders, given that both sides agree upon the terms and conditions [TB96, pp. 69f.]. In a strict sense, tendering is conducted by governmental institutions that outsource tasks to companies, frequently construction and engineering work [TB96, p. 66; Ker+ 00; TS08]. In particular, public institutions behave as the contracting entities (principals) whereas enterprises are the potential contractors that bid for the contract [Du+ 04]. With the open bidding process, governments seek to foster transparency and save costs by intensifying competition. Electronic tendering (e-tendering) can further enhance the classical tendering process [e.g. TS08]. It makes sense to regard e-tendering as a special case of electronic procurement (or e-procurement) that mainly represents transaction activities involving businesses and/or public authorities (see Table 2.2 in Section 2.2.1). While the traditional tendering is largely paper-based, time-intensive and often geographically restricted (because of the difficulty of effective dissemination of the tendering details) [TS08], electronic tendering not only enhances the overall procurement process for the contractee, but also facilitates the bidding process for the contractor leading to lower transaction costs. In a study from 2008, Tindsley and Stephenson surveyed experts in the field of construction in the United Kingdom and reported about enhanced communication, time-savings, and cost reductions through e-tendering as compared to traditional tendering [TS08]. The complex electronic tendering process requires sophisticated system support. It is especially critical that electronic tendering systems be able to support the full electronic tendering process [Ker+ 00] while implementing high security standards that are needed to fulfill legal requirements [Du+ 04]. One prominent online service where public tenders can be looked up is provided by the European Union (EU) published via Tenders Electronic Daily (TED)12 . TED is regularly updated and ensures easy access to tenders for goods and services by taking advantage of the CPV categorization standard (see Section 2.2.7.2). 12

http://ted.europa.eu/ (accessed on May 8, 2014)

52

2 Background and Related Work

2.3 Semantic Web and Linked Data As a second pillar of this thesis (besides the business-related aspects discussed previously, see Figure 2.1), the Semantic Web and Linked Data constitute the technological underpinnings of our approach. Subsequently, we thus summarize the main ideas, principles, and technologies related to these two topics.

2.3.1 Web Before getting more technical, it is necessary to become acquainted with the key terminology. For this reason, let us briefly outline the three most popular Web-related concepts, namely WWW, Semantic Web, and Linked Data.

2.3.1.1 World Wide Web The World Wide Web (WWW) represents a huge global repository of resources, like documents and services, organized in a decentralized fashion that avails itself of the networked infrastructure and protocols provided by the Internet [cf. BF99, pp. 16–23]. An important characteristic of the Web is its openness [AH11, p. 2, pp. 6f.], i.e. that everyone is able to publish and disseminate thoughts, ideas, and contributions. The term WWW is often abbreviated as WWW and frequently shortened to Web. It was invented by Tim Berners-Lee in 1989 while working at the Centre European pour la Recherche Nucleaire 13 (CERN) in Switzerland and is nowadays developed by the World Wide Web Consortium (W3C), that Tim Berners-Lee is currently heading as its director. The development of the Web was primarily motivated by a social requirement rather than a technical need [BF99, p. 123]: Its initial aim was to facilitate the collaboration between research scientists. In this regard, Tim Berners-Lee stated in his book Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor : “The Web is more a social creation than a technical one. I designed it for a social effect – to help people work together – and not as a technical toy.” [BF99, p. 123]

The WWW embodies the marriage of two ground-breaking technologies, the Internet and hypertext [BF99, p. 6]. While the Internet is a large network of computers with a basic protocol stack (e.g. TCP/IP as the network protocol family for the message transport), hypertext regards the linkage between documents [Nel65]. In the context of the Web, the idea of hypertext has resulted in the concept of hyperlinks. 13

Engl.: European Organization for Nuclear Research

2.3 Semantic Web and Linked Data

53

The three fundamental building blocks of the WWW are Uniform Resource Identifiers (URIs), Hypertext Transfer Protocol (HTTP), and Hypertext Markup Language (HTML) [BF99, p. 36]: • URIs [BFM05] uniquely identify Web resources, as we will discuss in more detail in Section 2.3.2. • HTTP denotes an application layer protocol traditionally specified in RFC 2616 [Fie+ 99], which has been superseded by multiple Requests for Comments (RFCs) in 2014, that are RFC 7230–7235: RFC 7230 [FR14c] defines standard messages for client and server communications over the Web; RFC 7231 [FR14d] describes methods, status codes, and message headers; RFC 7232 [FR14b] specifies conditional requests; RFC 7233 [FLR14] addresses range requests to get partial content; RFC 7234 [FNR14] covers relevant aspects of caching (e.g. browser and proxy caches); and, RFC 7235 [FR14a] discusses authentication. • HTML [Hic+ 14] is a semi-structured markup language intended for human consumption via Web browsers. Anchor links (or hyperlinks) defined within HTML documents make it possible to navigate across different Web documents. For Web resources, these links are typically represented using HTTP URIs.

2.3.1.2 Semantic Web From the early twenty-first century onwards, much effort of the W3C has gone into advancing the idea of the Semantic Web [SHB06]. Meanwhile, many people world-wide have started working and conducting research in the field of the Semantic Web. The Semantic Web basically constitutes an extension of the traditional Web: “The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.” [BHL01]

Unlike the traditional Web, which helps people to publish Web pages and consume information in a human-friendly way (i.e. using HTML) via their Web browsers, the Semantic Web strives to help machines understand and process the content published on the Web more easily [cf. BF99, p. 177; cf. AH11, pp. 5f.]. As opposed to the traditional Web that is commonly known as the document-based Web, the Semantic Web is also referred to as the Web of Data, where data is becoming a first-class citizen. As Tim Berners-Lee has stated in an interview in 2006, “[...] data is a precious thing and will last longer than the systems themselves” [Bri06], meaning

54

2 Background and Related Work

that with a focus on data it is possible to survive the decline of applications, but it also gives rise to novel applications and mash-ups that can rely on the readily available data. Similarly, data, once connected and integrated in an intelligent way, is able to unleash unprecedented potential [AH11, pp. 3f.]. The Semantic Web follows the separation of concerns [Dij82; cf. Par72] design principle of software engineering, featuring a high level of modularization. I.e., the data model and the semantics are separate from syntax, and syntax is separate from the technologies used to identify resources. Similarly, because data is kept independent of its applications, it is possible to materialize multiple application scenarios with the same data. The Semantic Web avails itself of the same technology stack as the Web, but with a few important additions necessary to better preserve meaning in the exchange of data. Figure 2.5 depicts an adapted version of the Semantic Web layer cake14 [cf. DFH11, p. 20] that visualizes the core technology stack of the Semantic Web. In our depiction, we omitted advanced topics addressed by the Semantic Web stack in [DFH11, p. 20] but partly irrelevant to this thesis, namely cryptography, rule languages like the Semantic Web Rule Language (SWRL) [Hor+ 04] and the Rule Interchange Format (RIF) [Bol+ 07], unifying logic, proof, and trust. Instead, we summarized rules and logical inferences into a common term reasoning [e.g. KD11, pp. 245–257]. Also, since XML has from the beginnings of the Semantic Web been considered the de-facto serialization standard for the Resource Description Framework (RDF) [SR14], we have updated this to a more general term data formats that now properly entails current serialization formats like XML [Bra+ 08], Terse RDF Triple Language (Turtle) [PC14], JavaScript Object Notation (JSON) [Bra14], etc.

User Interface and Applications Reasoning SPARQL

RDFS / OWL RDF

Unicode

Data Formats (XML, Turtle, JSON, ...) URI / IRI

Figure 2.5: Semantic Web layer cake [adapted from DFH11, p. 20] 14

http://www.w3.org/2007/03/layerCake.svg (accessed on May 6, 2014)

2.3 Semantic Web and Linked Data

55

The Semantic Web layer cake is organized hierarchically by every layer abstracting from the layers underneath, which is similar to the OSI reference model for networking [II94, p. 28] (i.e. the upper layers rely on the technology provided by the bottom layers). The lowest layer in Figure 2.5 represents the character set (Unicode) that every layer on top of it is supposed to adhere to. Furthermore, it defines URIs [BFM05] and Internationalized Resource Identifiers (IRIs) [DS05] as identifiers for Web resources. On the second-lowest layer of Figure 2.5, the RDF data model is introduced, along with the data formats that can be used to encode it. RDF Schema (RDFS) [BG14] and the Web Ontology Language OWL [DS04] as ontology languages complement the RDF model with capabilities to realize advanced modeling patterns, such as subsumption relationships or constraints. The SPARQL Protocol and RDF Query Language (SPARQL) [HS13] denotes a protocol and the query language that make it possible to query RDF datasets, typically via a SPARQL endpoint [e.g. Bui+ 13]. Based on logical axioms defined at the RDFS and OWL levels, reasoners can infer additional, implicit knowledge [e.g. KD11, pp. 245–257]. Finally, a user interface or application can take advantage of the capabilities offered by the inferior levels in the Semantic Web stack. In a nutshell, the Semantic Web makes the following contributions: 1. It facilitates data integration through global Web resource identifiers. 2. It provides a data model for making assertions about real-world objects. 3. It adds meaning (disambiguation) through ontologies. 4. It offers query mechanisms and reasoning capabilities to consume data that was published on the Semantic Web.

2.3.1.3 Linked Data The Linked Data movement has gained momentum over the past decade. Once again, it was first proposed by Tim Berners-Lee [cf. Ber06], who already pioneered the WWW and the Semantic Web. Bizer et al. [BHB09] characterize the idea of Linked Data as follows: “Linked Data is simply about using the Web to create typed links between data from different sources.” [BHB09]

In comparison to the Semantic Web, which largely deals with meaning and logics, the Linked Data idea focuses on moving towards a single huge data space of linked data expressed using RDF, in particular putting emphasis on the publishing aspects. The main building blocks of Linked Data are URIs and HTTP [BHB09]: URIs uniformly refer

56

2 Background and Related Work

to data at global scale, and HTTP makes URIs dereferenceable so that information can be looked up easily. In order to make Linked Data become a reality, Berners-Lee suggested in 2006 four rules that every data provider should adhere to in order to accomplish the goal of a global data space of interconnected data. These Linked Data principles [e.g. Ber06; HB11, p. 7; BHB09; BK12, p. 34] read as follows [Ber06]: “1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things.” [Ber06]

Every dataset encompassed by these four principles falls under the term Linked Data, including proprietary and unpublished datasets that are linked in a certain way. In contrast, Linked Open Data (LOD) describes that part of Linked Data that represents freely available data (especially governmental data [cf. BK12]). Based on their level of publishing quality, LOD datasets can be classified as 1-star (?) to 5-star (? ? ? ? ?) LOD15 [e.g. Ber06; HB11, p. 26; BK12, p. 17], where 5-star data considerably eases consumption: ? Available on the Web (irrespective of format) under an open license (e.g. picture of a chart) ?? Machine-readable structured data (e.g. Excel spreadsheet) ? ? ? Non-proprietary data format (e.g. comma-separated values (CSV)) ? ? ?? Open W3C standards to identify things (e.g. URIs and RDF) ? ? ? ? ? Links to other datasets Following these guidelines, many tools have been developed to quickly bring data of various data silos into the Web of Data (“to bootstrap the Web of Data” [HB11, p. 30]), among others D2R Server, Triplify, and Virtuoso Universal Server [BHB09]. To track the progress of these efforts, people from the Linked Data community early on started to visualize the interlinked datasets in the LOD cloud diagram [cf. Sch+ 14]. Figure 2.6 documents the evolution of LOD from 2007 to 2014 [cf. Sch+ 14]. Updated over time, this diagram has constituted a popular indicator for the growth of the LOD graph. However, in the meantime, LOD grew that big that the entailing graph has become difficult to put on a single diagram. 15

http://5stardata.info/ (accessed on May 13, 2014)

2.3 Semantic Web and Linked Data

Freshmeat BBC Later + TOTP Musicbrainz

ECS Southampton

SemWebCentral

Magnatune

SIOC

Ontoworld

NEW! SW Conference Corpus

US Census Data

Opendata Scotland Simd Health Rank

RDF Book Mashup

NEW!

World Factbook

NEW!

Wikicompany

Open Cyc

GovUK transparency impact indicators tr. families

GovUK societal wellbeing deprv. imd rank la '10

GovUK Transparency Input indicators Local authorities Working w. tr. Families

Aragodbpedia

DBLP Hannover

Linked Stock Index

W3C WordNet

GovUK Transparency Input ind. Local auth. Funding f. Gvmnt. Grant

Government Web Integration for Linked Data

EU Agencies Bodies

GovUK service expenditure

Eionet RDF

GovUK Imd Score 2010

Enakting CO2Emission

Environmental Applications Reference Thesaurus

Charging Stations

StatusNet Imirhil

StatusNet Coreyavis

StatusNet Exdc

StatusNet Deuxpi

StatusNet Hiico

StatusNet ldnfai StatusNet Ludost

(a) LOD graph as of November 10, 2007 [from Sch+ 14]

StatusNet Scoffoni

StatusNet Planetlibre StatusNet Keuser

StatusNet Mystatus

StatusNet 20100

StatusNet tl1n

StatusNet Tschlotfeldt

StatusNet Kenzoid

IServe

Bio2RDF

StatusNet thornton2

StatusNet Otbm

StatusNet Glou

StatusNet Mamalibre

StatusNet Fcestrada

StatusNet Alexandre Franke StatusNet Sebseb01

StatusNet doomicile

StatusNet Bonifaz

StatusNet Pandaid

StatusNet Tekk

StatusNet Spraci

StatusNet Uni Siegen

Product Ontology

StatusNet Sweetie Belle Ontos News Portal

StatusNet Ced117

Open Food Facts

KDATA

RKB Explorer ECS

StatusNet Johndrink Water

Livejournal

British Museum Collection

Bio2RDF Affymetrix

Linked Food

Dutch Ships and Sailors

RKB Explorer Italy

JudaicaLink

RKB Explorer Kaunas National Diet Library WEB NDL Authorities

Getty AAT Swedish Open Cultural Heritage

Identifiers Gene Expression Atlas RDF

DM2E

Semantic Web Journal

Biosamples RDF

Bio2RDF HGNC Bio2RDF Orphanet

Bio2RDF SGD Bio2RDF Pubmed

Bio2RDF Sabiork

Bio2RDF SGD Resources

Bio2RDF Iproclass Chembl RDF

Bio2RDF Taxonomy

Bio2RDF OMIM

Bio2RDF ECO Open Data Thesaurus

Bio2RDF Pharmgkb

Bio2RDF LSR Bio2RDF GOA

Bio2RDF CTD Bio2RDF Homologene Bio2RDF Wormbase

Bio2RDF OMIM Resources

StatusNet Status Tekord

StatusNet Kathryl

Bio2RDF Ncbigene

Bio2RDF DBSNP

Uniprot KB

NBN Resolving

Aspire Plymouth

RKB Explorer Epsrc

Serendipity

Aspire Harper Adams

Bio2RDF Irefindex Bio2RDF Drugbank

Diseasome FU-Berlin

RDF License

Product DB

DBpedia PT

StatusNet Ilikefreedom

TWC IEEEvis

Identifiers Org

Bio2RDF Clinicaltrials

Uniprot Taxonomy

Reactome RDF Austrian Ski Racers

Alpino RDF

StatusNet Cooleysekula

RKB Explorer Deploy

RKB Explorer FT

chem2 bio2rdf

Bio2RDF Mesh

DBpedia CS

DBpedia Lite

StatusNet linuxwrangling

StatusNet Legadolibre

StatusNet Ourcoffs

RKB Explorer Eurecom

Aspire Roehampton Deutsche Biographie

RKB Explorer IBM

Asn.us

Bio2RDF Taxon

Bio2RDF NDC

Enipedia StatusNet Macno

Bio2RDF GeneID

Ariadne

RKB Explorer Ulm

RKB Explorer Pisa

lobid Organizations

Uniprot

Biomodels RDF

GNI

Linked Life Data

Aspire Uclan

Aspire Portsmouth

RKB Explorer Darmstadt

RKB Explorer Roma

Bio2RDF Dataset

Organic Edunet

LOD ACBDLS

Linked TCGA

DNB GND DBpedia IT

StatusNet Jonkman

Dailymed FU-Berlin

Taxonconcept Occurences

RKB Explorer Budapest

Testee

Disgenet

GNU Licenses

DBpedia EU

StatusNet Mrblog

StatusNet Hackerposse

StatusNet piana

StatusNet Qth

StatusNet Mulestable

Drug Interaction Knowledge Base

Linklion

DBpedia ES

DBpedia JA

RKB Explorer Wiki

RKB Explorer Lisbon

Taxonconcept Assets

RKB Explorer kisti

DOI

Multimedia Lab University Ghent

Open Library

RKB Explorer OAI

RKB Explorer IEEE

RKB Explorer Risks

RKB Explorer LAAS

Libris

RKB Explorer Dotac

RKB Explorer Resex

RKB Explorer Citeseer

RKB Explorer Newcastle

RKB Explorer DBLP

RKB Explorer Irit

Core

Verrijktkoninkrijk

RKB Explorer Eprints

linkedct

Aves3D

DBpedia KO

StatusNet Timttmy

StatusNet Fragdev

StatusNet Quitter

StatusNet Status.net

StatusNet Belfalas

StatusNet Recit

StatusNet Integralblue

DBpedia FR

DBpedia DE

StatusNet Fcac

StatusNet Russwurm

StatusNet Skilledtests

StatusNet Rainbowdash

StatusNet Freelish

StatusNet schiessle

StatusNet Datenfahrt

VIVO University of Florida

Uniprot Metadata

Sider FU-Berlin

URI Burner

StatusNet chickenkiller

StatusNet Ssweeny

StatusNet Postblue

StatusNet gegeweb

StatusNet Lebsanft

StatusNet Spip

DBpedia live

StatusNet Gomertronic

StatusNet Soucy

StatusNet Kaimi

StatusNet Atari Frosch

StatusNet mkuttner

Geo Wordnet

StatusNet Lydiastench

DBpedia EL

StatusNet Thelovebug

StatusNet Somsants

StatusNet Morphtown

Taxon concept

DBpedia NL

Code Haus

StatusNet Dtdns

StatusNet Qdnx

StatusNet Orangeseeds

Oceandrilling Borehole

NVS

Enel Shops

Geospecies

RKB Explorer RAE2001

RKB Explorer Southampton

RKB Explorer ACM

Opencyc

StatusNet Progval

StatusNet chromic

StatusNet Opensimchat

Linked Open Data of Ecology

YAGO

Freebase

RKB Explorer NSF

Sztaki LOD

O'Reilly RKB Explorer Courseware

RKB Explorer Eprints Harvest

Nottingham Trent Resource Lists

Aspire UCL

Aspire Sussex Gutenberg

RKB Explorer JISC

RKB Explorer Curriculum

CKAN

Aspire NTU

Dev8d

lobid Resources

Bibbase

L3S DBLP

Morelab

Drugbank FU-Berlin

Aspire Keele

Semantic Web Grundlagen

Colinda Southampton ECS Eprints

Viaf

Archiveshub Linked Data

KUPKB

Aspire Brunel

Aspire MMU

RKB Explorer OS

Aspire Manchester

RKB Explorer Deepblue

LOV

EUNIS

B3Kat

Radatana

Ciard Ring

Datos Bne.es

Idref.fr

DWS Group

Bibsonomy

Southampton ac.uk

Semantic Web DogFood

DBpedia

StatusNet 1w6

StatusNet David Haberthuer

StatusNet Bka

StatusNet Maymay

StatusNet shnoulle

GovAgriBus Denmark

AEMET

LOD2 Project Wiki

W3C

FOAFProfiles

Ordnance Survey Linked Data

Universidad de Cuenca Linkeddata

Project Gutenberg FU-Berlin

RKB Explorer ERA

VIVO Indiana University

Bible Ontology

DCS Sheffield

UTPL LOD

Theses.fr Pub Bielefeld

Princeton Library Findingaids

Bluk BNB Mis Museos GNOSS

LCSH

PlanetData Project Wiki

UMBEL

StatusNet Samnoble

Open Mobile Network Geological Survey of Austria Thesaurus

UK Postcodes

Greek Administrative Geography

Lexvo

Linked Geo Data

GADM Geovocab

OSM

Geo Ecuador Randomness Guide London

Thist

Bizkai Sense

CIPFA

Linked Railway Data Project

Open Data Ecuador

ESD Toolkit

World Factbook FU-Berlin

GovUK Dev Local Authority Services

Agrovoc Skos

Wordpress

Linked MDB

Geo Linked Data

Linked Open Piracy

Enakting Population

Lingvoj

RKB Explorer unlocode

STW Thesaurus for Economics

DNB

RDF Ohloh BBC Wildlife Finder

Geo Names

Shoah Victims Names

Sudoc.fr

Aspire Qmul

Data Bnf.fr

FAO Geopolitical Ontology

Semanlink

DBTune Musicbrainz

NUTS Geovocab

Lista Encabeza Mientos Materia

JITA

Muninn World War I

Gesis Thesoz

Worldcat

RKB Explorer Webscience

Amsterdam Museum AS EDN LOD

Wordnet (W3C)

Typepad

NALT

EEARod

City Lichfield

IATI as Linked Data

Open Election Data Project GovUK Homelessness Accept. per 1000

GovUK wellb. happy yesterday std. dev.

GovUK Households Projections total Houseolds

EPO

Data for Tourists in Castilla y Leon

GovUK Net Add. Dwellings

GovUK Households Projections Population

ESD Standards

ISO 639 Oasis

MyOpenlink Dataspaces

WWW Foundation

Govtrack

FRB 270a.info

Greek Wordnet

Semanticweb.org

Europeana

Hellenic Fire Brigade

RKB Explorer cordis

Yso.fi YSA Semantic Quran

ZDB

Data Open Ac Uk

MSC

Revyu

Clean Energy Data Reegle

Enakting Mortality

BIS 270a.info

Apache

CE4R

NYTimes Linked Open Data

Hellenic Police

Yso.fi Allars

Dewey Decimal Classification

Reload

Lexinfo

Wordnet (VU)

Garnica Plywood

OpenlinkSW Dataspaces

BBC Music

NHS Jargon

Linked Eurostat

Ietflang

Pdev Lemon

Wiktionary DBpedia

Glottolog RKB Explorer Wordnet

Socialsemweb Thesaurus

Open Calais

Icane

Camera Deputati Linked Data

GovUK Education Data

OECD 270a.info

IDS

Olia

Lemonuby

Green Competitiveness GNOSS

Linked Crunchbase

BBC Programmes

GovUK Transparency Impact ind. Households In temp. Accom.

FAO 270a.info

ECB 270a.info

Open Data Euskadi

Open Data RISP

RDFize last.fm

GEMET

Eurostat RDF

Alexandria Digital Library Gazetteer

Project Gutenberg

GovUK Transport Data

Transparency 270a.info

Reference data.gov.uk

Worldbank 270a.info

Nomenclator Asturias

GovUK Imd Rank 2010

BPR

OCD

Open Energy Info Wiki

Eurostat Linked Data

Interactive Maps GNOSS

Miguiad Eviajes GNOSS

Jamendo DBTune

Linked Edgar

Loius

BFS 270a.info

IMF 270a.info

Isocat

WALS

WOLD Berlios

GNOSS

My Experiment

Brazilian Politicians

SORS

ABS 270a.info

UIS 270a.info

Environment Agency Bathing Water Quality

GovUK Societal Wellbeing Deprv. imd Score '10

GovUK Societal Wellbeing Deprivation Imd Rank 2010

2001 Spanish Census to RDF

Openly Local

Cornetto

Tags2con Delicious

ineverycrea

Deusto Tech

Didactalia

Chronicling America

DBTropes

Semantic XBRL

EEA

GovUK Input ind. Local Authority Funding From Government Grant

GovUK Households Accommodated per 1000

Enakting NHS

Eurostat FU-Berlin

GovUK Impact Indicators Planning Applications Granted

UK Legislation API GovUK Societal Wellbeing Deprv. Imd Empl. Rank La 2010

GovUK Societal Wellbeing Deprivation Imd Income Rank La 2010

GovUK societal wellbeing deprv. imd rank '07

GovUK Imd Income Rank La 2010

lingvoj

flickr wrappr

Opendata Scotland Simd Employment Rank

Statistics data.gov.uk

German Labor Law Thesaurus

Pokepedia

Enakting Crime

GovUK Housing Market

Bootsnall

DBnary

Flickr Wrappr

Nextweb GNOSS

Datahub.io

Lotico

Currency Designators

Linked NUTS

GovUK Transparency Impact Indicators Housing Starts

GovUK Societal Wellbeing Deprivation Imd Health Rank la 2010

GovUK Impact Indicators Housing Starts

DBLP Berlin

GovUK societal wellbeing deprivation imd employment score 2010

GovUK Societal Wellbeing Deprivation Imd Health Score 2010

GovUK Societal Wellbeing Deprivation Imd Education Rank La 2010

Opendata Scotland Simd Education Rank

DBpedia

GovUK Households 2008

Opendata Scotland Graph Education Pupils by School and Datazone

Ctic Public Dataset

DBTune John Peel Sessions

Indymedia

Zaragoza Datos Abiertos

Eurovoc in SKOS

Enakting Energy

GovUK Societal Wellbeing Deprivation imd Employment Rank La 2010

Acorn Sat

Linked Mark Mail

Linked User Feedback

Red Uno Internacional GNOSS

Proyecto Apadrina

Museos Espania GNOSS

Umthes

RKB Explorer Crime

GovUK Transparency Impact Indicators Neighbourhood Plans GovUK Transparency Impact Indicators Affordable Housing Starts

GovUK Societal Wellbeing Deprivation Imd Housing Rank la 2010

GovUK Impact Indicators Affordable Housing Starts

GovUK Wellbeing lsoa Happy Yesterday Mean

GovUK Homelessness Households Accommodated Temporary Housing Types

GovUK Households Social Lettings General Needs Lettings Prp Household Composition

GovUK Wellbeing Worthwhile Mean

GovUK Households Social lettings General Needs Lettings Prp Number Bedrooms

Opendata Scotland Simd Income Rank

GovUK Transparency Impact Indicators Planning Applications Granted

Opendata Scotland Simd Crime Rank

GovUK impact indicators energy efficiency new builds

GovUK Societal Wellbeing Deprivation Imd Crime Rank 2010

GovUK imd env. rank 2010

Opendata Scotland Simd Housing Rank

ODCL SOA

Opendata Scotland Graph Simd Rank

OpenGuides

Eurostat

GovTrack

Opendata Scotland Simd Geographic Access Rank GovUK Transparency Impact Indicators Energy Efficiency new Builds

Debian Package Tracking System

Jugem

Vulnerapedia

Artenue Vosmedios GNOSS

Elviajero DBTune Magnatune

DBTune artists last.fm

GovUK Imd Crime Rank 2010

Prefix.cc

Courts Thesaurus

Web Nmasuno Traveler

Gem. Thesaurus Audiovisuele Archieven

GovUK Societal Wellbeing Deprivation Imd Environment Rank 2010

Zaragoza Turruta

FOAF

Geonames

Prospects and Trends GNOSS

Athelia RFID

NEW!

Revyu Jamendo

57

Bio2RDF Biomodels

Nobel Prizes

SISVU

Linked Datasets as of August 2014

StatusNet Equestriarp

(b) LOD graph as of August 30, 2014 [from Sch+ 14]

Figure 2.6: Evolution of the LOD cloud diagram

A few prominent examples representing LOD sources on the Web are DBPedia16 [Aue+ 07], Wikidata17 [VK14], and Freebase18 [Bol+ 08]. Furthermore, in the context of e-commerce, the review platform revyu.com19 [HM07] and productdb.org20 are very useful. More examples for LOD datasets can be looked up online21 .

2.3.2 Unique Identifiers Uniform Resource Identifiers (URIs) serve a similar purpose for Web resources as product identifiers do for products (see Section 2.2.6.3): They uniquely identify them. URIs are specified in the Request for Comments (RFC) 3986 [BFM05]. There exists a general distinction between URIs: Uniform Resource Locator (URL) and Uniform Resource Name (URN) [BFM05, Section 1.1.3] (see Figure 2.7). The URL syntax defines its location and the type of accessibility [BFM05, Section 1.1.3], i.e. access through HTTP and the File Transfer Protocol (FTP) for the URLs provided in the example in Figure 2.7. On the other hand, a URN only assigns a name to a resource, but without specifying how the resource might be accessed or where it is located [BFM05, Section 1.1.3]. Accordingly, a URI is composed of the following component parts22 [BFM05, Section 3]: http://dbpedia.org/ (accessed on May 12, 2014) http://www.wikidata.org/ (accessed on May 12, 2014) 18 http://www.freebase.com/ (accessed on May 12, 2014) 19 http://revyu.com/ (accessed on May 12, 2014) 20 http://productdb.org/ (accessed on May 12, 2014) 21 http://linkeddata.org/data-sets (accessed on May 12, 2014)

16

17

22

This pattern shows only the most common component parts of a URI. Further refinements are possible.

58

2 Background and Related Work

URI

mailto:[email protected] tel:+49-89-12345678 URL

URN

http://www.example.org/ ftp://ftp.example.org/

urn:isbn:123-4-567-89012-X urn:ietf:rfc:3986

Figure 2.7: Relationship between URI, URL, and URN [based on BFM05]

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

We further exemplify a URI’s components on a fictive geocoding Web page, represented (a) by a dereferenceable HTTP URL, and (b) by a URN as it could be used by an application. a)

http://www.example.org:8080/geocode?s=germany&c=munich&a=marienplatz#canvas \__/

\__________________/ \_____/ \______________________________/ \____/

|

|

scheme |

authority

| path

query

| fragment

__________________________|_____________

/ \ / b)

|

\

urn:example:geocode:germany:munich:marienplatz

As an advancement to URIs, IRIs have been introduced in RFC 3987 [DS05]. They support resource identifiers with special characters. IRIs are based on the Universal Character Set (Unicode/ISO 10646) [DS05] and thus allow for a much wider range of characters than the American Standard Code for Information Interchange (ASCII) code table that underlies URIs. For example, a German blog could choose to use the following IRI for their blog post entry http://www.example.org/blog/2014/März/01.html

whereas with URIs we either need to by-pass special characters, namely http://www.example.org/blog/2014/Maerz/01.html

or we have to percent-encode the special character properly http://www.example.org/blog/2014/M%C3%A4rz/01.html

2.3 Semantic Web and Linked Data

59

To retain compatibility between IRIs and URIs, the specification in RFC 3987 defines bidirectional mapping algorithms [DS05, Section 3]. RDF, for example, added support for IRIs in recommendation version 1.1 [SR14]. As a design decision for URIs, Tim Berners-Lee postulated in 1998 the use of cool URIs. A cool URI is “one which does not change” [Ber98], i.e. a permalink. The idea was to omit any details that might be subject to future changes, e.g. status information about the document (e.g. “draft”, “final”), underlying software mechanisms (e.g. “.php”, “cgi-bin”), or even metadata (e.g. author information, or storage details like disk names). According to this, cool URIs represent well-conceived and sustainable Web identifiers that are aimed at being simple, stable, and manageable [SC08]. Otherwise, every technological upgrade would involve significant maintenance overhead in the best case (e.g. setting up appropriate redirects). A poor URI design carries the risks of loosing users and breaking existing applications. URIs on the Web can be used to refer to both information resources and non-information resources [JW04, Section 2.2]. This distinction becomes relevant when talking about resources on the Semantic Web. Web pages in the traditional sense, i.e. documents or information artifacts that can be retrieved (or dereferenced), constitute information resources [JW04, Section 2.2]. “[T]heir essential characteristics can be conveyed in a message” [JW04, Section 2.2]. By contrast, non-information resources describe entities which essential characteristics cannot be transferred over a medium [JW04, Section 2.2]. This includes real-world objects like people, cars, books, clothing, etc. Their resource identity on the Semantic Web can be regarded as pointing to non-information resources [cf. SC08, Section 3]. For instance, the author of this thesis can be identified on the Web by a URI, which, however, is different from a Web document that presents information about the author. The technical problems that arose by the need to keep identifiers for real-world objects and their representations on the Web apart was debated in the context of the httpRange-14 issue [SC08, Section 4.2].

2.3.3 Resource Description Framework The Resource Description Framework (RDF) is a framework for representing resources on the Web [SR14, Section 1]. It constitutes a basic data model that facilitates the exchange of knowledge representations in a distributed manner, based on the publication of graphs and URIs for identifying nodes and edge types in these graphs. In 2004, RDF has become a W3C recommendation [MM04] and was thus accepted as the official standard for modeling data on the Semantic Web. In 2014, the W3C Working Group obsoleted the

60

2 Background and Related Work

original specification in favor of RDF 1.1 [SR14], which suggested minor but important advancements to RDF, namely • support for IRIs, • a simpler mechanism for datatypes for all literals (even those with language tags, and plain literals have become obsolete), and • a number of new serialization formats for RDF [cf. Woo14]. The main building blocks of the RDF data model are triples [SR14, Section 3.1], sometimes referred to as statements [AvH08, p. 68]. Each triple is composed of three elements in a given order: Subject, predicate, and object. A triple can be formally defined as follows: (2.2)

(s, p, o) ∈ (R ∪ B) × R × (R ∪ B ∪ L)

In this formula, R denotes a set of resource identifiers (e.g. URI or IRI), B represents blank nodes, and L stands for literal values (or constant values). Likewise, triples can be graphically represented as illustrated in Figure 2.823 .

s

p

o

(a) Object is a URI or blank node

s

p

o

(b) Object is an RDF literal

Figure 2.8: RDF triple represented as a graph

As in the formal definition above, the object can be either an addressable node (i.e. a URI or blank node) or a literal value. A literal value is an atomic value and is either a textual label (or string value), date value, or numeric value [SR14, Section 3.3]. With RDF 1.0, it was optional to assign a datatype to a literal. With RDF 1.1, datatypes are now mandatory for literals. If the datatype is omitted, then a triple store should assume a string literal (i.e. using a plain or simple literal is treated as “syntactic sugar” for a string-typed literal) [CWL14, Section 3.3]. For textual literals it is also possible to associate them with a language tag. In this context, the datatype is implicitly known and need not be supplied [CWL14, Section 3.3]. The RDF data model allows the interlinking of RDF triples based on resource identifiers 23

In order to draw this and the upcoming RDF graphs, we used a slightly customized version of the brilliant TextMate bundle developed by Peter Geil, Turtle.tmbundle (https://github.com/peta/tu rtle.tmbundle (accessed on May 20, 2014)), which can generate graphs from Turtle code.

2.3 Semantic Web and Linked Data

61

and blank nodes24 . Because URIs are used to identify nodes and relationship types, a computer can reliably combine multiple statements into a consolidated graph by simple string comparison, i.e. multiple statements referring to the same subject or object can be collated. Herein lies the actual power of the Semantic Web, namely by gathering context information from various datasets and thus gradually connecting the dots. Figure 2.9 of Section 2.3.4 exemplifies such a graph. Multiple RDF statements can themselves be made a resource and be grouped together as a single RDF graph [SR14, Section 3.5]. An RDF graph provides context information about a group of RDF triples. It can also be identified by a URI. A collection of RDF graphs in turn forms an RDF dataset. RDF datasets generally consist of one default graph and an arbitrary number of named graphs [CWL14, Section 4; Car+ 05].

2.3.4 RDF Serialization Formats In the beginnings of the Semantic Web, the RDF/XML serialization format was introduced as a normative data format for RDF [SR14, Section 5.4]. Many tools were capable of handling XML documents and thus the XML toolset (parsing, transformation, search, etc.) could be applied to RDF/XML as well. Unfortunately, the fact that people had to learn XML before being able to work with RDF, introduced an unnecessary degree of separation. To mitigate this issue, a more human-friendly alternative syntax has been proposed and developed by Tim Berners-Lee: Notation 3 (N3) [Ber05; cf. BC11]. Since then, various additional data formats for RDF have evolved, most notably N-Triples [CS14a], Turtle [PC14], RDFa [Adi+ 13], and JSON-LD [SKL14]. In the following, let us introduce an example based on the GoodRelations [Hep08a] vocabulary for e-commerce (see Section 2.3.6.2) that will subsequently serve as our baseline for explaining the different syntax variants. Example. Imagine your child is begging for some ice cream. Luckily, not too far from you there is an ice cream shop, which makes the following offer: “A single scoop of ice cream for only e1.10”. 24

Blank nodes have the limitation that their locally scoped identifiers cannot be linked from outside the graph. Their advantage is, however, that modelers do not need to care about minting identifiers for secondary resources, which is especially useful for graphs not meant to be published or linked from externally [cf. CWL14, Sections 3.4 and 3.5].

62

2 Background and Related Work

Our simple example can be visualized using the RDF graph shown in Figure 2.9. The top node represents an instance of an offer, which name property is “Scoop of ice cream”. Please note also the English language tag supplied with the literal. The type of the offer is made explicit using the rdf:type property. Furthermore, the offer is intended for sale (using the business function property together with an instance gr:Sell ). The tricky part is the modeling of the price. For our purpose, we decided to model the price specification as a blank node, since the price is only related to this offer and thus it usually makes no sense to refer to it independently. The price is indicated to be calculated per unit (via gr:UnitPriceSpecification), and the price value and price currency are modeled as individual properties. ex:OfferIcecream

rdf:type gr:hasBusinessFunction gr:hasPriceSpecification

gr:Offering

gr:Sell

"Scoop of ice cream"@en

rdf:type

gr:UnitPriceSpecification

gr:name

gr:hasCurrency

"EUR"^^xsd:string

gr:hasCurrencyValue

"1.10"^^xsd:float

Figure 2.9: Example as an RDF graph

In the upcoming sections, we present the same example in different RDF syntaxes. To convert between those syntaxes, the author of this thesis has developed an online tool25 that allows to translate between the most common data formats for RDF [SRH13a].

2.3.4.1 RDF/XML RDF/XML is an XML-based data format for RDF. It has for a long time been regarded the normative syntax for RDF. Because of the wide tool support for XML, RDF/XML was expected to be understood by most Semantic Web tools. RDF 1.1 XML Syntax [GS14] provides a syntax and grammar definition for RDF/XML. Without going into details, Listing 2.1 encodes our ice cream example based on those definitions. 25

http://rdf-translator.appspot.com/ (accessed on May 8, 2014)

2.3 Semantic Web and Linked Data

63

1 2

5

Scoop of ice cream

6



7



8 9

1.10

10

EUR

11 12 13 14



Listing 2.1: Example in RDF/XML

2.3.4.2 Turtle Turtle is a shorthand for Turtle. Its spelled-out name indicates why it came into existence, namely because of aiming at being a terse data format for RDF. The following data formats pertain to the Turtle family of RDF languages [cf. SR14, Section 5]: • N-Triples [CS14a] is an RDF syntax where RDF triples are written line by line. URIs are surrounded by angle brackets. Furthermore, language tags are appended to literals separated by an @-symbol, and datatypes are specified by attaching them to the literal separated by two consecutive caret (^) symbols. N-Triples is the simplest form of serializing an RDF graph and thus straightforward to process, although not the most compact and readable one. Our example in N-Triples looks as shown in Listing 2.2. Lines 5–7 are statements that belong to the price specification. As we know from before, the price specification is described by a blank node, which can be assigned an arbitrary identifier with a local scope. In our case, it is _:ub22bL7C28. • Turtle [PC14] is a compact, human-readable syntax. It is mainly used to explain RDF content to people. Turtle designates an extension of N-Triples, i.e. every valid N-Triples document is also a valid Turtle document. As compared to N-Triples, Turtle permits to add prefix directives that can be used to shorten otherwise lengthy URIs (e.g. http://www.example.com/#OfferIcecream) to terse Compact URIs (CURIEs) [BM10] (e.g. ex:OfferIcecream). Other syntactical improvements encompass shorthand notations like “a” instead of rdf:type, predicate lists (multiple

64

2 Background and Related Work

1 < http://purl.org/goodrelations/v1#Sell> .

2 "Scoop of ice cream"@en .

3 .

4 _:ub22bL7C28 .

5 _:ub22bL7C28 "EUR"^^ .

6 _:ub22bL7C28 .

7 _:ub22bL7C28 "1.10"^^ .

Listing 2.2: Example in N-Triples

predicate-object pairs separated by semicolons “;”), object lists (multiple objects separated by commas “,”), or statements contained within square brackets to delimit a blank node [PC14]. Consequently, the following two examples represent one and the same RDF graph as illustrated in Figure 2.10:

o1 p2

s

p1

p2

o2

p3

o3

Figure 2.10: RDF graph that corresponds to the Turtle example

a) Simple Turtle notation (equals N-Triples) _:b . _:b . _:b . _:b .

2.3 Semantic Web and Linked Data

65

b) Turtle notation [ , ; ] .

Following this short introduction into the basics of the Turtle syntax, Listing 2.3 outlines our example in Turtle. Please note the prefix declarations for the vocabularies at the beginning of the code section. 1 @prefix gr: . 2 @prefix xsd: . 3 @prefix rdf: . 4 @prefix ex: . 5 6 ex:OfferIcecream a gr:Offering ; 7

gr:hasBusinessFunction gr:Sell ;

8

gr:hasPriceSpecification [ a gr:UnitPriceSpecification ;

9 10 11

gr:hasCurrency "EUR"^^xsd:string ; gr:hasCurrencyValue "1.10"^^xsd:float ] ; gr:name "Scoop of ice cream"@en .

Listing 2.3: Example in Turtle/N3

• Notation 3 (N3) [BC11] is a superset of Turtle, i.e. every valid Turtle document is also a valid N3 document. In comparison to Turtle, N3 provides even more syntactic simplifications that aim to facilitate readability (e.g. “=” in place of owl:sameAs). Hence, Listing 2.3 being valid Turtle also represents a valid snippet in N3. In addition to the data formats outlined above, there are other syntaxes derived from Turtle that permit to describe multiple named graphs within one document. They have been added as valid RDF data formats in the new RDF 1.1 W3C recommendation [SR14]: • N-Quads [Car14] is a line-based RDF syntax that extends N-Triples with support for encoding multiple named graphs. • TriG [CS14b] is an extension of Turtle and defines named graphs as block sections of RDF triples surrounded by curly braces.

66

2 Background and Related Work

2.3.4.3 RDFa The Resource Description Framework in Attributes (RDFa) [Adi+ 13] is a data format for encoding structured information in Web pages. It avails itself of (X)HTML26 , which is thanks to its wide dissemination a natural carrier for embedding data from the Semantic Web. As such, RDFa constitutes a lightweight alternative for RDF/XML, because it does not require complicated server configurations in order to deliver Semantic Web content. Since having become an W3C recommendation in 2008 [Adi+ 08], RDFa has continually grown in popularity, which was confirmed by several research studies [e.g. Mik11; MP12; Biz+ 13]. RDFa was, contrary to common belief, not the first syntax to embed RDF in Web documents. Similar approaches have been proposed in the past. For example, Simple HTML Ontology Extension (SHOE) can be considered a predecessor of RDFa and similar data formats for markup languages [e.g. LSR96; HH00]. As opposed to SHOE, RDFa is less intrusive for it makes use of existing Extensible Hypertext Markup Language (XHTML) and HTML elements and attributes and needs to define only a few additional attributes. Table 2.8 summarizes all relevant (X)HTML attributes used by RDFa [Adi+ 13, Section 5], and Listing 2.4 outlines an RDFa snippet corresponding to our example. Table 2.8: (X)HTML attributes defined for RDFa [based on Adi+ 13, Section 5] Category

Attribute

Explanation

syntax

prefix vocab

space-separated list of prefix-IRI pairs used for defining CURIEs IRI/URI mapping for locally scoped attribute values

subject

about typeof

IRI/URI/CURIE of the RDF subject space-separated list of RDF types for the RDF subject

rel rev property

relationship between two resources (RDF predicate) inverse relationship between two resources (inverse of rel ) RDF predicate referring to a literal value

resource href src content datatype (xml:)lang

IRI/URI/CURIE of the RDF object IRI/URI of the RDF object if the resource is navigable IRI/URI of the RDF object if the resource is embedded if supplied, it takes precedence over the element content (literal value) datatype of the literal value language tag of the literal value

predicate

object

The snippet in Listing 2.4 encodes all its RDF content inside HTML attributes. This technique is referred to as RDFa in “Snippet Style” [HGR09], meaning that a snippet with hidden markup is created that can be placed almost everywhere in an HTML document. 26

In fact, RDFa 1.1 is compatible with any XML document like e.g. Scalable Vector Graphics (SVG) pictures [cf. Her+ 13, Section 1.1], although XHTML or HTML are the most common uses.

2.3 Semantic Web and Linked Data

67

1

5



6



7



8



9



10



11



12 13 14



Listing 2.4: Example in RDFa

This technique has both benefits and limitations. A remarkable advantage is that it is much easier to generate, i.e. an application programming interface (API) with template engine or a converter can create the snippet in an automatic way. Furthermore, it is fairly straightforward to afterwards incorporate the generated snippet into a Web page. The drawback of this technique is that the structured data is decoupled from the visible content on the Web page, which adds unnecessary redundancy. To give an example, line 5 of Listing 2.4 could equally be written as Scoop of ice cream

which would reuse the text inside the HTML element as metadata for RDF.

2.3.4.4 JSON-LD JSON for Linked Data (JSON-LD)27 [SKL14] is a data format for serializing Linked Data as JSON. JSON is a light-weight, text-based, and language-independent data interchange format [Bra14]. It is widely used for software engineering projects as an alternative to XML [cf. Bra+ 08]. Being based on JSON makes JSON-LD easy to work with for humans [LG12]. Listing 2.5 encodes our example as JSON-LD, which meaning should be self-explaining in the meantime. In this example, all URIs are written out entirely, even though JSON-LD provides an equivalent mechanism like the prefix directives in Turtle or RDFa. I.e., one 27

http://json-ld.org/ (accessed on May 16, 2014)

68

2 Background and Related Work

1 { 2

"@id": "http://www.example.com/#OfferIcecream",

3

"@type": "http://purl.org/goodrelations/v1#Offering",

4

"http://purl.org/goodrelations/v1#hasBusinessFunction": {

5

"@id": "http://purl.org/goodrelations/v1#Sell"

6

},

7

"http://purl.org/goodrelations/v1#hasPriceSpecification": {

8

"@type": "http://purl.org/goodrelations/v1#UnitPriceSpecification",

9

"http://purl.org/goodrelations/v1#hasCurrency": {

10

"@type": "http://www.w3.org/2001/XMLSchema#string",

11

"@value": "EUR"

12

},

13

"http://purl.org/goodrelations/v1#hasCurrencyValue": {

14

"@type": "http://www.w3.org/2001/XMLSchema#float",

15

"@value": "1.10"

16

}

17

},

18

"http://purl.org/goodrelations/v1#name": {

19

"@language": "en",

20

"@value": "Scoop of ice cream"

21 22

} }

Listing 2.5: Example in JSON-LD

can define shorthand names (known as terms) within a context construct [SKL14, Section 5.1]. 2.3.4.5 Non-RDF Syntaxes for the Semantic Description of Data In the context of facilitating the distribution and consumption of data on the Web, alternative syntaxes have been suggested that do not present full compliance with RDF, though. The ones that we are covering here are Microformats, Microdata, and the Open Graph Protocol (OGP). These formats have enjoyed considerable adoption in the past years [e.g. MP12; Biz+ 13], mainly pushed by large companies like Google, Yahoo!, and Facebook. • Microformats28 [cf. Kha06] reuses existing HTML attributes, like e.g. class, href, etc. Microformats is not only a data format but also a vocabulary, because the eligible schema elements are specified from within Microformats [cf. Kha06]. Accordingly, for the product domain, predefined vocabulary terms exist such as the 28

http://microformats.org/ (accessed on May 16, 2014)

2.3 Semantic Web and Linked Data

69

h-product class for products, the p-name property for the name of a product, as well as the p-price property for its price [Çel14]. The adoption of new vocabularies into the Microformats namespace is a non-trivial, tedious, and slow process [ABH11, pp. 160f.]. E.g., to prevent potential interferences between schemes or clashes between vocabulary terms the Microformats standardization has to be centralized [ABH11, pp. 160f.]. In Listing 2.6, we detail our ongoing example in Microformats, this time modeled as a product rather than an offer due to the fact that the Microformats vocabulary makes no conceptual distinction between offers and products (unlike GoodRelations, see Section 2.3.6.2). 1 2

Scoop of ice cream

3

Price: e1.10

4

Listing 2.6: Example in Microformats

• Microdata [Hic13] is a syntax to annotate Web content with structured data. Similar to RDFa, it can be used to add metadata based on custom vocabulary terms to HTML markup. Unlike RDFa, which is closely related to the RDF data model, Microdata does not represent an RDF graph. It rather describes a special type of graph, i.e. a tree of nested groups (items) of name-value pairs [Hic13, Section 4]. In a way it resembles the structure of a JSON document [cf. Hic13, Section 7]. Nonetheless, it requires some effort to map from HTML+Microdata to RDF [cf. Kel14], mainly because of concentrating on sorting out edge cases like the special treatment of non-URI property names in Microdata or the handling of an ordered list of name-value pairs from Microdata in RDF [Kel14, Section 1.1]. Generally, Microdata is considered simpler than RDFa with a less steep learning curve. Listing 2.7 illustrates our previous example in Microdata. As you can see, Microdata is a more compact syntax than RDFa because local property identifiers inside an item inherit the scope of the corresponding item type (provided an itemscope attribute is in place). However, when using externally defined properties, the full URI of the property must be provided. • The Open Graph Protocol (OGP) [Fac12] is a simple data format and vocabulary that was inspired by RDFa which attributes it leverages [Fac12]. OGP allows to supply additional information in the HTML header about the object described

70

2 Background and Related Work

1

3

Scoop of ice cream

4

Price: e1.10

7



8



9



10



Listing 2.7: Example in Microdata

on a Web page [Fac12]. It has been developed and is used by Facebook to enrich their social graph, e.g. to ease finding out more about the topics that people are interested in. The downside is that there is only a restricted set of terms available, in other words OGP is not suitable for sophisticated use cases such as describing product offers like in our example.

2.3.5 Ontology Languages

RDF is a data model that is not targeting a specific application domain [cf. AvH08, p. 84]. It consists of RDF triples to describe basic relationships between resources or to specify the type of a resource. This is not very powerful though. What is missing is a schema language that offers means to specify a blueprint for making assertions about real-world phenomena, comparable to a database schema that precisely describes how data can be stored in a database (e.g. compare the CREATE TABLE and INSERT INTO statements of the Structured Query Language (SQL)). Ontology languages serve exactly this purpose: They are the toolset for building ontologies. Corcho et al. [CFG03] did a comprehensive literature review on ontology languages. In their work, they systematically analyzed and compared Ontolingua, the Operational Conceptual Modeling Language (OCML), LOOM, the Frame Logic (F-Logic), the XOL Ontology Exchange Language (XOL), SHOE, RDF(S), the Ontology Inference Layer (OIL), and the DARPA Agent Markup Language (DAML) [CFG03]. In the following, we concern with RDFS and the OWL, which have become the most popular ontology languages on the Web.

2.3 Semantic Web and Linked Data

71

2.3.5.1 RDF Schema The RDF Schema (RDFS) language adds a semantic layer on top of RDF, thus empowering RDF with a minimal set of modeling primitives [BG14, Section 1]. RDFS introduces the notion of classes and properties as specializations of resources [cf. BG14, Section 2 and Section 3]. Diverse classes and properties can be organized as hierarchies of subclasses and subproperties [e.g. AvH08, pp. 85–87; AH11, pp. 130f.; BG14]. For example, one might want to state that a car radio is a more specific type of an electronic device, just like a smartphone is. This simple hierarchy could be set up with RDFS using rdfs:subClassOf relationships [cf. AH11, pp. 131f.] as illustrated in Figure 2.11a. Furthermore, RDFS makes it possible to formalize basic constraints whereupon inferencing (or reasoning) over existing data can be applied. For instance, the domain (allowed subject types) and range (allowed object types) of a property can be supplied [e.g. AvH08, p. 90; AH11, pp. 130f.; BG14]. Figure 2.11b shows a property ex:hasPowerConsumption with a domain of ex:ElectronicDevice, i.e. the property is meant to be applied to instances of that specific type, including those class instances subsumed by ex:ElectronicDevice (i.e. instances of type ex:CarRadio and ex:Smartphone). ex:ElectronicDevice

rdfs:subClassOf rdfs:subClassOf

ex:CarRadio

ex:Smartphone

ex:ElectronicDevice

(a) Class hierarchy

rdfs:domain

ex:hasPowerConsumption

(b) Domain value restriction

Figure 2.11: RDFS language additions

Unfortunately, the mechanism for domain and range information for properties in RDFS differs fundamentally from similar mechanisms in traditional database systems, because it triggers the automatic addition of additional type information instead of reporting violation. Only in combination with disjointness axioms, this mechanism can be used for actually constraining the use of properties to particular types; for an overview see [dBru+ 05].

2.3.5.2 OWL Web Ontology Language The Web Ontology Language OWL is an even more powerful ontology language than RDFS. Historically, OWL emerged from an effort by a working group formed around

72

2 Background and Related Work

DAML+OIL to extend RDFS by more descriptive elements [AvH08, p. 113; cf. AH11, pp. 369f.]. DAML+OIL itself was a language that was previously split up into the U.S. language DAML Ontology Language (DAML-ONT) and the European language OIL [AvH08, p. 113]. OWL already exists in its second version (OWL 2), albeit in the context of this thesis we refer to OWL 1 for being the version still in wide use. OWL 1 became a W3C recommendation in 2004 [cf. DS04]. OWL addresses use cases that cannot be fulfilled by the primitive features provided by RDF and RDFS. For example, it extends RDFS, among others, by • value and cardinality constraints on classes (e.g. max 2 ), • class operations (union, intersection, and complement), • enumerations, • class relations (e.g. equivalent class), • advanced property types (e.g. object, datatype, functional, symmetric, annotation), • property relations (e.g. equivalent property), • individuals, and • individual identity (e.g. same as or different from) [cf. DS04]. Accordingly, OWL facilitates the modeling of advanced patterns from real-world use cases, such as: Example (Class A is equivalent to class B). a owl:Class . a owl:Class . owl:equivalentClass .

Example (Individual I describes the same entity as individual J). owl:sameAs .

Example (It is only possible to have exactly 2 parents). a owl:ObjectProperty . [] a owl:Restriction ; owl:onProperty ; owl:cardinality "2"^^xsd:nonNegativeInteger .

2.3 Semantic Web and Linked Data

73

Without going into detail, based on the restriction on features outlined above, three general languages (OWL layers) for OWL 1 were suggested. They are by decreasing expressivity [DS04, Section 8]: 1. OWL Full : Covers the full range of OWL constructs. 2. OWL DL: Restricted to primitives as found in description logics. 3. OWL Lite: Very light-weight and even more restricted than OWL DL, providing basic modeling primitives. The kind of OWL layer to choose requires the ontology engineer to make a trade-off decision between expressive power and computational overhead (decidability, at the extreme) [DS04, Section 8]. While for practical concerns it might be perfectly reasonable to trade OWL DL for OWL Full, from a decidability point of view it is recommendable to remain within OWL DL, which allows OWL reasoners to infer knowledge in a deterministic way [DS04, Section 8]. For exactly this reason, many ontology engineers had decided in the past to restrict their ontologies to OWL DL, which is known to be a fair compromise between expressivity and computational complexity. However, since real-world scenarios (especially Web ontologies) often require more powerful features from OWL Full but are less prone to reasoning, it is tempting to trade OWL DL ontologies for the sake of better modeling capabilities as provided by OWL Full.

2.3.6 Ontologies and Global Schemas Although the term ontology (or rather, Ontology (sic!)) also plays a major role in philosophy being a branch that studies the role of the being in the world [e.g. SBF98; GOS09], we subsequently concern with the term in the field of artificial intelligence (AI). In Table 2.9, we list the three mostly cited definitions of ontologies. Table 2.9: Most important contributors to the definition of the term ontology Authors

Definition

[Gru93] [Bor97, p. 12] [SBF98]

“An ontology is an explicit specification of a conceptualization.” “An ontology is a formal specification of a shared conceptualization.” “An ontology is a formal, explicit specification of a shared conceptualisation.”

Gruber [Gru93] was the first to provide a succinct and generally accepted definition for what an ontology describes in the field of AI. The other authors borrowed from this initial definition and slightly modified it. Compared to Gruber’s definition from 1993, Borst [Bor97, p. 12] concluded that an ontology is formalized in order to make it accessible to

74

2 Background and Related Work

a computer. Furthermore, the real-world phenomena encoded within an ontology need to be based on consensus [Bor97, p. 12]. Studer et al. [SBF98] merged the previous two definitions into a third one, and stated that an ontology promises “a shared and common understanding of some domain that can be communicated across people and computers” [SBF98]. As indicated in the preceding section, ontology languages are used for building ontologies, whereas ontologies denote the schemas necessary to give meaning to data. Data without any schema29 or ontology is of limited value, because nobody is able to interpret it. A schema defines a number of classes and properties (schema data) that help disambiguate information expressed in terms of these classes and properties (instance data). Gruninger and Lee summarize the main uses of an ontology as communication, computational inference, and reuse (and organization) of knowledge [GL02]. Consequently, an ontology facilitates the interaction between both human agents and systems [GL02]. Furthermore, it can improve systems (e.g. assist their specification) and help to better organize knowledge (e.g. re-using or extending ontologies) [GL02]. There exist different ontologies for varying scopes. One possible distinction is done between upper ontologies (also upper-level ontologies, or top-level ontologies) and domain ontologies [Gri+ 11, pp. 522f.]. Upper ontologies try to cover a large spectrum of domains [Gri+ 11, pp. 523], but at the cost of not being able to provide detailed descriptions. Top-level ontologies are thus often extended by domain ontologies [Gri+ 11, pp. 523]. Domain ontologies are narrowing on a particular application domain and consequently it is easier to capture concepts at greater detail [Gri+ 11, pp. 523]. The two most notable conceptual schemas in the field of e-commerce are schema.org and GoodRelations. The first covers most popular application domains on the Web. However, it is not an upper ontology because the domains it supports are covered in detail. Rather it can be considered an accumulation of multiple domain ontologies. The second describes a comprehensive domain ontology for e-commerce on the Web.

2.3.6.1 Schema.org Schema.org30 is an ongoing joint initiative of the leading search engine operators Google, Yahoo!, Bing, and Yandex. It strives to compile a single vocabulary that unifies a collection of popular Web schemas under a consolidated namespace (i.e. http://schema.org/). The word ontology is often used synonymously with the terms vocabulary, schema, or data dictionary. Unless otherwise noted, for the rest of the thesis we will keep referring interchangeably to these terms. 30 http://schema.org/ (accessed on May 21, 2014) 29

2.3 Semantic Web and Linked Data

75

The key goals of schema.org are to retain simplicity, and the incremental addition of further interesting domains by relying on feedback from a large community formed around the development and maintenance of the vocabulary. The vocabulary is intended to be understood by all search engines in order to provide the greatest benefit to the users. This way developers and Web site owners have a clear benefit of annotating their Web documents with schema.org. Such benefits might be search engine result snippets, feeding Google’s Knowledge Graph, or a better search experience due to additional relevance signals sent to search engines. Especially the value proposition in the form of rich snippets [Goo16] has attracted attention from various people and organizations that are running their own Web sites. As of February 2016, Google is communicating that it will display rich snippets on search engine results pages (SERPs) for • products including offering details and ratings, • reviews about products, restaurants, movies, and stores, • recipes, • events, and • software applications [Goo16]. Figure 2.12 illustrates a rich snippet as generated by Google, indicating rating, reviews, price, and stock availability of a product. The corresponding schema.org Microdata snippet is outlined in Listing 2.8. It is worth mentioning that schema.org presumes certain default values, so the user does not need to specify that the worst possible rating is 131 or that the product is intended for sale32 . These defaults can be easily guessed by search engines and hence it does not burden Web masters.

Figure 2.12: Google rich snippet 31

32

“If worstRating is omitted, 1 is assumed.” – schema.org documentation about http://schema.org/R ating (accessed on May 22, 2014) “The business function (e.g. sell, lease, repair, dispose) of the offer or component of a bundle (TypeAndQuantityNode). The default is http://purl.org/goodrelations/v1#Sell” – schema.org documentation about http://schema.org/businessFunction (accessed on May 22, 2014)

76

2 Background and Related Work

1 2

Power Sander

3



4

$80.30

5

Available

6



7



8

Great value for money!

9



10 11 12 13 14

4 out of 5 stars

Listing 2.8: Schema.org in Microdata

In terms of data formats, Google has historically always promoted Microdata as the preferred syntax for schema.org. Notwithstanding, it meanwhile claims explicitly to support also RDFa and JSON-LD [Goo15a].

2.3.6.2 GoodRelations GoodRelations [Hep08a] is a standardized, light-weight vocabulary for e-commerce on the Semantic Web. The model is defined in OWL DL, which allows any RDFS-style reasoner to compute valuable inferences on GoodRelations data [Hep08a]. Despite the work on the ontology had started a couple of years before, the ontology went first public in 2008. In late 2012, it was incorporated into schema.org, as was publicly announced on the schema.org blog [Guh12]. Nonetheless, development on GoodRelations in its own namespace still continues in parallel [Hep15b]. This makes GoodRelations and schema.org good complements. While structured data markup in GoodRelations can take advantage of schema.org elements, schema.org conversely can be enhanced with existing individuals defined by GoodRelations. GoodRelations aims to describe offers for products or services and their related resources on the Web [e.g. Hep08a; Hep15b]. A design decision behind the development of the ontology was to be simple yet flexible, i.e. to be an attractive option for the small Web shop owner while leaving the possibility for advanced modeling requirements imposed by industry. In particular, GoodRelations foregoes the detailed specification of products, because there exist classification standards and ontologies that can readily contribute them

2.3 Semantic Web and Linked Data

77

[cf. Hep08a]. The core part of the ontology entails the relationships between business entity (gr:BusinessEntity), offer (gr:Offering), and product or service (gr:ProductOrService), henceforth referred to as the agent-promise-object principle [Hep15b; Hep11]. As detailed in Figure 2.13, this structure allows to model a business party (agent) that makes an offer (promise) for a product or service (object), which a second party can purchase in return for some compensation, followed by the transferral of the property rights from the first party to the purchasing party [Hep15b]. Now, having this separation of offer and product or service, it is possible to define multiple offerings (or promises) for a single product or service. In practice, this could be used to model bulk prices or to effectively enforce price differentiation. GoodRelations makes no prior assumptions about the characteristics of the promise, so in theory an offer could consist of a bundle of items, its validity could be temporally restricted, and on top of that a good could even be traded for good karma instead of in return for a monetary amount [cf. Hep15b]. GoodRelations Entities

gr:BusinessEntity

gr:Offering

gr:ProductOrService

Agent 1

Promise

Object

APO Principle

Compensation

Transfer of Rights

Agent 2

Figure 2.13: Agent-Promise-Object principle [based on Hep15b]

The ontology provides basic support for the most frequently used properties and individuals in offering descriptions, such as product details, prices, and terms and conditions [Hep15b]. The GoodRelations ontology allows to extend products (gr:SomeItems or gr:Individual, both subclasses of gr:ProductOrService) with product models (gr:ProductOrServiceModel, likewise a subclass of gr:ProductOrService), or datasheets, that can contribute detailed product information like product features. For that purpose, it defines a fully-fledged meta-model for expressing quantitative and qualitative product properties in OWL. In addition to that, GoodRelations allows to refine product and product model descriptions with classes and features of comprehensive product classification standards and product ontologies. In this regard, a number of GoodRelations-compliant product ontologies

78

2 Background and Related Work

exist that can provide a more detailed description of products; they are eClassOWL (product types and features) [Hep05b], FreeClass (construction and building materials) [cf. Rad+ 13], the consumer product ontologies33 developed in the context of the Ontologybased Product Data Management (OPDM) project, and the Product Types Ontology (PTO)34 with over 600, 000 precise classes from Wikipedia. Moreover, the GoodRelations ontology was extended by a number of e-commerce verticals, including vocabularies for • consumer electronics 35 , • tickets 36 : concert, museum, airfare, and train tickets, • accommodation 37 : hotels, camping sites, vacation homes, etc., • vehicles in general38 : cars, boats, bikes, etc., and • the automotive industry: car options 39 , Volkswagen vehicles 40 , and used cars 41 . A number of tools have evolved since 2008 to complement the GoodRelations vocabulary. The tool chain consists of:

Publishing Tools • RDF Book Mashup 42 : Book offers from Amazon annotated with GoodRelations published as Linked Data on the Web. • GoodRelations Snippet Generator 43 : This form-based online tool helps small businesses and Web site owners with moderate Semantic Web experience to quickly generate custom RDFa snippets for embedding in their Web pages. • Shop extensions: Over time, a significant number of plug-ins and modules for various Web shops have been developed. So far, there exist extensions for oshttp://www.ebusiness-unibw.org/ontologies/opdm/ (accessed on May 22, 2014) http://www.productontology.org/ (accessed on May 8, 2014) 35 http://www.ebusiness-unibw.org/ontologies/consumerelectronics/v1.html (accessed 33

34

on May 22, 2014) 36 http://www.heppnetz.de/ontologies/tio/ns.html (accessed on May 22, 2014) 37 http://ontologies.sti-innsbruck.at/acco/ns.html (accessed on May 22, 2014) 38 http://www.heppnetz.de/ontologies/vso/ns.html (accessed on May 22, 2014) 39 http://www.volkswagen.co.uk/vocabularies/coo/ns.html (accessed on May 22, 2014) 40 http://www.volkswagen.co.uk/vocabularies/vvo/ns.html (accessed on May 22, 2014) 41 http://ontologies.makolab.com/uco/ns.html (accessed on May 22, 2014) 42 http://wifo5-03.informatik.uni-mannheim.de/bizer/bookmashup/ (accessed on May 22, 2014) 43 http://www.ebusiness-unibw.org/tools/grsnippetgen/ (accessed on May 22, 2014)

2.3 Semantic Web and Linked Data

79

Commerce44 , Magento Commerce45 , Joomla/VirtueMart46 , xt:Commerce47 , Oxid eShop48 , WordPress/WP e-Commerce49 , PrestaShop50 , and Drupal Commerce51 . • GR-Notify ping service 52 : In order to see whether there exist problems with the shop extensions, it was key to keep track of new installations, which has been realized with a registration service where Web shops and site owners can submit their URIs. It was also useful for monitoring adoption of shop extensions.

Converters • Google Product Feed Converter 53 : Converts Google product feeds to GoodRelations. • BMEcat2GoodRelations 54 : Converter from BMEcat documents to GoodRelations. First developed as an online service [cf. Mat09], it was then replaced by a more scalable command-line tool (see [SRH13b] and Chapter 4). • ELMAR2GoodRelations 55 : Command-line converter from Electronic Market Data Feed (ELMAR) product data feeds to GoodRelations. • PCS2OWL56 : Generic converter from product category systems to GoodRelationscompatible product ontologies (see [Sto+ 14] and Chapter 5).

Consuming Tools • GR4PHP 57 : A programming API intended for Hypertext Preprocessor (PHP) developers to fetch information about businesses and products from RDF stores [SGH12]. https://code.google.com/p/goodrelations-for-oscommerce/ (accessed on May 22, 2014) http://www.magentocommerce.com/magento-connect/msemantic-semantic-seo-for-richsnippets-in-google-and-yahoo.html (accessed on May 22, 2014) 46 https://code.google.com/p/goodrelations-for-joomla/ (accessed on May 22, 2014) 47 https://code.google.com/p/semantic-seo-for-xt-commerce/ (accessed on May 22, 2014) 48 http://wiki.oxidforge.org/Features/RDFa (accessed on May 22, 2014) 49 http://wordpress.org/plugins/wpec-goodrelations/ (accessed on May 23, 2014) 50 http://addons.prestashop.com/en/seo-prestashop-modules/3866-rich-snippets-andsemantic-seo-with-goodrelations.html (accessed on May 22, 2014) 51 https://drupal.org/project/commerce_goodrelations (accessed on May 22, 2014) 52 http://gr-notify.appspot.com/ (accessed on May 22, 2014) 53 http://www.ebusiness-unibw.org/tools/google-product-feed-converter/ (accessed on May 22, 44

45

2014)

http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR (accessed on May 22, 2014) 55 https://code.google.com/p/elmar-to-goodrelations/ (accessed on May 22, 2014) 56 http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL (accessed on May 22, 2014) 57 https://code.google.com/p/gr4php/ (accessed on May 22, 2014) 54

80

2 Background and Related Work

• GR2RSS 58 : This tool combines query results over GoodRelations e-commerce datasets with content syndication, thus empowering Web site owners that run off-the-shelf CMSs with support of syndication formats (Really Simple Syndication, sometimes Rich Site Summary or RDF Site Summary (RSS) or Atom Syndication Format (Atom)) to enhance their Web pages with Semantic Web content [SH13b]. • GoodRelations Validator 59 : Modeling mistakes on the Semantic Web are quite common. This online service validates RDF documents for semantic validity in multiple check steps, i.e. it checks whether all ontology constraints are satisfied. • GRCrawler 60 : The GoodRelations crawler is a focused Web crawler that extracts structured data in GoodRelations from a given set of Web resources (see Chapter 3). • Browser extensions and mobile applications: Browser extensions and mobile applications are promising tools because they can effortlessly take into account a user’s context information. A browser extension, for example, has knowledge about the page a user is currently visiting, whereas a mobile application can regard geopositional information. In the context of GoodRelations, the following extensions were showcased: – A Firefox extension61 that looks up product data from an RDF store based on some selected text on a Web page (e.g. a product name to compare with products in a RDF store). – A Google Chrome extension62 that shows the presence of rich GoodRelations markup in the current Web page and notifies the GR-Notify registration service accordingly. – A mobile application based on a large dataset about points of interest in Ravensburg city63 , letting the user explore surrounding restaurants or public institutions that are currently opened. Furthermore, a sophisticated vocabulary documentation system has been established for GoodRelations, which features examples in different syntaxes and considers the social aspect of ontology development, i.e. users can interact with developers using different http://www.stalsoft.com/gr2rss/ (accessed on May 22, 2014) http://www.ebusiness-unibw.org/tools/goodrelations-validator/ (accessed on May 22, 2014) 60 http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler (accessed on May 22, 2014) 61 https://addons.mozilla.org/en-us/firefox/addon/goodrelations-extension/ (accessed on 58

59

May 22, 2014)

http://www.stalsoft.com/grome (accessed on May 22, 2014) 63 http://wiki.goodrelations-vocabulary.org/Case_studies/Ravensburg (accessed on May 22, 62

2014)

2.3 Semantic Web and Linked Data

81

social media channels like Twitter, Facebook, Google+, Quora, Stackoverflow, or Delicious [cf. Hep11]. The GoodRelations ecosystem also includes an information platform64 (static Web page and Wiki) with collected knowledge and external links to additional material (slides, videos, etc.), as well as a community mailing list accompanied by a corresponding list archive65 .

2.3.6.3 Simple Knowledge Organization System The Simple Knowledge Organization System (SKOS) is a W3C recommendation and constitutes an OWL Full ontology for representing controlled vocabularies such as classification schemes, taxonomies, thesauri, or subject heading systems [MB09, Section 1.2]. SKOS is complementary to ISO 25964 [Int11]. In contrast with ISO 25964, which gives general advice on how to build a decent thesaurus [Int11] (see Section 2.2.7.1), SKOS deals with the machine-readable representation of thesauri on the Web [MB09, Section 1.2]. SKOS contrasts itself from RDFS or OWL as being a data model to informally represent a thesaurus or classification scheme rather than a formal ontology or ontology language [MB09, Section 1.3]. Unlike SKOS structures, ontologies and ontology languages generally define axioms and facts [MB09, Section 1.3]. Knowledge representation languages like RDFS and OWL are further limited to a few very strong semantic relationships, i.e. rdfs:subClassOf and rdfs:subPropertyOf. This means that an instance of class A that is subsumed by class B is also an instance of class B. It does not pose a problem for well-organized hierarchies such as a car being a subclass of a vehicle, but the assertion will not hold e.g. for a garage being subsumed by a class building, even if a real estate broker might disagree66 . SKOS defines elements better suited for representing light-weight KOSs than RDFS or OWL. The main element of SKOS is the notion of a concept (skos:Concept), which is an instance of an OWL class. A SKOS concept is “an idea or notion” [MB09, Section 3.1]. Each concept can be equipped with textual labels. SKOS distinguishes between three types of labels, a preferred one (skos:prefLabel ), an arbitrary number of alternative ones (skos:altLabel ), and hidden labels (skos:hiddenLabel ) [MB09, Sections 5.1 and 5.2]. Hidden labels are supposed not to be visible to the user, but to be considered for indexing purposes instead, e.g. by a digital library or an IR system [MB09, Section 5.1]. Concepts 64 65 66

http://www.goodrelations-vocabulary.org/ (accessed on May 22, 2014) http://ebusiness-unibw.org/pipermail/goodrelations/ (accessed on May 22, 2014)

A garage is typically part of a building (meronomy, parthood, or whole-part relationship [cf. MR06, p. 216]), but not a building on its own. On paper, however, both might well be treated similarly.

82

2 Background and Related Work

can be hierarchically organized with SKOS by taking advantage of the object properties skos:narrower and skos:broader [MB09, Section 8.1]. These and other additional semantic relationships (among others associative links like skos:related ) allow for a more finegrained description of the relations between concepts [cf. MB09, Section 8.1]. Accordingly, a concept A linked to another concept B via skos:narrower does not necessarily imply that A is a specialization of B. So a garage in this case could be modeled as a narrower concept of a building.

2.3.7 Query and Rule Languages The SPARQL Protocol and RDF Query Language (SPARQL) is a SQL67 -like query language and a protocol for RDF. It was standardized by a W3C working group and received recommendation status in 2008 [PS08]. In 2013, it was updated to SPARQL 1.1 [HS13], which basically added query federation support and SPARQL Update functionality. SPARQL was not the first attempt to define a query language for RDF. For example, it was preceded by RDF Query Language (RQL), Sesame RDF Query Language (SeRQL), TRIPLE, RDF Data Query Language (RDQL), N3, and Versa [cf. Haa+ 04]. The query language borrows much of its syntax from the Turtle language for RDF [cf. PC14]. In particular, the basic graph pattern [HS13, Section 5.1] used to query an RDF graph consists of a set of triple patterns with variables, structured the same way as RDF triples [HS13, Section 5.1]. A variable is given by a string of characters introduced with a question mark, e.g. ?var. Every SPARQL query follows the same basic structure, which components are prefix declarations, query result form, dataset definition, graph pattern, and solution modifiers [cf. HS13]. Some of these constituents are required, others are optional: Prefix declarations may be omitted if an abbreviated syntax with CURIEs is not needed. The dataset definition is not mandatory, because without supplying a named graph URI of a specific RDF subgraph, the query is executed on the default graph (which most often encompasses all named graphs as well). Finally, the solution modifiers, which permit to limit or manipulate the results, are optional as well [cf. HS13]. The SPARQL query language supports four types of queries [HS13, Section 16]: • SELECT : This query type retrieves a result list by binding values to the variables supplied with the query result clause (the variables in the SELECT clause are called projection; the wildcard symbol “*” (asterisk) can be used to select all variables bound in the graph pattern). 67

SQL is the standard query language for querying relational databases. It has a SELECT-FROMWHERE structure that the SPARQL query language adopted.

2.3 Semantic Web and Linked Data

83

• CONSTRUCT : This query type works like SELECT, but generates triples from a custom triple pattern template that is populated with the retrieved results. • ASK : This query type returns the boolean value “true” if some data in the dataset matches the given basic graph pattern, otherwise “false”. • DESCRIBE : This query type extracts all triples in the dataset for a particular resource. With SPARQL, it is possible to explicitly query named RDF graphs defined as graph URIs. Named graphs are either selected using the dataset definition FROM NAMED in front of the basic graph pattern [HS13, Section 13.2.2], or inside the graph pattern block [HS13, Section 13.3]. The syntax is then as follows: GRAPH { ... }

It is also possible to make a triple pattern an optional match. This can be done by embracing the triple pattern within an OPTIONAL clause [HS13, Section 6]. Furthermore, filters on results are possible from within the graph pattern section [HS13, Section 5.2.2]. SPARQL 1.1 incorporated numerous new functionalities that were highly demanded by academia and industry. It contributed among others subqueries, the property path feature, variable assignments (BIND and VALUES), better filtering opportunities, like FILTER NOT EXISTS ... for negation, and additional aggregate functions like MIN,

MAX, or AVG [cf. HS13]. Furthermore, it added support for SPARQL Update queries, for which the most important applications are graph manipulations [GPP13, Section 3.1]: (1) Insert data into the graph; (2) delete data from the graph; (3) insert and/or delete certain data if the specified graph pattern matches; (4) load data from a Web resource into the graph; and, (5) empty an entire graph. SPARQL-1.1-capable endpoints support federated queries by virtue of the SERVICE keyword [PB13, Section 2]. Federated queries permit to delegate portions of a query to remote SPARQL endpoints and to combine the results locally afterwards [PB13; Bui+ 13]. Recall the ongoing ice cream example. Listing 2.9 shows a corresponding SPARQL query. It combines many of the SPARQL features presented herein. In particular, • two prefix bindings are provided (lines 1–2); • the query form is of type SELECT (line 5); • the solution modifiers tell the query to remove duplicate results (keyword DISTINCT on line 5), to rank the results by the price (line 12), and to limit the returned result set to ten results (line 13);

84

2 Background and Related Work

1 PREFIX gr: 2 PREFIX ex: 3 4 # query name, price, and currency of offer ex:OfferIcecream 5 SELECT DISTINCT ?name ?price ?currency { 6

ex:OfferIcecream

7

gr:hasPriceSpecification [

8

gr:hasCurrency ?currency ;

9

gr:hasCurrencyValue ?price ] ;

10

gr:name ?name .

11

}

12

ORDER BY ?price

13

LIMIT 10

Listing 2.9: Example query in SPARQL

• in the projection three variables are selected, namely name, price, and currency (line 5); • abbreviated Turtle syntax is used inside the basic graph pattern block (lines 6–10); and • values are bound to variables for the name of the offer, its price and the related currency (lines 8–10). The graph pattern used in the query can also be visualized as a graph (see Figure 2.14). A side-by-side comparison of the RDF graph (see Figure 2.14a) and the corresponding graph for the SPARQL query (see Figure 2.14b) reveals what values will be bound to variables when the query is executed. ex:OfferIcecream

ex:OfferIcecream

rdf:type gr:hasBusinessFunction gr:hasPriceSpecification

gr:Offering

gr:Sell

?name

"Scoop of ice cream"@en

rdf:type

gr:UnitPriceSpecification

gr:hasPriceSpecification gr:name

gr:name

gr:hasCurrency

"EUR"^^xsd:string

(a) RDF graph (see Figure 2.9)

gr:hasCurrencyValue

"1.10"^^xsd:float

gr:hasCurrency gr:hasCurrencyValue

?currency

?price

(b) SPARQL graph pattern

Figure 2.14: Side-by-side comparison of RDF graph and SPARQL graph

2.3 Semantic Web and Linked Data

85

Rule languages allow to specify rules that otherwise would be impossible, or at least difficult, to express with OWL terminology. In logics, rules consist of an antecedent (or body) and a consequent (or head) [e.g. Hor+ 04, Section 1; GR98, p. 29; AvH08, p. 162]. In other words, a conclusion C is drawn from a set of premises P [cf. HR04, p. 5]. A rule consisting of a set of n premises and a conclusion can be formally represented as follows: P1 , . . . , P n → C

(2.3)

On the Semantic Web, this functionality is offered by rule languages like the Semantic Web Rule Language (SWRL) [Hor+ 04]. Due to the vast number of available rule languages and engines (e.g. Datalog, Rule Markup Language (RuleML), SWRL), the W3C chartered a working group that elaborated a format with a set of dialects charged with the interchange of rules across various rule languages, the Rule Interchange Format (RIF) [Bol+ 07]. Yet, even with SPARQL it is possible to execute a number of logical rules via SPARQL CONSTRUCT queries [e.g. AH11, pp. 88f., pp. 115f.]. SPARQL Inferencing Notation (SPIN)68 , for example, avails itself of this capability and defines inference rules based on SPARQL CONSTRUCT query forms [AH11, p. 116].

2.3.8 Storage and Reasoning In this section, we address the technologies to store and retrieve RDF data. In particular, we delineate the notions of a triple store and of a SPARQL endpoint. Then, we discuss the most relevant details and idiosyncrasies about reasoning over RDF data on the Semantic Web.

2.3.8.1 RDF Stores An RDF store is a system that “allows storage of RDF data and schema information, and provides methods to access that information” [HBS09, p. 490]. Its essential components are a repository for the storage and a middleware for the access of the data [HBS09, p. 490]. Triple stores and quad stores are sometimes used as alternative terms for RDF stores, even though strictly speaking they describe specific types of RDF stores, i.e. RDF engines that store only triples or triples with context information (graph name), known as quads [cf. HBS09, p. 494]. Most of the RDF stores available today can be classified as either native RDF stores, DBMS-backed stores, or RDF wrappers [e.g. Has+ 11; FCB12]: 68

http://spinrdf.org/ (accessed on May 25, 2014)

86

2 Background and Related Work

• A native store implements an RDF-compliant storage layout, usually either persistent or in-memory [FCB12]. • A DBMS-backed store maps RDF triples to relational database tables while taking advantage of existing and well-proven database technologies. The are three common ways of storing RDF triples in relational databases, namely using vertical tables, property tables, and horizontal tables [SN10]. • An RDF wrapper creates an RDF view over otherwise RDF-agnostic data sources. Typically, such RDF views are created from structured relational data or made accessible via Web APIs. Hence, this kind of RDF repositories provide read access only [Has+ 11]. A tool for generating such RDF views over relational databases is D2RQ69 .

2.3.8.2 SPARQL Endpoints SPARQL endpoints complement RDF stores with support for SPARQL queries over HTTP. “SPARQL endpoints are RESTful services that accept queries over HTTP written in the SPARQL language [...] adhering to the SPARQL protocol [...]” [Bui+ 13]

According to this definition, the capabilities of SPARQL endpoints are determined by the SPARQL query language, and by the SPARQL protocol definition. The protocol definition standardizes the communication with SPARQL endpoints, namely by specifying the messages that every endpoint shall understand, the URI pattern every endpoint should adhere to, and the supported HTTP request methods [cf. Fei+ 13], i.e. GET and POST and, for certain SPARQL Update operations, PUT and DELETE.

2.3.8.3 Reasoning Reasoning, or inferencing, is a common term for the task of inferring new, implicit facts from existing information. “In the context of the Semantic Web, inferencing simply means that given some stated information, we can determine other, related information that we can also consider as if it had been stated.” [AH11, p. 114] 69

http://d2rq.org/ (accessed on May 26, 2014)

2.3 Semantic Web and Linked Data

87

Many RDF stores include reasoning capabilities [FCB12]. Some only support basic inferences over RDFS axioms (e.g. rdfs:subClassOf ), while others implement very sophisticated reasoning including OWL axioms (e.g. owl:sameAs) and custom rules (e.g. SWRL rules)70 (see Section 2.3.7). For an overview of reasoning in semantic repositories, see [KD11, pp. 245–257]. To infer additional knowledge, rule-based reasoners generally apply one of the two inferencing strategies [e.g. KD11, p. 249]: • Forward-chaining concludes the goal from (a series of) facts. Thus, it concludes from antecedent to consequent. It is further known as data-driven or bottom-up reasoning [GR98, p. 145; cf. KD11, p. 249]. This is also known as materialization [e.g. KD11, p. 249]. • Backward-chaining starts from the goal (hypothesis) and tries to derive assertions that hold true, eventually drawing on supporting facts (evidence). Hence, it concludes from consequent to antecedent. Backward-chaining is also called goaldriven or top-down reasoning [GR98, pp. 145f.; cf. KD11, p. 249]. A reasoner with a backward-chaining strategy fundamentally performs query rewriting [e.g. KD11, p. 249]. A more formal treatment of reasoning in description logics can be found in [MH09]; an overview of reasoning with the Web Ontology Language (OWL) is given in [HP11].

2.3.8.4 Open-World and Closed-World Assumptions Many software systems, specifically database systems, are based on the premise that their data is complete and that everything not known to the system is deemed false. This principle, where the absence of information is regarded as negative information, is known as the closed-world assumption (CWA) [e.g. AvH08, p. 151; DFH11, p. 21]. CWA is described by the situation where a statement is regarded false when it is not supported by another statement, i.e. “[i]f a fact is not evaluated to be true [...], it is assumed to be false” [DFH11, p. 21]. By contrast, the Semantic Web is based on the open-world assumption (OWA) [AvH08, p. 151; DFH11, p. 21], which regards that in an open environment like the Web it is unrealistic to have full control over all the data. In such an environment it is possible that a dataset makes assertions about another dataset without the second taking note of 70

Stardog, for instance, implements SWRL rules: http://docs.stardog.com/owl2/ (accessed on May 26, 2014)

88

2 Background and Related Work

it. In the simplest case, a person claims to be a friend of a second person. Unless the dataset with this claim is confirmed by an RDF store, it would be wrong to assume this claim to be false just based on the fact that it is not there. OWA thus implies that “a statement cannot be assumed true on the basis of a failure to prove it” [AvH08, p. 151]. OWA as a characteristic of the Semantic Web has significant practical impact. Simply put, absent information like “noisy street next to the hotel” does not mean that there is indeed no highway next to it. This information could simply be missing, either because the assertion was omitted or because it was not detected for it was made available from an unknown information source. As a potential solution to OWA on the Semantic Web, Polleres et al. proposed contextually scoped negation [PFH06]. The authors’ idea was to allow scoped negation, i.e. to define rules that are valid only in a particular context (with respect to a specific rule base).

2.3.8.5 Non-Unique-Names Assumption A further peculiarity of the Semantic Web is that in OWL, without explicit statements like owl:sameAs or owl:differentFrom, it cannot be assumed that individuals with different names indeed represent different entities; this is termed the unique-names assumption (UNA) [cf. AvH08, p. 151]. In short, UNA implies that unless stated otherwise, one must always expect that people could use different terminology on the Web for referring to the same thing.

2.4 Semantic Data Interoperability A consolidated view on the data is crucial for querying large data sets like the Semantic Web. To accomplish this goal, a considerable number of heterogeneous data sources need to be integrated. The Semantic Web and Linked Data already made a significant step towards global data integration on the Web [cf. Gan+ 11, pp. 138f.]. The two design decisions of declaring a standard data model (RDF) and to mint globally unique identifiers (URIs or IRIs) for conceptual entities greatly facilitate the interlinking of datasets. Though, it still proves challenging to establish an integrated view over structured data on the Semantic Web, among others because of the variety of syntaxes for RDF71 , the multitude of conceptual 71

Although the variety of syntaxes does not pose a huge problem, because most syntaxes are designed around the RDF model.

2.4 Semantic Data Interoperability

89

models (even for describing similar things), the different levels of data granularity, and the lack of common standards. In this section, we discuss the main topics linked to the problem of semantic data interoperability that, although being an old and well-known research problem, is still far from being solved [ZD04; Hal05].

2.4.1 Data Integration and Heterogeneity

Data integration is “a pervasive challenge faced in applications that need to query across multiple autonomous and heterogeneous data sources” [HRO06]. The need for data integration emerges whenever different data sources, such as information systems within enterprises, need to be consolidated in order to create a uniform view on the data [ZD04; cf. HRO06; Len02]. The complexity of the data integration task increases as more heterogeneous data sources need to be merged [DHI12, p. 6; cf. Fen+ 01]. Some of these data sources are structured having a well-defined schema (e.g. relational databases), while others are semi-structured (e.g. XML, HTML) or unstructured (e.g. plain, unorganized text) [DHI12, p. 6]. Because schemas are often developed independently, it can further occur that similar things are described differently, or that the level of detail differs across descriptions although relying on the same schema. The heterogeneity of information manifests itself in a number of ways. Sheth [She99], for example, outlined three classes of information heterogeneity, namely syntactic, structural, and semantic heterogeneity. Similarly, Euzenat and Shvaiko consider syntactic (e.g. ontology language mismatch), terminological (e.g. product versus article), conceptual (intended meaning), and semiotic heterogeneity (subjective interpretation) [cf. ES07, pp. 40–42]. These differences in data quality and granularity pose serious challenges on data integration both for databases and the Web. In general, data integration can take place at two conceptual levels, i.e. at schema and instance level. Subsequently, we address both cases, i.e. integration at schema level in the form of schema and ontology alignment, and integration at instance level by means of data or instance matching.

90

2 Background and Related Work

2.4.2 Schema and Ontology Matching Schema and ontology matching constitute approaches for finding mappings between two schemas [e.g. RB01] or ontologies [e.g. SE13] based on similarity measures. While schema matching is a branch from database research, ontology matching describes the respective counterpart for the Semantic Web.

2.4.2.1 Schema Matching Schema matching deals with identifying correspondences between different database schemas [RB01]. A match “takes two schemas as input and produces a mapping between elements of the two schemas that correspond semantically to each other” [RB01]. Rahm and Bernstein [RB01] elaborated a categorization of schema matching approaches, as illustrated in Figure 2.15. Their main three distinctions of schema matching approaches are as follows [RB01]: • Schema-level versus instance-level : Besides those matches that consider schema information only, also schema matches relying on instance data are possible. • Element-level versus structure-level : Matches can span one or multiple elements. • Linguistic versus constraint-based : Computing matches based on linguistic features only or by taking into account restrictions like data types, integrity constraints, etc. Furthermore, it is possible not only to rely on an individual but on multiple matching criteria (hybrid matcher ) or to combine results of several matching algorithms (composite Schema Matching Approaches

Individual matcher approaches

Schema-only

Element-level

Linguistic Constraint-based ...

...

Combining matchers

Instance/contents-based

Structure-level

Element-level

Constraint-based Linguistic ...

Hybrid

Constraint-based

...

Name similarity Type similarity Graph matching IR techniques Description similarity Key properties (word freq., Global namespaces key terms)

... Value pattern and ranges

Composite

Manual

Automatic

Further criteria: - Match cardinality - Auxiliary information ... Sample approaches

Figure 2.15: Taxonomy of schema matching approaches [from RB01]

2.4 Semantic Data Interoperability

91

matcher ) [RB01]. Many schema matching approaches also take advantage of additional input in the form of auxiliary information like dictionaries, thesauri, or previously computed mappings (e.g. to reapply a concatenation of first name and last name developed for the author name to the editor name of a book) [RB01], and the mappings are not always simple one-to-one mappings but may involve also multiple schema elements, known as one-to-many and many-to-many mappings, respectively [DHI12, pp. 123f.; RB01]. For example, a simple one-to-one mapping is Book.author = Author.name

whereas a more complex one-to-many mapping would be Location.address = compose_address(Place.city, Place.zip, Place.street)

Another categorization of matching techniques distinguishes between rule-based and learning-based solutions [DH05]. Rule-based solutions define rules based on schema information, which makes them straightforward and fast. As a downside, rule-based solutions do not exploit valuable information from instance data, nor do they include past matches. Learning-based approaches aim at addressing these drawbacks.

2.4.2.2 Ontology Matching Ontology matching is a related discipline to schema matching, for it applies established techniques from database schema matching to ontologies [cf. ES07, p. 61, p. 63; Cas+ 11]. Moreover, it extends schema matching by novel approaches peculiar to ontologies, e.g. reasoning-based matching that relies on ontologies [e.g. Cas+ 11]. The field of ontology matching is sometimes also referred to as ontology alignment [cf. Ehr07]. The variety of terms for ontology matching and their inconsistent use among scientific works is sometimes confusing. The authors in [Ehr07, pp. 23f.; ES07, p. 42] thus provided a terminological distinction. In the following, we briefly summarize the key terms in the context of ontology matching as found in [ES07, pp. 42f.]: • A correspondence is the relationship that holds between concepts of different ontologies according to a matching algorithm [ES07, p. 42]. • Matching is the process of finding correspondences [ES07, p. 42]. • An alignment is a set of correspondences [ES07, p. 42]. • A mapping is the directed counterpart of an alignment [ES07, pp. 42f.].

92

2 Background and Related Work

• Merging is the creation of a new ontology from two others [ES07, p. 43]. • Integration is the inclusion of one ontology into another [ES07, p. 43]. The ontology matching process is to produce an alignment from two ontologies [e.g. Cas+ 11]. We can formalize the structure of the ontology matching process using the following equation [ES07, p. 44]: A0 = f (o, o0 , A, p, r)

(2.4)

The matching process is given by a function f that creates an output alignment A0 from the following input parameters: Two ontologies o and o0 , an existing alignment A that needs to be extended, some relevant parameters p for the match operation (e.g. similarity thresholds), and additional resources r for inclusion (e.g. external knowledge bases and thesauri) [ES07, p. 44]. Figure 2.16 provides a graphical representation of the ontology matching process taken from [ES07, p. 44]. Of course, this matching process provides only the baseline for more advanced matching strategies, among others the combination of several matchers or similarities, learning strategies from instance data, probabilistic methods, or user involvement [ES07, p. 117]. o

parameters

A

matching

o'

resources

A'

Figure 2.16: Ontology matching process [from ES07, p. 44]

As already mentioned before, ontology matching has produced several matching techniques, where a large part originates from schema matching. Euzenat and Shvaiko thus extended the taxonomy of automatic schema matching approaches, as presented in Figure 2.15, with further categories and aspects that apply to ontology matching [ES07, p. 65]. The resulting classification of matching techniques is given in Figure 2.17. Relevant changes to Figure 2.15 are highlighted in bold font face and/or emphasized with shaded background color. The biggest difference to Figure 2.15 is the layered organization of the classification (three layers). The taxonomy is read either top-down or bottom-up versus the middle. The upper

tokenisation, lemmatisation, morphology, elimination

name similarity, description similarity, global namespace

External

entire schema or ontology, fragments

Alignment reuse

SUMO, DOLCE, FMA

Upper level, domain specific ontologies

Syntactic

Extensional

frequency distribution

Data analysis and statistics graph homomorphism, path, children, leaves

Graphbased

External

Structure-level

Semantic

taxonomy structure

Taxonomybased

structure metadata

Repository of structures

Kind of input

SAT solvers, DL reasoners

Modelbased

Basic techniques

Granularity/Input interpretation

Semantics

Figure 2.17: Classification of matching techniques for ontology matching [from ES07, p. 65]

Matching techniques

Relational

Structural

type similarity, key properties

Constraintbased

Internal

lexicons, thesauri

Linguistic resources

Terminological

Linguistic

Languagebased

Stringbased

Syntactic

Element-level

Matching techniques

2.4 Semantic Data Interoperability 93

94

2 Background and Related Work

layer distinguishes first by the granularity of the input, followed by the interpretation of the input. The bottom level makes the distinction based on the kind of input used by the matching approaches [ES07, p. 64]. As compared to schema matching approaches, the auxiliary information are now encoded in the form of alignments that could be reused, ontologized thesauri and knowledge bases like WordNet RDF72 (derived from WordNet [Mil95]) or DBPedia73 (derived from Wikipedia), and upper-level and domain-specific ontologies. Furthermore, semantic characteristics can be taken advantage of, e.g. via model-based reasoning (see Figure 2.17). An important question that arises is how to represent alignments that could be detected between ontologies. First of all, it depends on the kind of relationship that holds between the mapped concepts. The simplest relationships is equivalence, which could be represented using an owl:equivalentClass property from OWL [ES07, p. 45]. But also other axioms like disjointness or specialization could be valid mappings. E.g., a concept cat in one ontology could be subsumed by a concept pet in a second ontology. Similarly, alignments could be composed of more complex relationships such as the confidence that a relationship holds [ES07, p. 46].

2.4.3 Data and Instance Matching Data matching and instance matching are two terms for the same problem and the counterparts of schema and ontology matching. Doan et al. [DHI12] define data matching as “the problem of finding structured data items that describe the same real-world entity” [DHI12, p. 173]. Other terms referring to the same task can be sometimes found in database, artificial intelligence (AI), and Web literature, namely record linkage, tuple deduplication, duplicate identification, entity consolidation, co-reference resolution, object matching, and link discovery [cf. DH05; Cas+ 11]. In the following, we stick to the term instance matching. Instance matching copes with the problem of duplicate representations of identical objects. This problem often arises when the same real-world entities appear in heterogenous data sources, especially in multiple databases [cf. DHI12, p. 173] or in different documents on the Web [cf. Cas+ 11]. Let us consider a data integration example of two databases: The synchronization of two customer databases reveals customer entries for “James Doe” and “Jim Doe”. “Jim” is a common shorthand for “James”. A closer investigation of the two customer entries reveals 72 73

http://wordnet-rdf.princeton.edu/ (accessed on June 4, 2014) http://dbpedia.org/ (accessed on May 12, 2014)

2.4 Semantic Data Interoperability

95

that they have the same date of birth, which leads us to conclude that they most likely refer to one and the same customer. Several techniques have been investigated to identify whether two representations match. We are not going to cover them in detail here, but the most prominent approaches for instance matching encompass rule-based, learning-based, clustering-based, probabilistic, and collective solutions [cf. DHI12, p. 174]. On the Semantic Web, instance matching tries to reconcile individuals or instances represented by URIs. To consolidate these entities, the owl:sameAs property has the same function for instances as the owl:equivalentClass property did for matching classes [Hog+ 10]. The Silk framework is a tool that employs instance matching techniques to discover links between entity pairs on the Web of Data [cf. Vol+ 09]. To achieve this, it offers a declarative language to specify the data sources and SPARQL queries for data retrieval, the type of the resulting links between entities, as well as conditions for these links in the form of similarity metrics which scores may be weighted and aggregated [Vol+ 09]. Based on the resulting scores, the links are either created or not [Vol+ 09]. The supported similarity metrics used for discovering the links are various string similarity metrics (including exact string matches); similarities between numerals, dates, and URIs; distance metrics between concepts in a taxonomy; and, set comparison [Vol+ 09].

2.4.4 String Matching Doan et al. [DHI12] define string matching as “the problem of finding strings that refer to the same real-world entity” [DHI12, p. 95]. String matching is central to many data integration tasks, including schema, ontology, or instance matching [cf. DHI12, p. 95]. Two strings can be matched using a string similarity measure [cf. DHI12, p. 95]. Cohen et al. [CRF03] provide a comparison of various string distance metrics ranging from simple edit distance metrics over token-based to hybrid forms. Their comprehensive list of considered string distance metrics includes among others Levenshtein, Jaro, Jaro-Winkler, Jaccard, Jensen-Shannon, or TF-IDF metrics [CRF03]. In the following, we describe the Levenshtein string distance [Lev66] in more detail, a popular yet very simple edit distance metric. The Levenshtein string distance computes the string similarity based on the minimum number of character edits (insertions, deletions, and substitutions) required to transform one sequence of characters into another. Returning to our previous example with the name mismatch between two customer entries, the edit distance between “James Doe” and “Jim Doe” is three (i.e. one character replacement and two deletions).

96

2 Background and Related Work

A similarity score is computed as the one minus the distance function between two objects o1 and o2 (provided that the distance is calculated as a normalized value in the range [0, 1]) [cf. BR11, p. 223]: sim(o1 , o2 ) = 1 − d(o1 , o2 )

(2.5)

This means for the Levenshtein string distance that the similarity can be computed using a normalized string distance [cf. DHI12, p. 97] as follows sim(s1 , s2 ) = 1 −

d(s1 , s2 ) max(length(s1 ), length(s2 ))

(2.6)

In our case, we obtain a similarity score of 0.67: sim(James Doe, Jim Doe) = 1 −

3 = 0.67 max(9, 7)

(2.7)

2.4.5 Data Cleansing Data cleansing (or data cleaning, scrubbing) [RD00] is a technique related to the methods of schema matching (ontology matching), instance matching, and other activities that improve the quality of the data [cf. RD00]. While the data integration tasks presented so far are primarily concerned with finding correspondences between related concepts or schemas [e.g. RB01; ES07; Cas+ 11], data cleansing “deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data” [RD00]. These errors may be caused by data quality problems like spelling mistakes, missing or invalid data, or inconsistencies. Rahm and Do identified two general dimensions for sources of data quality problems, i.e problems caused by single versus multiple data sources (single-source versus multi-source scenarios), and problems arising either at the schema or instance level [RD00]. In a data warehouse environment, the general approach to data cleansing consists of an initial data analysis to spot frequent anomalies, followed by the definition of a transformation workflow and mapping rules for the data, a verification step to evaluate the transformation, the execution of the data transformation, and the replacement of the dirty data by the cleansed data [RD00]. In relational database management systems (DBMSs) with SQL support, this data transformation task is often conducted using user-defined functions (UDFs) [cf. RD00]. On the Semantic Web by comparison, some obvious data quality problems can be spotted using SPARQL queries as suggested in [FH10]. These data quality problems could then be partially curated by materializing the corrections via SPARQL Update queries.

2.4 Semantic Data Interoperability

97

Standardization and normalization (or canonicalization) represent important cleansing and pre-processing steps for data. Creating and adhering to standards ensures that the data is represented in a uniform and consistent way [RD00]. E.g., relying on a code standard like UN/CEFACT [Uni09b] allows to seamlessly convert between metric (e.g. “centimeter”) and imperial units (e.g. “inch”), and vice versa. Normalization (or canonicalization) is essential to address the variety of representations in textual descriptions, and thus to support instance matching. Normalization operations can be applied on words or sentences (word normalization, or token normalization [MRS09, pp. 28–34]) to turn them into a canonical form for easier comparison. Some relevant techniques are summarized below: • String-based normalization [cf. ES07, pp. 76f.]: – Case normalization: E.g., convert everything to lower-case letters. – Removal of blank spaces, links, digits, and punctuation: E.g., “U.S.A.” becomes “USA”. • Linguistic normalization [cf. ES07, pp. 84f.]: – Tokenization: E.g., split sentences into tokens. – Stemming and lemmatization: E.g., stemming [e.g. Lov68; Por80] would reduce “actor”, “actress”, “action” etc., to their basic stem “act”; lemmatization would further consider contextual information to derive lemmata, i.e. “performer” in addition to “actor”. – Removal of stop words: E.g., “the”, “a”, “for”, etc. • Extrinsic linguistic techniques: – Usage of dictionaries and thesauri to match or disambiguate various terms [cf. ES07, p. 86]. – Expansion of common abbreviations and acronyms [cf. Sor+ 10]: E.g., “IR” to “information retrieval” or “Thu” to “Thursday”. Culotta et al. e.g., suggest a learning-based method based on edit distances to canonicalize data records [Cul+ 07]. Mauge et al. [MRR12] apply this method to e-commerce inventory. In essence, they use the method to find and cluster synonyms among properties that they have first extracted from unstructured textual product descriptions [MRR12].

98

2 Background and Related Work

2.5 Product Search In the following, we discuss relevant concepts and topics for product search.

2.5.1 Information Need The information need is the lack of information that users seek to compensate with searches. Manning et al. [MRS09] paraphrase it as “the topic about which the user desires to know more” [MRS09, p. 5]. Morville and Rosenfeld mention four types of information needs [MR06, pp. 34f.]: 1. Known-item seeking (“the right thing”), 2. exploratory seeking (“a few good things”), 3. exhaustive research (“everything”), and 4. re-finding (“need it again”). Information needs are expressed using different types of queries. A query is how the user expresses the information need when looking for answers [MRS09, p. 5].

2.5.2 Search Types According to the source of information considered, searches can be roughly classified into three main approaches: 1. Classical search (or traditional search) is the most common search paradigm and describes keyword searches over the document-based Web, based on IR techniques applied on textual descriptions [e.g. BP98]. 2. Semantic search goes beyond keyword search by trying to better understand the intended meaning of the terms. Because semantic search approaches typically rely on structured data from the Semantic Web, they can take into account context information (e.g. to better understand queries) and augment traditional search results with additional information from the Semantic Web [e.g. GMM03; Hen10]. 3. Hybrid search combines the two distinct search types, i.e. the flexibility of keywordbased search with context-aware semantic search [e.g. Bha+ 08; RSA04].

2.5 Product Search

99

As a complement to these main search types, personalized searches take advantage of user preferences, interests, and context information [e.g. Law00; Pit+ 02]. Based on this knowledge, search systems can then provide custom-tailored search results.

2.5.3 Information Retrieval Modern information retrieval (IR) describes a research field in computer science that aims to facilitate the access to information objects for users [BR11, p. 1]. In comparison to computer science, IR has a very long tradition that can be traced back to 5, 000 years ago, when the Sumerians already have started to organize information on clay tablets for later retrieval [Sin01; BR11, pp. 1f.]. With the invention of paper the amount of information has further increased which let storage and retrieval become even more important [Sin01]. Precursors of modern IR systems were built as mechanical and electro-mechanical devices starting in the 1920s, when Emanuel Goldberg was able to build a system that searched for patterns in a catalog of entries stored on a roll of film [SC12]. A particularly relevant work was published in 1945 by Vannevar Bush, who described in his essay “As We May Think” the idea of memex, an assistive mechanical device in the form of a desk that was intended to help users in organizing knowledge by means of a persistent storage (augmenting what the human memory is capable of storing) and associations74 (resembling the functioning of the human brain) [Bus45]. The history of IR research and development is more comprehensively reviewed in [SC12]. The wide application of modern, computer-based IR systems emerged from library science and digital libraries [BR11, p. 3]. The actual breakthrough of IR came with the WWW and search engines, where the corpus of information to be handled rose very quickly, which challenged traditional IR techniques [BR11, p. 3]. A general and widely accepted definition of IR is provided in the book entitled Modern Information Retrieval: The Concepts and Technology behind Search by Baeza-Yates and Ribeiro-Neto: “Information retrieval deals with the representation, storage, organization of, and access to information items such as documents, Web pages, online catalogs, structured and semi-structured records, multimedia objects. The representation and organization of the information items should be such as to provide the users with easy access to information of their interest.” [BR11, p. 1]

In the book An Introduction to Information Retrieval, Manning et al. give a narrower definition of IR: 74

The idea of associations formed the basis of modern hypertext systems as the WWW.

100

2 Background and Related Work

“Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).” [MRS09, p. 1]

Following this definition (without the notes within the parentheses), IR addresses the finding of unstructured information artifacts from large collections. Taking into account the content from within the parentheses, the definition becomes even narrower, focusing on the retrieval of digital, text-based documents. As compared to the previous definition by Baeza-Yates and Ribeiro-Neto [BR11] which also regards non-textual content, this second definition defines IR from a natural language processing (NLP) point of view (which also reflects the primary research focus of the book authors).

2.5.3.1 Approaches In general, IR models can be classified by textual properties of documents and, for retrieval on the Web, the link structure [BR11, p. 59]. Sometimes, multimedia retrieval is also taken into account, but which of course has different characteristics than text retrieval [BR11, p. 59]. IR models based on text are further distinguished by the level of structure, i.e. whether the text is unstructured or semi-structured [BR11, pp. 59f.] (e.g. HTML elements). Classical IR models are based on unstructured text, that are the boolean model, the vector-based model, and probabilistic models [BR11, p. 60]. In the following, we briefly sketch the main ideas of these classical IR models.

Boolean Model The boolean model of information retrieval treats documents as a sequence of words, represented by an inverted index [ZM06]. An inverted index, inverted file, or inverted file index is a dictionary-like data structure that stores terms alongside a list of documents they appear in, see e.g. [ZM06; MRS09, p. 6]. An inverted index is a very space-efficient way of storing sparse term-document matrices [MRS09, p. 6]. With boolean retrieval, queries are executed as boolean expressions (linked via AND, OR, or NOT connectors) evaluated over these dictionaries of index terms [MRS09, p. 4]. However, the boolean retrieval model has some downsides [SFW83]: 1. The number of results returned can vary greatly based on the search terms and boolean connectors used in the query [SFW83]. 2. Retrieved results are not ranked as a query either matches the document terms or does not [SFW83].

2.5 Product Search

101

3. Terms that appear in a query or document are not weighted and thus have equal importance, i.e. matches between queries and documents are exact [SFW83]. 4. The results are often counterintuitive, because for the disjunction of query terms (e.g. information OR retrieval) a document with one of these terms is given the same relevance as a document containing both terms; and similarly, for the conjunction (e.g. information AND retrieval) a document that contains one of these terms is considered just as useless as a document that does not contain any of them [SFW83]. Alternative models are fuzzy or extended boolean retrieval models [cf. SFW83]. The former extends the boolean retrieval model by giving weights to terms within documents which allows for a ranked retrieval of documents [SFW83]. In the extended boolean retrieval model, it is further possible to assign weights to query terms whereby a user can indicate mixed importance values for various search terms [SFW83]. Vector Space Model Salton et al. formalized the vector space model when they described an automated way to cluster documents for indexing [SWY75]. Each term in a document constitutes a single dimension in the vector space (that optionally might be weighted), i.e. a t-dimensional vector for t distinct terms in a text document [SWY75]. If documents are jointly relevant to a given user query, then they are nearby in the vector space, otherwise their vectors are distant [SWY75]. The similarity between two term vectors is usually determined by calculating the angle (cosine similarity) or the inner product between them [SWY75; Sin01]. In the vector space model, it is common that in addition to documents also queries are represented as term vectors, since a query, composed of a sequence of words or a phrase, is ultimately text [Sin01]. Contrary to the boolean retrieval model, the vector space model provides a partial matching and ranking of documents with respect to queries, which is conducted based on to the proximity of the term vectors in the vector space and the weights optionally assigned to the terms [BR11, p. 77]. Probabilistic Model An early proposal to use probability theory for establishing a ranking of documents was published in 1960 by Maron and Kuhns [MK60]. The probabilistic model for information retrieval was presented in 1976 by Robertson and Sparck Jones [RS76]. A probabilistic model uses statistical methods to estimate the probability of a document being relevant to a specific query [cf. Fuh92; BR11, p. 80f.]. Probabilistic models are generally based on the probability ranking principle [Rob77], meaning that a system’s performance is optimal, if the documents are ranked by decreasing probability of relevance for a specific request on the basis of relevance judgments available to the

102

2 Background and Related Work

system [e.g. Rob77; Cre+ 98; RZ09]. The binary independence model (BIM), a simple probabilistic model, assumes a binary index description of documents and that the terms in the documents are independently distributed [RS76]. Additional probabilistic models that are frequently mentioned in IR literature are, among others, BM25 [RZ09], but also binary independence indexing, staged logistic regression, 2-Poisson, or inference network models. For an overview of these probabilistic models, see [Fuh92; Cre+ 98].

Other Approaches Latent semantic indexing (LSI) (or latent semantic analysis) [Dee+ 90] is a novel vector-based approach that seeks to improve retrieval performance. It is based on singular-value decomposition of a term-document association matrix, in other words it leverages the latent semantic structure of documents and semantically indexes associated terms in a vector space [Dee+ 90]. This way it is possible to consider documents that in the first place seem irrelevant, yet are implicitly related to relevant documents based on some shared concepts. Similarly, irrelevant documents that appear relevant can be eliminated with higher confidence from the result set. Consequently, LSI is able to ameliorate the problems of synonymy (i.e. different words that mean identical or similar things [NO95], e.g. “car” versus “automobile”) and polysemy (or homonymy, i.e. identical or similar words that carry different meanings [NO95], e.g. “Jaguar” the big cat versus “Jaguar” the automobile brand name), that usually create problems for the conventional boolean and vector-based retrieval models [Dee+ 90]. While synonyms can negatively impact recall of IR systems, homonyms usually account for poor precision [Dee+ 90] (see Section 2.5.3.3 for a discussion of precision and recall in IR).

2.5.3.2 Ranking IR algorithms generally rank results based on previously computed scores associated with documents [Sin01]. The result set might further be pruned to those documents with scores above a specified threshold value [BR11, p. 78]. Term weighting is a method to attach weights to terms relative to their importance in documents or document collections [cf. SB88]. Its goal is to improve retrieval effectiveness by retrieving as many relevant documents as possible and rejecting irrelevant documents [SB88]. The term frequency–inverse document frequency (TF-IDF) model is the most popular weighting scheme for IR [BR11, p. 68]. It relates the term frequency within a document (TF – term frequency) [Luh57] to the frequency of the term over the whole document collection (IDF – inverse document frequency) [Spa72]. Accordingly, a document is relevant with respect to a particular query term, if the term is frequent

2.5 Product Search

103

in that document. At the same moment, the relevance of the document is lower if the term exists in several documents of a collection instead of solely occurring in a single document. In large document collections, there is a risk of term weighting schemes favoring longer documents, because long documents typically expose a richer set of terms and a higher term frequency [SBM96]. To address this unfair treatment of shorter documents, IR algorithms employ document length normalization functions to correct for different-length documents. Popular techniques mentioned in [SBM96] are cosine normalization, maximum TF normalization, and byte length normalization. The TF-IDF weighting scheme calculates ranking scores for documents as follows: Given a query q and a document d, then the score associated with the document d is the sum of all TF-IDF weights of query terms t [cf. MRS09, p. 119]: score(q, d) =

X

tf idft,d

(2.8)

t∈q

In the vector space model, the score is computed as the similarity between query and document vector [Sin01]. The similarity of two vectors can be determined by the cosine similarity (normalized dot product) [cf. MRS09, p. 124], i.e. the angle between the two vectors ~q and d~ in the t-dimensional space. The vectors are represented as vectors of weighted terms, in its basic form TF-IDF weights associated with terms [BR11, p. 78]. score(q, d) =

~q · d~ ~ |~q||d|

(2.9)

As already discussed before, probabilistic models rank documents according to the probability ranking principle [Rob77], which essentially states that for maximal effectiveness systems should rank documents by the probability of being relevant to a given query.

2.5.3.3 Evaluation Criteria for Information Retrieval In Modern Information Retrieval: The Concepts and Technology behind Search, BaezaYates and Ribeiro-Neto outline the central criterion for the success of an IR algorithm, that is the relevance of the presented results with respect to a given information need: “[T]he primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few non-relevant documents as possible.” [BR11, p. 4]

104

2 Background and Related Work

Precision and recall are two fundamental metrics for evaluating the quality of IR systems [e.g. Cle67]. They were initially defined as part of the Cranfield experiments in the 1950s [Cle67], a precursor of modern IR evaluation, where indexing systems were compared and evaluated with the help of systematically created reference collections. These test collections contain documents manually labelled as relevant or non-relevant relative to particular queries. Nowadays, similar test collections75 are provided in the context of the Text REtrieval Conference (TREC) conference, addressing the various information needs from different application domains [e.g. Sin01; SC12], including Web search, medical search, enterprise search, and others. In the following, we formally define precision, recall, and the F1 -measure as three popular evaluation criteria for information retrieval algorithms. Let Dretrieved denote the set of retrieved documents from a document collection D, and Drelevant be the set of documents deemed relevant (see Figure 2.18).

D

Dretrieved

D

Drelevant

Dretrieved

(a) Precision

Drelevant

(b) Recall

Figure 2.18: Side-by-side comparison of precision and recall

The precision, aiming for the highest possible amount of relevant documents in a returned result set [Cle67], is calculated as the length of the set of relevant documents retrieved divided by the number of all retrieved documents from a document collection [cf. MRS09, p. 155]: precision =

|Drelevant ∩ Dretrieved | |Dretrieved |

(2.10)

The recall, which strives to receive as much relevant documents as possible from a document collection [Cle67], is formalized as the amount of relevant documents that could be retrieved with respect to all relevant documents in the document collection [cf. MRS09, p. 155]: recall = 75

|Drelevant ∩ Dretrieved | |Drelevant |

http://trec.nist.gov/data.html (accessed on June 23, 2014)

(2.11)

2.5 Product Search

105

The values for precision and recall are rational numbers between 0 and 1. The higher the precision, the more of the retrieved documents are relevant. The higher the recall, the more of all the relevant documents could be retrieved. The paradox is now that precision and recall often describe an inverse relationship, that means if you improve one value it usually comes at the cost of the other value [Cle67]. Finally, the F1 -measure relates these two values [cf. MRS09, p. 156]. It is given by F1 =

2 · precision · recall precision + recall

(2.12)

In addition to the F1 -measure, there exist other measures that integrate precision and recall into a single number, e.g. mean average precision (MAP) or 11-point interpolated average precision [MRS09, pp. 159–161]. The 11-point interpolated average precision e.g., calculates the arithmetic mean of the interpolated precision at recall levels 0%, 10%, ... 100% (eleven points) [MRS09, p. 159]. Often however, especially for Web searches, it is difficult to get hold of all the documents in a collection, which at least hampers the computation of the recall measure. If the proportion of retrieved documents is very large, then it turns out difficult to compute the precision metric, too. For this reason, other metrics are sometimes more appropriate. Subsequently we delineate a non-exhaustive selection of these metrics: • Precision at n (P@n), where P@5 (precision at 5), P@10 (precision at 10), P@20 (precision at 20) are possible instances, is a very popular measure for Web searches [BR11, p. 140]. P@5 means that the result set is cut off after the fifth result, and the portion of relevant documents in this reduced result set gives the precision. This metric can be used to compare different IR systems, or to observe the behavior of a single system over a number of queries [BR11, pp. 139f.]. • R-precision computes the precision at the position R, given that R is the number of relevant documents for a particular query [BR11, p. 141]. This measure assumes that R is known. Thus it addresses a potential drawback of P@n, i.e. it relates the measure to the number of relevant documents, which can positively affect the measure for queries with many relevant results [MRS09, p. 161]. • Binary preference (BPREF) [BV04] is used if relevance judgements are incomplete in document collections, like in large corpora as the Web where other measures generally treat the unreceived documents as non-relevant [BR11, p. 151]. In short, BPREF computes a metric over preference relations among documents, and is based on the number of documents judged as non-relevant by human experts that show up ahead of relevant documents [BR11, p. 151].

106

2 Background and Related Work

In addition to the retrieval quality metrics, there are other important factors for evaluating IR systems such as system quality assessment and user utility tests (e.g., time required for completing the task, or usability scores) [MRS09, p. 168]. Accordingly, an IR system can be systematically evaluated in terms of usability by means of user-based experiments. A non-exhaustive list of popular methods for usability testing includes [cf. BR11, pp. 168– 173]: • Side-by-side panels: With this method, the results of two retrieval algorithms are compared next to each other [TH06]. Users are then asked for comparative judgements on the retrieval qualities of the two algorithms and, possibly, their interactions are logged [TH06]. • A/B testing: A/B testing is a controlled experiment where users are typically split into two evenly distributed groups (but also other splits are possible [e.g. MRS09, p. 70]), namely a control group A and a treatment group B, and randomly assigned to them [cf. Koh+ 09]. While for one user segment the conditions stay the same, the other group is faced with a different version with slightly changed parameters [Koh+ 09]. Running this kind of tests can provide useful insights prior to adding new features, changing the user interface design, or renewing the underlying retrieval algorithm of a system. • Click-through data: User clicks are recorded in the background while users are interacting with the system [Joa02]. From such data about the relevance of documents to a particular query, a retrieval algorithm can learn better rankings for future requests [Joa02]. In general, there can be imagined at least three possible settings for conducting user experiments. The classical example is a lab setting where users are observed during task execution and interviewed [BR11, p. 168]. An alternative method is to collect usage data by logging user interaction with the system [cf. Joa02]. This requires additional implementation effort. The third option is to outsource the fulfillment of user tasks to people on the Web, known as crowdsourcing. Usability testing with crowdsourcing can be regarded similar to a lab setting, with the important difference that a bigger audience can be reached at lower costs and timely overhead [Liu+ 12].

2.5.3.4 Tools There are many tools and applications that have been developed in the context of IR, among others commercial search engines like Google, Yahoo!, Bing, etc. Furthermore,

2.5 Product Search

107

there exist several open source projects. One of them is Apache Lucene76 , an industrialstrength open source project that brings several of the benefits of IR algorithms also to smaller development projects. Lucene is a software library that adds full-text search and indexing capabilities to applications, supporting a range of different query types, several ranking models, and lots of configuration options [cf. MHG10].

2.5.4 Human-Computer Interaction Human-computer interaction (HCI) describes an interdisciplinary field dealing with methods on how people can interact with computer systems. For an overview, see e.g. [Car97; Mye98; Gru12]. Even if HCI also deals with past and future advances in input and output devices like the computer mouse or motion tracking devices [Mye98], the research direction we are mostly interested herein are user interface designs and interaction models for search systems. White et al. [Whi+ 06b] frame the challenges of search systems as “[r]ather than just providing search results, search systems should help users explore, overcome uncertainty, and learn” [Whi+ 06b]. Morville and Callender [MC10] characterize search as follows: “[S]earch at its best is a conversation. It’s an iterative, interactive process where we find we learn. The answer changes the question. The process moves the goal. Search has the power to suggest, define, refine, cross-sell, upsell, relate, and educate. In fact, search is already among the most influential ways we learn.” [MC10, p. 9]

In the following three sections, we elaborate on the common distinguishing characteristics of search. After that, we summarize interaction paradigms for search, dedicating more space to the faceted search interfaces, and conclude with design guidelines for search interfaces.

2.5.4.1 Static versus Dynamic Search Search is traditionally oversimplified as a task accompanied by a static information need [Hea11]. In this view, search is about (a) identifying a problem, (b) articulating the information need, (c) formulating a query, and (d) evaluating the results [Hea11, p. 23]. While this model fits well known-item seeking (see Section 2.5.4.2) scenarios, it is not suitable for searches where the user does not yet exactly know what he is looking for. In fact, search is very often a dynamic process, where the information need adapts over time as users learn and collect new information [Hea11, p. 23]. At best, the design of user interfaces caters for these circumstances. Contemporary searches hence try to continually 76

http://lucene.apache.org/core/ (accessed on June 24, 2014)

108

2 Background and Related Work

engage the user in the search process. Information seekers revise queries in an iterative manner according to their changing information needs based on past results [Mar06]. While searching they collect, compare, filter, and digest pieces of information. This is sometimes referred to as the berrypicking model of search [Bat89]. In accordance with the dynamic search model, incremental search strategies are quite common. One such strategy is that users pose an initial query without fully specifying the information need (“testing the water”), which is only later refined in subsequent search iterations [Hea11, p. 26]. Another possibility is to do a series of smaller searches instead of writing a long complex query, in the hope of gradually approaching the final answer. This kind of incremental search strategy has been termed in [OJ93] as orienteering [Hea11, p. 23]. One important problem with current systems is that the temporary storage and later integration of intermediate results is mostly left to the human and only lightly supported by the systems.

2.5.4.2 Lookup versus Learning Depending on the information need and the prior knowledge of a domain, there might be different search tasks most appropriate. Morville and Callender discern two of them, i.e. • lookup (or known-item search), and • learn (or exploratory search) [MC10, pp. 27f.]. Further, Marchionini [Mar06] makes a more subtle distinction by regarding lookup, learn, and investigate. With regard to Morville and Callender [MC10], the latter two tasks (learn and investigate) can be summarized under a common term exploratory search [Mar06]. Lookup “is the most basic kind of search task” [Mar06] and involves fact retrieval and question answering (QA), i.e. to query information about items that are already known [cf. MC10, p. 28]. On the other hand, learning and investigative searches are multi-step approaches that comprise knowledge discovery, comparison, and analysis, to name but a few [cf. MC10, p. 28]. Investigative search delineates from learning in that it is a long-term process that aims to gather new knowledge or update existing one, or to fill potential gaps in knowledge [Mar06]. It hence constitutes an in-depth research process.

2.5 Product Search

109

2.5.4.3 Searching versus Browsing In general, we can characterize two main activities of an information-seeking agent, i.e. 1. searching (or querying) and 2. browsing (or navigating) [BR11, p. 4]. Morville and Rosenfeld [MR06, p. 35] further mention asking as a third informationseeking activity. Sometimes, if an information need is vague, it may not be appropriate or feasible to formulate a query. E.g., it might be difficult to recall the right terms or descriptions of an information need [Hea11, p. 24]. In such cases, browsing or navigating the option space is more promising than querying. Nonetheless, these information seeking tasks are most commonly combined, i.e. a user is able to alternate between searching, browsing, and asking [MR06, pp. 35f.].

2.5.4.4 Interaction Paradigms for Search The optimal search interface for user interaction depends on several factors. Hearst mentions three situations that affect the type of applicable search interface, namely 1. the kind of search task, 2. the time and effort that can be spent on the task, and 3. the past experience of the information-seeking agent [Hea11, p. 22]. For example, imagine someone wants to know more about a topic he currently lacks expertise in. Recall that in this case, navigating the option space and possibly finding new interesting things might be easier and lead to more promising results than having to write complex queries. In the following, we outline the two most common search paradigms in use today: 1. Keyword search: Widely employed by search engines and digital libraries, the prevalent query method is to enter query terms into a search box, referred to as keyword search. Keyword search is taking advantage of techniques from IR [cf. BR11, p. 3]. Keyword searches are intended to be easy to grasp for the vast majority of information seekers. A related but less intuitive approach is described by form-based query interfaces, i.e. guided search interfaces with multiple input fields for more expressive queries [Wei+ 13]. Due to their complexity for the average

110

2 Background and Related Work

user, they are typically reserved for expert searchers or narrowly defined information needs. 2. Navigational search: Before the rise of keyword-based search engines, links were collected and manually organized into Web directories to facilitate the discovery of documents via category navigation [Din+ 05; Wei+ 13] (e.g. by Yahoo! Directory77 , Open Directory78 , etc.). This navigational search has become less popular for Web searches with the enormous growth of the Web and improvements to search engine accuracy. However, navigational search is implemented by many Web sites, e.g. news portals and e-commerce platforms to facilitate page navigability or the discovery of products. Well-designed search interfaces usually integrate both keyword-based and navigational searches. Amazon, for example, effectively combines keyword searches with navigational capabilities to let the user narrow down the information space (i.e. the product catalog). Exploratory search goes one step further by blending querying and browsing strategies into a highly interactive user interface [Mar06]. Faceted search interfaces constitute an important instance of exploratory searches.

2.5.4.5 Faceted Search Interfaces Faceted search is a specific type of exploratory search [Wei+ 13], which in addition to traditional keyword search allows to navigate the option space via browsing and filtering [Tun09, p. 24]. Faceted search interfaces are quite common for e-commerce platforms like eBay or Amazon. Faceted search is a multi-dimensional search paradigm and based on the concepts of facets 79 and their facet values or terms [Tun09, pp. 7f.]. Facets can be roughly compared to categories orthogonal to each other. Consequently, to give an example, possible facets for beer would be “style”, “location”, and “brewery”. Similarly, examples of respective facet values would be “wheat beer” for the style, “Munich” for the location, and “Paulaner” for the brewery (see Figure 2.19). Faceted search is sometimes interchangeably used with the terms faceted navigation, faceted browsing, or guided navigation [Wei+ 13; cf. MC10, p. 95]. Tunkelang, however, 77 78 79

http://dir.yahoo.com/ (accessed on June 18, 2014) http://www.dmoz.org/ (accessed on June 18, 2014)

Important contributions to faceted classification were made by Shiyali Ramamrita Ranganathan, an Indian librarian, in the first half of the twentieth century, by proposing his extensible colon classification scheme for library science [Tun09, pp. 7f.].

2.5 Product Search

Style

111

Location

Brewery

Wheat beer

Munich

Paulaner

Pale beer

Erding

Erdinger

Dark beer

Hacker-Pschorr Bitburger

Figure 2.19: Faceted search interface

makes a distinction between faceted search and faceted navigation (or browsing), when he considers faceted navigation, like parametric searches, a predecessor of faceted search [Tun09, p. 21]. Parametric searches are search interfaces based on boolean algebra, where the interface components essentially describe a combination of logical AND and OR connectors [Tun09, p. 21]. Multiple facets would be connect by ANDs, whereas their option values by ORs [Tun09, p. 21]. Selecting both the “Paulaner” and “Bitburger” breweries and the location “Munich” from Figure 2.19 thus yields the following boolean query: Location(Munich) AND (Brewery(Paulaner) OR Brewery(Bitburger))

Faceted navigation adds guidance to parametric searches [Tun09, p. 23]. While parametric searches are executed as a one-shot query, the faceted navigation typically provides immediate feedback by adapting the choices available in other facets to the currently selected options [Tun09, p. 23]. So, in the example above, after selecting “Munich” for the location, it would not be possible anymore to choose the brewery “Bitburger”. In other words, since Bitburger beer is brewed in Bitburg, Rhineland-Palatinate, and not in Munich, the option for Bitburger has disappeared. Finally, a faceted search interface combines faceted navigation with keyword search [Tun09, p. 24]. Pioneering work on this type of search interfaces was conducted by Hearst et al., who suggested the Flamenco faceted search framework80 [Hea+ 02]. Wei et al. [Wei+ 13] compare a number of faceted search systems that have been presented in the past. The research on dynamic taxonomies [Sac00] is also closely linked to faceted search, because faceted search usually displays facets and facet values dynamically rather than showing rigid category structures that rely on static taxonomies. Faceted search interfaces have a number of characteristics that permit to contrast them from other search paradigms: (1) They do not require the user to manually formulate 80

http://flamenco.berkeley.edu/ (accessed on June 27, 2014)

112

2 Background and Related Work

complex queries, but they allow to refine and relax search results based on facets and facet values; (2) they require little knowledge about the underlying data schema; (2) they facilitate to explore the option space in a guided and incremental fashion, where possible options for next navigation steps depend on the current selection; and, (3) faceted search interfaces solve the problem of “dead ends” [FH11], i.e. they eliminate unsatisfiable choices that could otherwise lead to empty result sets [cf. Wei+ 13].

2.5.4.6 Design Guidelines for Search Interfaces A simple but very important rule in search interface design is simplicity, intending that a search interface needs to be as simple as possible avoiding any unnecessary complexities that could distract users [SBC97]. In here, we briefly summarize eight guidelines for good search user interface design. They have initially been proposed in the context of IR systems in the work of Shneiderman et al. [SBC97], and were further discussed in [Hea09]: 1. Consistency: Search user interfaces need to be consistent regarding terminology, layout, etc.; otherwise, usability might suffer [SBC97]. 2. Shortcuts: Highly repetitive tasks or well-known navigation paths should be supported by shortcuts. Shortcuts can be realized in the form of keyboard shortcuts for better interaction, but also clickable page links to facilitate navigation (e.g. page breadcrumbs or deep links in search engine result snippets) [Hea09]. 3. Informative feedback: The provided feedback during the search process should be informative to the user, so that the context of the particular query becomes clear [SBC97]. In this regard, it is necessary that a search interface quickly returns results, that search results are augmented with a short summary or snippet (possibly with the search terms highlighted), or that search terms are suggested to the user in the query formulation process [Hea09]. 4. Design for closure: For a user it is usually hard to estimate the size of the option space. Hence, a system should be designed as for the users to relieve them from tedious and unnecessary guesswork regarding their search progress [SBC97]. 5. Error handling: Simple errors should be reported as long as they give helpful insights and do not distract users [SBC97]. Furthermore, a system should take care of avoiding certain errors, such as presenting users with an empty result set, or taking action to make the search system more resilient against several different variants of the same query intent [Hea09].

2.5 Product Search

113

6. Reversal of actions: It should be possible for users to undo actions in order to be able to return to past queries, which can become very powerful in combination with relevance feedback [SBC97]. 7. User control: In general, users should feel like they have control over a search system [SBC97]. However, many search systems use search algorithms that a user does not necessarily have to understand in much detail. To some extent, a system could work autonomously, e.g. it could improve recall by eliminating stop words or expanding acronyms, or cluster results [Hea09]. Nevertheless, sometimes the user should take over control, e.g. the system could ask whether a spelling correction is desired [Hea09]. 8. Short-term memory load: Any search interface design that poses high cognitive load on users is generally discouraged. For instance, information at one place in compacted form is often better than finding the same piece of information distributed among multiple resources (e.g. many slides in a presentation for one idea) [SBC97]. In principle, users should not have to remember too much of the relevant information and instead be supported by the interface [Hea09]. This can include affordances for interaction (e.g. pre-filled default values in text fields, or tooltips), but also the maintenance of a search history [Hea09]. 2.5.5 Recommender Systems Nowadays, we are finding ourselves in a culture of mass customization rather than mass production [cf. FFS07; SKR99]. Mass production was the leading business model of the twentieth century, introduced in the early twentieth century by Henry Ford for producing millions of standardized Ford Model T automobiles [FFS07]. Mass customization is the imperative that companies are facing today [FFS07]. Customers have individual needs that can largely be fulfilled by the provision of multiple product variants. However, the vast array of options available for purchase makes people quickly feel overwhelmed [e.g. FFS07; SKR99]. Recommender systems are a means to help solving the problem of information overload that accrues from this variety of products [SKR99]. They facilitate the decision-making and are based on the same principle as traditional word-of-mouth recommendations via other people’s experiences [Bob+ 13]. “Recommender systems support users by identifying interesting products and services in situations where the number and complexity of offers outstrips the user’s capability to survey them and reach a decision.” [FFS07]

114

2 Background and Related Work

A good recommender system tries to understand the information need and to suggest the right products based on that. Possible business benefits of an accurate recommender system are to turn visitors into buyers, to stimulate cross-selling of products81 , or to improve customer loyalty [SKR99].

2.5.5.1 Approaches Recommendation approaches are usually distinguished by the type of information source. The most prominent information sources are user ratings and item82 characteristics. Adomavicius and Tuzhilin distinguish between three main categories of recommendation approaches [AT05]: 1. Content-based filtering considers item characteristics and suggests similar items based on prior interests or past purchases of the user, i.e. matching items to the user profile. 2. Collaborative filtering takes into account user profiles, preferences, and interests by other users. Based on similar profiles, a user is recommended items that other users have already preferred, rated, or visited previously. 3. Hybrid recommendation approaches combine different recommendation methods [Bur07; AT05], e.g. content-based and collaborative filtering, to overcome individual limitations of the single approaches. There is more explicit and implicit information about users and items that can be gathered and utilized, namely user behavior, various types of context information (e.g. demographics), social profiles, or the Internet of Things (IoT) [Bob+ 13]. Accordingly, we can augment the aforementioned categorization of recommendation approaches. Burke mentions the following two additional recommendation approaches [Bur07]: 4. Demographic filtering takes into account the demographic profile of the user, i.e. the similarity of user profiles is calculated based on demographic attributes like location, gender, or age. 5. Knowledge-based filtering recommends items based on domain knowledge and user preferences. Further, there is at least another approach that became attractive with social networks: 81 82

Cross-selling describes additional sales generated by related products, e.g. complementary goods. “Items are the objects that are recommended.” [RRS11]

2.5 Product Search

115

6. Community-based filtering takes advantage of social networks, e.g. by limiting recommendations to the circle of friends whom the user typically trusts more than the anonymous crowd [e.g. RRS11, p. 13].

2.5.5.2 Limitations There are several challenges that need to be solved for recommender systems. In the following, we outline selected problems that are repeatedly mentioned in literature [e.g. FFS07; Bob+ 13; RRS11; AT05].

Cold-Start Problem A common limitation of recommender systems is the cold-start problem [e.g. Bob+ 13; RRS11, p. 13; [Bur07]; [LKH14]]. The cold-start problem describes the problem of an insufficient number of initial ratings that recommendations could be based on [Bob+ 13]. It emerges either when a new recommender systems is launched for the first time, or when a new user or a new item is added to an existing recommender system [Bob+ 13]. While content-based filtering generally suffers from the new user problem only (i.e. the system has no prior knowledge about a user’s preferences), the collaborative method also faces new item challenges (i.e. there are not sufficiently many user ratings for a given item) [AT05]. The new user and new item problems are usually addressed by employing hybrid recommendation strategies, e.g. by combining content-based and collaborative filtering techniques [e.g. AT05; Bur07].

Sparsity Sparsity of user ratings is a problem of collaborative filtering techniques [AT05]. Especially if the number of items is large, it can happen that several of them have not been rated yet. This problem can be mitigated by drawing on demographic filtering techniques, i.e. to recommend items based on user groups that share the same demographic attributes [AT05].

Diversity A feature that is sometimes disregarded but still desirable for recommender systems is to establish some degree of diversity between recommended items [AT05]: First of all, if systems strive to suggest items that best possible match user profiles, then users will never be recommended serendipitous items, that are novel items that might be surprisingly interesting to them. To tackle this issue, recommender systems could avoid to suggest items that are too similar to the items seen before (instead of only those that are too dissimilar) [AT05].

116

2 Background and Related Work

2.5.5.3 Tools Numerous commercial Web portals integrate recommendation engines to recommend items to users, e.g. Netflix, Youtube, Amazon, eBay, etc. But there exist also several non-commercial tools for building recommendation engines. In the course of their research (e.g. MovieLens83 , a movie recommender system), the GroupLens research laboratory at the University of Minnesota developed an open source toolset for building and studying recommender systems, named LensKit84 [Eks+ 11]. In their complementing research paper, Ekstrand et al. [Eks+ 11] also give an overview of other popular recommender toolkits, among others Apache Mahout85 , which is a machine learning library that implements recommenders based on clustering, classification, and collaborative filtering techniques.

2.5.6 Natural Language Processing Natural language processing (NLP) is one branch of speech and language processing [JM09, p. 1]. Speech and language processing is a research area which can be subdivided into the fields of computational linguistics (linguistics), NLP (computer science), speech recognition (electrical engineering), and computational psycholinguistics (psychology) [JM09, p. 9]. According to Manaris, NLP spans the research disciplines of computer science, electrical engineering, mathematics, statistics, linguistics, psychology, philosophy, and biology [Man98, pp. 6f.]. Early research on NLP reaches back to the first years after World War II (late 1940s) when English scientists worked on machine translation to the end to translate Soviet papers on physics from Russian to English [Man98, p. 7]. Beginning in the 1990s, the field of NLP has since attracted a lot of commercial and academic interest due to the dawn of the Web, an increasing amount of text corpora made available for analysis (e.g. the Penn Treebank project with tag sets of annotated English text [MSM93]), significantly more powerful computers, and improvements to NLP models (e.g. the application of probabilistic models) [cf. JM09, pp. 12f.]. For a more detailed overview on NLP history, see e.g. [Bat95; Man98, pp. 7–12; cf. JM09, pp. 9–14]. It is not easy to find a universal definition for NLP, because of the many facets of language and its interdisciplinary nature [Man98, pp. 4f.]. Manaris defines NLP within the scope of HCI as http://movielens.org/ (accessed on June 27, 2014) http://lenskit.grouplens.org/ (accessed on June 27, 2014) 85 http://mahout.apache.org/ (accessed on June 27, 2014)

83

84

2.5 Product Search

117

“[...] the discipline that studies the linguistic aspects of human-human and humanmachine communication, develops models of linguistic competence and performance, employs computational frameworks to implement processes incorporating such models, identifies methodologies for iterative refinement of such processes/models, and investigates techniques for evaluating the resultant systems.” [Man98, p. 6]

NLP deals with text [cf. BKL09] and as such it borrows techniques from IR, machine learning, or text categorization [MS99, p. xxxi]. In IR, for example, NLP attracted wide attention with respect to better grasp the intent behind natural language queries, and the indexing of unstructured text for better retrieval [LS96]. The principal problem in dealing with natural language texts is the ambiguity in the language and the variety of structure in the sentences. Thus, “a practical NLP system must be good at making disambiguation decisions of word sense, word category, syntactic structure, and semantic scope” [MS99, p. 18]. 2.5.6.1 Approaches NLP entails various processes that each relies on different kinds of knowledge sources. A general model of NLP systems was suggested in [Bat95]. Figure 2.20 shows the classes of processes that operate at varying knowledge levels [see also Man98, pp. 30–32] in order to interpret input text. NLP processors apply at the lexical (morphological features), syntactic (word structure), and semantic (meaning) levels, as well as at the levels of domain (known concepts and relations), discourse (sequence of sentences), and pragmatic (purpose or intended goal) knowledge.

Syntax

Semantics

Discourse/ Pragmatics

Lexicon

Words

Domain Model

Understanding Search

Output(s)

Figure 2.20: Processes of an NLP system [from Bat95]

In Section 2.4 (semantic data interoperability), we have already learned some techniques of NLP. Subsequently, we extend them by briefly summarizing some of the most popular NLP approaches:

118

2 Background and Related Work

• Regular expressions for pattern matching and information extraction [cf. BKL09, pp. 97–106]. • (Pre-)processing steps based on words and sentences like stop word elimination [cf. BKL09, p. 60], tokenization [cf. BKL09, pp. 109–112], lemmatization [cf. BKL09, p. 108] sentence segmentation [cf. BKL09, pp. 112f.], or stemming (e.g. Lovins [Lov68] and Porter [Por80] stemmers) [cf. BKL09, pp. 107f.]. • Word sense disambiguation (WSD) [Nav09] to find out the intended meaning of a word [cf. BKL09, p. 28]. • Part-of-speech (POS) tagging as a method to categorize words based on the role they play in a sentence (i.e. noun, verb, etc.) [cf. BKL09, pp. 179–189]. POS tagging algorithms are typically based either on stochastic or on rule-based (e.g. the Brill tagger) approaches [Bri79]. By that, it is possible to distinguish the sense of the word limit in “price limit” and “to limit a risk”. • Named entity recognition (NER) to detect interesting named entities in text such as organizations, persons, or locations [cf. BKL09, pp. 281–283]. On top of that, relation extraction then tries to extract existing relationships between the named entities [cf. BKL09, pp. 283–285]. • A grammar to define the structure of natural language. Given an input sentence, the parser generates a parse tree based on the supplied grammar. A probabilistic context-free grammar is able to produce parse trees with associated probabilities that can be ranked [cf. BKL09, pp. 291–321]. In the course of this thesis, we will not use advanced NLP techniques. It is promising future work to integrate them into our approach.

2.5.6.2 Tools There exist a number of tools and programming libraries for NLP. Stanford NER86 is a NER tool written in Java. The Natural Language Toolkit (NLTK)87 is a powerful Python framework for NLP. For example, it supports NLP features like POS tagging, implements various classifiers, and comes with an interface for Stanford NER. The library is complemented by an accompanying book about NLP in Python [BKL09]. General Architecture for Text Engineering (GATE)88 [Cun02] is another popular framework that http://nlp.stanford.edu/software/CRF-NER.shtml (accessed on June 20, 2014) http://www.nltk.org/ (accessed on June 20, 2014) 88 https://gate.ac.uk/ (accessed on June 20, 2014) 86

87

2.5 Product Search

119

implements several features for text processing, including information extraction and NLP features. Finally, Apache OpenNLP89 is an open-sourced machine learning library for NLP with support for the most common NLP tasks.

2.5.7 Matchmaking The term matchmaking is widely used in e-commerce literature and describes the problem of 1. searching for potential matches between supply90 and demand91 , and 2. selecting the best fitting object among a set of candidates, even if no perfect match is available [e.g. Hof+ 00; Vei03; AL05; Di + 03]. Di Noia et al. [Di + 03] define the matchmaking problem as follows: “The process of searching the space of possible matches between demand and supplies can be defined as Matchmaking. Notice that this process is quite different from simply finding, given a demand a perfectly matching supply (or vice versa). Instead it includes finding all those supplies that can to some extent fulfill a demand, eventually propose the best ones.” [Di + 03]

While the matchmaking problem has ever existed since people have been trading goods, it has become more complex, challenging, and important since market transactions are taking place in electronic networks, and recently on the WWW [cf. Di

+ 03;

DR10].

Freuder and Wallace already commented in 1998 on the prospective role of matchmaking, that “[i]ntelligent matchmakers can be regarded as a third generation tool for Internet accessibility, where hypertext constitutes the first generation, and search engines the second.” [FW98].

2.5.7.1 Characteristics of Matchmaking Some aspects of matchmaking are often not sufficiently addressed by existing matchmaking approaches. The key characteristics are: 1. Matchmaking is a learning process where the user learns about the option space during search. 2. Matchmaking is an iterative process spanning multiple cycles. 89 90 91

https://opennlp.apache.org/ (accessed on June 28, 2014)

also: resource description, offer, advertisement also: request, query

120

2 Background and Related Work

3. Matchmaking is bi-directional, where both buyer and supplier have preferences that they might specify. 4. Matchmaking should support the gradual disclosure of information, i.e. only provide as much information as necessary and only when needed. 5. Matching items are multi-dimensional objects that need to be considered by the matchmaker and the corresponding resource specification language. 6. Matches are rarely perfect and most often partial. 7. Human involvement is key to successful matchmaking, that includes query refinement/relaxation and match explanation. 8. Matchmaking has to deal with multiple, heterogenous sources and differing modalities. 9. Context (e.g. past queries, location, skills) is a relevant criterion for matchmaking. 10. Matchmaking often needs to deal with dynamic information. 11. Matchmaking should regard temporal constraints, i.e. to be responsive for short-term needs. In the following, we summarize how these characteristics of matchmaking have been tackled in literature. Learning Process The matchmaking is a learning process [e.g. Hof+ 00; Vei+ 01; FW98]. Especially for complex products, a multi-phased dialog with intermediate feedback is indispensable [Hof+ 00]. Colucci et al. [Col+ 06] propose gradual refinement of the query in a multi-stage dialog. Multi-Dimensional Matchmaking Matchmaking is not a one-dimensional, but a multidimensional problem, where complex request and resource profiles need to be matched. On many Web portals, for example, products are compared according to their prices, and product characteristics have to be compared manually [AL05]. Agarwal and Lamparter suggest a semantic approach to match product descriptions from different providers in e-marketplaces based on multiple criteria [AL05]. In [Vei+ 01; Eit+ 01], the authors treat matchmaking as a multi-dimensional matching problem and suggest an XMLbased matchmaking approach (GRAPPA) that is able to match heterogenous and multidimensional items. They apply their matchmaker to the domain of human resources, where they match job applicant profiles to job advertisements.

2.5 Product Search

121

Human Involvement Many approaches from literature underrate the importance of human intervention in the matchmaking process. The user should be able to explore the option space [e.g. Hof+ 00; Col+ 06]. Colucci et al. [Col+ 06] suggest a visual approach to matchmaking that lets users explore the marketplace option space. This approach permits to set negotiable characteristics and to select bonus characteristics that the matchmaker should possibly take into account [Col+ 06]. Trust and Confidence Some authors address the problem of trust and confidence in matchmaking [e.g. Col+ 06; Hof+ 00]. Hoffner et al. [Hof+ 00] propose a staged dialog between electronic marketplace customers and providers where trust builds up gradually as sensitive information is only disclosed as necessary. Colucci et al. [Col+ 06] show a semantic-based explanation on the match degree to the user. Bidirectional Nature Many matchmaking proposals disregard the bi-directional nature of matchmaking. I.e., matchmaking should consider the symmetric exchange of information between customer and provider in order supply also the provider with necessary details about the customer’s promises, and not only the customer with information about the supplier [Hof+ 00]. Ragone et al. [Rag+ 08] merge the discovery and negotiation phase into a bilateral matchmaking approach to increase the utility of both transaction partners. Temporal Aspect Finally, time can be an important criterion for successful matchmaking. One challenge are interruptive sessions. Hoffner et al. [Hof+ 00] noted that a virtual marketplace should provide long-life sessions, which can be interrupted and restarted or resumed. So a matchmaking process is not limited in time [Hof+ 00]. Another aspect is the duration of the matchmaking process. Stollberg et al. [SHH07] suggested a caching mechanism to improve Web service discovery. The proposal reduces the search space for successive discovery operations [SHH07].

2.5.7.2 Matchmaking and Information Retrieval Kuokka and Harada [KH96] address the problem of matchmaking in the context of the increasing availability of information on the Web. They propose matchmaking as a possible solution to notify users about information that appears and changes dynamically in a large information environment. They implemented and evaluated two matchmakers, SHared DEpendency Engineering (SHADE) and COmmon INterest Seeker (COINS), one that

122

2 Background and Related Work

matches on formal representations, and the other to match free text [KH96]. To facilitate matchmaking on the Internet, Sycara et al. specified a language for describing agent capabilities, Language for Advertisement and Request for Knowledge Sharing (LARKS) [Syc+ 02]. A matchmaking agent based on LARKS was proposed, which matching process supports context, profile, similarity, signature, similarity, and constraints matching [Syc+ 02]. Di Noia et al. [Di + 03] argue that classical IR techniques, as suggested in [KH96] and [Syc+ 99], are not appropriate for matchmaking. IR is mostly about matching free-text queries against document collections with textual content. When considering the vector space model of IR, its limitations with regard to matchmaking can be revealed by giving an example, consisting of a demand “apartment with 2 rooms in soho pets allowed no smokers” and a supply “apartment with 2 rooms in soho no pets smokers allowed” [Di + 03].

In the vector space, both sentences are represented by two identical term vectors,

despite their meanings are obviously contradictory.

2.5.7.3 Ranking with Match Degrees The ranking of matches within matchmaking is multi-dimensional, taking into account multiple attributes that a user might choose from [Col+ 06]. The problem of many matchmaking approaches is that they lack classifying and ranking of matches [Di + 03]. Di Noia et al. propose a ranking mechanism based on description logics, a logical language that OWL is based on. The ranking of supplies is determined by the type of relations: Exact, potential, and partial matches [Di + 03]. Colucci et al. [Col+ 06] provide a more detailed categorization of matches: Exact, full, plug-in, potential, and partial matches [Col+ 06]. Figure 2.21 illustrates the possible match types given a demand D and a supply S [Col+ 06]: • Exact: D = S (D is equivalent to S) • Full : D ⊂ S (D is more general than S) • Plug-in: D ⊃ S (D is more specific than S) • Potential : D ∩ S 6= ∅ and D \ S 6= ∅ and S \ D 6= ∅ (D is compatible with S) 6 ∅ and ∃d ∈ D, ∃s ∈ S, where d is incompatible with s (D is not • Partial : D ∩ S = compatible with S)

2.5 Product Search

123

Full

Exact {car, red}

{car, red}

car

car red

red

Potential

Plug-In {car, red}

{car, red}

{car, red}

red

red

red

car

car

convertible Supply

{car, red}

{car, red, convertible}

Partial

convertible {car}

{car, convertible}

Demand

car blue {car, blue}

Figure 2.21: Match classes [based on Col+ 06]

The ranking is defined in the order partial → potential → f ull → exact. Special care is required for the ranking of plug-in matches, because they favor underspecified supplies [Col+ 06]. It must be guaranteed that more specific resource descriptions are not treated inferior to unfairly generic or less specific descriptions. Imagine a demand for a red car, D = {car, red}, and supplies S as in Figure 2.21. It is easy to see that by revising the request, the match classes can change. E.g., a refinement to D = {car, red, convertible} would turn a previous exact match (S = {car, red}) into a plug-in match, whereas a relax to D = {car} transforms it into a full match. Buyer and seller preferences in e-marketplaces are often vague [e.g. Rag+ 08; AL05]. Consider a buyer looking for a car that costs less than 15, 000 Euros and has a mileage of 50, 000 kilometers. Now, a seller offers the car for 15, 500 Euros and a mileage of 45, 000 kilometers. This, in fact, would be an interesting offer. Ragone et al. [Rag+ 08] propose a fuzzy matchmaking solution based on description logics that allows for conditional preferences with supplying hard and soft constraints. Agarwal and Lamparter [AL05] present an intelligent matchmaking portal where fuzzy user requests are relaxed to interval queries that are further matched against a marketplace. Stuckenschmidt and Kolb [SK08] suggest a partial matchmaking approach that uses approximate subsumption to match complex product descriptions. Based on this partial matchmaking approach, Nowakowski and Stuckenschmidt [NS10] present a demonstrator where a user can gradually relax constraints in a product catalog based on eClassOWL.

2.5.7.4 Related Research Fields and Applications Areas In fact, the problem of matchmaking can be found in many research fields and application domains outside the narrower context of business. Its interdisciplinary relevance led to a rich body of work on matchmaking from various angles, like • intelligent, agent-based systems [e.g. Syc+ 99; KH96],

124

2 Background and Related Work

• Web service discovery [e.g. Pao+ 02; LH03; GTB01; Sto+ 07; Wei+ 11], • human resource matching (e.g. match job profiles with applicants) [e.g. VWM02], • Grid resource matching (e.g. allocate computational resources for tasks) [e.g. RLS98; Har+ 04; LR05; Ama+ 09] and Cloud service discovery [e.g. Zar+ 13], • real estate matching [e.g. Poo+ 11], • recommendations in multiplayer games (e.g. suggestion of proper competitors) [e.g. JJD11; Man+ 11], • genetic matchmaking (e.g. mating in biology) [e.g. Wed+ 95], or • marital matchmaking (i.e. finding a spouse) [e.g. BD08].

3 Data Collection

3.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

3.2

State of the Art and Related Work 3.2.1

3.2.2 3.3

3.2.1.1

Standalone Crawlers . . . . . . . . . . . . . . . . . . . . . . . 129

3.2.1.2

Embedded Crawlers . . . . . . . . . . . . . . . . . . . . . . . . 129

3.2.1.3

Extractors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Web Data Commons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

3.3.1

Ping-based Discovery of Relevant Web Sites . . . . . . . . . . . . . . . . 133

3.3.2

Crawling Strategy and Implementation . . . . . . . . . . . . . . . . . . . 134 3.3.2.1

Sitemap-based Crawling . . . . . . . . . . . . . . . . . . . . . 136

3.3.2.2

Spider-based Crawling . . . . . . . . . . . . . . . . . . . . . . 138

Extraction of Structured Data . . . . . . . . . . . . . . . . . . . . . . . . 138 3.3.3.1

Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.3.3.2

Politeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

3.3.3.3

Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

Evaluation and Analysis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.4.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

3.4.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.3

3.5

Approaches for the Collection of Structured Data . . . . . . . . . . . . . 129

Sweet-Spot Deep Crawling Approach . . . . . . . . . . . . . . . . . . . . . . . . 133

3.3.3

3.4

. . . . . . . . . . . . . . . . . . . . . . . . . 129

3.4.2.1

Shop Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

3.4.2.2

Property Statistics . . . . . . . . . . . . . . . . . . . . . . . . 142

Comparison with Web Data Commons . . . . . . . . . . . . . . . . . . . 145 3.4.3.1

Quantitative Comparison of Entities in WDC and GRC . . . . 146

3.4.3.2

Quantitative Comparison of Structured Data in Web Shops . 146

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

In the recent years, the publication of structured data inside Hypertext Markup Language (HTML) content of Web sites has become a mainstream feature of commercial Web sites. In particular, e-commerce sites have started to add Resource Description Framework in

125

126

3 Data Collection

Attributes (RDFa) or Microdata markup based on schema.org [SchND ] and GoodRelations [Hep08a]. For many potential usages of this huge body of data, we need to crawl the sites and extract the data from the markup. Unfortunately, a lot of markup can be found in very deep branches of the sites, like product detail pages. Such pages are difficult to crawl because of their sheer number and because they often lack links pointing to them which means they cannot be found by a spider-based crawling approach [e.g. BP98; CC03; Ara+ 01]. In this chapter, we analyze the approach taken by the Web Data Commons (WDC) initiative, propose an alternative, ping-based crawling strategy that focuses on the deep detail pages of e-commerce Web sites, and compare the results from our crawl of 2, 628 shops with the WDC corpus, in particular with regard to the amount of data and vocabulary usage. We provide evidence that popular Web crawlers like Common Crawl fail to detect most of the product detail pages that hold a majority of the data, and show that the statistics of structured data in Web shops from our deep Web crawl differ significantly from the WDC dataset. Our crawl is used in the following chapters as data for our prototype.

3.1 Problem Statement Today, evermore vendors are offering and selling their products on the Web. The spectrum of online vendors ranges from a few very large retailers like Amazon1 and Best Buy2 featuring thousands of goods to many small Web shops with much smaller assortments. Consequently, the amount of product information available online is very significant. While product descriptions on the Web were mostly unstructured in the past, the situation has meanwhile changed. Numerous online shops have started to expose product offers using structured data markup in RDFa, Microdata, or Microformats. Repeated statistics of the Common Crawl corpus generally affirm this ongoing trend [MPB14] (see also Section 3.2.2 for more details). To some extent, the semantic annotations of product offers on the Web have been promoted by several search engine operators that offer tangible benefits and incentives for Web shop owners. For instance, Goel et al. publicly declared in 2009 via the Google Webmaster Central blog: “Today, we’re announcing Rich Snippets, a new presentation of snippets that applies Google’s algorithms to highlight structured data embedded in web pages. [...] When searching for a product or service, users can easily see reviews and ratings [...] our 1 2

http://www.amazon.com/ (accessed on October 19, 2015) http://www.bestbuy.com/ (accessed on October 19, 2015)

3.1 Problem Statement

127

experiments have shown that users find the new data valuable – if they see useful and relevant information from the page, they are more likely to click through.” [GGH09]

Similarly, Microsoft has put it on the Bing Ads Web site: “Drive traffic to your sites for free with Bing Rich Captions. When customers search and your product pages appear in the organic search results, the price and ratings information may appear below the blue links in the search results. These details help potential customers make informed decisions about the organic results they want to click on and may move them closer to a purchase decision.” [MicND ]

To summarize, search engines claim that additional data granularity can increase the visibility of single Web pages in the form of rich snippets (or rich captions, in Bing terminology) displayed on search engine results pages (SERPs). Early attempts with enhanced result snippets were conducted in the SearchMonkey project [Mik08], where potential benefits of this technique have already been laid out, namely increased and higher-quality traffic for Web site owners, and better search experience for users [Mik08]. Also Haas et al. have surveyed a general preference of users for enhanced search results and have measured a higher click-through rate (CTR) [Haa+ 11]. In addition to those immediate effects for publishers and consumers, structured data can provide useful relevance signals that search engines can take into account for their ranking algorithms as for the delivery of more appropriate search results. These generic prospects obviously hold for the concrete case of structured product data as well. Unfortunately, for regular data consumers other than Google or Bing3 , it is difficult to make use of the wealth of product information on the Web. There are at least two problems that arise in this context: 1. Crawling the Web for structured product offers is very resource-consuming due to the number of relevant pages. 2. The data granularity of product offers supplied by Web shops is relatively low. As problem 1 suggests, an obvious challenge is related to crawling the Web. The sheer size of the Web and its dynamics make it infeasible to conduct an exhaustive crawl that would entail a comprehensive and current snapshot of all product information published by Web shops. Due to scarce crawling resources, established crawling strategies usually prioritize the visiting of Web pages. For example, they take into account the contents of sitemaps [SS09], the page load time [SC10; cf. Cas+ 04], or they honor the link structure of Web pages (i.e. PageRank [Pag+ 98; BP98]) [SH15c; cf. MPB14; MMB14]. Thus, from a crawler point of view, landing pages of Web sites constitute the more popular part of 3

Popular search engines are often directed to Web pages proactively by shop owners.

128

3 Data Collection

the Web, whereas product item pages generally are of lower priority. On that account, we claim that popular crawling strategies often fail to reach the product detail pages where the interesting product data resides. This problem was publicly debated in a World Wide Web Consortium (W3C) mailing list thread from 20124 , where concerns were raised that the usual way of crawling the Web, as performed by Common Crawl, misses to reach a major part of relevant product item pages. We have shown in [SH15c] that Common Crawl systematically misses a large share of the deep detail pages of Web sites, and should thus not be thought of comprising a representative sample of the Web for e-commerce. Problem 2 refers to the issue that structured product information on the Web is generally not sufficient for deep product comparison, because its level of detail is very limited. The majority of structured product data on the Web reflects only very basic commercial details of product offers. A likely reason is that the immediately beneficial effects of adding structured data in major search engine results (like Google rich snippets) do not require sophisticated annotations of products – which, undoubtedly, would be at the cost of the simplicity of publishing structured data on the Web. To be valuable for regular data consumers though, unstructured product data would need to be cleansed and lifted to a higher data granularity (e.g. using natural language processing (NLP) techniques or machine learning [cf. PBB14]), or existing structured data be enriched via additional data sources (see Chapters 4 and 5). For testing the aforementioned assumptions, we use a “positive list” of e-commerce sites with data markup, which we gained from a site registration service that is pinged by many popular extension modules for shop software applications. We then propose a crawling strategy for e-commerce, conduct a deep Web crawl based on our seed list, and analyze the quantity and structure of the extracted data. Finally, we determine the overlap of our crawl with the WDC corpus, whereby we challenge the representativeness of Common Crawl with respect to e-commerce data on the Web. Our goal is first to provide a corpus of data for our further work, but also to understand the usability of the WDC corpus for real-world applications in the e-commerce domain. The rest of this chapter is structured as follows: Section 3.2 presents related work on Semantic Web crawlers and structured data statistics; in Section 3.3, we detail our deep crawling approach; in Section 3.4, we analyze our crawl and compare our work with the WDC dataset; and Section 3.5 concludes the chapter. 4

http://lists.w3.org/Archives/Public/public-vocabs/2012Mar/0095.html (accessed on July 15, 2014) and http://lists.w3.org/Archives/Public/public-vocabs/2012Apr/0016.html (accessed

on July 15, 2014)

3.2 State of the Art and Related Work

129

3.2 State of the Art and Related Work In this section, we present current approaches for the collection of Semantic Web data and existing statistics about structured data on the Web.

3.2.1 Approaches for the Collection of Structured Data Harvesting structured content from the Web poses specific challenges to Web crawlers, spiders, or sniffers, e.g. in terms of link traversal across resources and the indexing of structured content in heterogenous formats from various data sources [HUD06]. Various approaches have been suggested for collecting structured data from the Web. Hereinafter we distinguish between standalone crawlers [e.g. Dod06; Ise+ 10; MMB14], crawlers as part of semantic search engines [e.g. Din+ 04; Ore+ 08; Hog+ 11; DM11], and extractors [e.g. MPB14]. Even though we tried to cover the most prominent and widely used tools, it is out of the scope of this work to offer a comprehensive survey of Semantic Web crawlers.

3.2.1.1 Standalone Crawlers Some popular standalone Semantic Web crawlers are Slug [Dod06] or LDSpider [Ise+ 10]. Both are tools that represent configurable crawling frameworks targeted at harvesting Semantic Web data. In particular, they follow semantic links (i.e. RDF links between resources), extract structured data, and store metadata about visited pages. Meusel et al. [MMB14] describe Anthelion, a focused crawler targeted at structured data in Web pages that uses an intelligent selection strategy based on a combination of an online classifier and a trade-off approach between exploitation and exploration of Web page candidates. As an alternative to Semantic Web crawlers, general-purpose Web crawling frameworks like Scrapy5 can be configured to crawl the Web for structured data, e.g. to scrape Web pages for RDFa and Microdata content.

3.2.1.2 Embedded Crawlers Semantic search engines typically extract and index metadata from various data sources, e.g. annotated Web pages, published Resource Description Framework (RDF) files, or online RDF repositories. Examples of popular semantic search engines are Swoogle 5

http://scrapy.org/ (accessed on August 6, 2014)

130

3 Data Collection

[Din+ 04], Sindice [Ore+ 08], SWSE [Hog+ 11], or Watson [DM11]. On top of this, the crawling components of major commercial search engines that have recently announced to process structured data could likewise be added to this kind of crawlers (e.g. Google, Bing, Yahoo!, or Yandex). 3.2.1.3 Extractors Extractors are software components that aim at extracting relevant information from previously conducted crawls or other data sources. In other words, extractors do not crawl autonomously. For example, WDC [e.g. MPB14] is a project that focuses on extracting structured data in RDFa, Microdata, and Microformats from the Common Crawl corpus. Similarly, the Virtuoso Sponger6 can be prompted to derive RDF data from a variety of different data sources, such as Web application programming interfaces (APIs), comma-separated values (CSV) files, or annotated Web pages. 3.2.2 Web Data Commons Various studies indicate a recent increase of structured data markup on the Web. Based on a Bing crawler sample from early 2012, Mika and Potter reported a significant ratio of over 30% of the Web pages in their collection featuring structured data markup in a broad Web crawl [MP12]. The findings in the context of the Web Data Commons (WDC) project show similar results. WDC is an effort by a research group headed by Christian Bizer. They regularly extract metadata from the Common Crawl corpus, which is a project that aims at making Web content accessible to the public for research purposes and practical applications alike. Thereby people and organizations do not have to conduct costly crawls by themselves. A study based on such a Common Crawl corpus from February 2012 revealed that about 12% of all HTML pages already contain structured data markup (in contrast to 6% in 2010) [MB12]. For a corpus of November 2013, these figures are even higher with 26% [MPB14; WebND d]. Figure 3.1 illustrates how RDFa and Microdata have evolved throughout three different WDC datasets from 2012 to 2014 [WebND a; WebND d; WebND c]. Figure 3.1a shows the share of the formats with respect to the domains (or Web sites), whereas Figure 3.1b outlines it relative to all extracted entities. Although there was a significant growth of structured data (Microdata in particular) from 2012 to 2013, the increase for 2014 slowed down. 6

http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSponger

May 26, 2014)

(accessed

on

3.2 State of the Art and Related Work

Percentage of all domains

30

50

Microdata RDFa Percentage of all entities

35

25 20 15 10 5 0

131

2012-08

2013-11 Common Crawl corpus

2014-12

(a) Distinct domains with structured data

Microdata RDFa

40 30 20 10 0

2012-08

2013-11 Common Crawl corpus

2014-12

(b) Number of entities from structured data

Figure 3.1: Distribution of syntaxes in the Web Data Commons

Table 3.1 gives a detailed overview on the main statistics of the three WDC datasets [WebND b; WebND a; WebND d]. With regard to the total number of crawled domains and entities, the amount of structured data even decreased a bit from 2013 to 2014 (see Figure 3.1). Nonetheless, it is interesting to see that Microdata has outperformed RDFa, which is likely due to Microdata being perceived the simpler data format than RDFa as well as the pushing by search engines that have for long indicated a slight preference towards Microdata. Figure 3.2 shows the number of entities per domain based on the numbers from Table 3.1. Table 3.1: Structured data in the Web Data Commons [based on WebND a; WebND d; WebND c] August 2012 Domains Entities Total markup RDFa Format Microdata Class

gr:Offering s:Offer

November 2013 Domains Entities

December 2014 Domains Entities

2,286,277

1,811,471,956

1,779,935

4,264,562,758

2,722,425

5,516,068,263

519,379 140,312

188,243,535 266,169,151

471,406 463,539

436,100,210 1,964,777,851

571,581 819,990

405,541,283 2,209,497,281

1,342 8,456

371,864 13,725,226

2,199 35,635

498,825 154,407,699

2,196 62,849

440,403 236,952,507

With respect to markup for product offers, the results are similarly promising as the overall growth of structured data markup. In the analysis of the Common Crawl corpus from August 2012, Bizer et al. [Biz+ 13] detected 1, 342 sites with structured product offers in RDFa (gr:Offering entities), and 8, 465 sites with structured product offers in Microdata syntax (s:Offer entities). This amounts to 0.26% of the sites containing markup in RDFa and 6.03% in Microdata, respectively [Biz+ 13]. A more recent extract from the Common Crawl corpus of December 2014 further reports 2, 196 sites (0.38%7 ) with gr:Offering entities in RDFa, and 62, 849 sites (7.66%8 ) with s:Offer entities in 2,196 7 domains(gr:Of f ering) = 571,581 = 0.38% domains(RDF a) 62,849 8 domains(s:Of f er) = = 7.66% domains(M icrodata) 819,990

3 Data Collection

Number of entities per domain (log-scaled)

132

104

Microdata RDFa

103

102

2012-08

2013-11 Common Crawl corpus

2014-12

Figure 3.2: Average number of entities per domain in Common Crawl corpora (log-scaled y-axis)

8

Percentage of domains

7

gr:Offering s:Offer

6 5 4 3 2 1 0

2012-08

2013-11 Common Crawl corpus

2014-12

Figure 3.3: Share of structured product offer data with respect to domains with RDFa (for gr:Offering) and Microdata (for s:Offer ) in the Web Data Commons from 2012–2014

Microdata. The respective trends are depicted in Figure 3.3. Figure 3.4 relates the share of structured product offers to the total amount of structured data (RDFa, Microdata, and Microformats) in the WDC. In particular, it demonstrates the share of gr:Offering and s:Offer entities at domain (see Figure 3.4a) and entity level (see Figure 3.4b) relative to all structured data found in the Common Crawl corpora from 2012, 2013, and 2014 [WebND a; WebND d; WebND c]. According to Figure 3.4a, more than two percent of the domains with structured data contained product offer data in 2013 and 2014, as opposed to half a percent in 2012. Similarly, as per Figure 3.4b, the amount of structured e-commerce data has increased fivefold (in relative figures) between the 2012 and 2013 datasets.

3.3 Sweet-Spot Deep Crawling Approach

Percentage of domains

2.0

4.5

gr:Offering s:Offer

4.0 Percentage of entities

2.5

133

1.5 1.0 0.5

gr:Offering s:Offer

3.5 3.0 2.5 2.0 1.5 1.0 0.5

0.0

2012-08

2013-11 Common Crawl corpus

2014-12

0.0

2012-08

(a) Percentage of domains

2013-11 Common Crawl corpus

2014-12

(b) Percentage of entities

Figure 3.4: Share of structured product offer data with respect to all structured data markup in the Web Data Commons from 2012–2014

To sum up, a potential bias in these statistics cannot be ruled out (e.g. due to the variety of the crawl samples), but multiple independent studies provided more than anecdotal evidence to conclude that structured e-commerce data on the Web is on the rise.

3.3 Sweet-Spot Deep Crawling Approach In this section, we present a focused, depth-first Web crawler for structured e-commerce data. A focused Web crawler is capable of quickly collecting pages of a specific topic [CvBD99]. In our case, the crawler focuses on the product detail pages of e-commerce Web sites that contain descriptions of product offers in GoodRelations. Using a depth-first crawling strategy permits us to quickly reach the deep levels of Web sites that otherwise would often not be judged relevant by search algorithms. This is in contrast with breadthfirst crawling, which aims at quickly finding the most relevant pages according to a high link density [cf. NW01].

3.3.1 Ping-based Discovery of Relevant Web Sites A central question to every Web crawler is where to start from with the crawling process. Usually, an initial list with seed Uniform Resource Identifiers (URIs) is provided that can be extended or updated continuously as the crawl progresses. We prepared such an initial seed list for our GoodRelations crawl. Our primary data sources come from a list of URIs that we collected using a central registry component and notification service for GoodRelations-empowered Web pages and Web shops, namely

134

3 Data Collection

GR-Notify9 . Over time, the list of GoodRelations resources contained in the registration service have been augmented by ping submissions from the following channels: • Notifications by popular open source Web shop extensions, e.g. for Magento, Joomla/VirtueMart, or PrestaShop. Some of them submit the shop URI autonomously after successful installation, others ask shop owners to manually register their shop URI via form-based submission. • Log file analysis of our tool chain (e.g. GoodRelations Snippet Generator10 ), and crowdsourcing approaches like Grome11 , a Google Chrome browser plug-in that pings GR-Notify as the user with the active browser plug-in visits a page containing GoodRelations. The idea is similar to the one presented in [SM12]. • Human users pinging the registry service themselves via a form-based user interface for URI submission, e.g. implementers that followed the GoodRelations Quickstart Guide12 . Furthermore, we added selected data sources from a manually maintained list of Web shops that we are aware of exposing GoodRelations data13 . This includes e-commerce service providers (e.g. www.rakuten.de, formerly www.tradoria.de), large retailers (e.g. www.sears.com), and small Web shops.

3.3.2 Crawling Strategy and Implementation Our crawler extracts data from Web pages annotated with GoodRelations markup in RDFa and Microdata, including data about product offers, product instances, and product models [cf. Hep08a]. Furthermore, the crawler accepts the following types of input URIs (raw and compressed using gzip), namely XML sitemaps (or sitemap indexes for larger shops) [Sit08], single item pages, product category pages, and shop main pages. The implementation of our crawler was realized as a Python module with parallelization, i.e. several domains can be crawled concurrently. The overall crawling task is split into smaller chunks that are individually distributed to a pool of processes14 . Each process manages its own process stack, i.e. a distinct area of memory allocated to the program http://gr-notify.appspot.com/ (accessed on May 22, 2014) http://www.ebusiness-unibw.org/tools/grsnippetgen/ (accessed on May 22, 2014) 11 http://www.stalsoft.com/grome (accessed on May 22, 2014) 12 http://wiki.goodrelations-vocabulary.org/Quickstart (accessed on August 19, 2014) 13 http://wiki.goodrelations-vocabulary.org/Datasets (accessed on July 31, 2014) 9

10

14

On our machine, we spawned 32 processes, which is twice the number of central processing unit (CPU) cores.

3.3 Sweet-Spot Deep Crawling Approach

135

Algorithm 1: Crawling algorithm Input : List of seed URIs (S) 1

for uri ∈ S do // parallelized execution

2 3 4 5 6 7 8 9 10 11 12 13

extract the domain part from uri check_robots(domain ) if found sitemap URIs (M ) in robots.txt then for sitemap_uri ∈ M do read_sitemap(sitemap_uri, domain ) end else if crawling of domain is allowed then crawl(uri, domain ) end end end

code and the variables of the respective process. This ensures the autonomy of every process. The general steps of the crawling process are as follows: 1. Load and compile the list of seed URIs from text files stored in a local folder and filter out duplicate URIs. 2. Allocate a pool of worker processes. 3. Start a parallel crawl and continually assign one seed URI at a time to an idle worker process until finished. The flowchart in Figure 3.5 depicts the crawling algorithm in detail. Algorithm 1 formalizes the main program flow. The algorithm starts with a list of seed URIs, where each of them is assigned a free process out of the process pool. If there is no idle process available (e.g. because there are more seed URIs than processes), then the URI stays in the pipeline (implemented as a list of URIs) as long as it becomes eligible for processing. Lines 2–13 describe the code section that every process needs to run through before being ready to accept a new task from the seed list. The details of the crawling algorithm are indicated below: 1. Try to fetch robots.txt from the root directory [cf. Kos07; Kos95]. If successful, adhere to the politeness policy and check if Sitemap: directives with references

136

3 Data Collection

to sitemap files exist [cf. Sit08]. Otherwise, try to locate sitemap.xml in the root directory of the Web site. 2. From here on, the algorithm branches into one of two possible paths, based on whether a sitemap file was detected or not, as outlined in Figure 3.5.

3.3.2.1 Sitemap-based Crawling

When crawling a sitemap, a process extracts metadata from the list of URIs provided by the sitemap file (see Function read sitemap). Besides sitemaps15 , the function also handles sitemap index files. Sitemap index files are collections of sitemaps [Sit08]. A heuristic distinction between sitemaps and sitemap index files can be made based on the Extensible Markup Language (XML) root element names (e.g. sitemapindex versus urlset). Once the algorithm reaches a sitemap, then the function stops the recursion

and starts to extract metadata iteratively from the linked Web page URIs. Although not shown in the code snippet, read sitemap is able to extract gzip-compressed sitemaps, it defines an upper threshold to prevent visiting too many pages with no metadata, and it adheres to crawl delays. As an aside, our approach is currently limited to only discover those pages explicitly listed in a sitemap. Hence, a partial sitemap pointing at category pages or a subset of the available product item pages yields an incomplete crawl. Function read sitemap Input : sitemap_uri, domain 1 2 3 4

if crawling of domain is allowed then read sitemap_uri contents and store found URIs in U if is sitemap index file then for uri ∈ U do // recursion read_sitemap(uri, domain )

5

end else for uri ∈ U do

6 7 8

extract_metadata(uri, domain )

9 10 11 12

end end end 15

http://www.sitemaps.org/ (accessed on August 21, 2014)

3.3 Sweet-Spot Deep Crawling Approach

137

Legend S = List of seed URIs M = List of sitemap URIs U = List of URIs

Start s = seed URI m = sitemap URI u = URI d = domain t = time

Load list of seed URIs List of seed URIs (S)

Parallelized paths are in blue color

More URIs in S?

no

yes Pick a URI s

Stop

Extract domain part d from s

Find and read robots.txt in the location of d

no

Found sitemaps in robots.txt? yes

no

List of sitemap URIs (M)

URI allowed as per robots.txt?

{recursive}

yes Wait for a time t as per robots.txt politeness rule

no

yes

Read content and extract metadata

Pick a URI m

Write NTriples file

no

More URIs in M?

URI allowed as per robots.txt?

no

yes

Crawling depth for d < MAX?

Wait for a time t as per robots.txt politeness rule

yes Extract anchor links, filter by domain d and not yet visited

URI m not a sitemap index?

List of page URIs (U)

Extract URI links from sitemap

More URIs in U?

List of page URIs (U)

no

Read content of sitemap index

yes

no

yes Pick a URI u

{recursive}

More URIs in U?

no

yes Pick a URI u

Extract metadata from content in u

Spider-based crawling part

Write NTriples file

Figure 3.5: Flowchart of the crawling algorithm

Sitemap-based crawling part

138

3 Data Collection

3.3.2.2 Spider-based Crawling Our spider-based crawling approach, as indicated in Function crawl, performs a depth-first crawl. It starts by extracting metadata from a Web page and collects all outgoing links. The function then enters a recursion and crawls all discovered URIs unless a configurable maximum crawl depth is reached. Outgoing links that either have been visited before or pertain to a different domain as the current process is in charge of, are skipped. Function crawl Input : uri, domain, [depth=0], MAX_CRAWL_DEPTH 1 2 3 4 5 6 7

content ← extract_metadata(uri, domain ) if content then read content, look for anchor links, and store found URIs in U filter outgoing links (U ) pertaining to the same domain as domain filter outgoing links (U ) that have not been visited yet if depth < MAX_CRAWL_DEPTH then for uri ∈ U do // recursion crawl(uri, domain, depth+1 )

8 9 10 11

end end end

3.3.3 Extraction of Structured Data Finally, the function extract metadata, invoked by both the read sitemap and crawl functions, harvests metadata content and serializes the extracted triples to N-Triples files. The detection of structured markup within an HTML page is implemented as regular expressions. The regular expressions for matching RDFa and Microdata are as follows: RDFa: # matches characeristic strings such as # - xmlns:gr="http://purl.org/goodrelations/v1#" or # - prefix="gr: http://purl.org/goodrelations/v1#" regex_rdfa = re.compile( "(xmlns:([a-z0-9]+)\s*=\s*[\"’]{0,1}http://purl\.org/goodrelations/v1#[\"’]{0,1}|prefix\s*=\s *[\"’]{0,1}[^\"’]*([a-z0-9]+)\s*:\s*http://purl\.org/goodrelations/v1#[^\"’]*[\"’]{0,1})", re.IGNORECASE|re.DOTALL)

3.3 Sweet-Spot Deep Crawling Approach

139

Microdata: # matches characteristic strings such as # - itemtype="http://purl.org/goodrelations/v1#SomeItems" or # - itemprop="http://purl.org/goodrelations/v1#name" regex_md = re.compile( "item[a-z]+\s*=\s*[\"’]{0,1}http://purl\.org/goodrelations/v1#[a-z0-9\-\_]+", re.IGNORECASE|re.DOTALL)

The Python RDFLib library further takes care of parsing the content and loads the extracted triples into an in-memory graph, that is later on serialized to N-Triples. The algorithm is able to decompress (or, more specifically, gunzip) pages ending in “.gz” or carrying the header “Content-Encoding: gzip” [cf. FR14d, Section 3.1.2.2].

3.3.3.1 Identification The crawler identifies itself in the Hypertext Transfer Protocol (HTTP) request header with the user agent field [cf. FR14d, Section 5.5.3] python-grcrawler/ (http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler)

This way every Web site owner is given the ability to contact us anytime to ask questions regarding our crawler.

3.3.3.2 Politeness A Web site owner can control the interaction behavior of our focused crawler with a robots.txt file [cf. Kos07]. Our crawler obeys all important robots.txt directives. That means, it abandons sites if it is not allowed to crawl them, skips directories that are explicitly excluded for proprietary crawlers, and avoids to overload servers by respecting the indicated crawl delay. For instance, a crawl delay of one second and thus a maximum amount of 60 requests per minute can be obtained as follows: User-agent: python-grcrawler Disallow: Crawl-delay: 1

140

3 Data Collection

For those Web pages that lack a robots.txt file we use our own politeness policy, i.e. we set the crawl delay to a default value of five seconds. The architecture of the crawler further ensures that no site is simultaneously hit by more than one process, otherwise the policy constraints would be violated.

3.3.3.3 Storage The purpose of our crawler is to find and store contents of Web pages that contain GoodRelations data. Up to now, we are able to extract GoodRelations content encoded as RDFa and Microdata16 . At the moment, we temporarily store all structured content we could gather in N-Triples files that are later uploaded to a private SPARQL endpoint for our research.

3.4 Evaluation and Analysis In the following, we report statistics on the dataset obtained in late 2011/early 2012 by conducting a focused Web crawl relying on the aforementioned crawling strategy.

3.4.1 Method We applied our focused Web crawl to a variety of data sources mentioned in Section 3.3.1, namely (1) entries of a central URI submission service17 , where shop extensions and shop operators could register their shop URIs, and (2) selected data sources from a manually maintained collection of datasets18 . We spidered Web sites up to a maximum crawl depth of three hops. Among the crawled datasets thereby obtained, some of the bigger ones were not crawled entirely. For example, we could not conduct an exhaustive crawl of a large online retailer like wayfair.com, thus we stopped the crawling at some point. Similarly, huge datasets like sears.com or bestbuy.com were not considered due to resource limitations19 . We stored the crawl data in an RDF store that exposes a SPARQL Protocol and RDF Query Language (SPARQL) endpoint. More exactly, we loaded it into a Virtuoso Open In fact, we added support for schema.org in Microdata and RDFa in the meantime, which, however, we did not yet support at the time of the crawl. 17 http://gr-notify.appspot.com/ (accessed on May 22, 2014) 18 http://wiki.goodrelations-vocabulary.org/Datasets (accessed on July 31, 2014) 19 Crawling all ∼15 million pages from sears.com would take roughly 174 days with a crawl delay of one second. 16

3.4 Evaluation and Analysis

141

Source instance, which has some important characteristics that has affected our choice. It provides the necessary scalability in order to load large amounts of structured data into a database. It is fault-tolerant with respect to imperfect data, where other RDF stores would abort the loading process with an error message. And finally, it supports executing SPARQL queries to comfortably access selected data that we need for our analysis. For the statistics, we took advantage of the excellent Pandas20 library for data analysis with the Python programming language [McK12] together with the iPython Notebook interactive development environment [cf. PG07] that encourages literate programming [cf. Knu84].

3.4.2 Results Overall, our focused crawl yielded 20 GB of raw structured e-commerce data (i.e. files in N-Triples syntax) with a total of 188, 765, 183 triples.

3.4.2.1 Shop Statistics Our dataset reflects information from 2, 628 shops that expose structured e-commerce data. Among them, 2, 314 shops provide product offers contributing to a total of 3, 197, 130 offerings (see Table 3.2 in Section 3.4.2.2 for the full statistics). Consequently, a shop in our dataset consists of 1, 382 offers on average. The distribution of the numbers of items offered by the Web shops follows a power law distribution, i.e. a small number of Web shops account for a large quantity of products, whereas the majority of the shops offer only very few items. For better readability, we used a logarithmic-scaled instead of a linear-scaled y-axis in Figure 3.6 to convey the information. Accordingly, about 500 shops offer 1, 000 and more products, and over 800 shops offer less than 100 products. The first quartile (25%) is at 37, the median (50%) at 145, and the third quartile (75%) at 530 product offers, as shown in Figure 3.7. The vertical lines at the edges in Figure 3.7 represent the whiskers, which are defined as 1.5 times the interquartile range (1.5 × (75% − 25%)). The maximum number of product offers gathered for a shop was 174, 487 (not visible in Figure 3.7 due to outlier correction), and the minimum was one item. Figure 3.8 shows the ten shops with the most product offers in our crawl dataset. We assigned to each shop a different Uniform Resource Name (URN) identifier that we 20

http://pandas.pydata.org/ (accessed on July 30, 2014)

142

3 Data Collection

Number of items (log-scaled)

106 105 104 103 102 101 100

0

500

1000

Shops

1500

2000

Figure 3.6: Distribution of items per shop (log-scaled y-axis)

145.0 37.0

0

529.0

200

400

600 800 Number of items

1000

1200

1400

Figure 3.7: Boxplot of the distribution of items per shop

generated based on the domain names of the Web shops, e.g. www.example.org resulted in urn:www.example.org.

3.4.2.2 Property Statistics To examine the nature of properties used to describe products, product models, and product offers, we first measured the average number of properties across the entire dataset. This gave us the results outlined in the last column of Table 3.2. Table 3.2: Instance count and average number of properties in crawl dataset Type

GoodRelations Concept

Offers Flat offers Products Product models

gr:Offering gr:Offering w/o gr:includes/gr:includesObject gr:ProductOrService (incl. subclasses) gr:ProductOrServiceModel

No. of Instances

Avg. Properties

3,197,130 421,125 2,772,951 82,173

13.43 11.11 7.14 11.93

3.4 Evaluation and Analysis

urn:wptuning.tradoria.de urn:it.bestshopping.com urn:www.macmall.com urn:www.arzneimittel.de urn:www.shopforia.com urn:akku24.tradoria.de urn:Strumpfmoden-Discount.de urn:www.salmanbilgisayar.com.tr urn:www.youboutiques.com urn:computeronlineshop.tradoria.de 0

143

20000

40000

60000

80000 100000 120000 140000 160000 180000 Number of items

Figure 3.8: Ten most represented shops by offer count

Table 3.2 covers all classes of GoodRelations that allow for supplying product features, i.e. product offers, instances, and models. Not very surprisingly, we found out that only very little product features are specified beyond those typically sufficient to satisfy search engines.

Offers The bar chart in Figure 3.9 displays the most frequent properties (the upper 90%, i.e. 20 out of 55) used for product offers in the dataset. Most offers include general properties from the GoodRelations namespace, as indicated by the orange-colored bars. Image and page links from external vocabularies are represented by red bars. The single black bar for gr:hasBrand denotes the erroneous usage of this property with a product offer, which is not allowed in GoodRelations. The strongest property almost always present with offers is gr:hasPriceSpecification.

Flat Offers As illustrated in Figure 3.10, the statistics look a bit different for flat offers than for offers with products attached. While the strongest properties remain the same, there are noticeably more external vocabularies involved, e.g. properties from Open Graph Protocol (OGP) and review information. Also, many flat offers seem to rely on the old and outdated modeling pattern for GoodRelations that uses rdfs:label and rdfs:comment annotation properties rather than the newer properties gr:name and gr:description. In total, the 90% of the most frequent properties for flat offers constitute 24 out of 46 properties. Again, there is some wrong usage of properties. Instead of attaching gr:hasCurrency to the offer directly, it shall be attached to gr:hasPriceSpecification. Furthermore, the GoodRelations class gr:BusinessEntity is mistakenly used as a property.

144

3 Data Collection

http://purl.org/goodrelations/v1#hasPriceSpecification http://purl.org/goodrelations/v1#name http://purl.org/goodrelations/v1#description http://purl.org/goodrelations/v1#hasBusinessFunction http://purl.org/goodrelations/v1#validThrough http://xmlns.com/foaf/0.1/page http://purl.org/goodrelations/v1#acceptedPaymentMethods http://purl.org/goodrelations/v1#eligibleRegions http://purl.org/goodrelations/v1#hasStockKeepingUnit http://purl.org/goodrelations/v1#includes http://xmlns.com/foaf/0.1/depiction http://search.yahoo.com/searchmonkey/media/image http://purl.org/goodrelations/v1#hasBrand http://purl.org/goodrelations/v1#hasEAN_UCC-13 http://purl.org/goodrelations/v1#hasInventoryLevel http://purl.org/goodrelations/v1#hasMPN http://purl.org/goodrelations/v1#validFrom http://purl.org/goodrelations/v1#includesObject http://purl.org/goodrelations/v1#availableDeliveryMethods http://purl.org/goodrelations/v1#eligibleCustomerTypes 0

500000

1000000

1500000 2000000 Number of properties

2500000

3000000

3500000

Figure 3.9: Frequency of offer properties in crawl (upper 90% – 20 out of 55)

http://purl.org/goodrelations/v1#hasPriceSpecification http://xmlns.com/foaf/0.1/depiction http://purl.org/goodrelations/v1#name http://purl.org/goodrelations/v1#description http://purl.org/goodrelations/v1#hasBusinessFunction http://xmlns.com/foaf/0.1/page http://purl.org/goodrelations/v1#hasEAN_UCC-13 http://purl.org/goodrelations/v1#acceptedPaymentMethods http://ogp.me/ns#site_name http://www.facebook.com/2008/fbmlapp_id http://ogp.me/ns#url http://www.w3.org/1999/xhtml/vocab#stylesheet http://ogp.me/ns#title http://purl.org/goodrelations/v1#hasCurrency http://ogp.me/ns#type http://ogp.me/ns#image http://purl.org/goodrelations/v1#BusinessEntity http://www.w3.org/2000/01/rdf-schema#label http://www.w3.org/2000/01/rdf-schema#comment http://rdf.data-vocabulary.org/#hasReview http://purl.org/goodrelations/v1#hasInventoryLevel http://purl.org/goodrelations/v1#hasStockKeepingUnit http://purl.org/goodrelations/v1#hasMPN http://purl.org/goodrelations/v1#condition 0

50000 100000 150000 200000 250000 300000 350000 400000 450000 Number of properties

Figure 3.10: Frequency of flat offer properties in crawl (upper 90% – 24 out of 46)

3.4 Evaluation and Analysis

145

Products Contrary to our expectations, there is less variety of properties for product instances than for product offers in the crawl. Out of the 43 available product properties, only 11 properties account for 90% of the most used ones that are shown in Figure 3.11. Apart from generic product properties from GoodRelations, the most frequent properties are foaf:depiction, foaf:page, and yahoo:image. But even those properties are not specific to products. http://purl.org/goodrelations/v1#name http://purl.org/goodrelations/v1#hasStockKeepingUnit http://xmlns.com/foaf/0.1/depiction http://xmlns.com/foaf/0.1/page http://search.yahoo.com/searchmonkey/media/image http://purl.org/goodrelations/v1#hasBrand http://purl.org/goodrelations/v1#hasEAN_UCC-13 http://purl.org/goodrelations/v1#hasMPN http://purl.org/goodrelations/v1#description http://purl.org/goodrelations/v1#category http://purl.org/goodrelations/v1#hasInventoryLevel 0

500000

1000000 1500000 2000000 Number of properties

2500000

3000000

Figure 3.11: Frequency of product properties in crawl (upper 90% – 11 out of 43)

Product Models The situation is for product models even more surprising than for product instances. In the entire crawl dataset, there appear only 24 different properties along with product models. Out of these 24 properties, the 90% most used ones are represented by 17 properties (see Figure 3.12). Among these properties, most are from external vocabularies, i.e. Friend of a Friend (FOAF) [BM14], RDF Schema (RDFS), the RDF Review vocabulary, and the Yahoo! Searchmonkey vocabulary. Nonetheless, the properties used are again not very specific.

3.4.3 Comparison with Web Data Commons

We evaluate our approach by comparing our findings to the results from the WDC initiative. In particular, we are interested in whether our focused GoodRelations crawl dataset (henceforth GRC) is more complete and covers a larger amount of useful product data than the latest Common Crawl, which is without a doubt several times bigger than our crawl.

146

3 Data Collection

http://xmlns.com/foaf/0.1/depiction http://www.w3.org/2000/01/rdf-schema#comment http://xmlns.com/foaf/0.1/thumbnail http://www.w3.org/2000/01/rdf-schema#seeAlso http://www.w3.org/2000/01/rdf-schema#label http://purl.org/stuff/rev#hasReview http://purl.org/goodrelations/v1#hasManufacturer http://purl.org/goodrelations/v1#hasEAN_UCC-13 http://search.yahoo.com/searchmonkey/product/identifier http://search.yahoo.com/searchmonkey/product/specification http://search.yahoo.com/searchmonkey/media/image http://search.yahoo.com/searchmonkey/product/manufacturer http://purl.org/goodrelations/v1#name http://purl.org/goodrelations/v1#hasStockKeepingUnit http://purl.org/goodrelations/v1#description http://xmlns.com/foaf/0.1/page http://www.w3.org/1999/02/22-rdf-syntax-ns#type 0

10000

20000

30000 40000 50000 Number of properties

60000

70000

80000

Figure 3.12: Frequency of product model properties in crawl (upper 90% – 17 out of 24)

3.4.3.1 Quantitative Comparison of Entities in WDC and GRC In [MPB14], Meusel et al. present a table with the 30 most frequent RDFa classes in WDC. In there, some GoodRelations classes are listed, that we use to compare with our crawl (GRC), as detailed in Table 3.3. Despite the number of domains featuring GoodRelations data is similar in both datasets, we found much more entities in GRC than in WDC. The considerable differences in the entity/domain-ratios21 of the two datasets underlines this impression, as depicted in Figure 3.13. Table 3.3: Comparison of entity frequency in WDC and in GRC GoodRelations Class gr:Offering gr:BusinessEntity gr:UnitPriceSpecification gr:SomeItems gr:TypeAndQuantityNode gr:QuantitativeValue

Domain 2,199 2,155 1,681 1,429 1,221 1,032

WDC 2013 Entities 498,333 394,556 429,409 235,785 187,865 192,560

E/DRatio

Domain

GRC Entities

226.62 183.09 255.45 165.00 153.86 186.59

2,314 1,031 2,271 2,137 455 1,259

3,197,130 425,769 3,557,590 2,558,844 637,409 1,553,239

E/DRatio 1,381.65 412.97 1,566.53 1,197.40 1,400.90 1,233.71

3.4.3.2 Quantitative Comparison of Structured Data in Web Shops In addition to the quantitative comparison of the total numbers of entities in the datasets, we were interested in whether there are differences in the amount of structured data 21

The E/D-ratio denotes the average number of entities per Web site.

3.4 Evaluation and Analysis

147

1600

wdc grc

Ratio Entities/Domains

1400 1200 1000 800 600 400 200 0

ng feri

f gr:O

u

gr:B

ity Ent

ss sine

atio

S

rice

nitP gr:U

ific pec

n

ms

Ite ome S : r g

Q And

ype gr:T

de yNo

tit uan

ant

u gr:Q

e Valu

ive itat

Figure 3.13: Comparison of the E/D-ratios for WDC and GRC

of individual Web shops. For this purpose, we required access to the raw data from WDC (in our case, the RDFa datasets were sufficient, because at the time of our crawl GoodRelations was used almost exclusively with RDFa). WDC publishes the datasets as a list of gzip-archived N-Quad files. In total, the uncompressed files for RDFa amount to 695 GB, each of them with a file size of about one gigabyte. Our idea was to match graph names from GRC to graph names from WDC. By a simple query (prefix declaration omitted) like SELECT DISTINCT ?g WHERE { GRAPH ?g {?s a gr:Offering} }

against a SPARQL endpoint with the GRC dataset, we could obtain a list of all 2, 314 named graphs [Car+ 05] in the form of URNs that expose offers in our dataset. We saved them as a list of graph names (without the urn: prefix) into a text file. Afterwards, we ran a grep command against WDC counting the number of times a domain name appears in the dataset. Since N-Quads is a line-based syntax, the obtained result denotes the number of triples (or quads) in WDC. More precisely, we executed the following command to extract the graph name part from the gzipped N-Quads files (i.e. every fourth term/column in a line), and to write it into a text file. The second command selects the first one of the text files created before. $ ls *.nq.gz | sed ’s/.nq.gz//’ | parallel -j64 \ >

"zcat {}.nq.gz | awk ’{if(\$(NF)){print \$(NF-1)}}’ >> {}.wdc_graphs.txt"

$ ls *.wdc_graphs.txt | head -n 1 ccrdf.html-rdfa.0.wdc_graphs.txt

148

3 Data Collection

We used the GNU parallel command line tool to distribute the task over several processes. It is run with the option “-j64” to spawn 64 processes, i.e. four processes per CPU core on our machine with 16 cores. The next command shows how we used grep to count the number of lines from the previously created graph name files. Without going into detail, it uses the graph name file from GRC (grc_graph_uris.txt) as input to a parallelized grep. Thus, every process is assigned a graph name from GRC to search for within the collection of graph name files from WDC. The processed graph name together with the number of quads found in WDC are stored in a text file (num_triples.txt), for which the first two lines are displayed below. $ cat grc_graph_uris.txt | parallel -j64 \ >

’grep {} *.wdc_graphs.txt | (var=$(wc -l); if [ "$var" -ne "0" ]; then echo {} $var; fi)’

>

>> num_triples.txt

$ head -n 2 num_triples.txt sanita.ro 29 shop.goodrelations.pl 11

There is a remarkable overlap of 214 shops between the two datasets. After some cleansing steps that involved consolidating www. and non-www. variants and eliminating graph names without dot (e.g. “goodrelations” and “localhost”), we retained 205 distinct shops that appear in both datasets. Subsequently we refer to these reduced datasets as WDC205 and GRC205 . Table 3.4 presents the ten domains with the largest number of triples in GRC and compares them with the number of triples in WDC. One shop in this list (www.eastwood.com) is represented with more triples in WDC than in GRC. We explain this divergence by the crawls being conducted at different points in time and thus with possible interim changes, and that we occasionally had to prematurely abandon the crawling task of Web sites, e.g. because of interruptions like an imminent server restart, downtime, or similar. It could also be that the internal link structure of this Web site has more hops than the threshold (that was three hops) used for our spider-based crawling approach. Yet another idea is that WDC may sometimes spot more URIs from external links that our crawler cannot find because they are not in the sitemap nor reachable via internal links from the very same domain, or that some sites have duplicate content under multiple URIs. Nevertheless, out of the 205 shops, we could only find ten shops where our dataset exhibited less triples than the WDC (see Table 3.5 for a complete list). Intuitively, following our observations from above, there are on average more triples for every shop in our GRC than in the WDC dataset. To verify this assumption, we stated the following null hypothesis:

3.4 Evaluation and Analysis

149

Table 3.4: Comparison of the amount of RDF triples in shops for WDC and GRC, sorted in descending order by the number of triples in GRC Domain www.shopforia.com www.speichermarkt.de www.macmall.com www.mts-shop.eu www.wayfair.com www.parfumerie.nl www.eastwood.com www.takatomo.de www.radonline.de www.puissance-moteur.fr ... Total: 205 sites in the sample

Triples in WDC

Triples in GRC

226 702 570,200 42,226 116,623 63 3,329,977 23,706 865 71

5,469,335 4,999,534 4,305,817 3,984,312 1,817,762 1,703,252 1,136,916 1,047,689 883,364 695,354

Table 3.5: Comparison of the amount of RDF triples in shops for WDC and GRC, filtered by domains for which WDC contains more triples than GRC Domain www.eastwood.com www.threadless.com www.demartina.com www.simplyglobo.com www.pauladeenstore.com www.foodnetworkstore.com www.antfuel.com www.heppnetz.de franz.com www.christian-junghanns.de

Triples in WDC

Triples in GRC

3,329,977 1,828,048 14,355 5,845 35,308 439,303 24,708 314 1,587 116

1,136,916 71,341 12,854 3,820 2,900 2,276 316 310 113 83

Null hypothesis. The amount of data for every single Web shop in GRC is not larger than in the WDC dataset.

As the matching samples in the two datasets (WDC205 and GRC205 ) are not normally distributed22 and the domains appearing in the datasets are the same (and thus comparable to repeated measurements), we carried out a paired samples test on the number of triples, more precisely a Wilcoxon signed-rank test [Wil45]. As a result, the number of triples in GRC205 (median = 17, 520) was significantly higher than in WDC205 (median = 98), t(205) = 970, p = 1.76 × 10−29 , r = −0.79. We thus conclude that our crawl collected on average significantly more structured product data per Web shop than the Common Crawl. 22

A Shapiro-Wilk test [SW65] with a null hypothesis of a normally distributed sample gave us the following p-values: WDC205 : p = 4.58 × 10−30 ; GRC205 : p = 4.55 × 10−28 .

150

3 Data Collection

3.5 Conclusion In this chapter, we have provided evidence that popular Web crawlers like Common Crawl fail to detect the product details pages that contain a majority of the product data. We have proposed an alternative, ping-based crawling strategy that focuses on the deep detail pages of e-commerce Web sites, and compared the results from our crawl of 2, 628 shop sites with the WDC corpus. Our statistics of structured data in Web shops differ significantly from the WDC dataset. In addition, we have shown that structured e-commerce data on the Web lacks data granularity, which for example limits its usefulness for deep product comparison and calls for additional techniques for enriching product data, which is addressed in Chapter 6 of this thesis. In the course of our research, we identified the following potential limitations in our approach that open up opportunities for future work: • The two compared datasets were created at disparate points in time. Our crawl was conducted in late 2011/early 2012, while the latest Common Crawl snapshot originates from November 2013. During that period, the implementation of structured data on these Web sites might have changed. Conceivable changes include the addition and removal of pages, newly added or deleted triples, data format changes (e.g. from RDFa to Microdata), or migrations to other vocabularies (e.g. from GoodRelations to schema.org). • Even though the coverage between the two compared datasets was already considerable (∼10%), we would throw in that a crawl over exactly the shop URIs from the WDC dataset (i.e. 100% coverage) would give an even better evaluation of our work. • In this work, we by and large focused on GoodRelations and RDFa. Repeating the same experiments with taking into account schema.org and Microdata would be a worthwhile exercise that might lead to additional insights. • A more comprehensive analysis would be possible with only a few additions to the crawler. These improvements include to generate and store further metadata (e.g. a timestamp of the extraction, or details about the detected data format and vocabulary), to make it possible to resume interrupted crawls, to utilize sitemaps (e.g. the lastmod element) for enhancing the crawling plan, or to specify a SPARQL endpoint location where to instantly upload the crawled data by means of SPARQL Update [GPP13] queries.

4 Product Model Master Data

4.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

4.2

State of the Art and Related Work

4.3

4.2.1

BMEcat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

4.2.2

Other Approaches

4.3.2

4.3.3

4.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

Product Model Master Data for the Semantic Web . . . . . . . . . . . . . . . . . 158 4.3.1

4.4

. . . . . . . . . . . . . . . . . . . . . . . . . 156

Aligning BMEcat with GoodRelations . . . . . . . . . . . . . . . . . . . 159 4.3.1.1

Product Details . . . . . . . . . . . . . . . . . . . . . . . . . . 160

4.3.1.2

Product Features . . . . . . . . . . . . . . . . . . . . . . . . . 161

4.3.1.3

Catalog Group Systems . . . . . . . . . . . . . . . . . . . . . . 162

4.3.1.4

Product and Catalog Group Map . . . . . . . . . . . . . . . . 163

Selected Modeling Problems . . . . . . . . . . . . . . . . . . . . . . . . . 163 4.3.2.1

Datatype versus Object Properties . . . . . . . . . . . . . . . 163

4.3.2.2

Float Value Ranges in Datatype Properties . . . . . . . . . . . 164

4.3.2.3

Units of Measurement . . . . . . . . . . . . . . . . . . . . . . . 165

Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 4.4.1

Coverage of Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

4.4.2

Missing Product Features on the E-Commerce Web of Data . . . . . . . 167

4.4.3

Leverage Effect of Product Master Data on the Web . . . . . . . . . . . 169

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

For any business taking part in supply chain networks, it is crucial for reasons of efficiency and competitiveness to maintain a consolidated view of all its core business entities such as products, customers, or places. This data that is shared across different applications is generally referred to as master data [e.g. Ora11; MV05; Los09, p. 6; Whi+ 06a; Dre+ 08, p. 1]. For example, Loshin puts it as follows: “Master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections, and taxonomies” [Los09, p. 6]; and Oracle defines master data as “the business objects that are shared

151

152

4 Product Model Master Data

across more than one transactional application. This data represents the business objects around which the transactions are executed. This data also represents the key dimensions around which analytics are done. Master data creates a single version of the truth about these objects across the operational IT landscape” [Ora11]. For a literature review, see [Sil+ 11]. To date, the automatic exchange of product information between business partners in a value chain is typically done using business-to-business (B2B) catalog exchange standards such as Price Catalog Message (PRICAT) [UN 12], Commerce XML (cXML) [Ari16], BMEcat [SLK05a], or master data pools based on the Global Data Synchronization Network (GDSN) standard [SLÖ08]. At the same time, the Web of Data, in particular the GoodRelations [Hep08a] vocabulary, offers the necessary means to publish highly structured product data in a machine-readable format. The advantage of the publication of rich product descriptions can be manifold, including better integration and exchange of information between Web applications, high-quality data along the various stages of the value chain, or the opportunity to support more precise and more effective searches. In this chapter, we show that existing product catalogs can provide a huge lever for product offering descriptions on the Web. Initially, we (1) stress the importance of rich product master data for e-commerce on the Semantic Web, and then we (2) present a tool to convert BMEcat Extensible Markup Language (XML) data sources into an RDF-based data model anchored in the GoodRelations vocabulary. The benefits of our proposal are tested using product data collected from a set of ∼2, 500 online retailers of varying sizes and domains, as described in the previous Chapter 3.

4.1 Problem Statement Online shopping has experienced significant growth during the last decade. Preliminary estimates of retail e-commerce sales in the USA show an increase of 14.7% between Q41 of 2013 and Q4 of 2014, while they grew to almost three times 2005 levels, totaling 7.7 percent (96 billion U.S. dollars) of the entire U.S. retail sales market [Uni14]. These recent statistics indicate a large body of different-sized online stores ranging from major retailers like Amazon, Best Buy or Sears to small Web shops offering only tens or hundreds of products. Hence it comes as no surprise that instances of popular commodities are offered by a fairly large number of shopping sites. Many of those online shops maintain databases where they can store information and data to describe their goods. Nonetheless, for site owners it proves difficult to get hold of rich and high-quality product data for all of 1

Q4 = Fourth quarter

4.1 Problem Statement

153

their items over time, especially if their specifications originate from product catalogs by different manufacturers. Large-size retailers might obtain this information in a semiautomated fashion via some form of catalog exchange format. However, small shop owners might have to enter products and feature data manually. This scenario produces repeated definitions of the same product features, but mainly with incomplete, inconsistent and outdated information across various online retailers. Little and inaccurate information about products ultimately hampers the effective matchmaking of products. Another major source of product data for commodities are their manufacturers. These compile and maintain specifications of all of their products. Typically, their product catalogs are managed in product information management (PIM) systems that can export content to different types of media, e.g. electronic product catalogs as seen on many manufacturer sites or printed catalogs [Abr14, p. 1]. PIM systems host essential and core product data also known as product master data [Whi07]. Table 4.1 presents a simple illustration of the situation using the example of three random products. The table compares the number of features provided by the goods’ manufacturers with the features found at a large leading online retailer’s Web site (i.e. Amazon) and other online merchants of various sizes selected arbitrarily via the “Shopping” service of Google Germany2 . Unless otherwise specified, by “features” we mean structured product specifications (i.e. datasheets in tabular form published on the shop pages) without taking into account product pictures, product name, and product description. It can be seen that the richness of product data provided across the different sources varies significantly, but also that the manufacturers expose much more detailed product information the retailers. Table 4.1: Comparison of product features between manufacturers and retailers Manufacturer Product Features Retailer Product Features Coveragea 15 amazon.de 39 notebooksbilliger.de Samsung LED TV ES6300 89 28.09% 22 conrad-electronics.de 24 voelkner.de 10 amazon.de 22 redcoon.de Siemens Kettle TW86103 25 23.64% 4 quickshopping.de 13 elektro-artikel-shop.de 12 amazon.de 3 sportscheck.com Suunto M5 Running Pack 33 1 otto.de 49% 15 klepsoo.com 8 tictactime.de a

2

“Coverage” = Ratio of average number of retailer features and manufacturer features

http://www.google.de/shopping/ (accessed on May 8, 2014)

154

4 Product Model Master Data

To date, product master data is typically passed along the value chain using B2B channels based on electronic data interchange (EDI) standards and catalog exchange formats such as BMEcat (catalog from the German Bundesverband Materialwirtschaft, Einkauf und Logistik (Engl.: Federal Association of Materials Management, Purchasing and Logistics) (BME)) [SLK05a]. Such standards can significantly help improve the automatic exchange of data. However, trading partners still have to negotiate and set up information channels bilaterally, which prevents them from establishing ad-hoc business relationships and raises the barriers for potential business partners that either do not have the means or the budget to connect via imposed B2B standards. Similarly, end users are neglected who could benefit from enterprise data liberalization by facing better search and matchmaking services for products [Di + 03]. An approach to tackle this issue is to publish rich product master data from PIM systems of manufacturers on the Web of Data, so that it can be electronically consumed by other merchants intending to trade these goods. Under this premise, retailers and Web shop owners could then rely on widely used product “strong identifiers” such as European Article Numbers (EANs), Universal Product Codes (UPCs), Global Trade Item Numbers (GTINs), or manufacturer part numbers (MPNs), to establish a link to this rich data straight from the manufacturers. Figure 4.1 illustrates an example of this approach, where the data from three different online merchants can be augmented with product descriptions and features as published by the manufacturer, on the basis of the corresponding product strong identifier. Each online merchant can then use this rich manufacturer information to augment and personalize their own offering of the product in question. In this chapter, we outline the potential leverage of manufacturer datasheets from PIM systems for product model master data on the Web. In particular, we propose to use the XML-based BMEcat standard in order to make highly structured product feature data available on the Web of Data. We describe a conceptual mapping and the implementation of a respective software tool for automatically converting BMEcat documents into Resource Description Framework (RDF) data based on the GoodRelations vocabulary for e-commerce [Hep08a]. This is attractive, because most PIM software applications can export content to BMEcat. With our approach, a single tool can nicely bring the wealth of data from established B2B environments to the Web of Data. Our proposal can manifest at Web scale and is suitable for every PIM system or catalog management software that can create BMEcat XML product data, which holds for about 82% of all of such software systems that we are aware of, as surveyed in [Web11]. Furthermore, it can minimize the proliferation of repeated, incomplete, or outdated definitions of the same product master data across various online retailers by means of

4.1 Problem Statement

Manufacturer Web Site: Datasheet http://www.acme.com Product page with details

High-quality picture

weight: 250g color: blue EAN: GTIN14: MPN: brand:

1234567890123 12345678901234 ACME123 ACME

155

Many Shop Sites (with Incomplete Product Features) http://www.shop1.com

http://www.shop2.de http://www.shop3.uk

Shop 1: Offer page

Shop 2: Offer page

price: $ 99.99

price: $ 102.10

Shop 3: Offer page

price: $ 96.00

ACME123 GTIN14: 12345678901234 EAN: 1234567890123 MPN: brand: ACME

price: $ 99.99 weight: 250g color: blue

Search Engine or Browser Plug-in

Figure 4.1: Enriching shop pages with product master data from manufacturers based on “strong identifiers” [from Hep12a]

simplifying the consumption of authoritative product master data from manufacturers by any size of online retailer. It is also expected as a result that the use of structured data in terms of the GoodRelations vocabulary by manufacturers and online retailers will bring additional benefits derived from being part of the Web of Data, such as search engine optimization (SEO) in the form of rich snippets [Goo16], or the possibility of better articulating the value proposition of products on the Web. Figure 4.2 indicates how the modeling of the approach looks like in GoodRelations. The upper part of the figure denotes the concepts that can generally be provided by Web shops, namely the description of the business entity, the offering description, and possibly a basic product description. The lower part depicts the product model master data as provided by manufacturers, which might include much more comprehensive and granular product details than those supplied by retailers. Via a GoodRelations predicate gr:hasMakeAndModel (e.g. materialized according to the implicit link of product strong identifiers), it is possible to add a link from a product to its product model and thus enrich the basic product descriptions with high-quality product model master data, e.g. from BMEcat catalogs. To test our proposal, we converted a representative real-world BMEcat catalog of two well-known manufacturers and analyzed whether the results validate as correct RDF/XML datasets grounded in the GoodRelations ontology. Additionally, we identified examples

156

4 Product Model Master Data

gr:BusinessEntity gr:legalName gr:hasGlobalLocationNumber ... gr:hasPOS ...

gr:offers

gr:Offering gr:name ... gr:hasBusinessFunction gr:hasPriceSpecification ...

gr:includes

gr:ProductOrService gr:name gr:hasEAN_UCC-13 gr:hasMPN ... gr:hasBrand ...

Explicit link:

Implicit link using product "strong identifiers"

Web Shops

gr:hasMakeAndModel

gr:ProductOrServiceModel gr:name gr:hasEAN_UCC-13 gr:hasMPN ... gr:hasBrand gr:hasManufacturer ... { product features }

Manufacturers (BMEcat)

Figure 4.2: Retailer and manufacturer data in GoodRelations

that illustrate the problem scenario described relying on structured data collected from ∼2, 500 online shops together with their product offerings. Our tests allowed us to confirm the immediate benefits and impact that adopting our approach can bring to both manufacturers and retailers. The rest of this chapter is structured as follows: Section 4.2 reviews previous efforts on product data management (PDM) with Semantic Web technologies; Section 4.3 covers the key concepts that are the basis for the BMEcat2GoodRelations tool; Section 4.4 focuses on the evaluation of our overall approach; and finally, the conclusions and future opportunities of our work are discussed in Section 4.5.

4.2 State of the Art and Related Work In this section, we briefly describe the main characteristics of the BMEcat format and summarize related work.

4.2.1 BMEcat BMEcat is a sophisticated XML standard for the exchange of electronic product catalogs between suppliers and purchasing companies in B2B settings [HS00]. The current release is BMEcat 2005 [SLK05a], a fully downwards-compatible update of BMEcat 1.2 [SLK05a, p. 17]. The most notable improvements over previous versions are the support of external catalogs and multiple languages, and the consistent renaming of the ambiguous term

4.2 State of the Art and Related Work

157

ARTICLE to PRODUCT [SLK05a, p. 17]. Figure 4.3 presents a high-level view of the document structure for the transmission of a catalog using BMEcat 2005. HEADER CATALOG AGREEMENT SUPPLIER BUYER

T_NEW_CATALOG CATALOG_GROUP_SYSTEM CATALOG_STRUCTURE PRODUCT PRODUCT_DETAILS PRODUCT_FEATURES PRODUCT_ORDER_DETAILS PRODUCT_PRICE_DETAILS PRODUCT_TO_CATALOGGROUP_MAP

Figure 4.3: BMEcat 2005 skeleton [based on SLK05a]

A BMEcat document comprises a header and a transaction part [HS00]: • The header part defines global settings such as defaults for currency, eligible regions or catalog language, and specifies seller and buyer parties involved in the transaction [HS00]. It further may state the agreement or contract that the document is based on [HS00]. The default values specified in the document header may be overridden by values defined at product instance level in the document [cf. SLK05a, p. 14]. • The transaction part consists of a product data section and data related to classification standards (e.g. eCl@ss3 , United Nations Standard Products and Services Code (UNSPSC)4 ) or vendor-specific catalog group systems [cf. SLK05a, p. 14]. Product data sections consist of product-related information, feature data, price details, and order details [HS00]. The element name of the payload part determines the transaction type and can be one of T_NEW_CATALOG (new catalog), T_UPDATE_PRODUCTS (update of product data), and T_UPDATE_PRICES (update of price data) [HS00].

4.2.2 Other Approaches The rise of B2B e-commerce revealed a series of new information management challenges in the area of product data integration [e.g. Fen+ 01; SH01]. Separately, the gradual realization of the Semantic Web vision has motivated significant efforts aimed at representing existing e-commerce-related data and product classification standards adopting open semantic technologies and data models [e.g. Hep06; Hep07b; BM08]. 3 4

http://www.eclass.de/ (accessed on May 16, 2014) http://www.unspsc.org/ (accessed on May 16, 2014)

158

4 Product Model Master Data

Yet, in the particular context of managing product master data, two previous solutions [Bru+ 07; Wan+ 09] stand out based on their similarities with respect to our problem scenario. The study in [Bru+ 07] presents a meta-model in OWL DLP as part of a semantic application framework that can provide semantic capabilities to a generic PIM system. Wang et al. [Wan+ 09] have developed an extension that allows lifting the data from existing relational databases of leading master data management (MDM) systems to the RDF format. This allows semantic interoperability across organizations’ core data, applications and systems. Both solutions share our reliance on Semantic Web technologies to facilitate product master data integration and consistency across separate data sources. However, there are several aspects where they deviate from the proposal that we are going to present in the upcoming sections, most notably: (a) Their scope focuses on closed corporate environments which may involve proprietary applications or standards rather than open technologies at the scale of an open Web of Data; and (b) being aimed at generic PIM and MDM systems, their level of abstraction is rather high, introducing additional degrees of separation with respect to the applicability to the problem scenario targeted by our conversion approach. In that sense, our approach is, to the best of our knowledge, the only solution developed on the basis of open standards, readily available to both manufacturers and retailers to convert product master data from BMEcat into structured RDF data suitable for publication and consumption on the Web of Data.

4.3 Product Model Master Data for the Semantic Web The implementation of the logic behind the alignments to be presented herein resulted in the BMEcat2GoodRelations converter. In 2009, Mark Mattern, a master student supervised by Martin Hepp at the University of Innsbruck, developed a first version of an online converter that implemented an extensive mapping from BMEcat to GoodRelations as part of his master thesis [Mat09]. Yet, the online converter turned out to be impractical with respect to extremely large BMEcat files. In the following, we give proper credit to the valuable work of Mattern [Mat09] by extending it with additional mappings specially for product master data and suggesting a more robust tool architecture that can accommodate conversions of large BMEcat files. Our tool, BMEcat2GoodRelations, is a portable command line application written in Python. The tool facilitates the conversion of BMEcat XML files into their corresponding RDF representation anchored in the GoodRelations ontology for e-commerce. It scales well to file sizes of several hundred

4.3 Product Model Master Data for the Semantic Web

159

megabytes. For more information about the project, we refer to the project landing page hosting the open source code repository5 , where one can find a detailed overview of all the features of the converter, including a comprehensive user’s guide. A round-tripping toy example that describes the file structure of the converter output is also available online6 .

4.3.1 Aligning BMEcat with GoodRelations In the following, we outline correspondences between elements of BMEcat and GoodRelations and propose a mapping between the BMEcat XML format and the GoodRelations vocabulary. Given their inherent overlap, a mapping between the models is reasonable with some exceptions that require special attention. We will highlight these cases, nonetheless we cannot cover the full alignment here. For the mapping between the two schemas we considered the following aspects: • Company details (address, contact details, etc.), • product offer details, • catalog group structures, • product features including links to media objects, and • references to external product classification standards. Furthermore, multi-language descriptions in BMEcat are handled properly, namely by assigning corresponding language tags to RDF literals. An illustrative example of a catalog and its respective conversion is available online7 . However, within the scope of this work we focus mainly on product model data. Also, we do not provide alignments for full classification standards that can be exchanged starting with BMEcat 2005, primarily because of the complexity and for legal reasons especially relevant when converting licensed classification standards. Moreover, there already exist proposals that focus on the conversion and publication of product classification standards (e.g. eClassOWL [Hep05b]). 6

http://code.google.com/p/bmecat2goodrelations/ (accessed on August 23, 2014) http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/ (accessed on Octo-

7

http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/ (accessed on Au-

5

ber 19, 2014)

gust 23, 2014)

160

4 Product Model Master Data

4.3.1.1 Product Details

At the center of the proposed alignments are product details and product-related business details. Table 4.2 shows the BMEcat-2005-compliant mapping for product-specific details. Table 4.2: Mapping of product details from BMEcat to GoodRelations BMEcat

GoodRelations

PRODUCT

gr:Offering, gr:Individual /gr:SomeItems, gr:ProductOrServiceModel gr:hasEAN_UCC-13, gr:hasGTIN-14

SUPPLIER_PID type={ean, gtin} PRODUCT_DETAILS DESCRIPTION_SHORT lang={en, de, . . . } DESCRIPTION_LONG lang={en, de, . . . } INTERNATIONAL_PID type={ean, gtin} MANUFACTURER_PID MANUFACTURER_NAME PRODUCT_STATUS type={new, used, . . . }

gr:name with language en, de, . . . gr:description with language en, de, . . . gr:hasEAN_UCC-13, gr:hasGTIN-14 gr:hasMPN gr:hasManufacturer → gr:BusinessEntity gr:legalName gr:condition



Table 4.2 adds an additional level of detail to the PRODUCT → PRODUCT_DETAILS structure introduced in Figure 4.3. The element name highlighted in bold font face determines a new nesting level, e.g. PRODUCT consists of an attribute for the product identifier of the supplier and a subelement PRODUCT_DETAILS. The elements discussed in the present context are all mapped to properties of product instances, product models and offers in GoodRelations. However, our main interest lies in the alignment to gr:ProductOrServiceModel. The product identifier can be mapped in two different ways, at product level or at product details level, whereby the second takes precedence over the other (in other words, the globally scoped value is overwritten by the locally scoped value). Whether the EAN or the GTIN is mapped depends on the type attribute supplied with the BMEcat element. Furthermore, the mapping at product level allows to specify the MPN, product name and description, and condition of the product. Depending on the language attribute supplied along with the DESCRIPTION_SHORT and DESCRIPTION_LONG elements in BMEcat 2005, multiple translations of product name and description can be obtained. Lastly, the manufacturer name is mapped to a little more complex pattern in GoodRelations, i.e. the value of MANUFACTURER_NAME maps to the name of the legal entity attached to the product model via gr:hasManufacturer. Listing 4.1 gives an example of product details in Terse RDF Triple Language (Turtle) after having been mapped to GoodRelations.

4.3 Product Model Master Data for the Semantic Web

161

1 samsung:LEDTV_ES6300 a gr:ProductOrServiceModel ; 2

gr:name "Samsung LED TV ES6300"@en ;

3

gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ;

4

gr:hasMPN "ledtv_es6300"^^xsd:string ;

5

gr:hasManufacturer samsung:Samsung .

6 7 samsung:Samsung a gr:BusinessEntity ; 8

gr:legalName "Samsung Group"@en .

Listing 4.1: Example of product details in Turtle/N3

4.3.1.2 Product Features BMEcat allows to specify products using vendor-specific catalog groups and features, or to refer to classification systems with externally defined categories and features. The mapping of product classes and features is shown in Table 4.3. Table 4.3: Mapping of product features from BMEcat to GoodRelations BMEcat

GoodRelations

PRODUCT_FEATURES REFERENCE_FEATURE_SYSTEM_NAME REFERENCE_FEATURE_GROUP_ID REFERENCE_FEATURE_GROUP_NAME

referenced classification system identifier rdf:type (class id of classification system) gr:category

FEATURE FNAME FDESCR FVALUE FUNIT FREF

rdfs:label and property name in GoodRelations rdfs:comment gr:hasValueFloat gr:hasUnitOfMeasurement feature id of referenced classification system, property name in the GoodRelations context

The target GoodRelations property of the REFERENCE_FEATURE_GROUP_NAME element is gr:category. REFERENCE_FEATURE_SYSTEM_NAME (e.g. ECLASS-5.1 ) and REFERENCE_FEATURE_GROUP_ID have no direct mapping, rather a combination of them unambiguously determines the class identifier of a reference classification system (e.g. eClassOWL [Hep05b]). Likewise, the FREF element can be used together with FVALUE and an optional FUNIT element to specify a feature whose property is referenced externally. Otherwise, if no FREF is available for a feature, then the feature is defined locally. The FUNIT element can be used to distinguish property types in GoodRelations, i.e. to assign a quantitative object property to the product model in RDF if a value for FUNIT is given, otherwise a datatype property. The distinction will be addressed in more detail in Section 4.3.2. Listing 4.2 gives an example of product features

162

4 Product Model Master Data

1 samsung:LEDTV_ES6300 a gr:ProductOrServiceModel ; 2

samsung:compatible_3D "true"^^xsd:boolean ;

3

samsung:screen_size [ a gr:QuantitativeValueFloat ;

4

gr:hasValueFloat "101.6"^^xsd:float ;

5

gr:hasUnitOfMeasurement "CMT"^^xsd:string ] .

6 7 samsung:compatible_3D a owl:DatatypeProperty ; 8

rdfs:subPropertyOf gr:datatypeProductOrServiceProperty ;

9

rdfs:label "Plays 3D content"@en ;

10

rdfs:domain gr:ProductOrService ;

11

rdfs:range xsd:boolean .

12 13

samsung:screen_size a owl:ObjectProperty ;

14

rdfs:subPropertyOf gr:quantitativeProductOrServiceProperty ;

15

rdfs:label "Screen size"@en ;

16

rdfs:domain gr:ProductOrService ;

17

rdfs:range gr:QuantitativeValue .

Listing 4.2: Example of product features in Turtle/N3

in Turtle after having been mapped to GoodRelations (prefix declarations omitted).

4.3.1.3 Catalog Group Systems Catalog groups are hierarchical structures that are used for facilitating the navigation and finding of products in a catalog [HS00]. They often group products not only by their similarity but also typical usages or target audiences. With catalog groups, it is possible to further refine product descriptions (see Chapter 5). A catalog group system is mapped using an rdfs:subClassOf hierarchy based on the GenTax algorithm [HdB07], which can create meaningful ontology classes for a specific context while at the same time preserving the original hierarchy, i.e. the catalog group taxonomy. Table 4.4 outlines the mapping of catalog groups in BMEcat to RDF. The hierarchy is determined by the group identifier of the catalog structure that refers to the identifier of its parent group. Table 4.4: Mapping of a catalog group system in BMEcat to a rdfs:subClassOf hierarchy BMEcat

GoodRelations

CATALOG_GROUP_SYSTEM CATALOG_STRUCTURE GROUP_ID GROUP_NAME lang={en, de, . . . } GROUP_DESCRIPTION lang={en, de, . . . } PARENT_ID

owl:Class class name of owl:Class rdfs:label with language en, de, . . . rdfs:comment with language en, de, . . . rdfs:subClassOf (class id of superclass)

4.3 Product Model Master Data for the Semantic Web

163

Listing 4.3 provides an example in Turtle of a catalog group structure built up according to the GenTax algorithm [HdB07] (lines 2–12). 1 # GenTax mapping 2 samsung:TV-tax a owl:Class ; 3

rdfs:label "Samsung TV [Category]"@en .

4 samsung:TV-gen a owl:Class ; 5

rdfs:subClassOf gr:ProductOrService ;

6

rdfs:label "Samsung TV [Product type]"@en .

7 samsung:LED_TV-tax a owl:Class ; 8

rdfs:label "Samsung LED TV [Category]"@en .

9 samsung:LED_TV-gen a owl:Class ; 10

rdfs:subClassOf gr:ProductOrService ;

11

rdfs:label "Samsung LED TV [Product type]"@en .

12

samsung:LED_TV-tax rdfs:subClassOf samsung:TV-tax .

13 14

# Assignment of product type to product model

15

samsung:LEDTV_ES6300 a gr:ProductOrServiceModel, samsung:LED_TV-gen .

Listing 4.3: Example of catalog group information in Turtle/N3

4.3.1.4 Product and Catalog Group Map In order to link catalog groups and products, BMEcat maps group identifiers with product identifiers using PRODUCT_TO_CATALOGGROUP_MAP [SLK05a, pp. 220f.]. Accordingly, products in GoodRelations are assigned corresponding classes from the catalog group system, i.e. they are defined as instances (rdf:type) of classes derived from the catalog group hierarchy. In Listing 4.3 (line 15), a product type is assigned to a product model based on a mapping rule between product identifiers and catalog group identifiers specified in the BMEcat catalog. 4.3.2 Selected Modeling Problems In the following, we cover aspects of the conversion where the alignment of the two schemas turned out to be challenging. 4.3.2.1 Datatype versus Object Properties The Web Ontology Language OWL distinguishes between object properties and datatype properties [DS04]. The former category describes properties that link between individuals,

164

4 Product Model Master Data

whereas the latter links individuals to data values (literals), e.g. an entity with a numeric value or a textual description. The GoodRelations vocabulary further refines the categorization made by OWL by separating qualitative and quantitative object properties. On the other side, BMEcat does not explicitly discriminate types of features, so features (FEATURE ) typically consist of FNAME, FVALUE and, optionally, an FUNIT element [cf. SLK05a, pp. 138–143]. The presence of the FUNIT element helps to distinguish quantitative properties from datatype and qualitative properties, because quantitative values are determined by numeric values and units of measurements, e.g. “150 millimeters” or “1 bar”. Thus, any other feature is either a qualitative or a datatype property. It is impossible to define a rule that reliably distinguish qualitative properties and datatype properties in an automated way during the conversion (e.g. are “S”, “M”, and “L” qualitative values describing garment sizes or rather simple literal values?), so we defer this task to the RDF world (potentially bringing in additional knowledge) and declare all such properties as datatype properties with a range of type string. For those features whose values likely qualify as boolean values we provide a simple heuristic, i.e. if the feature value is one of “y”, “n”, “yes”, “no”, “true”, or “false”, then the property is treated as a boolean datatype property. Similarly, all rules that apply to properties also apply to their respective values, i.e. a quantitative property implies quantitative values, and so forth.

4.3.2.2 Float Value Ranges in Datatype Properties

Unlike GoodRelations, BMEcat does not allow to model range values by definition. There are two possible ways to model them in BMEcat, though. Either the BMEcat supplier defines two separate features, or the range values are encoded in the FVALUE element of the feature. The first option defines a feature for the lower range value and a feature for the upper range value, respectively. The downside of this approach is that two unrelated GoodRelations properties arise. The second alternative, i.e. range values encoded as single feature values, leads to invalid literals (e.g. gr:hasValueFloat “10–20”ˆˆxsd:float) when mapped to GoodRelations. For that reason, typical value patterns describing upper and lower ranges (like operating temperature of “5–40” degrees Celsius) are mapped during conversion to pairs of gr:hasMinValueFloat and gr:hasMaxValueFloat properties and respective values in GoodRelations. This approach, however, works only for the most prevalent syntactical patterns for range values in text fields.

4.4 Evaluation

165

4.3.2.3 Units of Measurement BMEcat and GoodRelations recommend to use UN/CEFACT [Uni06] Common Codes to describe units of measurement. In reality, though, it is common that suppliers of BMEcat catalogs export raw unit of measurement codes, i.e. just as they are found in their PIM systems. Instead of adhering to the standard three-letter UN/CEFACT Common Code, they often provide different representations of unit symbols, e.g. “cm”, “centimeters”, etc. in place of “CMT”. This is inconvenient with regard to potential applications that should consume the data and compare products based on feature descriptions. As a means to enhance the data quality already during the conversion process, our tool allows for the provision of a mapping table with invalid unit codes and their respective UN/CEFACT counterparts.

4.3.3 Scalability BMEcat files, especially of large industrial companies, can easily exceed 100 MB of file size. In order to cope with such significant file sizes, an online tool like the one proposed in [Mat09] quickly reaches its limits. The tool presented herein is an advancement of the work by Mark Mattern incorporating some lessons learned, i.e. (a) realizing a decentralized architecture that allows to run the tool offline via a command line interface, and (b) parsing file contents intelligently using an event-based parsing strategy without having to store the complete XML tree in main memory. By that it was possible to convert a 500 MB input file in less than two hours on an Apple Macbook from 2008 with 4 GB of main memory and an Intel Core 2 Duo processor running at 2.4 GHz. Even more importantly, the memory consumption during processing was fairly low.

4.4 Evaluation To evaluate our proposal, we implemented two use cases that allowed us to produce a large quantity of product model data from BMEcat catalogs. We tested the two BMEcat conversions using standard validators for the Semantic Web, presented in the upcoming Section 4.4.1. In Section 4.4.2, we then compare the product models obtained from one of the BMEcat catalogs with products collected from Web shops through our focused Web crawl from Chapter 3. Finally, we show the potential leverage of product master data from manufacturers with regard to products offered on the Web.

166

4 Product Model Master Data

4.4.1 Coverage of Use Cases We tested our conversion using BMEcat files from two manufacturers, one in the domain of high-tech electronic components (Weidmüller Interface GmbH und Co. KG8 ), the second one a supplier of white goods (BSH Hausgeräte GmbH9 ). In the case of Weidmüller, the conversion results are available online10 . While the Weidmüller catalog comes with its own proprietary catalog group system, the products in the BSH catalog were classified according to eCl@ss 6.1. This allowed us to validate the BMEcat converter comprehensively. Although the conversions completed without errors, we could still detect a few issues in each dataset that will be covered subsequently. To validate the output of our conversion, we used publicly available online and offline validators. In addition to that, our converter prints helpful warning messages to the standard output. In summary, the converter was tested using the following validation steps: (1) BMEcat2GoodRelations converter output (including error and warning messages, if any); (2) RDF/XML syntax validity11 ; (3) Pellet validation12 for spotting semantic, logical inconsistencies; and (4) GoodRelations-specific compliance tests13 to spot data model inconsistencies. The converter has built-in check steps that detect common irregularities in the BMEcat data, such as wrong unit codes or invalid feature values. In Table 4.5, we list a number of warning messages that were output during the conversion of the BMEcat files, together with the validation results of the different validation tools. As shown in the table, the two conversions pass most validation checks, with a few data quality issues reported by some validators. In the BSH catalog, for example, some fields that require floating point values contain non-numeric values like “/”, “0.75/2.2”, “3*16”, or “34 x 28 x 33.5”, which originates from improper values in the BMEcat. Another data quality problem reported is the usage of non-uniform codes for units of measurement, instead of adhering to the recommended 3-letter UN/CEFACT Common Codes (e.g. “MTR” for meters, “VLT” for Volt, etc.).

http://www.weidmueller.com/ (accessed on August 23, 2014) http://www.bsh-group.com/ (accessed on August 23, 2014) 10 http://catalog.weidmueller.com/semantic/sitemap.xml (accessed on August 23, 2014) 11 http://www.rdfabout.com/demo/validator/ (discontinued as of August 23, 2014; but the source code is still available), http://www.w3.org/RDF/Validator/ (accessed on August 23, 2014) 12 http://clarkparsia.com/pellet/ (accessed on August 23, 2014) 13 http://www.ebusiness-unibw.org/tools/goodrelations-validator/ (accessed on May 22, 2014) 8

9

4.4 Evaluation

167

Table 4.5: Validation of BMEcat conversions Validation

BSH

Weidmüller

warnings: (a) wrong values where numeric values were expected; (b) non-standard unit codes detected

warnings: (a) non-standard unit codes detected

RDF Validator

valid. warning: invalid lexical value for literal

valid

W3C RDF Validation

valid

valid

Pellet

valid. warning: malformed xsd:float detected

valid

GoodRelations Validator

step 32 failed: non-compliance of float literal with xsd:float

valid

BMEcat2GoodRelations verter

con-

4.4.2 Missing Product Features on the E-Commerce Web of Data Table 4.1 from this chapter’s introduction has revealed a mismatch between the features published by manufacturers and those published by online retailers via offering descriptions. In this section, we describe one additional example that uses structured data on the Web of Data. In addition to the manufacturer BMEcat files, we took a real dataset obtained from a focused Web crawl whereby we collected product data from ∼2, 500 shops (see Chapter 3). Figure 4.4 depicts the distribution of the product offer count across Web shops in the crawl. For this figure, we did only consider product offers with EANs, which appeared in 847 shops. Furthermore, in order to remove any potential bias caused by multiple definitions of the same product on different pages (because of non-canonical Uniform Resource Identifiers (URIs) containing query strings like prod_id=1&sess_id=XYZ), the boxplot was generated using the count of product offers per shop with distinct EANs. Hence, according to Figure 4.4, only 25% of the Web shops offer more than 52614 products with distinct EANs, half of the shops offer less than 91 products, and one quarter of the shops offer less than 15 products. There is even one shop that offers 79, 076 products with distinct EANs. In Table 4.6, we complement the example given in the introduction with insights from our collected data. The products listed in the first column of the table represent product models from the BSH dataset that match product instances from Web shops based on identical EANs. In the current dataset, there exist 95 of such matches based on EANs. The comparison of the amount of properties from the manufacturer with the number of 14

The exact number of the upper quartile is 526.5, but since the number of products is discrete, we herein refer to 526 products.

168

4 Product Model Master Data

min = 1 median = 91 q1 = 15

0

50

100

150

max = 79,076 q3 = 526.5

200

250

300

350

400

450

500

550

600

78900 78950 79000 79050 79100

offer count

Figure 4.4: Boxplot of the product offer count (with EANs) across Web shops in the crawl

properties from the retailers shows a significant gain from augmenting retail product data with manufacturers’ product model master data. For instance, take the vacuum cleaner (German: Bodenstaubsauger ) in row 2 of Table 4.6. It shows 30 product properties coming from the manufacturer and an average number of nine properties across the three shops offering that product. Therefore, the properties in the shops only amount to a fraction (30%) of the properties available from the manufacturer. Table 4.6: Product features in BSH BMEcat versus data from retailers publishing GoodRelations markup BSH Product Features Retailer Product Features Coveragea TW86103 Wasserkocher 25 10 marketplace.b2b-discount.de 40% (EAN: 4242003535615) Bodenstaubsauger Beutel 10 www.ay-versand.de VS06G2410 2400 W 30 9 www.megashop-express.de 30% (EAN: 4242003356364) 8 fairplaysport.tradoria-shop.at Mikrowelle HF25M5L2 Edelstahl 51 7 www.european-gate.com 13.73% (EAN: 4242003429303) a

“Coverage” = Ratio of average number of retailer features and BSH features

The relatively constant number of product features contributed by retailers may be explained by the shop extensions that typically expose only standard features like product name, GTIN, EAN, stock keeping unit (SKU), product weight, and product dimensions. Although this helps to explain the numbers to some extent, it does not change our premise that structured product master data is still lacking on the Web. We gathered all the data in a SPARQL-capable RDF store and extrapolated some statistics to substantiate the potential of our approach. The number of product models in the BSH was 1, 376 with an average count of 29 properties, while the Weidmüller BMEcat consisted of 32, 585 product models with 47 properties on average created by our converter. By contrast, the product instances from the crawl only contain seven properties on average, the product offers 13, and product models nearly twelve (see Table 3.2 in Chapter 3).

4.4 Evaluation

169

4.4.3 Leverage Effect of Product Master Data on the Web Table 4.6 from Section 4.4.2 confirmed the scenario presented in Table 4.1 in the introduction of this chapter by comparing BSH product model data and structured product data from a sample of ∼2, 500 online shops. In this section, we present some specific examples of the number of online retailers that could readily benefit from leveraging our approach. To remain within the scope of the use cases discussed, the examples are chosen from the BSH BMEcat product catalog within the German e-commerce market. We chose to check for the number of shops offering products using a sample size of 90 random product EANs from the BSH BMEcat. The sample size was selected based on a 95% confidence level and 10% confidence interval (margin of error), i.e. requiring a minimum of 90 samples given the population of 1, 376 products in the BMEcat. Using the sample of EANs, we then looked up the number of vendors that offer the products by entering the EAN in the search boxes on Amazon.de15 , Google Shopping Germany16 , and the German comparison shopping site preissuchmaschine.de17 . This gave us a distribution of shops grouped by EAN as outlined in the boxplots in Figure 4.5. min  

q1  

q1  median   min   q1   min  

median  

q2  

max   h2p://www.preissuchmaschine.de/  

q2  

max  

median  

h2p://www.amazon.de/  

q2  

max   h2p://www.google.de/shopping/  

0  

10  

20  

30  

offer  count  

40  

50  

60  

Figure 4.5: Boxplots of the distribution of shop offers per EAN

The numbers that we got from this experiment were surprisingly small. For example, there was a maximum number of 48 sellers offering a specific product. For half of the products that we tested at least 16 offers appeared in the price comparison search engine preissuchmaschine.de. In the Amazon.de and Google Shopping Germany marketplaces by comparison, the number of offers for a product among the sample of product EANs was even lower. We can think of various explanations for this, namely that the marketplace http://www.amazon.de/ (accessed on August 23, 2014) http://www.google.de/shopping/ (accessed on May 8, 2014) 17 http://www.preissuchmaschine.de/ (accessed on August 23, 2014) 15

16

170

4 Product Model Master Data

regulations try to limit competition among market participants and, more importantly, that adding products to the marketplace presents a barrier to smaller shop owners (in the case of Google Shopping, a shop is asked to upload product data using a populated product feed or an application programming interface (API) [cf. Goo15b]). Furthermore, the small numbers may be due to (1) localized searches (all shopping comparison engines in the .de-domain), (2) the fact that shops rarely populate their products with EAN identifiers, or (3) the type of products in our sample, in this case from the domain of white goods that are likely not the most popular product category for selling online. More precisely, unsupported small shop owners may not find it very attractive to sell dishwashers online given the logistical effort involved. To put Figure 4.5 (boxplots) in perspective, we did a comparison with a more popular product, i.e. “Canon PowerShot A2300 schwarz” (with EAN “8714574578828”). We repeated the above searches with the same online services, but now using (a) the EAN of this digital camera model and (b) the product name, suspecting that many retailers do not populate their products with EANs but use other “strong identifiers” instead. Table 4.7 summarizes the results of this analysis. Amazon.de and preissuchmaschine.de constantly returned 45 and 233 results, respectively. Google Shopping Germany, however, gave only four results when searching by EAN number, but 144 results for a search by product name. These results indicate that using a combination of different types of “strong identifiers” could leverage product master data on the Semantic Web. Table 4.7: Product searches for a digital camera model on popular e-marketplaces Marketplace preissuchmaschine.de Amazon.de Google Shopping Germany

EAN Hits

Product Name Hits

45 233 4

45 233 144

Figure 4.6 shows the frequency distribution of EANs among the ∼2, 500 shops of our Web crawl. This information also contributes to an estimate of the impact of our approach. Almost half a million of the EANs are unique. For those unique products there is no big gain from manufacturers publishing product model master data, except for the high-quality product descriptions. The benefit becomes clearer for the 101 EANs that appear between 100 and 250 times. If we managed to persuade a single manufacturer of these products to publish product model master data on the Web of Data, then at least 100 retailers could benefit immediately. The expected lever is even higher given that manufacturers usually produce more than one good, and that the effort for publishing the full product catalog as structured data is similar than for one product, since there remains one BMEcat file to convert. Moreover, most retailers do not offer only one but

4.5 Conclusion

171

Number of EANs (log-scaled)

1000000 100000

About 6,000 EANs are offered exactly 5 times

10000

1,230 EANs are offered between 11 and 25 times

1000 100 10

00 10 1-2 50 25 1-5 00 50 1-1 00 0 10 01 -25 25 00 01 -50 00

51 -1

26 -50

11 -25

10

9

8

7

6

5

4

3

2

1

1

Number of product offers for a particular EAN

Figure 4.6: Frequency distribution of EANs with respect to the number of product offers for a particular EAN in the dataset

many product items by various manufacturers.

4.5 Conclusion The proliferation of online retailers in recent years was accompanied by a growing number of products being offered on the Web. Such a substantial increase of online goods introduces new data management challenges. More specifically, it involves how information, in particular products, features or descriptions, can be processed by stakeholders along the product lifecycle. Our experience after a survey of ∼2, 500 different-sized online merchants indicates that in the current conditions product data from retail sites suffers from incomplete, inconsistent or outdated product detail information. In this thesis, we have presented a conceptual mapping and workflow for lifting product model master data from the popular BMEcat standard to the GoodRelations data model and the RDF meta-model, allowing to use the Semantic Web approach for augmenting product information from the Web. As a practical solution to mitigate the shortage of missing product master data in the context of e-commerce on the Web of Data, we have proposed the BMEcat2GoodRelations converter. This ready-to-use solution comes as a portable command line tool that converts product master data from BMEcat XML files into their corresponding OWL representation on the basis of the GoodRelations ontology. All interested merchants then have the possibility of electronically publishing and consuming this authoritative manufacturer data to enhance their product offerings

172

4 Product Model Master Data

relying on widely adopted product “strong identifiers” such as EAN, UPC, GTIN, or MPN. Or alternatively, consumers of retail site markup could augment the raw data therewith. We argue that the construction of a firm basis of product master data is the prerequisite for useful product discovery and matchmaking scenarios. The data we have collected and analyzed should motivate manufacturers to release their product master data and encourage retailers to attach strong identifiers to their products. The immediate impact would be a huge lever for enriching online offers by product features and less effort to be put into data cleansing thanks to a gain in more high-quality data. Both factors would pave the way to more granular data analysis and search experience for organizations and individuals.

5 Product Type Information

5.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

5.2

State of the Art and Related Work

5.3

5.4

5.5

5.6

. . . . . . . . . . . . . . . . . . . . . . . . . 177

5.2.1

Product Classification Standards . . . . . . . . . . . . . . . . . . . . . . 177

5.2.2

Proprietary Product Category Systems . . . . . . . . . . . . . . . . . . . 178

5.2.3

Other Approaches

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Deriving Product Ontologies from Knowledge Organization Systems . . . . . . . 180 5.3.1

Conceptual Approach

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

5.3.2

Transformation of a Product Classification System . . . . . . . . . . . . 182

5.3.3

Converting Property Types, Range Information, and Enumerated Values 184

5.3.4

Serialization and Deployment . . . . . . . . . . . . . . . . . . . . . . . . 185

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5.4.1

Correctness of the Derived Product Ontologies . . . . . . . . . . . . . . 187

5.4.2

Statistics on New Product Classes and Properties . . . . . . . . . . . . . 189

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 5.5.1

Classification of Product and Offer Descriptions . . . . . . . . . . . . . . 192

5.5.2

Navigation over Product and Offer Data . . . . . . . . . . . . . . . . . . 193

5.5.3

Semantic Annotation of Products and Offers on the Web . . . . . . . . . 193

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

The classification of products and services greatly facilitates reliable and efficient electronic exchanges of product data between organizations. Many companies classify products (a) according to generic or industry-specific product classification standards, or (b) by using proprietary category systems. Such classification systems often contain thousands of product classes that are updated as needed (e.g. to cover new types of products), which implies a large quantity of potentially useful product category information for e-commerce applications on the Web of Data. Thus, instead of engineering product ontologies from scratch, which is costly, tedious, error-prone, and requires maintenance effort, it is generally attractive to derive them from existing classifications. This approach has been studied in the literature before, e.g. [CG01; Kle02; ZL03; Hep05b; Hep06].

173

174

5 Product Type Information

In this chapter, we (1) describe a generic, semi-automated method for deriving Web Ontology Language (OWL) ontologies from product classification standards and proprietary category systems, which is conceptually based on the GenTax algorithm [HdB07] and the approach used for eClassOWL [Hep05b], but extended and updated to match Linked Open Data (LOD) principles [Ber06]. Moreover, we (2) show that our approach generates logically and semantically correct vocabularies, and (3) demonstrate the practical benefit of our approach. The resulting product ontologies are compatible with the GoodRelations vocabulary for e-commerce and schema.org, and they can be used to enrich product and offer descriptions on the Semantic Web with granular product type information from existing data sources.

5.1 Problem Statement The categorization of products and services plays a crucial role for many businesses and business applications [Fen+ 01]. It enables reliable and efficient electronic transactions on product data between organizations in a dynamic domain, characterized by innovation and a high degree of product specificity. In concrete terms, product classes allow for intelligent decision-making and operations over aggregated data, e.g. facilitating spend analysis [cf. HLS07] or enhanced navigability and search within product catalogs [cf. HS00]. The ability to operate on groups of products is often superior to applying heuristics on unstructured product descriptions, especially at tasks that require abstractions over individual product models or that depend on accessing subtle differences between competing products. For instance, a search for a personal computer relying on textual matches will not only return personal computers but probably related accessories or books as well that discuss the broad topic personal computers. Alternatively, with class membership information, it is possible to reliably distinguish between personal computers and related, but not necessarily relevant, products. Moreover, it facilitates to query an exhaustive set of personal computers, which otherwise, with heuristics, would be difficult and expensive. In practice, organizations often arrange products and services according to informal product classification systems that are not based on knowledge representation principles, e.g. eCl@ss [EClND ] or UNSPSC [UniND ]. At the same time, the number of quality, practically relevant product ontologies on the Web is still limited [Hep07a], among others because most ontology engineering work is done in the context of academic research projects where efforts rarely go beyond early prototypes [Hep05b]. For serious e-commerce applications on the Web of Data, though, we need a broad domain coverage of specific classes, properties, and enumerated values for describing products and services. For this

5.1 Problem Statement

175

reason, a cost-efficient solution able to accommodate business needs on the Web of Data would be very useful. Product classification systems are suitable candidates for creating high-quality and lowcost product ontologies for the Web [e.g. Hep06]. In many areas of e-commerce, where domains are typically composed of thousands of classes and properties, it proves difficult to engineer domain ontologies manually, because that would imply to get hold of a large number of concepts [Hep06]. Moreover, the conceptual dynamics [Hep07a] underlying the domain of products and services, determined by ongoing product innovation and a high degree of product specificity, make the manual creation of product ontologies even more problematic [Hep06]. Let us exemplify the situation by comparing the release sizes [ECl14] of different versions of eCl@ss [EClND ], a comprehensive industry standard for the classification and description of products and services (see Figure 5.1): While eCl@ss 5.1.4 had defined 30, 329 classes in 2007, eCl@ss 6.1, only announced two years later, was already counting 32, 795 classes. The differences become even more evident for eCl@ss 6.1 and eCl@ss 9.0 BASIC with an increase of 25%, reaching 40, 870 concepts within only five years. In place of engineering new domain ontologies, it is often more practical to derive product 45000

Classes Properties Values

Number of concepts in release

40000 35000

30,329 classes

40,870 classes

30000

32,795 classes

25000 20000 15000 10000

9.0 BASIC (08.12.14)

8.1 BASIC (02.12.13)

8.0 BASIC (02.12.12)

7.1 BASIC (30.11.11)

7.0 BASIC (15.02.11)

Version number and release date

6.2 (01.12.09)

6.1 (21.08.09)

6.0 (30.04.08)

5.1.4 (01.07.07)

5.1.3 (26.10.06)

5.1.2 (12.07.06)

5.1.1 (07.09.05)

5.1 (??.??.??)

5.0.1 (??.??.??)

5.0 (02.09.03)

4.1 (28.02.02)

3.0 (30.04.00)

0

4.0 (27.03.01)

5000

Figure 5.1: Conceptual dynamics of the eCl@ss product categorization standard [based on ECl14]

176

5 Product Type Information

ontologies from works already in place, i.e. to reuse existing industrial taxonomies, as argued in [Hep06]. This has several benefits: 1. The product classifications provide a comprehensive coverage of the conceptual domains [Hep06], and that often in multiple languages. 2. There is no significant overhead involved for maintaining derived product ontologies; on the contrary, they are automatically kept up-to-date with amendments to the classifications conducted by domain experts in response to changes in the real world [cf. Hep06]. 3. Existing industrial standards are popular and thus already in wide use to classify product instance data [cf. Hep06]. In other words, a large amount of products in relational databases are already classified according to such product categorization standards. At the same time, numerous Web shops create and maintain proprietary category systems along with their product catalogs. Hence, instead of manually crafting complex domain ontologies and thereby in a sense reinventing the wheel, it appears sensible to unlock the potential of existing, wellmaintained knowledge organization structures and to classify products on the Semantic Web according to them. Unfortunately, product classifications are difficult to use for the Semantic Web in their raw form. They generally offer weak, ambiguous “topic” semantics, i.e. the same category can be used for very different types of entities. Furthermore, they only define an informal hierarchy, so that it is unclear whether a subsumption hierarchy shall be described using rdfs:subClassOf, skos:broader /skos:narrower, skos:broaderTransitive/skos:narrowerTransitive, etc.

Even if a hierarchical relationship between classes in a classification system

could be described using the rdfs:subClassOf property (e.g. ex:Car rdfs:subClassOf ex:Vehicle), inconsistencies may still arise when applied to another subsumption path

of the same classification system (e.g. ex:Gearbox rdfs:subClassOf ex:Vehicle). In this chapter, we present a generic approach and a fully-fledged, modular, and largely automated tool for deriving Web ontologies from product classification systems. We show that our approach generates logically and semantically correct domain ontologies in OWL DL that 1. establish canonical Uniform Resource Identifiers (URIs) for every conceptual element in the original schema, 2. preserve the taxonomic structure of the original classification while making its categories usable in multiple contexts,

5.2 State of the Art and Related Work

177

3. comply with the GoodRelations vocabulary for e-commerce [Hep08a] and schema.org, and 4. can be readily deployed according to LOD principles [HB11] on the Web of Data. The results of our transformation unlock additional semantics that enable novel Web applications. Thanks to the enrichment of product master data and a more granular description of offers by virtue of product ontologies, search engines and other consumers of structured data can take advantage of product type information for product search, comparison and matchmaking. The rest of this chapter is structured as follows: Section 5.2 covers relevant background about product classification systems and compares our proposal with relevant alternatives in the literature; Section 5.3 introduces our approach, which results are then evaluated and discussed in Sections 5.4 and 5.5; and finally, Section 5.6 concludes our work and discusses future extensions.

5.2 State of the Art and Related Work For the scope of this research, we distinguish two groups of classification schemes (systems) relevant to the domain of commercial products and services. These are product classification standards and proprietary product category systems (or structures). The main aspects of both groups are discussed in this section. Additionally, there is further relevant information that cannot be included here due to space limitations, but is available online1 . This supplementary material gathers a series of key attributes for every classification system comprising version, organization(s) authoring and managing the classification, available data sources, official report, target usage domain, intended regional use, and level of multilingual support.

5.2.1 Product Classification Standards Product classification standards (or product categorization standards) are widely accepted knowledge structures often consisting of thousands of categories [cf. HLS07]. They typically comprise: (a) Hierarchical structures for the aggregation of products, which for example allow for spend analysis or reasoning over hierarchical relations; (b) common features and values related to product categories; and (c) multilingual descriptions of the elements that conform the standard. 1

http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014)

178

5 Product Type Information

The product classification standards that we considered at the time of this research are: • Classification of Products by Activity (CPA) [Eur08b], • Central Product Classification (CPC)2 , • Common Procurement Vocabulary (CPV) [Eur08a], • eCl@ss3 , • ElektroTechnisches InformationsModell (Engl.: Electro-Technical Information Model) (ETIM)4 , • FreeClass [Han07], • Global Product Classification (GPC) [GS105], • proficl@ss5 , and • Klassifikation der Wirtschaftszweige (Engl.: German Classification of Economic Activities) (WZ) [Sta08]. The featured standards are based on industry consensus and exist for various business fields, be it horizontal or vertical industries. eCl@ss, proficl@ss, and GPC, for example, describe a wide range of products from multiple industrial sectors. By contrast, CPV is intended for the procurement domain, whereas ETIM is focused on the field of electronics. Two standards, CPA and WZ, put forward classifications of comprehensive economic activities instead of product classifications per se. Nonetheless, commercial products can be classified according to them and their use is common among governmental publishers of statistical data. To solve potential ambiguity problems of product names, standards such as eCl@ss, ETIM, and proficl@ss include synonyms to provide discriminatory features [cf. Nav09] and to retain higher recall in product search scenarios. Furthermore, many standards (CPA, CPV, FreeClass, and WZ) feature translations in various languages. 5.2.2 Proprietary Product Category Systems Proprietary product category systems (or catalog group systems, category structures) are also suited for organizing products and services. Other than product classification standards, catalog group systems are generally characterized by little community agreement. Instead of communities or standardization bodies, single organizations or small http://unstats.un.org/unsd/cr/registry/cpc-2.asp (accessed on September 16, 2014) http://www.eclass.de/ (accessed on May 16, 2014) 4 http://www.etim.de/ (accessed on May 16, 2014) 5 http://www.proficlass.de/ (accessed on September 16, 2014) 2

3

5.2 State of the Art and Related Work

179

interest groups are taking the lead for the development of such category structures. Thus, they are accepted only by a relatively small number of stakeholders, and their usage is often limited to a narrow context, e.g. to represent a navigational structure in a Web shop. Some examples of catalog group hierarchies considered in the context of this work are proprietary product taxonomies like the Google product taxonomy [Goo13] and the productpilot6 category system (the proprietary category structure of a subsidiary of Messe Frankfurt), as well as product categories transmitted via catalog exchange formats like BMEcat7 [SLK05a]. The latter can take advantage of both product categorization standards and catalog group structures in order to organize types of products and services and to contribute additional granularity in terms of semantic descriptions, as previously covered in Chapter 4 [see also SRH13b].

5.2.3 Other Approaches This research work partially builds upon previous works in the area of transforming classification standards into Web ontologies. The challenges in the conversion of product classification standards were already discussed in [Hep05a; Hep06], whose findings led towards the development of the GenTax algorithm in [HdB07], still a core component of our solution. The subsequent initial release of the GoodRelations ontology [Hep08a] motivated the first transformation of the eCl@ss standard [cf. Hep05b] (version 5.1.48 ) as a GoodRelations-compliant ontology relying on the GenTax methodology. Alternatively, there have been previous efforts to convert other product classification schemes that are also supported by our tool: Most notably CPV ([PAA08], and another effort in the context of a project concerned with the publishing of open government data9 ), primarily used to streamline the procurement and tendering process in the public sector. On a broader scope, the research in [Vil11] provides the most recent and comprehensive survey of methods and tools for the refactoring of most types of non-ontological resources into ontological resources, i.e. Web ontologies. Villazon-Terrazas [Vil11] developed a comprehensive qualitative framework to categorize non-ontological resources based on their characteristics. One of the types of non-ontological resources acknowledged in his work are actually the general classification schemes for any given domain, such as those for products considered in the current research. In fact, two methods, [Hak+ 06], again 6

http://www.productpilot.com/ (accessed on September 16, 2014)

Developed by the German Bundesverband Materialwirtschaft, Einkauf und Logistik (Engl.: Federal Association of Materials Management, Purchasing and Logistics) (BME). 8 http://www.heppnetz.de/projects/eclassowl/ (accessed on September 16, 2014) 9 http://linked.opendata.cz/resource/dataset/cpv-2008 (accessed on September 16, 2014) 7

180

5 Product Type Information

GenTax, and a tool, SKOS2OWL10 [HR09], are identified to focus on the conversion of classification schemes into Web ontologies. Yet, in summary, to the best of our knowledge, the approach described in this chapter is the only methodology with mature tool support that extends the features and capabilities of all the conversion efforts previously mentioned, on at least one, if not several of the following fronts: (1) The level of automation; (2) modular, extensible architecture supporting the conversion of an arbitrary number of classification systems; (3) the application to a broad set of non-ontological resources, i.e. almost all relevant classification schemes; (4) traceability including preservation of the taxonomic structure between the elements in the original classification scheme and those in the derived Web ontology; (5) improved support for properties and enumerations; (6) high degree of configuration options aimed at the deployment on the Web of Linked Open Data (LOD); and, lastly, (7) compliance to the GoodRelations and schema.org vocabularies, which currently allows for the publication of product information in various Web data formats (e.g. Microdata and Resource Description Framework in Attributes (RDFa)).

5.3 Deriving Product Ontologies from Knowledge Organization Systems In this section, we present a generic, semi-automated approach to turn standards and proprietary product classification systems into respective product ontologies. Subsequently, we outline the conceptual architecture of our proposal, followed by a description of the conceptual transformation.

5.3.1 Conceptual Approach Figure 5.2 depicts the conceptual approach of PCS2OWL11 . The tool consists of a modular architecture that builds upon three layers, namely parser, transformation process, and serializer. Prior to executing the script, a moderate amount of initial human labor is needed, mainly to prepare the import modules (parsers) for the respective classification systems, as indicated by the dashed rectangle in Figure 5.2. This preliminary task includes providing the essentials for mapping the taxonomy and setting up the handling of property types. Apart from defining these details, the parsers’ purpose is to load categories, features, and values of product classification systems into an internal model, which specifies ontology classes, properties, and individuals. The next steps, the transformation and serialization 10 11

http://www.heppnetz.de/projects/skos2owl/ (accessed on September 16, 2014) http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL (accessed on May 22, 2014)

5.3 Deriving Product Ontologies from Knowledge Organization Systems

Proprietary Hierarchies and Catalog Group Structures

Product Categorization Standards CPA 2008

CPC Ver.2

CPV 2008

eCl@ss 5.1.4 and 6.1

FreeClass

GPC

proficl@ss 4.0

WZ 2008

181

ETIM 4.0

Google product taxonomy

productpilot

BMEcat catalog groups

.xls, .csv, .xml, .txt, .mdb, ... Custom Parsers Objects (classes, properties, individuals) Transformation RDF Serialization RDF/XML, HTML, sitemap.xml

Figure 5.2: Conceptual architecture of PCS2OWL

processes, are fully automated. In the transformation step, the internal model, consisting of entities for classes, properties, and individuals, is turned into an Resource Description Framework (RDF) graph that describes the final ontology. At this stage, also the logical rules from the parsers are applied to the internal model. Finally, the RDF model is serialized as RDF/XML, and all other files required for the online deployment of the product ontologies are created accordingly. In the context of this work, we have developed custom parsers for a number of popular categorization standards and proprietary taxonomies for products and services, previously introduced in Sections 5.2.1 and 5.2.2 and outlined in Figure 5.2. The input formats of the source files of the classification systems are irrelevant to the converter, since the parsers have to be hand-crafted anyway. For our conversions, we had to deal with Excel spreadsheets (files ending in “.xls”), comma-separated values (CSV) files (“.csv”), Extensible Markup Language (XML) files (“.xml”), database tables (“.mdb”), and plain text files (“.txt”). The effort necessary for developing a parser module is negligible in comparison to handcrafting a product ontology from scratch. For simple classification systems such as GPC or the Google product taxonomy that merely comprise classes and do not define properties, it was for example sufficient to extend an empty parser template by only twenty lines of additional code. Even the most complex parser module that we have created so far required less than 200 lines of custom code. This module for the FreeClass classification

182

5 Product Type Information

standard includes sophisticated rules for raising the data quality of the resulting product ontology.

5.3.2 Transformation of a Product Classification System A core aspect of the transformation step is the creation of the product classes in the resulting ontology based on the source product classification system. To create the ontology classes, the PCS2OWL tool relies on the GenTax approach introduced in [HdB07]. GenTax allows to generate a consistent OWL DL ontology while preserving the taxonomic structure of the original categories in the product classification system. In order to do so, the GenTax method creates, for each category in the product classification system, two corresponding OWL classes in the target ontology: 1. Broad topic: A broader taxonomic class that represents the category from the product classification system. 2. Specific type: A context-specific class, in our case in the domain of products and services. For a given category identified as “ID” in the original product classification system, let us hereinafter refer to the pair of OWL classes that GenTax creates as C_ID-gen and C_ID-tax, following the naming convention of the original GenTax specification [HdB07]. There are additional design decisions that are applied in the conversion process to create the classes and the class structure of the resulting ontology [cf. HdB07]: 1. All taxonomic classes (C_ID-tax ) are arranged in a subsumption class hierarchy via the rdfs:subClassOf relation to preserve the hierarchical structure of the corresponding categories in the original product classification system. 2. Every context-specific class (C_ID-gen) is defined as a subclass of the common GoodRelations product class gr:ProductOrService via the rdfs:subClassOf property to state that it represents entities that are products. 3. Every context-specific class (C_ID-gen) is at the same moment also a subclass of the corresponding C_ID-tax taxonomic class, to preserve its traceability to the category in the original product classification system that it was derived from. 4. There exist no subsumption relationships between context-specific classes (C_IDgen), because it is not possible to determine automatically whether a subsumption

5.3 Deriving Product Ontologies from Knowledge Organization Systems

183

relation between two C_ID-gen classes holds or does not due to frequent anomalies and ambiguities in the original categorization schemas [HdB07]. Figure 5.3 shows an example that results from the conversion of the following fragment of the English version of the Google product taxonomy [Goo13]: Cameras & Optics > Cameras > Digital Cameras Cameras & Optics > Cameras > Disposable Cameras

Figure 5.3 exhibits all four design decisions of the GenTax algorithm outlined previously. The right side shows the taxonomic class hierarchy, whereas the left part describes the context-specific class hierarchy. The black solid arrows stand for the rdfs:subClassOf relationships. As indicated, (1) the taxonomic classes represent the categories in the Google product taxonomy and preserve the same hierarchical structure; (2) the contextspecific classes represent actual products and services and, hence, are subsumed by gr:ProductOrService; (3) all context-specific (product) classes are at the same time subclasses of their respective taxonomic class, e.g. the product class C_Cameras-gen is a subclass of the broad topic C_Cameras-tax ; and (4) no subsumption relation is imposed upfront between the product classes, thus in visual terms they are arranged as mutual pair-wise siblings. This design decision is based on the observation that the hierarchical relationships in informal category systems frequently suffer from modeling anomalies attributable to their specific intended usage; for a deeper analysis, see [Hep05a; Hep06; HdB07]. Nonetheless, a hierarchy between product classes might still be established either manually or automatically. If set up automatically, a statistical test might be necessary, i.e. to safeguard the reliability of the conceptual choice by taking random samples which validity is checked (e.g. the use of rdfs:subClassOf relationships to represent the hierarchy). The adoption of the GenTax approach provides several features to the resulting ontologies produced by the PCS2OWL tool. GenTax creates meaningful, practically useful product classes (i.e. “-gen” classes on the left side of Figure 5.3) by defining these as subclasses of gr:ProductOrService, which at the same time renders the resulting OWL DL product ontology compatible with GoodRelations and schema.org. By preserving the hierarchical structure of the product classification system (i.e. “-tax” classes on the right side of Figure 5.3), GenTax allows the execution of generalization/specialization queries based on the original product classification system. For example, it permits to query the common category C_Cameras-tax in order to get the union of all instances of the classes C_DigitalCameras-gen and C_DisposableCameras-gen. The use of the rdfs:subClassOf relationship in the taxonomic classes means that no reasoning capabilities beyond the

5 Product Type Information

GEN Concept based on label in one narrow context (here: product or service)

gr:ProductOrService

Cameras and Optics

Cameras

This concept represents all instances of Cameras and Optics

Digital Cameras

This concept represents all actual Digital Cameras

Disposable Cameras

Generic Concepts (Objects of a clearly defined type)

184

Category Concepts (Topics, Categories)

This concept represents all objects that are related to Cameras and Optics as a topic

Cameras and Optics

Cameras

Digital Cameras

Disposable Cameras

TAX Broad meaning of the label so that the edges are subClassOf

Figure 5.3: GenTax applied to a subset of the Google product taxonomy [cf. HdB07]

widely supported RDF Schema (RDFS) inferencing are required to navigate through the taxonomic structure of the original product classification system in the generated ontology. Additionally, for traceability and provenance purposes, every class indicates the ontology that it is described by through taking advantage of the rdfs:isDefinedBy property; and moreover, every taxonomic class specifies a hierarchy code (materialized via an annotation property :hierarchyCode) to link it to the corresponding category code used in the source classification system.

5.3.3 Converting Property Types, Range Information, and Enumerated Values Our approach extends the GenTax algorithm of the previous section. In addition to the extraction of OWL classes from hierarchical classifications, PCS2OWL converts features and feature values of product classification systems, thus contributing additional semantics to categories. The different types of properties that are supported by the tool are in line with the GoodRelations ontology and consist of • qualitative properties (gr:qualitativeProductOrServiceProperty), • quantitative properties (gr:quantitativeProductOrServiceProperty), and • datatype properties (gr:datatypeProductOrServiceProperty). Similarly, our tool distinguishes between enumerations or qualitative values (gr:QualitativeValue, e.g. the color “red”), quantitative values (gr:QuantitativeValue of type xsd:float or

5.3 Deriving Product Ontologies from Knowledge Organization Systems

185

xsd:integer, e.g. values that indicate ranges like “500 milliliters”), and literal values with datatypes (xsd:float, xsd:integer, xsd:boolean, or xsd:string). Custom rules and heuristics guide the distinction of the property types and related values. They have to be provided with the parser modules in order to be applicable for the subsequent transformation step where respective OWL properties are generated automatically. Thus, the quality of the conversion strongly depends on the correctness of these logics: As a general rule of thumb, a numerical value accompanied by a unit code in the product classification system will yield a quantitative value in the resulting product ontology, and not a qualitative value or a datatype literal. Similarly, if there are predefined values in a classification system, the corresponding properties and the individuals created from the values themselves will be of qualitative nature. Sometimes, classification standards even provide additional information to facilitate the distinction of the intended type of features and values. E.g., ETIM indicates logical values with an “L” metadata flag, hence best mapped as boolean literals in RDF. A corresponding guideline for eCl@ss was detailed in [Hep08b].

5.3.4 Serialization and Deployment In this section, we describe the serialization and deployment of the resulting product ontologies. This includes deciding on a canonical URI pattern for publishing the entities on the Web, and providing alternative ways to support standards-compliant Web ontology deployments. The product classes and related entities in the ontologies obey a common URI pattern, which is comprised of 1. the base URI of the ontology; 2. a prefix to help humans distinguish URIs of different entity types, namely “C_” for classes, “P_” for properties, and “V_” for values; 3. an identifier that is unique in the context of the category system, which for categories is typically the hierarchy code; and, for classes, 4. a suffix to distinguish context-specific or generic (“-gen”) classes from taxonomic (“-tax”) classes.

186

5 Product Type Information

Following this pattern, the URI12 of a context-specific class “Disposable Cameras” (hierarchy code “10001488”) in the resulting GPC product ontology is http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen

PCS2OWL offers two deployment alternatives for product ontologies, namely based on hash and slash URIs. Hash URIs use the number sign character (“#”) to refer to entities, whereas slash URIs address entities directly and are ended by a slash character (“/”). In the latter case, the difference between entities and their respective document representations is established using Hypertext Transfer Protocol (HTTP) forwarding with the status code 303 See Other [SC08, Section 4.2]. The two deployment alternatives are described in [SC08, Section 4]. The first option generates a single comprehensive dump of the RDF graph, which is serialized as RDF/XML. The downside of this approach is the potentially huge file size that can make it infeasible for large classification systems, because every HTTP request against the URI of a single element from the resulting ontology will require the transmission of the entire representation, as the hash fragment part is not sent to the server according to the specification for URIs [BFM05, Section 3.5; SC08, Section 4.1]. By contrast, the slash-based option generates a series of small RDF files, comprising separate files for all taxonomic and generic classes, and, if available, also for properties and individuals. This has the advantage that it allows serving smaller chunks of code for individual elements compared to its full dump counterpart. Moreover, with this option the tool creates a navigable documentation consisting of a set of interlinked Hypertext Markup Language (HTML) pages that mimic the hierarchical relationships. The two deployment alternatives imply different URI patterns, that are of the following form: http://example.org/pcs#C_1234-gen → hash-based http://example.org/pcs/C_1234-gen → slash-based

In addition to the creation of RDF/XML and HTML files, PCS2OWL also generates a Semantic Sitemap [Cyg+ 08], and an .htaccess 13 file for the easy deployment on an Apache Web server. Content negotiation [FR14d, Sections 3.4 and 5.3] for the delivery of the Web resources is ensured using best practice patterns described online [BP08, Recipe 5]. For slash URIs it means that by dereferencing an arbitrary entity URI (e.g. a class URI), an HTML-preferring client is redirected to a respective HTML document using the HTTP response status code 303 See Other. Similarly, the client retrieves RDF/XML, if (accessed on September 16, 2014) 13 https://httpd.apache.org/docs/current/howto/htaccess.html (accessed on February 19, 2016) 12

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_10001488-gen

5.4 Evaluation

187

the media type supplied with the HTTP Accept-header is application/rdf+xml. In this sense, our approach constitutes a fully LOD-compliant deployment [HB11].

5.4 Evaluation In the following, we evaluate our approach. We focus on two key aspects, namely on the correctness of the conversion results and on the amount of new product classes, properties, and enumerations obtained. 5.4.1 Correctness of the Derived Product Ontologies In this part of the evaluation, we analyze whether the product ontologies properly reflect the elements and the hierarchical structure provided by the original product classification systems. We first did a quantitative comparison of the conceptual elements in the product classification systems and all classes, properties and individuals of the corresponding product ontologies. For this purpose, we examined the number of concepts in the source files or database tables and contrasted them to the number of files produced for related types of concepts, e.g. the number of taxonomic classes in the ontologies. If the numbers matched, it implied that the concepts were successfully converted and are contained in the product ontologies, which was actually the case for all ontologies that we have generated. We complemented and further confirmed our previous findings by an experiment conducted on a product ontology derived from the Google product taxonomy [Goo13]. The taxonomy file is publicly available online14 as plain text. It is line-based and characterized by a category tree which hierarchical structure is expressed using delimiting angle brackets as follows: Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods

The taxonomy is read from the left starting with the most generic concept and getting more specific moving to the right. Accordingly, “Coffee” is a more specific concept than “Beverages” with respect to the Google product taxonomy. Our idea was basically to reverse-engineer the original taxonomy starting from the product ontology that we loaded into a SPARQL Protocol and RDF Query Language (SPARQL) endpoint. A set of appropriate SPARQL queries allowed us to build up the whole hierarchy in a top concept → ... → bottom concept fashion. We then concatenated the respective RDFS labels using 14

http://www.google.com/basepages/producttype/taxonomy.en-US.txt (accessed on July 22, 2014)

188

5 Product Type Information

the exact same delimiters as advocated by the Google product taxonomy file format. And finally, the results of the concatenation were compared to the contents in the original source file. This way we were able to recreate an equivalent copy of the original file, which confirms the completeness and reliability of our conversion. Figure 5.4 illustrates the single steps of our evaluation approach, which are described in more detail online15 . Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods!

TXT

Conversion

[Food,_Beverages_and_Tobacco] Food, Beverages & Tobacco

Top node (most generic)

[Beverages] Beverages

[Coffee] Coffee

SPARQL + Concatenation + Validation

TXT

[Coffee_Pods] Coffee Pods

Leaf node (most specific)

Food, Beverages & Tobacco > Beverages > Coffee > Coffee Pods! Figure 5.4: Reverse-engineering of the Google product taxonomy

The last step in the evaluation of the conceptual correctness comprised a qualitative assessment of the consistency of the classification hierarchies. An important observation is that taxonomies are mostly created for a special purpose, e.g. to provide a navigational structure for products in a Web shop or to classify products from a procurement perspective. If we were able to show, though, that a significant share of the taxonomic classes form a valid subsumption hierarchy in the context of products and services, then the taxonomic relationships could be adopted for product classes as well. This would add a lot of value to e-commerce scenarios, e.g. by enhancing reasoning capabilities over products. For a related experiment, we chose the product ontology of GPC. We looked for inconsistent paths in the subsumption hierarchies with respect to the domain of products and services. Figure 5.5 exemplifies (a) a valid and (b) an invalid subsumption path. As Figure 5.5a demonstrates, “Juice Drinks” represents a valid subsumption path, both in the original context and in the target domain of products or services. Likewise is the subsumption path for swap drives perfectly valid in the original context, since GPC is a classification standard for trading in the supply chain. However, it is invalid with 15

http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/ (accessed on September 16,

2014)

5.4 Evaluation

189

[50000000] Food/Beverage/Tobacco

Top node (most[65000000] generic)

Computing

[50200000] Beverages

[65010000] Computers/Video Games

[50202300] Non Alcoholic Beverages Ready to Drink

[65010300] Computer Drives

[10000223] Juice Drinks Ready to Drink (Shelf Stable)

Leaf node [10001134] (most specific)

(a) valid

Swap Drives

Top node (most generic)

Leaf node (most specific)

(b) invalid

Figure 5.5: Examples of valid and invalid subsumption relations from the GPC hierarchy when interpreted as product classes

regard to products and services (see Figure 5.5b): “Computer Drives” and therefore “Swap Drives” are not subclasses (or specializations) of “Computers/Video Games”, rather they are parts of them. Many such inconsistent examples can be found in GPC16 , which led us to conclude that the product taxonomy for the GPC product ontology cannot be simply derived from the original classification system automatically. A similar example of an inconsistent subsumption relationship we have just encountered before are “Coffee Pods” from the Google product taxonomy, which are no true specializations of “Beverages” and “Coffee”, but consumables for coffee machines.

5.4.2 Statistics on New Product Classes and Properties In Section 5.1, we have argued that our approach produces a large number of readily usable product classes for the Web that would be infeasible to craft and maintain manually. In order to support this claim, we will now analyze relevant statistics about the derived product ontologies17 . As a preliminary step, we loaded all product ontologies into a SPARQL endpoint. Storing each product ontology as a different named graph [Car+ 05] (urn:cpa, urn:gpc, etc.) allowed us later on to execute SPARQL queries based on their graph names. To give an example, we used the SPARQL 1.1 query of Listing 5.1 (prefix declarations omitted) to determine the number of hierarchy levels in the product ontologies. We executed the We drew a representative random sample of subsumption paths as explained in [HdB07], but a large number of relationships were identified as invalid with respect to the domain of products and services. 17 http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014) 16

190

5 Product Type Information

1 SELECT (COUNT(DISTINCT ?c) AS ?num_classes) WHERE { 2 GRAPH { 3 ?c a owl:Class . 4 ?c rdfs:subClassOf{3} ?sc . 5 FILTER NOT EXISTS {?c rdfs:subClassOf gr:ProductOrService} 6 } 7 }

Listing 5.1: Calculating the number of hierarchy levels of product classification systems

query repeatedly where in every step we incremented the property path length by one unit until we obtained no results. Increasing the property path length from three to four in the provided example yields zero results, meaning that the hierarchy depth of the product ontology is four, i.e. the longest existing path consists of four classes linked by three consecutive rdfs:subClassOf predicates. The FILTER statement of the query assures that only taxonomic classes are regarded, excluding those classes defined as products or services which would lead to otherwise incorrect results. As reported in Section 5.3, our research took into account ten popular product classification standards, among them two different versions of eCl@ss, and three proprietary category structures. The common abbreviations of the product classification systems together with the versions that have been converted are given in the first column of Table 5.1. The upper part of the table lists the statistics for the product categorization standards, whereas the lower three rows represent the proprietary category systems. For BMEcat we

Table 5.1: Statistics of product classification standards and category systems Classification Number of System Levels Classes Properties Individuals Top-Level Classes CPC Version 2 5 4,409 0 0 10 CPA 2008 6 5,429 0 0 21 CPV 2008 4 10,419 0 0 254 eCl@ss 5.1.4 4 30,329 7,136 4,720 25 eCl@ss 6.1 4 32,795 9,910 7,531 27 ETIM 4.0 2 2,213 6,346 7,001 54 FreeClass 2012 4 2,838 174 1,423 11 GPC 2012 4 3,831 1,710 9,562 37 proficl@ss 4.0 ≤6 4,617 4,243 6,815 17 WZ 2008 5 1,835 0 0 21 Google prod. tax. ≤7 5,508 0 0 21 productpilot ≤8 7,970 0 0 20 BMEcat na na 0 0 na

Class Distr. (%) 18 53 6 18 16 8 21 17 36 33 17 28 na

5.5 Discussion

191

cannot report specific numbers, since the standard supports the transmission of catalog group structures of various sizes and types. Columns two to six capture the number of hierarchy levels, product classes, properties, value instances, and top-level classes for each product ontology. It is worth noting that some of the product ontologies have a fixed number of hierarchy levels (e.g. eCl@ss has four levels), while for others the numbers vary (e.g. proficl@ss, which has up to six levels). Similarly, some of them are quite shallow (e.g. ETIM with two levels), while others provide deep hierarchies (e.g. CPA with six levels) with sometimes redundant concept names at consecutive levels. The large quantity of entities (classes, properties, individuals) implies an extensive coverage of the product or services domain, which, if built up manually, would be prohibitively expensive and time-consuming. Besides product classes, some product ontologies also contain properties and individuals that contribute valuable product details for the Semantic Web. Lastly, the seventh column (“Class Distr. (%)”) indicates the distribution of classes within the derived product ontology [cf. HLS07, Table 2]. This distribution is measured as the percentage of classes that belong to the largest top-level class with respect to the total number of classes in the ontology. This value describes the topology of the hierarchical structure and is thus an indicator for the quality of the product ontology. For example, in CPA one (“manufactured products”) of the 21 top-level classes contains more than half of all the classes in the standard, while the classes in ETIM are more evenly distributed across various branches (only 8% of all classes belong to the largest class “hand tools”). For a similar analysis, which also inspired our approach, see [HLS07]. Among the classification systems with multilingual support, CPA is the one with the most translations, featuring class labels in 26 languages on average. Other product ontologies that also support multiple languages are CPV with an average of 22.9 languages, FreeClass with 6.9 languages, and WZ and the productpilot category system both having two translations. The variety of languages supported increases the chance of finding products annotated with product classes on the Web more easily. Translations can also be valuable input to future hybrid product search approaches that combine formal knowledge representations with natural language processing.

5.5 Discussion The following section presents a series of e-commerce use cases that embody some of the novel opportunities that search engines and other consumers of structured data can exploit in areas such as product search, comparison, and matchmaking. These opportunities arise from using the now available Web product ontologies from PCS2OWL that allow to

192

5 Product Type Information

articulate more granular product descriptions across both the Web of Documents and the Web of Data. Let us consider, for instance, an online retailer interested in improving its product trading and data management processes. One enhancement consists in the adoption of the GPC classification standard instead of developing a custom scheme from scratch, leveraging the GPC ontology on the Web. Now further imagine that our retailer has published on the Web a snippet in Microdata syntax as in Listing 5.2, describing an offer for a specific disposable camera in GoodRelations. 1 2 4 5 6 7 8 10 11 12 13 14

Listing 5.2: Annotation example in Microdata syntax

For readability, the qualified names of the vocabulary URIs involved are used hereinafter. They rely on the prefix declaration of gr: for GoodRelations [Hep08a], gpc: for the GPC product ontology18 , s: for schema.org19 , and ex: 20 for the product data traded by our online merchant.

5.5.1 Classification of Product and Offer Descriptions Listing 5.2 specifies a disposable camera and the associated offer via the URIs ex:product and ex:offer, respectively. The offer is declared to be an instance of s:Offer (equivalent to gr:Offering) and is accompanied by a price specification consisting of a price value and a currency. The product is defined as an instance of the class s:SomeProducts (equivalent to gr:SomeItems) and, thanks to the additionalType property in schema.org/Microdata, http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/ (accessed on September 16, 2014) http://schema.org/ (accessed on May 21, 2014) 20 http://example.org/# (accessed on September 16, 2014) 18

19

5.6 Conclusion

193

it is an instance of the class gpc:C_10001488-gen as well. This definition, together with the existing linkage across the classes gpc:C_10001488-gen, gpc:C_10001488-tax, and the property gpc:hierarchyCode in the GPC Web ontology, materializes the product ex:product on the Web as an instance of the category “10001488” labeled as “Disposable Cameras” in the original GPC classification standard.

5.5.2 Navigation over Product and Offer Data The adoption of the GPC Web ontology would allow our online retailer to navigate along the product categories of the original GPC standard. Applied to the example in Listing 5.2, this navigation path is determined by the super- and subclasses of gpc:C_10001488-tax, which are defined via the rdfs:subClassOf relationship. For example, the immediate parent class of gpc:C_10001488-tax (the category of our camera) is gpc:C_68020100-tax 21 . Or, in terms of the original schema, the GPC product category “68020100 Photography” is the parent category of “10001488 Disposable Cameras”.

5.5.3 Semantic Annotation of Products and Offers on the Web The fact that product classes are published on the Web using URIs renders them applicable for use with common Web data formats, such as Microdata, RDFa, and Facebook Open Graph Protocol (OGP). Product annotations in those syntaxes can lead to improvements with regard to the current state of the document-based Web, namely in the form of search engine result snippets (known as rich snippets or rich captions, respectively) [Goo16; MicND ] and other mid-term benefits that may arise from providing more computeraccessible meaning. In this context, it is also worth noting that the existing alignment22 between the GoodRelations ontology and schema.org allows to annotate products using URIs in both the gr: and the s: namespaces.

5.6 Conclusion The ontology engineering task in the domain of products and services is typically tedious, costly, and time-consuming. To master this problem, we presented a generic method and a toolset for deriving product ontologies in a semi-automatic way from existing (accessed on September 16, 2014) 22 http://wiki.goodrelations-vocabulary.org/Cookbook/Schema.org (accessed on September 16, 2014) 21

http://www.ebusiness-unibw.org/ontologies/pcs2owl/gpc/C_68020100-tax

194

5 Product Type Information

product classification standards and proprietary category systems, which is superior to building them up manually in several aspects. For example, it successfully addresses the generally large number of concepts in product categorization standards and the conceptual dynamics inherent to the domain of products and services. We have supported our contribution by converting 13 practically relevant product classification systems of different scopes, sizes, and structures, and have shown that we can generate useful product ontologies while effectively preserving the original taxonomic relationships. These ontologies are ready for deployment on the Web of LOD. Furthermore, we exemplified how products can be annotated using the derived product ontologies, rendering them more visible and discernible on the Web. In particular, employing product classes to semantically annotate product instances empowers product data consumers to find and aggregate products and respective offers with less effort. For example, they could be readily used for assisting faceted search over semantic e-commerce data. As future work, we imagine to extend the set of available parsers by additional product classification systems, and to publish all converted product ontologies including those that, at the time of writing this chapter, we were not yet granted permission due to lack of copyright clearance. Moreover, we think that our product ontologies could attract related research fields, such as finding correspondences across product classification systems by means of ontology matching techniques. Similarly, we should point out that our generic toolset could be easily adapted to convert classification systems and taxonomies not necessarily connected to the domain of products and services.

6 Cleansing and Enrichment

6.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

6.2

Typology of Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6.3

6.2.1

Redundant Entity Definitions . . . . . . . . . . . . . . . . . . . . . . . . 198

6.2.2

Schema Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

6.2.3

Unit of Measurement Mismatches . . . . . . . . . . . . . . . . . . . . . . 201

6.2.4

Missing, Invalid, and Inconsistent Data . . . . . . . . . . . . . . . . . . . 202 6.2.4.1

Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

6.2.4.2

Invalid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

6.2.4.3

Inconsistent Data . . . . . . . . . . . . . . . . . . . . . . . . . 204

6.2.5

Data Granularity Mismatches . . . . . . . . . . . . . . . . . . . . . . . . 204

6.2.6

Natural Language Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 6.3.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208

6.3.2

Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

6.3.3

6.3.4

6.3.5

6.3.2.1

RDF Datatype Cleansing . . . . . . . . . . . . . . . . . . . . . 210

6.3.2.2

Other Cleansing Heuristics . . . . . . . . . . . . . . . . . . . . 211

Entity Consolidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 6.3.3.1

Entity Consolidation Based on Identifiers . . . . . . . . . . . . 212

6.3.3.2

Entity Consolidation Based on Proper Names . . . . . . . . . 214

Schema Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 6.3.4.1

Two Schemas with Structural Mismatches . . . . . . . . . . . 216

6.3.4.2

Two Schemas with Direct Correspondences . . . . . . . . . . . 216

6.3.4.3

One Schema with Multiple Patterns . . . . . . . . . . . . . . . 218

6.3.4.4

One Schema with Modeling Shortcuts . . . . . . . . . . . . . . 218

Missing Relationships

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

6.3.5.1

Product Model Information Based on Identifiers . . . . . . . . 220

6.3.5.2

Product Model Information Based on Proper Names . . . . . 221

6.3.5.3

Product Feature Inheritance from Product Model to Product

6.3.5.4

Product Feature Inheritance from Product Variants . . . . . . 223

195

221

196

6 Cleansing and Enrichment

6.3.5.5

Consumables, Accessories, and Spare Parts . . . . . . . . . . . 224

6.3.6

Missing Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

6.3.7

Data Lifting and Enrichment . . . . . . . . . . . . . . . . . . . . . . . . 228

6.3.8

Unit Conversion and Canonicalization . . . . . . . . . . . . . . . . . . . 230 6.3.8.1

Conversion of Units of Measurement . . . . . . . . . . . . . . 230

6.3.8.2

Currency Conversion . . . . . . . . . . . . . . . . . . . . . . . 232

6.4

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

6.5

Implementation of a Data Management Web User Interface . . . . . . . . . . . . 238 6.5.1

6.6

User Interface Tabs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 6.5.1.1

Loading Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

6.5.1.2

Cleansing Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.5.1.3

Dynamic Rules . . . . . . . . . . . . . . . . . . . . . . . . . . 240

6.5.2

Data Management with RDF Graphs . . . . . . . . . . . . . . . . . . . . 240

6.5.3

Execution Order of Cleansing Rules . . . . . . . . . . . . . . . . . . . . . 242

6.5.4

Translation versus Canonicalization . . . . . . . . . . . . . . . . . . . . . 242

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

Despite the existence of ontology languages and top-level and domain ontologies, ecommerce data at Web scale will typically exhibit a significant degree of structural and semantic heterogeneity, and suffer from data quality problems such as omissions or contradictions. Both will hamper the direct consumption of data for automated information processing. In this section, we analyze this problem, develop a typology of such obstacles and develop and implement prototypical solutions for selected problems.

6.1 Problem Statement With the Web of Data [BHB09], there exists a rich source of product data that consuming clients, at least in theory, can immediately benefit from. In particular, the Web of Data promises sophisticated Web product search and matchmaking opportunities. Yet, in practice, many data-consuming applications find it very challenging to process raw product data from the Web, because it is heterogeneous and exhibits a range of data quality problems. Generally speaking, not the amount of data, but the variety in the data is the main large bottleneck of the current e-commerce Web of Data. For research on data quality problems on the Web of Data, see e.g. [MP15; SH15c].

6.2 Typology of Obstacles

197

The diversity in the data is mainly linked to the fact that disparate data sources are developed independently by different parties, and serving distinct purposes. Heterogeneity and data quality problems can already be observed at small scale. For example, they predominate within corporate settings whenever enterprises or departments have different business needs, or following mergers and acquisitions that require data integration between various information systems [cf. Hal05]. However, the situation is getting much more complex and critical at larger scale, such as when harvesting e-commerce data from the Web. This implies to deal with varying data formats and vocabularies, often at different levels of granularity, competing modeling patterns among vendors and manufacturers, inconsistent use of units of measurement and currency units, or linguistic idiosyncracies (e.g. homonyms and synonyms) [SH01]. One could argue that there exist standards (e.g. data formats, ontology languages, product ontologies, or code standards) that, if everybody would adhere to, could solve the heterogeneity problem. Unfortunately, this premise is unrealistic for a large body of distributed data sources like the Web. There are too many stakeholders, systems, and applications involved with each having individual requirements. A consolidated view on the data is a prerequisite for effective product searches and matchmaking [e.g. Di + 03]. To give an example, when looking for a new car, the query ought to include entities typed as cars, but automobiles as well. Furthermore, in search for automobiles manufactured by BMW, we require all corresponding cars (1) to feature a property that links to the respective car manufacturer, and (2) to have a consistent entity representation for BMW. In fact, the great variety of product data on the Web complicate these data integration tasks. In the following, we analyze the types of problems and their causes and sketch techniques for using the data for product search despite the underlying deficiencies. The rest of this chapter is structured as follows: In Section 6.2, we elaborate on a typology of obstacles relevant to the domain of e-commerce on the Web of Data; Section 6.3 presents viable techniques for overcoming these obstacles; as an evaluation, we demonstrate in Section 6.4 how frequent selected issues are in our real Web crawl (see also Chapter 3); in Section 6.5, we showcase a prototypical implementation that eases the cleansing and enrichment process; and finally, Section 6.6 concludes this chapter.

6.2 Typology of Obstacles In this section, we present a categorization of prevalent obstacles that are regularly found in structured e-commerce data on the Web. In a sense, these obstacles are special kinds

198

6 Cleansing and Enrichment

of data quality problems. Data quality problems are a result of poor data quality. It is widely accepted that data quality can be defined with regard to consumers, e.g. Wang and Strong [WS96] characterize data quality as “data that are fit for use by data consumers” [WS96]. Some relevant dimensions for evaluating data quality are accuracy, completeness, relevancy, timeliness, ease of understanding, and accessibility of data [WS96]. Consequently, if any of these dimensions is not satisfied by the data, then we are facing a data quality problem. Many data quality problems arise from data misinterpretation, i.e. that the semantics of the data is not always clear in any context [Mad03]. E.g., an attribute “price”, if not further specified, can be understood as with or without taxes included. This does not necessarily pose a problem as long as price values are viewed in isolation, but it might lead to incorrect results once prices are aggregated or compared. Rahm and Do [RD00] developed a comprehensive categorization of data quality problems concerning data at schema and instance level that may appear in single-source or multisource environments [RD00] (see Section 2.4.5). The problem categories that they mention are also relevant for the e-commerce on the Web of Data, including integrity constraint violations, data entry errors, heterogeneous data models, and inconsistent data [cf. RD00]. In the context of this thesis, we focus on the following types of problems: • Redundant entity definitions, • schema heterogeneity, • unit of measurement mismatches, • missing, invalid, and inconsistent data, • data granularity mismatches, and • natural language issues.

6.2.1 Redundant Entity Definitions In a distributed system like the World Wide Web (WWW), it is very likely that the exact same entities are defined more than once. This repeated definition of entities is commonly termed as redundancy [cf. RD00]. A proper synchronization between data providers on the Web fails due to the vast number of data sources being created and maintained autonomously and in parallel. It is also generally easier to define entities locally instead

6.2 Typology of Obstacles

199

of looking up an authoritative Uniform Resource Identifier (URI) on the Web, if such exists. Unfortunately, this makes query formulation more tedious and challenging as compared to textbook-style examples in the SPARQL Protocol and RDF Query Language (SPARQL) documentation or respective tutorials [cf. HS13]. For certain entity types we observed substantial redundancy. In the context of e-commerce, these entity types are • manufacturer, • brand, • product model, • dealer or vendor, and • location. This is because for these types of objects, there will typically be multiple Web resources that expose identical or near-identical information without explicit links. In some cases, the very same information is included in most or all pages of the same Web site (e.g. dealer master data might be contained in every single offer page of a shop), or product details are part of both manufacturer and dealer Web sites. In the context of e-commerce scenarios, many of these entities can be considered master data. For an overview of the notion of master data, see e.g. [Los09; Dre+ 08]. Regardless of the business case, the specification of master data is usually accepted among various systems, applications, and processes. Product model data published by a manufacturer, e.g., can be used by a great number of retailers and vendors. Therefore, repeated definitions of this kind of entity types can be consolidated without introducing conflicts. Other entities on the Web shall not be consolidated, though. For example, product offers are specific to a single vendor because of their unique prices and conditions. Consolidating two offers that belong to the same product, e.g. based on their product identifiers, would inevitably lead to inconsistencies and contradictions. In summary, entity consolidation based on strong identifiers is only applicable to those objects that are actually representations of the very same object based on some identity criterion. For an overview of the problem of identity in the context of knowledge representation, see the OntoClean approach in [GW02] and [GW09].

200

6 Cleansing and Enrichment

6.2.2 Schema Heterogeneity Redundant entity definitions, as addressed in the previous section, are a form of heterogeneity at the instance level, because a canonical representation of entities is missing. However, many data quality issues arise from differences at the schema level. The problem of schema heterogeneity already persists for decades, and one reason why “schema heterogeneity is difficult and time-consuming is that it requires both domain and technical expertise” [Hal05]. In other words, it requires a lot of human effort to create mappings between different schemas. Schema-level heterogeneity on the Web is described by structural and semantic discrepancies between ontologies [cf. Hal05]. For product data on the Semantic Web, we need to distinguish between two important levels of schema heterogeneity: 1. The top-level e-commerce data model : This level of schema heterogeneity refers to the conceptual differences due to competing data formats and vocabularies. For example, there exist several alternatives for modeling e-commerce data on the Semantic Web, namely a) the h-product Microformat standard [Çel14], i.e. a particular data format and vocabulary for embedding product data in Hypertext Markup Language (HTML); b) data-vocabulary.org 1 , an earlier attempt by Google to establish a vocabulary on the Web, and meanwhile deprecated by schema.org; c) GoodRelations [Hep08a] in its original (gr:) namespace, e.g. where an offer is linked to a business entity via gr:offers, and a product is linked to an offer via gr:includes; d) schema.org [SchND ] prior to the GoodRelations integration [Guh12], e.g. where an offer had to be linked to a product via s:offers; e) schema.org extended with GoodRelations, where among others a new property s:itemOffered was added to schema.org that corresponds to gr:includes in GoodRelations [cf. Hep15a]. 2. Product types, features, and enumerations: This level of schema heterogeneity entails structural and semantic differences between domain ontologies. For instance, there exist competing classifications that define product classes, properties, and individuals using distinct naming and structure. 1

http://www.data-vocabulary.org/ (accessed on February 19, 2016)

6.2 Typology of Obstacles

201

In addition to that, even within a single schema, there can be heterogeneity in the form of multiple patterns or idioms for the same type information. E.g., GoodRelations defines two classes for location, namely the older and meanwhile deprecated class gr:LocationOfSalesOrServicesProvisioning, and a new class gr:Location. Or, textual descriptions are attached to product-related items sometimes using rdfs:label and rdfs:comment, sometimes using domain-specific attributes like gr:name/gr:description or schema:name/schema:description.

6.2.3 Unit of Measurement Mismatches In Section 2.2.6.2, we learned about a number of different code standards, i.e. date and time formats, country codes, language codes, codes for units of measure, and currency codes. In this section, we by and large focus on mismatches of codes for units of measurement, and of currency codes, since they can be regarded as a special type of units of measure. The other code standards will not be considered in here, first due to their general relevance beyond the narrow scope of product search, and second because they can mostly be solved using heuristics at countless degrees of freedom (e.g. to convert a date string into its canonical form). Mismatches on units of measurement can generally manifest in two ways, namely 1. as various unit standards that one might choose from: • UN/CEFACT [Uni09b], • Unified Code for Units of Measure (UCUM) [SM13], • Quantities, Units, Dimensions and Types (QUDT) [Hod+ 14], • Ontology for Units of Measure and Related Concepts (OM) [RvAT13], and 2. as one standard, but with differing unit codes for describing the same physical dimensions (e.g. metric and imperial units like “gram” versus “pound”, or base and derived units such as “kilogram” versus “gram”). In addition, data providers often exhibit units of measure that are not standards-compliant. This is the case, for example, when manufacturers miss to curate quantities using standardized units in their product information management (PIM) system, e.g. using “Volts” or “V” rather than the UN/CEFACT Common Code “VLT”, or when vendors price products with “Euros” or the currency symbol “e” instead of adhering to the ISO 4217 currency code “EUR”. This variety adds substantial complexity for data consumers.

202

6 Cleansing and Enrichment

6.2.4 Missing, Invalid, and Inconsistent Data As completeness and consistency of data are major factors for data quality, we will subsequently discuss them in more detail. The related problems in this context are 1. missing data, 2. invalid data, and 3. inconsistent data.

6.2.4.1 Missing Data As we have described in Chapters 3 and 4, many product offers published by Web shops lack granular product data for enabling deep product comparison. Missing data can actually mean missing data in the representation, or data that is available, albeit only implicitly. By implicit data we mean latent information that is not explicitly stated, e.g. published in the form of a free-text field or image. In the following, we outline different kinds of missing data.

Missing Entity Type Information A lot of product data on the Web is not categorized. A possible reason is that most Web shops neither have the means to link their products to classification standards nor do they have a mechanism to mint URIs for every category from their category system. So it is very common to publish products without further specifying their kind or to attach a textual description, e.g. using a property like gr:category as shown in Listing 6.1. 1 ex:GoldenNecklace a gr:SomeItems ; 2

# name and description of the item

3

gr:name "18K yellow-gold necklace"@en ;

4

gr:description "Necklace of yellow gold with a metal purity of 18K weighs 4 grams."@en ;

5

# provide some category information

6

gr:category "Jewelry > Necklace"@en ;

7

# additional product features

8

gr:hasWeight [ a gr:QuantitativeValueFloat ;

9 10

gr:hasValueFloat "4.0"^^xsd:float ; gr:hasUnitOfMeasurement "GRM"^^xsd:string ] .

Listing 6.1: Categorizing products with textual properties

6.2 Typology of Obstacles

203

Missing Relationships In spite of related entities that could well be interlinked, there are conceptual gaps often present in the data. For example, relationships are often missing between products and their respective product models, among products and product variants, or between products and their consumables, accessories, and spare parts.

Missing Attributes Many data quality problems originate from missing attributes in product data. Publishers often omit unit codes for quantitative values, or publish prices without currency units. Furthermore, statements about entities are often underspecified and thus ambiguous, in particular if value-added taxes are not detailed for prices, or if validity durations of product or price offers are incomplete, e.g. when either no start or end dates are supplied. The same holds for opening hours of physical stores.

Missing Datatype Information Especially when harvesting data from Microdata markup [Hic13], Resource Description Framework (RDF) datatype information will often be missing [cf. Kel14, Section 3.2]. Then, SPARQL queries and ordering operations will not work properly. For example, examine the SPARQL query in Listing 6.2. Imagine that some product data specifies the currency unit without the datatype xsd:string. In that case, the query would not match the data. Thus, the fixing of missing datatype information constitutes an important preprocessing step for product search. 1 SELECT ?product ?price 2 WHERE { 3

?product a gr:SomeItems ;

4

gr:hasCurrency "EUR"^^xsd:string ;

5

gr:hasCurrencyValue ?price .

6 }

Listing 6.2: SPARQL SELECT query to retrieve products with prices in “Euros”

6.2.4.2 Invalid Data Invalid data comprises unexpected data values that a SPARQL processor cannot handle properly. In real-world data, such possible problems are the presence of • data entry errors (e.g. misspelled product names, or erroneous product identifiers), • string values where numerals are expected (e.g. a currency value that contains alphanumerical characters),

204

6 Cleansing and Enrichment

• wrong datatypes used for literals (e.g. xsd:string instead of xsd:float), or • inconsistently used units of measurement (see Section 6.2.3).

6.2.4.3 Inconsistent Data In addition to missing and invalid data, data can also be inconsistent. Such inconsistencies are often either formal conflicts or conflicts at the business logic level. Integrity Constraint Violations A source of formal inconsistencies are integrity constraint violations [e.g. RD00; Hal05]. This includes violated domain or range constraints, non-compliance to cardinality constraints, or minimum values greater than maximum values. In this case, the conflict is between the instance data and the schema [cf. RD00]. Redundant Data with Conflicts When data sources are merged, assertions coming from different data sources are often conflicting [cf. RD00]. E.g., the same triples might appear multiple times with different values. In this case, the conflict is on the instance level between multiple instances. Business Logic Violations Another problem source in the domain of e-commerce is that the more general and formal data quality problems are often complemented by specific inconsistencies at the business logic level. This entails cases such as that the validity of a product offer ends before it starts (the same holds for opening hours), price specifications (or opening hours) are partially overlapping, the list price is set much lower than the retail price, the price tag is an extreme outlier, or that multiple list prices were specified for the same product offer.

6.2.5 Data Granularity Mismatches Even within a single schema, there often exist multiple patterns for the same type of information that differ in the amount of structure. Without proper handling, this will limit the recall for a query, because the respective SPARQL patterns will not be found. Basically, two types of data granularity mismatches stand out, namely 1. differing modeling patterns, and 2. weakly structured information.

6.2 Typology of Obstacles

205

GoodRelations offers some modeling shortcuts to ease the publication of recurring patterns in data, namely the properties gr:includes, gr:hasValue, gr:hasValueFloat, and gr:hasValueInteger [Hep11]. Otherwise, it has e.g. always been tedious to model a product offer and its respective product instance. To link these two entities, the code in Listing 6.3 was needed [cf. Hep11]. 1 ex:Offer gr:includesObject [ a gr:TypeAndQuantityNode ; 2

gr:amountOfThisGood "1"^^xsd:float ;

3

gr:hasUnitOfMeasurement "C62"^^xsd:string ;

4

gr:typeOfGood ex:Product ] .

Listing 6.3: Linking two entities with the gr:includesObject modeling pattern

The rationale was that by means of this flexible and powerful pattern one could model product offers consisting of a single product item or bundles with multiple items. However, most product offers do not require such complex modeling, as they include only one product item. Thus, a shortcut was added to GoodRelations that allows to express this information more elegantly (see Listing 6.4) [cf. Hep11]. 1 ex:Offer gr:includes ex:Product .

Listing 6.4: Linking two entities with the gr:includes modeling shortcut

Prior to consuming data based on this shortcut, the shortcut needs to be expanded to the canonical long form as shown above [Hep13]. Similar rules hold for the shorthand properties gr:hasValue, gr:hasValueFloat, and gr:hasValueInteger. They describe point values and can be easily expanded to respective intervals (e.g. gr:hasMinValue and gr:hasMaxValue with both having the same range value) in order to simplify query formulation. It is common practice on the Web to publish data at varying levels of data granularity. Some databases do not have a sophisticated data model and thus are only capable to export weakly structured information. Of course, it can also be the other way around, albeit less frequently. Another reason why data providers would publish weakly structured information is that there exist no adequate property definitions in the existing ontologies that they could rely on. Therefore, product data, especially product features, are often only available as plain string literals. For example, imagine a data provider using a property ex:hasOperatingVoltage with a value “220–240 V”. To unleash the semantics, the value should better be split into its value and unit constituents and the interval be

206

6 Cleansing and Enrichment

1 ex:Product a gr:SomeItems ; 2

ex:hasOperatingVoltage [ a gr:QuantitativeValue ;

3

gr:hasMinValue 220 ;

4

gr:hasMaxValue 240 ;

5

gr:hasUnitOfMeasurement "VLT" ] .

Listing 6.5: Modeling of intervals in GoodRelations

modeled via two datatype properties gr:hasMinValue and gr:hasMaxValue, as indicated in the query in Listing 6.5. To give yet another example of weakly structured information, it is difficult for many systems (e.g. shop software) to keep price and currency values apart when exposing product offer data on the Web. For this reason, schema.org accepts four different modeling patterns for price specifications (see Listing reflst:schemaorg-price-modeling). Please note that variants 1 and 2 outlined in Listing 6.6 represent shortcuts of variants 3 and 4 in schema.org. 1 # 1. Price attached directly to the offer as a textual property 2 ex:Offer a schema:Offer ; schema:price "100.0 USD" . 3 # 2. Price attached directly to the offer but in a more granular way 4 ex:Offer a schema:Offer ; schema:price "100.0" ; schema:priceCurrency "USD" . 5 # 3. Price modeled via a detailed price specification node and as a textual property 6 ex:Offer a schema:Offer ; 7 8

schema:priceSpecification [ a schema:UnitPriceSpecification ; schema:price "100.0 USD" ] .

9 # 4. Price modeled via a detailed price specification node but in a more granular way 10 11 12

ex:Offer a schema:Offer ; schema:priceSpecification [ a schema:UnitPriceSpecification ; schema:price "100.0" ; schema:priceCurrency "USD" ] .

Listing 6.6: Price modeling patterns in schema.org

In summary, for a data consumer it poses difficulties to take advantage of weakly structured information and to cater for the variety of possible modeling patterns.

6.2.6 Natural Language Issues Despite an ontology-based representation, queries will frequently be hybrid in nature, i.e. include the search for certain keywords in string values. When dealing with natural language though, we need to be aware of some caveats. It is possible that entities are

6.3 Techniques

207

described using multiple, different languages, which is not inherently bad since it can increase recall. A keyword that fails to match items based on a specific language might still match them based on a translation. Unfortunately, for the same language a myriad of dialects and regional differences is often available. This can lead to ambiguity of terms and incompatible or incorrect spellings. When speaking of ambiguous terminology, we frequently intend the problems connected with homonyms and synonyms that in information retrieval (IR) can lead to poor precision and recall ratios, respectively [cf. Dee+ 90]. Homonyms, on one hand, are terms that share the same name but carry different meanings [NO95]. An example thereof is “chair” 2 , which, as a noun, can actually mean a piece of furniture or a professorship, and, as a verb, the act of leading a meeting, event, or discussion. On the other hand, synonyms are words with the same meaning even though their spellings are different [NO95]. “Car” and “automobile” are two different terms to refer to a motorized vehicle with four wheels. Similarly, “bicycle”, “cycle”, and “bike” all denote the same kind of objects. The American English term “elevator” is known as “lift” in British English, etc. Finally, an important class of problems with natural language are incorrect and incompatible spellings, as well as acronyms [cf. RD00]. The first group of spelling mistakes are due to data entry errors (e.g. “childern” instead of “children”, or “compair” instead of “compare”), while incompatible spellings are due to differences in the language, i.e. dialects or regional differences. In English, an important distinction is made between American English and British English, where “fiber” becomes “fibre” or “categorization” becomes “categorisation”. The third group, acronyms, are abbreviations for words. E.g., “Serial ATA” is often abbreviated as “SATA” or “European Union” as “EU”. Simplistic SPARQL queries will fail as soon as a single of the aforementioned problems are found in the corpus of data.

6.3 Techniques

In this section, we sketch potential solutions to the obstacles outlined in the preceding section. 2

http://www.merriam-webster.com/dictionary/chair (accessed on November 3, 2014)

208

6 Cleansing and Enrichment

6.3.1 Overview Table 6.1 gives an overview of the challenges and a short list of possible approaches. In the following, we address some of them in more detail. Table 6.1: Obstacles with respective solutions Challenge Redundant entity definitions

Approaches →

Entity consolidation / instance matching [e.g. RD00; DH05]

Schema heterogeneity



Schema alignment [e.g. RD00; RB01]

Unit of measurement mismatches



Unit conversion and canonicalization [cf. Cul+ 07; SH13a]

Missing, invalid, and inconsistent data



RDF datatype cleansing, cleansing heuristics, data mining, enrichment [e.g. RD00; DH05; Cul+ 07]

Data granularity mismatches



Data lifting heuristics, enrichment, schema matching [e.g. RD00; RB01]

Natural language issues



Word sense disambiguation (WSD), named entity recognition (NER) and named entity disambiguation (NED) [e.g. Nav09; MRN14], query expansion [e.g. Vor94]

In general, one might choose between multiple basic techniques for data cleansing and enrichment. The most popular techniques are summarized below along with their typical use cases: • OWL and RDFS reasoning: Reasoners allow to infer knowledge from implicit facts using logical inferences (see Section 2.3.8.3). On the Semantic Web, reasoners for different ontology languages are available. An RDF Schema (RDFS) reasoner draws conclusions from RDFS statements, mainly rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, and rdfs:range. An Web Ontology Language (OWL) reasoner, in its simplest form, infers knowledge from axioms including owl:equivalentClass, owl:equivalentProperty, or owl:sameAs. Reasoners can either materialize inferred triples, or perform reasoning tasks at query time [e.g. KD11, p. 249]. • SPARQL: With SPARQL, there basically exist two approaches for cleansing and enrichment. The first option is to formulate SPARQL queries, possibly using nested SELECT queries. This is very cumbersome, though, because it is limited to (a) the cognitive complexity of formulating such one-turn SPARQL queries, and (b) endpoint restrictions like the execution time and the length of the SPARQL queries that cannot be arbitrarily long. More promising is the SPARQL CONSTRUCT feature that provides a convenient mechanism for defining custom rules. By SPARQL CONSTRUCT queries, new triples can be created (consequent) based on a graph

6.3 Techniques

209

pattern match (antecedent) [cf. AH11, pp. 88f., pp. 115f.]. Subsequent queries then execute over the original graph plus the newly materialized data. Another alternative for defining custom rules is the use of a dedicated rule language (e.g. Semantic Web Rule Language (SWRL) [Hor+ 04]), that we are not going to discuss in more detail here, though. • Script-based approaches: For complex functions that are inefficient or impossible to achieve with OWL and RDFS inferencing and SPARQL rules, it is sometimes superior to export data in order to process it offline using scripts, or to take advantage of user-defined functions (UDFs) or functions built into a SPARQL endpoint. Viable use cases are geocoding and currency conversion. Most of our solutions presented in this section will rely on SPARQL CONSTRUCT rules. We will use them to define production rules that materialize as new data. For the rest of this chapter, we will omit the prefix declarations in the header of the Terse RDF Triple Language (Turtle)/Notation 3 (N3) examples and SPARQL queries for brevity. Unless otherwise stated, we assume the namespace declarations for Turtle and SPARQL3 as provided in Listing 6.7. 1 # example namespace 2 @prefix ex: . 3 4 # domain ontologies 5 @prefix gr: . 6 @prefix schema: . 7 @prefix vso: . 8 @prefix pto: . 9 10

# ontology languages

11

@prefix rdfs: .

12

@prefix owl: .

13 14

# datatypes

15

@prefix xsd: .

Listing 6.7: Namespace declarations used for Turtle/N3 and SPARQL examples

3

Although the syntax for declaring namespaces is slightly different with SPARQL, i.e. replacing “@prefix” by the keyword “PREFIX” and omitting the trailing dot.

210

6 Cleansing and Enrichment

6.3.2 Preprocessing In the following, we consider some problems with the representation of structured data on the Web. These can mainly be fixed by simple preprocessing steps.

6.3.2.1 RDF Datatype Cleansing Missing RDF datatypes can be added to simple, plain, or untyped literals based on the schema definition. The left-hand side of Listing 6.8 shows an example where the RDF datatypes are missing. The right-hand side constitutes the same data but with the correct datatypes, as intended. Though the difference seems minimal, even such slight representational discrepancies can lead to a loss of recall in SPARQL queries over RDF data. 1 ex:NumSeats a gr:QuantitativeValueInteger ; 2 gr:hasValue "9" ; 3 gr:hasUnitOfMeasurement "C62" .

ex:NumSeats a gr:QuantitativeValueInteger ; gr:hasValueInteger "9"^^xsd:int ; gr:hasUnitOfMeasurement "C62"^^xsd:string .

Listing 6.8: Adding RDF datatypes to plain literals

For historical reasons, RDF expects type information to be stated in each single literal rather than being taken from the underlying schema, which is burdensome for data publishers and thus often lacking in the resulting data. According to the RDF 1.1 specification [CWL14, Section 3.3], simple literals without a datatype or language tag are syntactic sugar for literals with xsd:string datatype. In the worst case, this could turn numeric literals into xsd:string literals as well. Thus, it is better to add the right datatype information to untyped RDF literals based on the datatype indicated in the schema. The GoodRelations vocabulary definition for the property gr:hasValueInteger is given in Listing 6.9. 1 gr:hasValueInteger a owl:DatatypeProperty ; 2

rdfs:label "has value integer (0..1)"@en ;

3

rdfs:domain gr:QuantitativeValueInteger ;

4

rdfs:range xsd:int .

Listing 6.9: OWL definition of the gr:hasValueInteger property

With the SPARQL CONSTRUCT rule outlined in Listing 6.10, we can generate triples that add the RDF datatype to untyped literals based on the schema definition, leading to the graph on the right-hand side of Listing 6.8.

1 2 3

6.3 Techniques

211

1 CONSTRUCT {?s ?p ?new_o} 2 WHERE { 3

?s ?p ?o .

4

?p rdfs:range ?range .

5

FILTER(datatype(?o) != ?range)

6

BIND(STRDT(?o, ?range) AS ?new_o)

7 }

Listing 6.10: SPARQL CONSTRUCT query to recover the correct datatype from schema information

The query in Listing 6.10 can also be used to fix literals with wrong datatypes by taking into account the schema constraints (e.g. range restrictions), as the side-by-side example in Listing 6.11 exhibits. Note that this approach assumes that the vocabulary is known and defines exactly one datatype per property. As soon as complex OWL class definitions are used as the ranges of a property (e.g. text or number), additional heuristics have to be employed. 1 ex:NumSeats a gr:QuantitativeValueInteger ; 2 gr:hasValueInteger "9"^^xsd:string ; 3 gr:hasUnitOfMeasurement "C62"^^xsd:int .

ex:NumSeats a gr:QuantitativeValueInteger ; gr:hasValueInteger "9"^^xsd:int ; gr:hasUnitOfMeasurement "C62"^^xsd:string .

1 2 3

Listing 6.11: Assigning correct RDF datatypes to literals with incorrect datatypes

6.3.2.2 Other Cleansing Heuristics Invalid data values are very common, especially differences in the formatting of numerical values. Regional differences affect the use of decimal point and thousands separator, e.g. “1,200.5”, “1.200,5”, or “1200.5”. The latter is the default format required and understood by most modern computers, i.e. no thousands separator and a decimal period. In the examples listed in Listing 6.12, we contrast incorrect and correct numerical values. 1 ex:Weight a gr:QuantitativeValueFloat ; 2 gr:hasValueFloat "5,0"^^xsd:float ; 3 gr:hasUnitOfMeasurement "GRM"^^xsd:string .

ex:Weight a gr:QuantitativeValueFloat ; gr:hasValueFloat "5.0"^^xsd:float ; gr:hasUnitOfMeasurement "GRM"^^xsd:string .

Listing 6.12: Converting the invalid data value “5,0” to “5.0”

The approach of how to transform invalid data values might differ. For the specific use case presented herein, we supply a suitable SPARQL CONSTRUCT query in Listing 6.13 that replaces the decimal point in any data values of type float, double, or decimal.

1 2 3

212

6 Cleansing and Enrichment

1 CONSTRUCT {?s ?p ?new_o} 2 WHERE { 3

?s ?p ?o .

4

FILTER(datatype(?o) = xsd:float || datatype(?o) = xsd:double || datatype(?o) = xsd:decimal)

5

# simplistic, a more generic solution would consider REGEX

6

FILTER(CONTAINS(str(?o), ",") && !CONTAINS(str(?o), "."))

7

BIND(STRDT(REPLACE(str(?o), ",", "."), datatype(?o)) AS ?new_o)

8 }

Listing 6.13: SPARQL CONSTRUCT query to convert invalid numerical values

A more robust solution would be to employ regular expressions (REGEX) instead of the CONTAINS and REPLACE functions in Listing 6.13. Although it is computationally more expensive, such a regular expression pattern as outlined subsequently is capable of matching numerical values like “+1.602e-19”: [-+]?[0-9]*\.?[0-9]+([Ee][-+]?[0-9]+)?

6.3.3 Entity Consolidation A viable solution to redundant entity definitions is entity consolidation (or instance matching, see also Section 2.4.3). Redundantly defined entities share certain properties that are unique to them. For product models e.g., these unique properties are product identifiers. For business entities, even the legal name of the business might be sufficiently distinctive. Yet, if no such unique properties exist, it is often still possible to find a unique combination of properties for an entity, as we will see shortly.

6.3.3.1 Entity Consolidation Based on Identifiers As already stated in Chapter 4, product identifiers are particularly suitable for entity consolidation. For product models, duplicate entity definitions can be consolidated using product identifiers such as the European Article Number (EAN) or the Global Trade Item Number (GTIN). Listing 6.14 shows two product models with identical EANs. To consolidate these two product model entities, a SPARQL CONSTRUCT query as in Listing 6.15 can be issued. In addition to EAN-13, the SPARQL CONSTRUCT query also captures other product identifiers, i.e. GTIN-8 and GTIN-14. The graph pattern in the query matches any combination of two product models with different Web identifiers (i.e. URIs), but the same product identifiers. For every matching pair of product models,

6.3 Techniques

213

1 ex:Model1 a gr:ProductOrServiceModel ; 2

gr:name "Siemens Silence Pro 1800"@en ;

3

gr:hasEAN_UCC-13 "1234567890123"^^xsd:string .

4 ex:Model2 a gr:ProductOrServiceModel ; 5

gr:name "Siemens 1800W Vacuum Cleaner"@en ;

6

gr:hasEAN_UCC-13 "1234567890123"^^xsd:string .

Listing 6.14: Redundant product models with the same EAN

1 CONSTRUCT {?model2 owl:sameAs ?model1} 2 WHERE { 3

?model1 a gr:ProductOrServiceModel ;

4

?hasProductId ?productId1 .

5

?model2 a gr:ProductOrServiceModel ;

6

?hasProductId ?productId2 .

7

FILTER (?hasProductId IN (gr:hasEAN_UCC-13, gr:hasGTIN-14, gr:hasGTIN-8))

8

FILTER (?model1 != ?model2 && str(?productId1) = str(?productId2) && str(?productId1) != "")

9 }

Listing 6.15: SPARQL CONSTRUCT query for product models based on arbitrary product identifiers

the query generates a triple using the owl:sameAs property, denoting that the two entity definitions should be considered the same. Keep in mind that the consolidation of product entities on this basis is valid only for product models (e.g. two datasheets for the same consumer electronics commodity). Actual products of the same kind, e.g. on two different eBay auctions, do typically not refer to the very same object, but two objects of the same make and model. The previous query executed on the data in Listing 6.14, yields the two triples outlined in Listing 6.16. Recall from Chapter 2 that the equals sign (“=”) is syntactic sugar in N3 that represents the owl:sameAs property. 1 ex:Model1 = ex:Model2 . 2 ex:Model2 = ex:Model1 .

Listing 6.16: owl:sameAs links between redundant product model entities

The consolidation of other types of entities resembles the consolidation of product models. For these entity types, some identifiers are particularly suitable for consolidation, namely • for business entities, identifiers attached using the GoodRelations properties gr:hasDUNS, gr:hasNAICS, gr:hasGlobalLocationNumber, gr:hasISICv4, and

214

6 Cleansing and Enrichment

• for locations, identifiers supplied with gr:hasGlobalLocationNumber or gr:hasISICv4. At a more general level, this approach can be used for any pair of entities of the same type that share the same property value for a property that exposes a reliable identity criterion.

6.3.3.2 Entity Consolidation Based on Proper Names As an alternative to strong identifiers, entity consolidation can also be based on • proper names, • combinations of multiple weak identifiers, • combinations of proper names and weak identifiers, and • other criteria. When proper names are used, they need to be distinctive enough to reliably consolidate entities. Examples thereof are • for product models, the product name contained in gr:name, • for business entities, the legal name (gr:legalName), or • for brands, the brand name in gr:name. Proper names can be compared based on varying matching criteria. For example, consolidation with brand names could require an exact match of proper names. With the legal name, a small gazetteer of acronyms could help to match variants of business entity types like “Limited” and “Ltd.”. Moreover, product names could be compared using string distance metrics like the Levenshtein string distance [Lev66], Jaccard coefficient [Jac12], or other popular methods [cf. CRF03], in combination with a similarity threshold. Furthermore, combinations of names and weak identifiers can help to increase the reliability of proper name matches. E.g., a brand or manufacturer name could be used together with a manufacturer part number (MPN) to unambiguously identify a product model. Listing 6.17 gives an example where two models are defined. Although their product names are different, they could be consolidated, as they have the same brand name and MPN. The consolidation rule is defined as a SPARQL CONSTRUCT query and outlined in Listing 6.18. Note that the current rule ignores capitalization. The normalization step

6.3 Techniques

215

1 ex:Model1 a gr:ProductOrServiceModel ; 2

gr:name "Bosch 9-Gallon Dust Extractor with Semi-Auto Filter Clean VAC090S"@en ;

3

gr:hasBrand [ a gr:Brand ;

4

gr:name "Bosch"@en ] ;

5

gr:hasMPN "VAC090S"^^xsd:string .

6 ex:Model2 a gr:ProductOrServiceModel ; 7

gr:name "Bosch Carpet Extractor 9 Gallon"@en ;

8

gr:hasBrand [ a gr:Brand ;

9

gr:name "Bosch"@en ] ;

10

gr:hasMPN "VAC090S"^^xsd:string .

Listing 6.17: Redundant product models based on the combination of manufacturer name and MPN

1 CONSTRUCT {?model1 owl:sameAs ?model2} 2 WHERE { 3

?model1 a gr:ProductOrServiceModel ;

4

gr:hasBrand [ gr:name ?brandName1 ] ;

5

gr:hasMPN ?mpn1 .

6

?model2 a gr:ProductOrServiceModel ;

7

gr:hasBrand [ gr:name ?brandName2 ] ;

8

gr:hasMPN ?mpn2 .

9

FILTER(?model1 != ?model2 &&

10

lcase(?brandName1) = lcase(?brandName2) &&

11 12

lcase(?mpn1) = lcase(?mpn2)) }

Listing 6.18: SPARQL CONSTRUCT query for consolidating redundant product models based on identical pairs of brand names and MPNs

could be further adjusted by stripping whitespace characters, punctuations, dashes, etc. The output of the query is the same as already indicated in Listing 6.16. As a third possibility, consider consolidating entities based on other criteria, e.g. based on matching addresses or geo coordinates between entities, or based on singular properties. Albeit this can be a quite powerful mechanism, it often involves fuzzy operations that might cause unwanted side-effects (e.g. consolidation of two distinct companies located in the same building).

6.3.4 Schema Alignment A comprehensive list of approaches to overcome schema heterogeneity has already been outlined in Section 2.4.2, where we discussed schema and ontology matching. In this

216

6 Cleansing and Enrichment

section, we elaborate on working solutions on how we can consolidate different e-commerce schemas on the Semantic Web. For schema alignment tasks, OWL reasoning is of particular importance, because it constitutes a simple way to exploit OWL and RDFS axioms without the need to define SPARQL CONSTRUCT rules.

6.3.4.1 Two Schemas with Structural Mismatches Two schemas that represent the same fact but with different structures, need to be harmonized in order to allow for comfortable querying. Listing 6.19 presents a structural mismatch between product offers in schema.org and GoodRelations. 1 ex:Product a schema:SomeProducts ; 2 schema:offers ex:Offer . 3 ex:Offer a schema:Offer .

ex:Offer a gr:Offering ; gr:offers ex:Product . ex:Product a gr:SomeItems .

Listing 6.19: Product offering definition in schema.org and GoodRelations

Listing 6.20 shows how the entities between these two schemas can be mediated using a SPARQL CONSTRUCT query. The query translates the data in the left part of Listing 6.19 to the data on the right-hand side. 1 CONSTRUCT { 2

?offer a gr:Offering ;

3

gr:offers ?product .

4

?product a gr:SomeItems .

5 } 6 WHERE { 7

?product a schema:SomeProducts ;

8

schema:offers ?offer .

9 10

?offer a schema:Offer . }

Listing 6.20: SPARQL CONSTRUCT query to convert a product offer in schema.org to the respective offer in GoodRelations

6.3.4.2 Two Schemas with Direct Correspondences Some classes, properties, or instances are either equivalent or at least very similar across different schemas. E.g., the schema:Person class defined in schema.org is more specific than the gr:BusinessEntity class defined in GoodRelations. Thus, the former can be defined as a subclass of the latter. Likewise, schema:ProductModel and gr:ProductOrServiceModel

1 2 3

6.3 Techniques

217

are equivalent classes. Listing 6.21 shows their class definitions in the respective vocabularies.

1 schema:ProductModel a rdfs:Class ; 2 rdfs:label "Product Model"@en ; 3 rdfs:subClassOf schema:Product .

gr:ProductOrServiceModel a owl:Class ; rdfs:label "Product or service model"@en ; rdfs:subClassOf gr:ProductOrService.

Listing 6.21: Product model definition in schema.org (actually schema.rdfs.org) and GoodRelations

The RDFS and OWL ontology languages offer properties to define simple alignment axioms between related concepts of different schemas [HB11, pp. 24f.]. For example, the properties to express equality in OWL are owl:equivalentClass between classes [HB11, p. 24], owl:equivalentProperty between properties [HB11, p. 24], and owl:sameAs between instances [cf. Vol+ 09]. Furthermore, somehow weaker properties that most RDFS and OWL reasoners are capable of are rdfs:subClassOf and rdfs:subPropertyOf [cf. HB11, p. 24]. At line 1 in Listing 6.22, the two product model definitions are aligned using an owl:equivalentClass property. Furthermore, lines 3–4 define instances of these two classes. 1 schema:ProductModel owl:equivalentClass gr:ProductOrServiceModel . 2 3 ex:Model1 a gr:ProductOrServiceModel . 4 ex:Model2 a schema:ProductModel .

Listing 6.22: Axiom to translate among two product model classes and product model instances

The SPARQL SELECT query in Listing 6.23 only matches ex:Model1 of the knowledge base in Listing 6.22, since it is explicitly defined as a GoodRelations product model. However, by employing an OWL reasoner capable of handling the owl:equivalentClass relationship, the same query would also return the product model defined in schema.org. The query in Listing 6.23 thus yields both product models, as shown in the result table next to the query.

1 SELECT ?s 2 WHERE { 3 ?s a gr:ProductOrServiceModel . 4 }

s 1 2

ex:Model2 ex:Model1

Listing 6.23: SPARQL SELECT query and triples returned by selecting all GoodRelations product models

1 2 3

218

6 Cleansing and Enrichment

6.3.4.3 One Schema with Multiple Patterns Within the same schema, multiple patterns to represent the same information may exist. In schema.org, the price information can, among others, be attached to the product offer directly, or indirectly via a price specification entity. The two alternatives are outlined in Listing 6.24. Obviously, a query that respects only one type of modeling would miss entities expressed with the other pattern. 1 ex:Offer a schema:Offer ; 2 # no intermediate price ... 3 # ... specification node 4 schema:price "100.0" ; 5 schema:priceCurrency "USD" .

ex:Offer a schema:Offer ; schema:priceSpecification [ a schema:UnitPriceSpecification ; schema:price "100.0" ; schema:priceCurrency "USD" ] .

Listing 6.24: Two equivalent modeling patterns for prices in schema.org

Under the assumption that a query is formulated that matches only the pattern on the right side of Listing 6.24, we can use the SPARQL CONSTRUCT query in Listing 6.25 to bring the non-matching pattern into the required form. 1 CONSTRUCT { 2

?offer schema:priceSpecification [ a schema:UnitPriceSpecification ;

3

schema:price ?price ;

4

schema:priceCurrency ?currency ] .

5 } 6 WHERE { 7

?offer a schema:Offer ;

8

schema:price ?price ;

9

schema:priceCurrency ?currency .

10 11

FILTER NOT EXISTS {?offer schema:priceSpecification ?pspec} }

Listing 6.25: SPARQL CONSTRUCT query to translate between two equivalent modeling patterns within the same schema

6.3.4.4 One Schema with Modeling Shortcuts As modeling patterns are often complex although in frequent use, some schemas define modeling shortcuts for them. GoodRelations e.g., defines handy shortcuts like gr:includes, gr:hasValue, gr:hasValueFloat, and gr:hasValueInteger. Listing 6.26 contrasts the shortcut gr:includes with its longer form. Comparing these two variants, it is easy to notice that the shortcut assumes some implicit defaults, such

1 2 3 4 5

6.3 Techniques

1 ex:Offer a gr:Offering ; 2 gr:includes ex:Product . 3 # the current modeling pattern ... 4 # ... is sufficient for describing ... 5 # ... an offer that includes ... 6 # ... a single product item. 7 ex:Product a gr:SomeItems .

219

1 2 a gr:TypeAndQuantityNode ; 3 gr:amountOfThisGood "1.0"^^xsd:float ; 4 gr:hasUnitOfMeasurement "C62"^^xsd:string ; 5 gr:typeOfGood ex:Product ] . 6 ex:Product a gr:SomeItems . 7 ex:Offer a gr:Offering ; gr:includesObject [

Listing 6.26: Modeling shortcut and expanded version for attaching a product to an offer

as that the offer is composed of only one item. In a SPARQL endpoint, the modeling shortcut can be easily expanded by issuing a SPARQL CONSTRUCT query as illustrated in Listing 6.27. The advantage of shortcut expanding is that it facilitates query formulation, essentially that queries can safely rely on the full modeling patterns. 1 CONSTRUCT { 2

?offer a gr:Offering ;

3

gr:includesObject [ a gr:TypeAndQuantityNode ;

4

gr:amountOfThisGood "1.0"^^xsd:float ;

5

gr:hasUnitOfMeasurement "C62"^^xsd:string ;

6

gr:typeOfGood ?product ] .

7 } 8 WHERE { 9

?offer gr:includes ?product .

10

?product a ?ptype .

11

FILTER (?ptype != gr:ProductOrServiceModel)

12 13

FILTER NOT EXISTS {?offer gr:includesObject [ gr:typeOfGood ?product ]} }

Listing 6.27: SPARQL CONSTRUCT query to expand a shortcut pattern for products to its canonical long variant

In addition to product instances, the gr:includes shortcut also works with product models attached to product offers. In that case, the SPARQL CONSTRUCT query in Listing 6.27 would unfold one additional link gr:hasMakeAndModel between the product instance and the product model. Another modeling shortcut exists for point values. They can be expanded to intervals so that they are matched by interval queries, e.g. querying the lower and upper limits of a quantitative value. A simple example of such an expansion is provided in Listing 6.28, where the left-hand side denotes the point value, and the right-hand side constitutes the corresponding interval definition.

220

6 Cleansing and Enrichment

1 ex:ScreenSize a gr:QuantitativeValueFloat ; 2 # rewrite point value below to interval 3 gr:hasValueFloat "7"^^xsd:float ; 4 gr:hasUnitOfMeasurement "INH"^^xsd:string .

ex:ScreenSize a gr:QuantitativeValueFloat ; gr:hasMinValueFloat "7"^^xsd:float ; gr:hasMaxValueFloat "7"^^xsd:float ; gr:hasUnitOfMeasurement "INH"^^xsd:string .

Listing 6.28: Quantitative values as point values and intervals

The translation of point values to intervals for the example at hand can be achieved by employing the SPARQL CONSTRUCT query in Listing 6.29. The query ensures that the shortcut is only expanded if neither a minimum value nor a maximum value exists in the data. At the same time, it prevents duplicate expansion in case the same SPARQL CONSTRUCT query is executed more than once. 1 CONSTRUCT { 2

?qv gr:hasMinValueFloat ?v ;

3

gr:hasMaxValueFloat ?v

4 } 5 WHERE { 6

?qv gr:hasValueFloat ?v .

7

# RDF graph does not contain a property for the min value

8

FILTER NOT EXISTS {?qv gr:hasMinValueFloat ?v}

9

# RDF graph does not contain a property for the max value

10 11

FILTER NOT EXISTS {?qv gr:hasMaxValueFloat ?v} }

Listing 6.29: SPARQL CONSTRUCT query to convert point values to intervals

6.3.5 Missing Relationships Subsequently, we are elaborating on rules and axioms and their usage for solving the problem of missing relationships in the data.

6.3.5.1 Product Model Information Based on Identifiers Many Web shops are missing granular product information, which could easily be supplied by manufacturers (see Chapter 4). To equip them with product model information though, either an explicit, typed relationship or a shared identifier between product offers (or product instances) and product models need to be in place. The idea is similar to the one for entity consolidation between the same concepts, as discussed in Section 6.3.3.1. The left part of Listing 6.30 shows a product item and a product model with the same product

1 2 3 4

6.3 Techniques

221

identifier. Consequently, it would make sense to materialize the correspondence between the product and the product model in the data. This assertion can be represented using the gr:hasMakeAndModel relationship, as indicated in the right part of Listing 6.30. 1 ex:Product a gr:SomeItems ; ex:Product a gr:SomeItems ; 1 2 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string . gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ; 2 3 # no link to make and model given gr:hasMakeAndModel ex:Model . 3 4 ex:Model a gr:ProductOrServiceModel ; ex:Model a gr:ProductOrServiceModel ; 4 5 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string . gr:hasEAN_UCC-13 "1234567890123"^^xsd:string . 5

Listing 6.30: Product model information based on matching EANs

The respective SPARQL CONSTRUCT query to generate the bridge axiom between the product item and the product model is given in Listing 6.31. 1 CONSTRUCT {?product gr:hasMakeAndModel ?model} 2 WHERE { 3

?product gr:hasEAN_UCC-13 ?ean1 .

4

FILTER NOT EXISTS {?product a gr:ProductOrServiceModel}

5

?model a gr:ProductOrServiceModel ; gr:hasEAN_UCC-13 ?ean2 .

6 7

FILTER NOT EXISTS {?product gr:hasMakeAndModel ?model2}

8

FILTER(str(?ean1) = str(?ean2) && str(?ean1) != "")

9 }

Listing 6.31: SPARQL CONSTRUCT query to establish a link between products and product models with matching EANs

6.3.5.2 Product Model Information Based on Proper Names Corresponding products and product models could be linked based on proper names as well, or based on combinations of proper names and weak identifiers, or other criteria. This becomes especially relevant if no shared product identifier is available between a product and a product model. The way of how to establish links between products and product models differs by the type of unique keys, i.e. product identifier or proper name, and is equivalent to what we have already seen in the context of entity consolidation with proper names in Section 6.3.3.2. 6.3.5.3 Product Feature Inheritance from Product Model to Product Once a link between a product item and a product model is established, product features defined for the product model could be used to augment the product item and

222

6 Cleansing and Enrichment

ultimately the product offer. Consider the following example in Listing 6.32, where the left part describes the source data with a product linking to a product model via gr:hasMakeAndModel, and where the right part shows the product data augmented by the product information from the product model (i.e. the triple that links to the weight of the product).

1 2 3 4 5 6 7 8 9 10 11 12

1 2 gr:name "Galaxy S5 - White"@en . gr:name "Galaxy S5 - White"@en ; 3 # no weight given gr:weight ex:WeightSGS5 . 4 ex:Model a gr:ProductOrServiceModel ; ex:Model a gr:ProductOrServiceModel ; 5 gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ; gr:hasEAN_UCC-13 "1234567890123"^^xsd:string ; 6 gr:name "Samsung Galaxy S5 W"@en ; gr:name "Samsung Galaxy S5 W"@en ; 7 gr:weight ex:WeightSGS5 . gr:weight ex:WeightSGS5 . 8 # the weight the product model refers to # the weight both entities refer to 9 ex:WeightSGS5 a gr:QuantitativeValueFloat ; ex:WeightSGS5 a gr:QuantitativeValueFloat ; 10 gr:hasValueFloat "145.0"^^xsd:float ; gr:hasValueFloat "145.0"^^xsd:float ; 11 gr:hasUnitOfMeasurement "GRM"^^xsd:string . gr:hasUnitOfMeasurement "GRM"^^xsd:string . 12 ex:Product a gr:SomeItems ;

gr:hasMakeAndModel ex:Model ;

ex:Product a gr:SomeItems ;

gr:hasMakeAndModel ex:Model ;

Listing 6.32: Product with and without product features from the product model

The transformation in Listing 6.32, where the product item inherits product features of the product model, is achieved by the SPARQL CONSTRUCT query in Listing 6.33. Note that the query ensures that (1) the product itself is not a make and model (line 5); (2) the expanded properties are actual product features and not arbitrary non-technical properties (lines 7–9), since that could have unwanted side-effects (e.g. consider inheriting the name of the make and model); and, (3) the very same properties (with possibly different values) do not yet exist for the specific product item (line 11). 1 CONSTRUCT {?product ?property ?modelValue} 2 WHERE { 3

?model a gr:ProductOrServiceModel .

4

?product gr:hasMakeAndModel ?model .

5

FILTER NOT EXISTS {?product a gr:ProductOrServiceModel}

6

?model ?property ?modelValue .

7

VALUES ?superProperty {gr:qualitativeProductOrServiceProperty

8

gr:quantitativeProductOrServiceProperty gr:datatypeProductOrServiceProperty}

9

?property rdfs:subPropertyOf ?superProperty .

10

# product does not have this property yet (also covers rdf:type statements)

11 12

FILTER NOT EXISTS {?product ?property ?productValue} }

Listing 6.33: SPARQL CONSTRUCT query for the inheritance of product features from the product model

6.3 Techniques

223

6.3.5.4 Product Feature Inheritance from Product Variants Some product models are variants of other product models, i.e. they have most features in common. Therefore, a product variant will usually inherit all product-related properties from the base product model, save for those that are already defined by the variant, and those that are constituent parts of and thus unique to the other product model (e.g. product identifiers). In Listing 6.34, we specify a product model for a Ford T, and a variant thereof, a red Ford T model. While it is pretty safe for the variant to inherit the technical features (like engine power and displacement, in our example), other details like the model date (or similar identifiers like Vehicle Identification Numbers (VINs) or GTINs) shall not be adopted, as indicated in the right box of Listing 6.34. Further, the color is not inherited, because the product variant already defines this property itself, which would otherwise lead to conflicts. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

ex:RedFordTModel a gr:ProductOrServiceModel, vso:Automobile ; gr:isVariantOf ex:FordTModel ; vso:color "red"@en . # no engine displacement given # no engine power given ex:FordTModel a gr:ProductOrServiceModel, vso:Automobile ; gr:name "Ford T Model - Black"@en ; vso:modelDate "2002-01-01"^^xsd:date ; vso:color "black"@en ; vso:engineDisplacement ex:DisplacementFT ; vso:enginePower ex:PowerFT . ex:PowerFT a gr:QuantitativeValueFloat ; gr:hasValueFloat "15.0"^^xsd:float ; gr:hasUnitOfMeasurement "KWT"^^xsd:string . ex:DisplacementFT a gr:QuantitativeValueFloat ; gr:hasValueFloat "2.9"^^xsd:float ; gr:hasUnitOfMeasurement "LTR"^^xsd:string .

1 2 gr:isVariantOf ex:FordTModel ; 3 vso:color "red"@en ; 4 vso:engineDisplacement ex:DisplacementFT ; 5 vso:enginePower ex:PowerFT . 6 ex:FordTModel a gr:ProductOrServiceModel, 7 vso:Automobile ; 8 gr:name "Ford T Model - Black"@en ; 9 vso:modelDate "2002-01-01"^^xsd:date ; 10 vso:color "black"@en ; 11 vso:engineDisplacement ex:DisplacementFT ; 12 vso:enginePower ex:PowerFT . 13 ex:PowerFT a gr:QuantitativeValueFloat ; 14 gr:hasValueFloat "15.0"^^xsd:float ; 15 gr:hasUnitOfMeasurement "KWT"^^xsd:string . 16 ex:DisplacementFT a gr:QuantitativeValueFloat ; 17 gr:hasValueFloat "2.9"^^xsd:float ; 18 gr:hasUnitOfMeasurement "LTR"^^xsd:string . 19 ex:RedFordTModel a gr:ProductOrServiceModel, vso:Automobile ;

Listing 6.34: Product variant with and without product features from a related product model

The logic behind the inheritance of product features based on the product model variant is defined by the SPARQL CONSTRUCT query in Listing 6.35. The query makes sure that (1) the expanded properties are actual product features and not arbitrary non-technical properties (lines 7–9); (2) certain properties (especially variant-specific identifiers) not already filtered by in step 1 are excluded (line 11); and, (3) properties that already exist for the variant (e.g. vso:color in Listing 6.34) are not overwritten (line 13).

224

6 Cleansing and Enrichment

1 CONSTRUCT {?model2 ?property ?value1} 2 WHERE { 3

?model1 a gr:ProductOrServiceModel .

4

?model2 a gr:ProductOrServiceModel ;

5

gr:isVariantOf ?model1 .

6

?model1 ?property ?value1 .

7

VALUES ?superProperty {gr:qualitativeProductOrServiceProperty

8

gr:quantitativeProductOrServiceProperty gr:datatypeProductOrServiceProperty}

9

?property rdfs:subPropertyOf ?superProperty .

10

# exclude model-specific identifiers

11

FILTER(?property NOT IN (vso:modelDate))

12

# do not override existing properties

13

FILTER NOT EXISTS {?model2 ?property ?value2}

14

}

Listing 6.35: SPARQL CONSTRUCT query for the inheritance of product features from product variants

6.3.5.5 Consumables, Accessories, and Spare Parts A lot of product data on the Semantic Web is published independently, so a proper linkage between products and their consumables, accessories, or spare parts is often missing. On the other hand, with this additional information valuable product recommendations could be made. Let us exemplify the problem using an example of a consumable4 . Consider a retailer that publishes data about a laser printer. Now, it would be interesting to see suitable toner cartridges listed together with the printer. However, as toner cartridges are sold by many different vendors, this appears to be a difficult endeavor. Listing 6.36 shows the status quo on the left, and the target state on the right. 1 ex:Printer a gr:SomeItems, 2 pto:Printer_(computing) ; 3 gr:hasMPN "7800V_DN"^^xsd:string ; 4 gr:name "Phaser 7800"@en . 5 ex:TonerCartridge a gr:SomeItems, 6 pto:Toner_cartridge ; 7 gr:name "Black Hi Capacity Toner Cartridge for Phaser 7800 (7800V_DN ...)"@en .

8

# no link for consumable given

ex:Printer a gr:SomeItems, pto:Printer_(computing) ; gr:hasMPN "7800V_DN"^^xsd:string ; gr:name "Phaser 7800"@en . ex:TonerCartridge a gr:SomeItems, pto:Toner_cartridge ; gr:name "Black Hi Capacity Toner Cartridge for Phaser 7800 (7800V_DN ...)"@en ; gr:isConsumableFor ex:Printer .

Listing 6.36: Products where one (a toner cartridge) is a consumable for another (a printer)

An important observation is that products on the Web often use a product model identifier of the consumer product in the product name or description of the consumable. 4

The situation is similar for accessories and spare parts, thus we will not cover them here.

1 2 3 4 5 6 7 8

6.3 Techniques

225

In particular, the MPN is often part of the product name of the consumable. This small detail allows, even if pretty fuzzy, to accomplish the relationship between consumer products and consumables. Listing 6.37 details a SPARQL CONSTRUCT rule to generate the gr:isConsumableFor link between a product and its consumable. 1 CONSTRUCT {?consumable gr:isConsumableFor ?product} 2 WHERE { 3

VALUES ?ptype {gr:ProductOrService gr:ProductOrServiceModel gr:SomeItems gr:Individual} # omit

4

VALUES ?ctype {gr:ProductOrService gr:ProductOrServiceModel gr:SomeItems gr:Individual}

5

?product a ?ptype ;

deprecated classes

6

gr:hasMPN ?mpn .

7

?consumable a ?ctype ;

8

gr:name ?consumableName .

9

FILTER(?product != ?consumable && CONTAINS(str(?consumableName), str(?mpn))) # comparison based on str(literal) is more robust than literal!

10

}

Listing 6.37: SPARQL CONSTRUCT query to add gr:isConsumableFor link based on the MPN of one product contained in the product name of the other product

Alternatively, if there is no MPN or other product identifier available, one could try to find the product name of the consumer product in the product name or description of the consumable. This method is, of course, not as accurate as relying on product identifiers. The sketched approach is of course very simplistic and will suffer from false positives (e.g. textual content like “not compatible with XYZ” or “supersedes XYZ” will also create a respective gr:isConsumableFor statement). In real-world applications, the pattern should be complemented by more advanced natural language processing approaches.

6.3.6 Missing Attributes While relevant attributes like the price value or the value of a quantitative characteristic are mostly present, other attributes are sometimes missing in the data, as for example the unit of measurement of a quantitative value, the currency code of a price, details about value-added taxes, or the start and/or end date of a validity period. Unfortunately, quantitative characteristics of products cannot be compared or filtered by as long as the unit of measurement is missing, and missing validity dates in an open world can mean that a product offer is either valid or not. I.e., the open-world assumption (OWA) [e.g.

226

6 Cleansing and Enrichment

DFH11, p. 21] that e.g. GoodRelations as an OWL ontology is based on, does not allow drawing conclusions from the absence of a statement. However, to some extent, we can recover missing attributes by assuming implicit defaults. For the unit code, we can assume that if there is no unit code present, that there exists no relevant unit for a quantitative value. Thus we could assume “C62”, which means “one unit or piece” [cf. Hep08b; Uni09a]. By comparison, the currency code could be determined based on the shop location. E.g., if the shop is located in the USA, it is likely that the prices are given in U.S. dollars, whereas a shop in Germany will likely publish prices in Euros. Similarly, implicit default values might apply to value-added taxes based on experience. Finally, a heuristic for a missing start date of the price validity, could be to rely on the date of the Hypertext Transfer Protocol (HTTP) request for the data (if available), otherwise to assume the current date. Listing 6.38 illustrates two examples of quantitative values with lacking unit codes. The first example specifies a price without a currency code, and the second example models a quantitative value for the number of central processing unit (CPU) cores where the unit of measurement code is absent. 1 ex:Price a gr:UnitPriceSpecification ; 2 gr:hasCurrencyValue "49.99"^^xsd:float .

ex:NumCPUCores a gr:QuantitativeValueInteger ; gr:hasValueInteger "4"^^xsd:int .

Listing 6.38: Two examples of where some unit codes (code of the unit of measurement and the currency code) are missing for quantitative values

If we knew for example that the price specification was published by a Web shop located in the USA, we could issue the SPARQL CONSTRUCT query shown in Listing 6.39. 1 CONSTRUCT {?price gr:hasCurrency "USD"^^xsd:string} 2 WHERE { 3

?price gr:hasCurrencyValue ?v .

4

FILTER NOT EXISTS {?price gr:hasCurrency ?currency}

5

FILTER(isNumeric(?v))

6 }

Listing 6.39: SPARQL CONSTRUCT query to assign the default currency value wherever missing

The SPARQL CONSTRUCT query generates the missing triple with the currency code defaulting to “USD”, as shown below: ex:Price gr:hasCurrency "USD"^^xsd:string .

1 2

6.3 Techniques

227

If we assume that for codes for the unit of measurement the default unit code is “C62” intending “no unit code” or “a piece” [cf. Hep08b; Uni09a], the respective SPARQL CONSTRUCT query looks as indicated in Listing 6.40. For the sake of simplicity, let us assume that for this query reasoning support is enabled. Starting from the properties gr:hasMinValue and gr:hasMaxValue, an RDFS-style reasoner could then infer triples using the subproperties outlined in Figure 6.1. Similarly, the gr:QuantitativeValue class subsumes the classes gr:QuantitativeValueFloat and gr:QuantitativeValueInteger [cf. Hep11]. In addition to quantitative values, the query is also able to cope with entities typed as gr:TypeAndQuantityNode and a respective property gr:amountOfThisGood, which is used in combination with the gr:includesObject modeling pattern [Hep08b]. 1 CONSTRUCT {?qv gr:hasUnitOfMeasurement "C62"^^xsd:string} 2 WHERE { 3

VALUES ?qvt {gr:QuantitativeValue gr:TypeAndQuantityNode}

4

?qv a ?qvt .

5

?qv gr:hasMinValue|gr:hasMaxValue|gr:amountOfThisGood ?v .

6

FILTER NOT EXISTS {?qv gr:hasUnitOfMeasurement ?uom}

7 }

Listing 6.40: SPARQL CONSTRUCT query to recover missing unit codes in quantitative values

gr:hasMaxValue

gr:hasMaxValueFloat

gr:hasMaxValueInteger

gr:hasMinValue

gr:hasValue

gr:hasMinValueInteger

gr:hasMinValueFloat

gr:hasValueInteger

gr:hasValueFloat

Figure 6.1: Property hierarchy of quantitative values in GoodRelations [based on Hep11]

Executing the SPARQL CONSTRUCT query in Listing 6.40 on the data in Listing 6.38 yields the following RDF triple: ex:NumCPUCores gr:hasUnitOfMeasurement "C62"^^xsd:string .

228

6 Cleansing and Enrichment

6.3.7 Data Lifting and Enrichment Data lifting and enrichment consists of improving the data granularity (i.e. the degree of explicit structure) by (1) rules to derive structured data from textual properties, and (2) enrichment using external resources. The data lifting task essentially comprises heuristics like regular expressions to get structured data from text. Listing 6.41 exemplifies the situation by specifying a quantitative value where the unit code is included in the value property as it happens quite often (i.e. “60.0 LTR”). By comparison, the quantitative value on the right-hand side outlines the target status. 1 ex:Capacity a gr:QuantitativeValue ; 2 gr:hasValue "60.0 LTR"^^xsd:string . 3 # unit of measurement code missing in here

ex:Capacity a gr:QuantitativeValue ; gr:hasValue "60.0"^^xsd:float ; gr:hasUnitOfMeasurement "LTR"^^xsd:string .

Listing 6.41: Comparison of non-granular and granular quantitative value descriptions

In this specific case, it is sufficient to employ a simple heuristic to split the string into the value and code parts. This is exactly what the SPARQL CONSTRUCT rule in Listing 6.42 is doing. It binds the term before the whitespace character as the numeric part, and the term after the whitespace character as the unit code. 1 CONSTRUCT { 2

?qv ?qvp ?numericPart ;

3

gr:hasUnitOfMeasurement ?uomPart .

4 } 5 WHERE { 6

VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger

7

gr:TypeAndQuantityNode}

8

?qv a ?qvt .

9

VALUES ?qvp {gr:hasValue gr:hasValueFloat gr:hasValueInteger gr:amountOfThisGood}

10

?qv ?qvp ?v .

11

FILTER NOT EXISTS {?qv gr:hasUnitOfMeasurement ?uom}

12

BIND(STRBEFORE(?v, " ") AS ?numericPart)

13

BIND(STRAFTER(?v, " ") AS ?uomPart)

14

}

Listing 6.42: SPARQL CONSTRUCT query that applies a heuristic to extract a value and a unit code from a free-text field

Another important class of data values often modeled as one textual property are open and closed intervals. Consider e.g. the seating capacity of a van, which is typically between one and nine passengers. Listing 6.43 gives a comparison of the same information

1 2 3

6.3 Techniques

1 ex:SeatingCapacity a gr:QuantitativeValue ; 2 # interval provided in text 3 gr:hasValue "1-9"^^xsd:string ; 4 gr:hasUnitOfMeasurement "C62"^^xsd:string .

229

ex:SeatingCapacity a gr:QuantitativeValue ; gr:hasMinValue "1"^^xsd:int ; gr:hasMaxValue "9"^^xsd:int ; gr:hasUnitOfMeasurement "C62"^^xsd:string .

Listing 6.43: Intervals modeled in text as compared to individual intervals

(1) encoded as text, and (2) modeled in a more granular way using gr:hasMinValue and gr:hasMaxValue properties. Notice that the property type used to attach the textual label is arbitrary. Instead of using gr:hasValue, we could likewise have used the textual properties gr:name or gr:description. The approach is essentially the same. To lift integer intervals encoded in text, the SPARQL CONSTRUCT query in Listing 6.44 could be used. A regular expression is applied on the textual label to match possible interval definitions, and the extracted numbers are assigned suitable datatypes. 1 CONSTRUCT { 2

?qv gr:hasMinValue ?minv ;

3

gr:hasMaxValue ?maxv

4 } 5 WHERE { 6

VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger}

7

?qv a ?qvt ;

8

gr:hasValue ?v .

9

FILTER(REGEX(?v, "^[0-9]+-[0-9]+$")) # matches integer intervals

10

BIND(STRDT(STRBEFORE(?v, "-"), xsd:int) AS ?minv)

11 12

BIND(STRDT(STRAFTER(?v, "-"), xsd:int) AS ?maxv) }

Listing 6.44: SPARQL CONSTRUCT query to convert intervals in text (decimals or integers) to intervals modeled using appropriate properties

Enrichment, unlike data lifting, takes advantage of external resources. In particular, product data can be augmented by additional data from • product classifications (for an overview of classification systems, see Chapter 5), • linguistic and lexical databases (e.g. WordNet [Mil95], the Product Types Ontology (PTO)5 ), 5

http://www.productontology.org/ (accessed on February 20, 2016)

1 2 3 4

230

6 Cleansing and Enrichment

• existing Linked Data sources (e.g. WordNet RDF6 , DBPedia7 , Freebase8 , PTO). Besides the higher recall obtained through the additional data from these data sources, an important advantage is the linguistic aspect. The synonyms and translations found in these data sources can help to disambiguate terms with often conflicting meanings such as caused by homonyms and synonyms in natural language.

6.3.8 Unit Conversion and Canonicalization Quantitative information is often based on different units. The GoodRelations ontology permits publishers to supply unit codes with quantitative values and price specifications, which need to be canonicalized prior to comparing them. In the following, we discuss possible solutions for the conversion of units of measurement and monetary amounts.

6.3.8.1 Conversion of Units of Measurement For the representation of units of measure, there exist different code standards and ontologies. A popular code standard for units of measurement are the UN/CEFACT Common Codes. GoodRelations, for example, suggests to use these three-letter unit codes for representing quantitative values. At the same time, the QUDT9 collection of ontologies, developed by TopQuadrant and the National Aeronautics and Space Administration (NASA), has also an attribute for UN/CEFACT Common Codes. QUDT is an ontology that encodes knowledge to convert between different units of measure. Hence, quantitative values expressed in GoodRelations can be converted using the QUDT ontology [Hod+ 14] and its complementing vocabularies for unit conversion [e.g. Mas+ 11]. In [AH11, pp. 289– 294], Allemang and Hendler give a nice overview of how unit conversion with QUDT works for GoodRelations. Listing 6.45 highlights (in N3 syntax) the relevant parts from QUDT needed for the conversion between centimeters and meters. Each unit exhibits a conversion multiplier and an offset to convert it into a corresponding unit. The supply of a dimensional unit type (here: qudt:LengthUnit) allows to establish the necessary linkage between compatible units, as depicted in Figure 6.2. A second unit type is further used to describe whether the corresponding unit represents a base unit or a derived unit. http://wordnet-rdf.princeton.edu/ (accessed on June 4, 2014) http://dbpedia.org/ (accessed on May 12, 2014) 8 http://www.freebase.com/ (accessed on May 12, 2014) 9 http://www.qudt.org/ (accessed on September 30, 2014)

6

7

6.3 Techniques

231

1 @prefix qudt: . 2 @prefix unit: . 3 @prefix xsd: . 4 5 unit:Centimeter a qudt:LengthUnit, qudt:SIDerivedUnit ; 6

qudt:uneceCommonCode "CMT" ;

7

qudt:conversionMultiplier "0.01"^^xsd:double ;

8

qudt:conversionOffset "0.0"^^xsd:double .

9 10

unit:Meter a qudt:LengthUnit, qudt:SIBaseUnit ;

11

qudt:uneceCommonCode "MTR" ;

12

qudt:conversionMultiplier "1"^^xsd:double ;

13

qudt:conversionOffset "0.0"^^xsd:double .

Listing 6.45: Base and derived units in QUDT represented in N3 syntax [adapted from Mas+ 11]

qudt:DerivedUnit

qudt:LengthUnit

rdf:type

unit:Centimeter

rdf:type

qudt:SIBaseUnit

rdf:type

rdf:type

unit:Meter

Figure 6.2: Base and derived units in QUDT linked via a common type qudt:LengthUnit [based on NAS10]

The formula for the conversion between two compatible values is as follows: valuetarget =

(offset source + valuesource × multipliersource ) − offset target multipliertarget

(6.1)

As said before, GoodRelations and QUDT commonly refer to the UN/CEFACT Common Code standard. While GoodRelations defines the gr:hasUnitOfMeasurement property, QUDT provides an attribute qudt:uneceCommonCode. The SPARQL CONSTRUCT query in Listing 6.46 outlines a generic conversion of quantitative values to their reference or base units (e.g. centimeters to meters). Executing this query yields new quantitative values carrying the same dimensions as before, but canonicalized to their base units. Note that the expansion of the shortcut for point values to the long, canonical form of an interval with matching lower and upper boundaries must have been executed prior to applying this heuristic to point values.

232

6 Cleansing and Enrichment

1 PREFIX qudt: 2 PREFIX gr: 3 PREFIX xsd: 4 5 CONSTRUCT { 6

?s ?prop [ a gr:QuantitativeValue ;

7

gr:hasMinValue ?min_value_new ;

8

gr:hasMaxValue ?max_value_new ;

9

gr:hasUnitOfMeasurement ?base_uom ] .

10

}

11

WHERE {

12

?s ?prop ?qv .

13

VALUES ?qvt {gr:QuantitativeValue gr:QuantitativeValueFloat gr:QuantitativeValueInteger}

14

?qv a ?qvt ;

15

gr:hasMinValue|gr:hasMinValueFloat|gr:hasMinValueInteger ?min_value ;

16

gr:hasMaxValue|gr:hasMaxValueFloat|gr:hasMaxValueInteger ?max_value ;

17

gr:hasUnitOfMeasurement ?uom . ?current_unit a ?unit_type ;

18 19

qudt:uneceCommonCode ?qudt_uom ;

20

qudt:conversionMultiplier ?multiplier ;

21

qudt:conversionOffset ?offset . ?base_unit a qudt:SIBaseUnit, ?unit_type ;

22

qudt:uneceCommonCode ?qudt_base_uom .

23 24

FILTER (str(?uom) = str(?qudt_uom) && str(?uom) != str(?qudt_base_uom))

25

BIND ((xsd:float(?min_value)*xsd:float(?multiplier))+xsd:float(?offset) AS ?min_value_new)

26

BIND ((xsd:float(?max_value)*xsd:float(?multiplier))+xsd:float(?offset) AS ?max_value_new)

27

BIND (STRDT(str(?qudt_base_uom), xsd:string) AS ?base_uom) # convert to xsd:string datatype

28

}

Listing 6.46: Unit conversion of quantitative values in GoodRelations

As an aside, when investigating QUDT we could find units in the vocabulary where no unit code was provided. For the most relevant ones, we thus decided to manually supply the missing axioms, as indicated in Listing 6.47.

6.3.8.2 Currency Conversion In many business applications that could be built on the basis of Linked Open Data (LOD) [cf. Ber06; BHB09; HB11], the conversion of monetary amounts from one currency to another is a much-needed functionality. The ISO 4217 [Int08] standard for currencies currently defines 179 accepted currencies world-wide10 . Despite the existence of currency 10

http://www.currency-iso.org/dam/downloads/table_a1.xml (accessed on September 30, 2014) lists

currently 179 distinct currencies.

6.3 Techniques

233

1 @prefix qudt: . 2 @prefix unit: . 3 @prefix xsd: . 4 5 unit:Inch qudt:uneceCommonCode "INH"^^xsd:string . 6 unit:Inch qudt:uneceCommonCode "INC"^^xsd:string . # wrongly used unit code for inch 7 unit:OunceMass qudt:uneceCommonCode "ONZ"^^xsd:string . 8 unit:Gram qudt:uneceCommonCode "GRM"^^xsd:string . 9 unit:Kilogram qudt:uneceCommonCode "KGM"^^xsd:string . 10

unit:Kilometer qudt:uneceCommonCode "KMT"^^xsd:string .

11

unit:MilePerHour qudt:uneceCommonCode "HM"^^xsd:string .

12

unit:Liter qudt:uneceCommonCode "LTR"^^xsd:string .

Listing 6.47: Provision of additional axioms that are not covered by QUDT

conversion application programming interfaces (APIs) on the Web, their integration into operations over RDF data is still burdensome and requires proprietary code. To the best of our knowledge, there exist no RDF-based Web services dedicated to exchange rates. QUDT as a vocabulary for quantities, units, dimensions and types, entails all of the world’s currencies. Nonetheless, QUDT does not offer currency conversion, because it is a relatively static document but exchange rates change very frequently, at least on a daily basis. For currency conversion, it is thus necessary to call an external Web service that retrieves the most current exchange rates (e.g. by invoking a SPARQL Inferencing Notation (SPIN) function) [cf. Knu09]. To fill this gap, we proposed in a scientific publication [SH13a] a conceptually clean and scalable way to add currency conversion functionality to the Web of Linked Data. In a nutshell, we 1. defined an OWL ontology11 for modeling currency exchange rates in RDF, and 2. put online a RESTful [Fie00] Web service to serve RDF representations populated with the latest currency exchange rates from open Web APIs or data feeds. Our service is available online12 . By having the exchange rates available as triples in RDF, it renders our approach applicable to standard SPARQL processors. In a blog post, Knublauch [Knu13] picked up our online service to showcase how to implement currency conversion by defining SPIN functions. Exchange Rate Ontology (XRO), prefixed with xro:, and available online at http://purl.org/xro/ (accessed on September 30, 2014) 12 http://www.currency2currency.org/ (accessed on September 30, 2014) 11

234

6 Cleansing and Enrichment

Listing 6.48 describes a sample currency exchange rate between Euros and U.S. dollars as published by our Web service. 1 @prefix dcterms: . 2 @prefix rdfs: . 3 @prefix xch_EUR: . 4 @prefix xro: . 5 @prefix xsd: . 6 7 xch_EUR:USD a xro:ExchangeRateInfo ; 8 9

rdfs:label "Euro to US Dollar"@en ; rdfs:comment "1 EUR = ? USD"@en ;

10

xro:base ;

11

xro:counter ;

12

xro:rate "1.2436"^^xsd:decimal ;

13

xro:inverseRate "0.804117079447"^^xsd:decimal ;

14

dcterms:source ;

15

xro:timeOfConversion "2014-11-16T23:00:00+00:00"^^xsd:dateTime .

Listing 6.48: Example of a populated currency exchange rate instance

In Listing 6.49, we demonstrate a SPARQL CONSTRUCT query for currency conversion with our online service. The query shows how the prices of product offers are canonicalized to a base price expressed in Euros, eventually materialized by the query as an RDF graph consisting of the new price specification. Note that the the currencies are modeled as DBPedia currencies (see Listing 6.48), which specify a property dbpprop:isoCode for holding the three-letter ISO 4217 currency codes. The general formula to convert between prices is priceA = rateA2B · priceB

(6.2)

where A and B represent an arbitrary currency pair. In this formula, the price expressed in the currency A (priceA ) is calculated from the price according to a currency B (priceB ). The conversion factor is described by the exchange rate between those two currencies. More precisely, rateA2B means the rate of currency A with regard to currency B. Accordingly, the formal currency conversion of five Euros into US dollars with respect to the exchange rate from Listing 6.48 is: priceU SD = rateU SD2EU R · priceEU R = 1.2436 · 5 = 6.22

(6.3)

6.4 Evaluation

235

1 PREFIX gr: 2 PREFIX xro: 3 PREFIX dbpprop: 4 5 CONSTRUCT { 6

?s gr:hasPriceSpecification [ a ?ptype;

7

gr:hasCurrencyValue ?base_price ;

8

gr:hasCurrency ?base_code ;

9

gr:hasUnitOfMeasurement ?uom ] .

10

}

11

WHERE {

12

?s gr:hasPriceSpecification [ a ?ptype ;

13

gr:hasCurrency ?code ;

14

gr:hasCurrencyValue ?price ] .

15

?xrate xro:rate ?rate ;

16

xro:base [ dbpprop:isoCode ?base_code ] ;

17

xro:counter [ dbpprop:isoCode ?counter_code ] .

18

FILTER (str(?counter_code) = str(?code) && str(?base_code) = "EUR" && ?rate != 0)

19

BIND (?price/?rate AS ?base_price)

20

}

Listing 6.49: SPARQL CONSTRUCT rule for currency conversion with SPARQL

6.4 Evaluation Data cleansing and enrichment are of significant importance for the Web of Linked Data, and in particular for deep product comparison over structured data. So far, we have only dealt with toy examples to show how data cleansing techniques can be used to solve data quality problems. In this section, we demonstrate that the data quality problems are also prevalent in our crawl, hence we aim to substantiate our work from the previous sections with real numbers. In the following, we will give answers to the following questions with regard to our Web crawl from Chapter 3: • How many redundantly defined product models are in the crawl? • How many attributes incompatible to the UN/CEFACT Common Codes standard can be found in the data? • How many invalid currency codes with respect to the ISO 4217 standard are used? • How many products did we encounter that lack type information? • How many products could potentially be linked to their product models?

236

6 Cleansing and Enrichment

• How many products could inherit product features from their respective product models? • How many unit codes for quantitative data are missing? • How many price specifications do not feature currency codes? • How many price specifications do lack information about value-added taxes? • How many prices do not exhibit validity durations? • How many entities do use shortcuts (e.g. gr:includes, gr:hasValue) rather than the full modeling pattern? To find answers to these questions, we executed a number of SPARQL SELECT queries on our crawl data. The absolute frequency of some interesting entities in the crawl corpus is shown in Table 6.2. Our evaluation takes into account two different datasets, namely the full crawl, and, whenever not feasible (e.g. if the query is too complex), a subset of the crawl. We obtained this subset by drawing a random sample of 100 offerings from the crawl. This gives us the means to test the aforementioned techniques under realistic conditions. Table 6.3 shows the results of our analysis by outlining the problems with their frequency in the data, complemented by a short explanation. Table 6.2: Statistics of entities in the crawl corpus Entity Type

Number of Instancesa

Product offers Product items

3,097,631 2,674,366

Product models Price specifications Quantitative values Quant. float values Quant. integer values Units of measurement Currencies

72,982 3,517,854 1,525,063 18,705 627 5,189,941 3,379,722

GoodRelations Concepts gr:Offering class gr:ProductOrService b , gr:SomeItems, and gr:Individual classes gr:ProductOrServiceModel class gr:UnitPriceSpecification c class gr:QuantitativeValue class gr:QuantitativeValueFloat class gr:QuantitativeValueInteger class gr:hasUnitOfMeasurement property gr:hasCurrency property

Note that these statistics somewhat deviate from the statistics reported in Table 3.2 of Chapter 3, because this time we counted distinct instances. b The class gr:ProductOrService was here considered as a product item, although in a strict sense it subsumes classes for product items (gr:SomeItems and gr:Individual ) and product models (gr:ProductOrServiceModel ). c GoodRelations also defines other types of price specifications [cf. Hep11], whereas we limit ourselves to gr:UnitPricespecification for our analysis. a

Table 6.3 reveals the percentage of invalid currency codes and codes for the units of measurement and their wrong instances, respectively. However, most of the codes used are correct three-letter unit/currency codes. E.g., we could detect ten different codes for

6.4 Evaluation

237

Table 6.3: Data quality problems in the crawl corpus Frequency

Ratioa

Redundant product model entity definitions

167

0.23%

EAN comparison based on perfect matches (including identical datatypes)

Invalid codes for the unit of measurement

686

0.01%

Instances: “KG”, “” (empty string)

11,014

0.33%

Instances: “฿”, “RO”, “EURO”, “” (empty string)

2,575,279

96.29%

By contrast, only 99,087 products have type information

Missing links between products and make and models

47,830

nab

Missing links according to identical EANs. In comparison, 71,957 gr:hasMakeAndModel links are available in the crawl right away

Products that could inherit product features

55,233

2.07%

Based on the presence of explicit gr:hasMakeAndModel links

0c

na

Due to its complexity, the query was executed over a random sample of 100 offering instances only

Missing currency codes in price specifications

167,956

4.77%

We counted based on the absence of a gr:hasCurrency relationship

No indication of gr:valueAddedTaxIncluded

962,270

27.35%

The price specification lacks a statement about value-added taxes

2,471,861 679,791 678,000

70.27% 19.32% 19.27%

. . . missing gr:validFrom . . . missing gr:validThrough . . . missing both validity start and end

products: 2,078,480 product models: 565

77.72%

gr:includesObject → gr:typeOfGood

0.77%

gr:includesObject → gr:typeOfGood → gr:hasMakeAndModel

gr:hasValue shortcuts

18,586

1.22%

Shortcut for gr:hasMinValue gr:hasMaxValue

gr:hasValueFloat shortcuts

18,588

99.37%

Shortcut for gr:hasMinValueFloat and gr:hasMaxValueFloat

0

0.00%

Shortcut for gr:hasMinValueInteger and gr:hasMaxValueInteger

Problem

Invalid currency codes Missing product type information

Missing unit codes for quantitative values

Missing validity durations for prices gr:includes shortcuts

gr:hasValueInteger shortcuts a b

c

Description

and

The ratio is calculated with respect to the value for an appropriate entity from Table 6.2. For entity links it was not useful to calculate a ratio (thus “na” for “not available”), because potentially there could exist m × n links, where m is the number of product models and n the number of products. This number reflects the analysis of a random sample of 100 offering instances, but in reality the value is expected to be much higher.

238

6 Cleansing and Enrichment

the units of measurement. Of those, nine are correct, i.e. “C62”, “CMT”, “GRM”, “INH”, “KGM”, “LBR”, “MGM”, “MTR”, and “ONZ”. “LBM” does not exist in the UN/CEFACT code table [cf. Uni09a]. Furthermore, we found 36 correct currency codes [cf. Int08], namely “ARS”, “AUD”, “BGN”, “BRL”, “CAD”, “CHF”, “CLP”, “CNY”, “COP”, “CZK”, “DKK”, “EUR”, “GBP”, “HUF”, “IDR”, “ILS”, “INR”, “IRR”, “JPY”, “KES”, “LTL”, “LVL”, “MKD”, “MXN”, “MYR”, “PHP” “PLN”, “RON”, “RUB”, “SEK”, “TND”, “TRY”, “UAH”, “USD”, “VND”, and “ZAR”. Among them, two have been replaced by the Euro though, i.e. “LTL” (Lithuanian Litas) and “LVL” (Latvian Lats). In addition, we encountered 62 different language tags in the crawl. Further, we evaluated whether wrong datatypes are present in the data. Out of the sample of 100 random product offers, we detected the following instances which literals have wrong datatypes: ex:Price1 gr:hasCurrencyValue "139.00"^^xsd:string . ex:Price2 gr:valueAddedTaxIncluded "1"^^xsd:integer . ex:Price3 gr:valueAddedTaxIncluded "true"^^xsd:string .

The correct instances would be: ex:Price1 gr:hasCurrencyValue "139.00"^^xsd:float . ex:Price2 gr:valueAddedTaxIncluded "true"^^xsd:boolean . ex:Price3 gr:valueAddedTaxIncluded "true"^^xsd:boolean .

Our statistics indicate that the data quality problems presented in this chapter are also prevalent in the crawl. We thus conclude that data cleansing and enrichment are essential for the Web as a whole.

6.5 Implementation of a Data Management Web User Interface For the ease of maintenance of the cleansing rules presented in this chapter, we developed a Web user interface for data management. The user interface incorporates three important functionalities, namely • the loading and unloading of data from RDF files into an RDF store, • rules for cleansing and the enrichment of data, and • dynamic rules.

6.5 Implementation of a Data Management Web User Interface

239

The data management user interface is illustrated in Figure 6.3. With tabbing, it is possible to switch between the three functions. In the example given, the tab for managing the loading and unloading of data is active. Data can be added and removed conveniently to and from a SPARQL endpoint, and a tabular view allows to examine the data of every RDF graph currently available in the RDF store.

Figure 6.3: Data management Web user interface

6.5.1 User Interface Tabs In the following, we describe the three tabs of the user interface in Figure 6.3 in more detail. We arranged them by their natural order of execution (e.g. the data loading precedes the cleansing task).

6.5.1.1 Loading Data First of all, some data needs to be present in a SPARQL endpoint before it can be cleansed or enriched. The data management section allows to add triples from a set of local RDF files placed in a particular folder. They are uploaded to a SPARQL endpoint using the insert functionality of SPARQL Update queries13 . Similarly, triples can be deleted anytime from the SPARQL endpoint. 13

For selected SPARQL endpoints such as Stardog, Fuseki, and Virtuoso Open Source, we provided implementations relying on their native APIs to obtain better throughput.

240

6 Cleansing and Enrichment

6.5.1.2 Cleansing Rules The next step is to cleanse the data and prepare it for product search. The cleansing and enrichment rules implemented into the prototype for the most part align with the techniques presented earlier in this chapter. After executing the cleansing rules, the products in the SPARQL endpoint are ideally canonicalized for effective querying. 6.5.1.3 Dynamic Rules In addition to the cleansing rules executed prior to product search, some triples may be added to the RDF graph during the search process. We refer to this functionality as dynamic rules. They could be added in response to user interaction or an operating recommender system, e.g. by taking into account user settings, preferences, past purchases, etc. Up to now, we have implemented a simple mechanism where the user is asked whether he would like to expand the search by products that belong to more generic categories than the current category. It would be straightforward to extend it to user-assisted ontology or instance matching. 6.5.2 Data Management with RDF Graphs For effective data management, it must be possible to seamlessly add and remove triples from a SPARQL endpoint. In particular, it is a key requirement to be always able to turn back to the previous or initial state. For this reason, the amendments to the SPARQL endpoint need to be traced in a reliable way. As data from specific data sources is added to a SPARQL endpoint, the new triples are assigned a named graph [Car+ 05] reflecting their provenance. The named graph is represented by a Uniform Resource Name (URN) composed of the name of the data source, i.e. the name of the RDF file or the cleansing rule that was applied. An example of such a graph name is urn:0-2-1-unit-conversion. It denotes the RDF graph with RDF triples generated by applying the cleansing rule for unit conversion. This simple technique allows to comfortably delete any previously added RDF triples. Nonetheless, it does not support the distinction between novel and prior statements, so there could arise conflicting assertions. In order to avoid conflicting assertions, we add some meta information together with the newly created RDF graph. I.e., we replace an entity by redefining an updated entity and deprecate the old entity by attaching some meta information to the new graph. Our mechanism is illustrated in Figure 6.4. More precisely, we use the owl:deprecated

6.5 Implementation of a Data Management Web User Interface

241

annotation property to flag entities as deprecated if they were replaced by a newer version. “An annotation with the owl:deprecated annotation property and the value equal to "true"^^xsd:boolean can be used to specify that an IRI is deprecated.” [MPP12, Section 5.5]

Hence, instead of replacing RDF triples in an RDF store, we only invalidate the old triples, which permits later on to revert to the previous state. Since the meta information is part of the new graph, it will be discarded as well on deletion of the respective RDF graph. ex:Weight

ex:Product

owl:deprecated

gr:weight

"true"^^xsd:boolean

ex:Weight

rdf:type

gr:QuantitativeValue

ex:Product

gr:hasValue

250

gr:hasUnitOfMeasurement

"GRM"^^xsd:string

(a) urn:original-graph.nt Weight in gram

gr:weight

ex:Weight2

rdf:type

gr:QuantitativeValue

gr:hasValue

0.25

gr:hasUnitOfMeasurement

"KGM"^^xsd:string

(b) urn:new-graph.nt Weight in kilogram

Figure 6.4: Axiom replacement mechanism

In a SPARQL query, data marked as deprecated can be filtered out conveniently, as illustrated in Listing 6.50. The query in Listing 6.50 is intended to be applied on the joint RDF graph in Figure 6.4. Because the query filters out any deprecated entities, it does not match the definition of ex:Weight from Figure 6.4. 1 SELECT ?value ?uom 2 WHERE { 3

ex:Product gr:weight ?weight .

4

?weight gr:hasValue ?value ;

5

gr:hasUnitOfMeasurement ?uom .

6

FILTER NOT EXISTS {?weight owl:deprecated true}

7 }

Listing 6.50: SPARQL SELECT with owl:deprecated

In a production environment, more advanced approaches for capturing additions and removals can be considered, like deltas on RDF graphs [BC04] or relevant parts from the PubSubHubbub protocol [PM10].

242

6 Cleansing and Enrichment

6.5.3 Execution Order of Cleansing Rules Sometimes, we encounter non-trivial dependencies when executing inference rules for cleansing and enrichment. To give an example, it is pretty safe to interchange the cleansing rules for expanding and fixing numeric values as shown in Listing 6.51, because none of these rules makes assumptions about the other rule. 1 2 3 4 5 6 7 8 9

ex:QV gr:hasValue "11990,0"^^xsd:float .

ex:QV gr:hasValue "11990,0"^^xsd:float .

# 1. expansion:

# 1. fix numeric values:

ex:QV gr:hasMinValue "11990,0"^^xsd:float .

ex:QV gr:hasValue "11990.0"^^xsd:float .

ex:QV gr:hasMaxValue "11990,0"^^xsd:float . # 2. fix numeric values:

# 2. expansion:

ex:QV gr:hasMinValue "11990.0"^^xsd:float .

ex:QV gr:hasMinValue "11990.0"^^xsd:float .

ex:QV gr:hasMaxValue "11990.0"^^xsd:float .

ex:QV gr:hasMaxValue "11990.0"^^xsd:float .

Listing 6.51: Interchangeable execution of two cleansing rules

However, for other cleansing rules that rely on a certain pattern in the data, it is nontrivial. It matters for example whether we replace deprecated classes by their new counterparts before we execute rules that are relying on these new classes. Similarly, the unit conversion functionality assumes unit values to be expressed as intervals, i.e. modeled as values with lower and upper boundaries. Thus, the rule to convert point values to intervals needs to be executed beforehand. Thus, it is extremely important to arrange cleansing rules in a way that the execution order gives credit to potential interdependencies of rules.

6.5.4 Translation versus Canonicalization In general, there are two possible approaches for unit conversion in a SPARQL endpoint: Either on-the-fly translation of values during query execution, or canonicalization. This canonicalization can be accomplished via materialization of values to a given base unit (e.g. “KGM”). Compared to real-time translation, it is computationally less expensive for a SPARQL endpoint to compare values based on uniform units than still having to translate them on the fly. For applications, it is further no big deal to define a function that translates values into their preferred units (e.g. to “GRM”). It is basically one conversion formula that needs to be applied, e.g. the value in “KGM” has to be multiplied by exactly one thousand to obtain the value in “GRM”.

1 2 3 4 5 6 7 8 9

6.6 Conclusion

243

To consolidate various units, our data management interface executes the cleansing rule shown in Listing 6.46 that updates14 a quantitative value with the one represented in its base unit.

6.6 Conclusion In this chapter, we have developed a typology of data quality problems for e-commerce and presented techniques to address these problems. We further analyzed the prevalence of these obstacles in the Web crawl from Chapter 3, whereby we confirmed an urgent need for data cleansing on the e-commerce Web of Data. We complemented our work with a data management Web user interface that facilitates the maintainability of data management, data cleansing, and enrichment rules. The cleansing and enrichment rules introduced in this chapter are neither meant to be comprehensive nor to be exhaustive. Without any doubt, there exist many more cleansing rules that could be applied to e-commerce data at different levels of complexity. Besides the rules presented herein, we envisage more complex rules that take advantage of translations in various languages, natural language processing (NLP) techniques, or information extraction tasks. For example, missing price specifications could be obtained by employing a simple extraction rule to create the respective properties from prices in text.

14

Please note that in the example given in Listing 6.46 and in all previous examples of this chapter, the triple with the owl:deprecated property was omitted, since its rationale was only explained later.

244

6 Cleansing and Enrichment

7 Faceted Product Search on the Semantic Web

7.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

7.2

State of the Art and Related Work

7.3

7.2.1

Faceted Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

7.2.2

Other Approaches 7.2.2.1

Adaptive Faceted Search . . . . . . . . . . . . . . . . . . . . . 251

7.2.2.2

Faceted Search over RDF Data . . . . . . . . . . . . . . . . . 251

7.2.2.3

Faceted Search over Structured E-Commerce Data

7.2.2.4

Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

. . . . . . 251

Faceted Search User Interface . . . . . . . . . . . . . . . . . . . . . . . . 253 7.3.1.1

Keyword Search . . . . . . . . . . . . . . . . . . . . . . . . . . 255

7.3.1.2

Faceted Navigation . . . . . . . . . . . . . . . . . . . . . . . . 255

7.3.2

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256

7.3.3

Incremental Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 257

7.3.4

Instance-based Search Filtering . . . . . . . . . . . . . . . . . . . . . . . 258

7.3.5

User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 7.4.1

7.4.2

7.4.3 7.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

Adaptive Faceted Search Interface for Product Offers . . . . . . . . . . . . . . . 252 7.3.1

7.4

. . . . . . . . . . . . . . . . . . . . . . . . . 249

Impact of Search Specificity on the Size of the Result Set in Product Search260 7.4.1.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

7.4.1.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

7.4.1.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

Usability Studies of Faceted Search Interfaces for Products . . . . . . . . 262 7.4.2.1

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263

7.4.2.2

Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264

7.4.2.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Proof of Concept with Real Product Data from the Web . . . . . . . . . 266

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267

245

246

7 Faceted Product Search on the Semantic Web

Linked Open Data (LOD) [e.g. BHB09; HB11] has become a popular paradigm for publishing and consuming data on the Web. In the past few years, a growing amount of e-commerce information has been published either as LOD or as embedded Microdata [cf. Hic13] or Resource Description Framework in Attributes (RDFa) [cf. Adi+ 13] markup in Hypertext Markup Language (HTML) that can be easily converted into Resource Description Framework (RDF) and combined with LOD sources. Unfortunately, the usage of such data for product search and comparison remains an open challenge, for the following reasons: First, the products and services are themselves specific and heterogeneous with regard to their relevant characteristics. Second, the search process involves learning about the option space, i.e. it is difficult to formulate queries without knowing how well conceptual elements from the schemas are populated and how much they influence the size and characteristics of the result set. Third, there is also learning about correspondences in the underlying product features, i.e. approximate ontology alignments. For instance, the product features “input voltage” and “supply voltage” might be equivalent in the context of a particular product search, while they are not exactly equivalent in the general case. A human user might discover, refine, and revise such context-bound correspondences in the process of searching. Search interfaces for e-commerce data from the LOD paradigm have so far either tried to consolidate the data into a single set of product features and product categories or confronted the human users with the raw data and its inherent conceptual heterogeneities. In this chapter, we present an adaptive faceted search interface over RDF data for deep product comparison on the Web that is directly based on the popularity of schema elements in the data and does not rely on a rigid, conceptual schema with hard-wired product features, thereby being suitable for arbitrary product domains and product evolution. We (1) present a proof of concept and demonstrate it with real product data from the Web, (2) provide some preliminary evidence that the product space in a sample dataset narrows down logarithmically by the number of product features used in a query, and (3) show that the usability of our approach is comparable with approaches with hard-wired product features, while improving the depth and breadth of product search and comparison. Our findings suggest that an instance-driven faceted search system for LOD, which dynamically adapts to user requirements and patterns in data, is a promising direction for future search interfaces for e-commerce and other application domains, and a precondition for meaningful product search on the Web of Linked Open Data.

7.1 Problem Statement

247

7.1 Problem Statement In the recent years, companies have increasingly added structured e-commerce data published as Microdata or RDFa markup to HTML Web pages [MP12; MB12; MPB14]. Such product, store, and offer data is primarily based on the GoodRelations and schema.org vocabularies and forms, while mainly provided for major search engines like Google, a promising data source for novel Web applications and services. Unfortunately, the available means for exploring this giant RDF graph of e-commerce information are limited. The diversity of products and data sources, the inherent learning effects during search, the heterogeneity in terms of data semantics with the resulting need to align data schema elements on the go, and the sparsity of the graph of product information create special requirements for product comparison solutions that are currently not met. On top of the technical challenges, products and services are typically characterized by a vast variety of product features that influence the overall utility of a certain product, trade-offs between such features, and a significant variation in item prices. Consequently, product comparison includes multi-dimensional, non-linear trade-off decisions. In essence, there are at least the following fundamental requirements that a product comparison solution on the Web should fulfill: 1. Multi-dimensional views on products. The complexity and dynamics of products and services necessitate multi-parametric search models based on distinguishing properties and attributes of product entities, which, on the Web of Data, can be realized by considering the structure of the available data. In other words, it is important to allow arbitrary paths for narrowing down the set of candidate products instead of hard-wired, sequential search processes, and to integrate the product feature perspective with the product price perspective at any given time in the search task at hand. 2. Support learning about the option space. Search is an iterative, incremental learning process [e.g. MC10, p. 9] rather than a static, one-shot query. For example, users grasp new information about the option space in every single search turn [Bat89], possibly leading to changes in their price expectations of the ranking of product feature preferences. Thus, users need a way to relax or refine their constraints and preferences based on how those modify the size of the option space. 3. Facilitate incremental, user-driven schema alignment. For product search with incremental learning, it is not only vital to assist in navigating and pruning the option space, but also to actively engage the user in extending and consolidating

248

7 Faceted Product Search on the Semantic Web

the schema level of the underlying data. Since users are likely to learn about correspondences in the underlying product features during the user interaction, the approximate alignment of conceptual elements should be integrated in the iterative search process, and be fed back to the graph. E.g., a user interface could ask the user for approval of a possible match between two product features. 4. Take into account the popularity of conceptual elements in the instance data. A user interface that is solely based on the schema elements defined in the underlying ontologies is likely inefficient, because the user lacks information about the availability of matching data (e.g. whether a property is used at all) and the relevance of a constraint on the option space (e.g. whether products differ in that property). Due to a sparsely populated graph of product information on the Web, efficient user interfaces should thus adapt to the actual usage of schema elements in the data rather than be based on predefined, rigid schema definitions. 5. Utilize metrics for the efficiency of the search process. An efficient search interface presents choices to the user that help to quickly narrow down the option space, e.g. by proposing discerning features that partition the option space in the best possible way, or by suggesting properties that promise the highest utility to a given user need. Note that such metrics should not simply look at the effect on the size of the option space (efficient partitioning into subsets), but also at the effects with regard to the quality of the fit to the users’ preference structures (maximize the utility in the economics sense). Established search approaches fall short with deep product comparison at Web scale. Information retrieval (IR) (e.g. keyword searches over document collections) essentially flattens multi-dimensional product descriptions to a simple, one-dimensional term match. On the other extreme, query formulation like with the SPARQL Protocol and RDF Query Language (SPARQL) is generally too complex for the average user and lacks mediation between the conceptual models of the data versus the mental models of human users. Other approaches suggested for browsing RDF data (e.g. Tabulator [Ber+ 06]) are very low-level for serious product search. As a result of these shortcomings, consumers tend to narrow down the set of candidate offers very early in the search process, which bears the risk that potentially interesting product offers are eliminated prematurely. Also, results are highly biased towards a single product or offer dimension (e.g. low prices) [Sac05]. In this chapter, we assume that faceted search [Tun09], a special form of exploratory search [Mar06], is an appropriate approximation for deep product comparison on the Web

7.2 State of the Art and Related Work

249

of Data. Faceted search is well established both in practice (e.g. eBay1 and Amazon2 ) and in academia as a way to guide users through option spaces [e.g. Wei+ 13; FH11; ODD06]. In a nutshell, it constitutes a multi-dimensional interaction paradigm based on facet-value pairs, e.g. product dimensions, where facet views and the result set dynamically adapt with the actual data. I.e., irrelevant options are hidden each time a user selects a facet or facet value. As the key contribution of our work, (1) we propose an instance-driven, adaptive faceted search interface for deep product comparison on the Web of Linked Data. In particular, we describe the main components of the faceted search interface, and detail the iterative, incremental search strategy applied on the product domain. In this regard, we also introduce an instance-based search filtering approach and highlight the role of user feedback in RDF environments. We then evaluate our approach in the following ways: (2) We provide evidence that faceted product search leads to a logarithmic reduction of the result set; (3) we empirically study that our dynamic faceted search interface has no significant negative impact on usability; and (4) we showcase our approach using some real e-commerce data that we have collected from the Web. The rest of this chapter is structured as follows: Section 7.3 describes our faceted search interface over structured product data; in Section 7.4, we evaluate our approach; in Section 7.2, we compare our work with related works from the literature; and finally, Section 7.5 concludes our work and discusses future directions.

7.2 State of the Art and Related Work In this section, we summarize the theoretical background of faceted search and point to related research on faceted search interfaces over structured data.

7.2.1 Faceted Search Faceted search is a multi-dimensional search paradigm based on facet-value pairs. It offers interactive guidance to users via progressively refining a query against a collection of items [Tun09, p. 23; Wei+ 13]. In practice, in every selection step a user sees only those facet-value pairs for which further interaction is reasonable [e.g. Wei+ 13]. Thus, faceted search effectively eliminates the risk of hitting dead ends, i.e. the return of empty result sets [Tun09, p. 23]. 1 2

http://www.ebay.com/ (accessed on January 14, 2015) http://www.amazon.com/ (accessed on January 14, 2015)

250

7 Faceted Product Search on the Semantic Web

While the term faceted search is sometimes equated with faceted browsing or faceted navigation [e.g. Wei+ 13; MC10, p. 95], it is often understood as an interaction paradigm consisting of faceted navigation complemented with keyword search functionality [e.g. Tun09, p. 24]. Substantial research related to faceted search was also conducted under the term dynamic taxonomies [Sac00; Sac05; ST09]. A faceted search interface is based on facets and facet values, or terms [cf. Wei+ 13]. Facets can be compared to mutually orthogonal categories (i.e. terms cannot appear in multiple categories) [e.g. Yee+ 03; ODD06; Wei+ 13], whereas facet values are instances of these categories. In the context of product search, facets represent product dimensions (e.g. features “color” or “material”) and facet values correspond to instances of these dimensions (e.g. “brown” or “wood”). Products are usually represented by multiple facets and facet values. User selections in faceted search interfaces map to boolean expressions for the filtering of the option space. While a selection of multiple facets generally leads to their conjunction (e.g. color “brown” and material “wood”), multiple facet values are mostly combined using disjunction (e.g. color “brown” or “red”) [cf. Hea+ 02]. In set-theoretic terms, conjunction corresponds to the intersection of items, and disjunction to their union. Faceted search interfaces dynamically adapt to changes in selection. In other words, in response to user interaction, the facet views update on reducing or expanding the option space. Furthermore, the faceted navigation paradigm3 does not lead to dead ends or empty results [e.g. FH11], because a user is always presented facets for which instance data is available. Unlike parametric searches, where a user is forced into a sequential search order (e.g. choose camera type, then focal length, after that picture resolution, and finally color), faceted search allows users to drill down the search space in any preferred order that best suits their individual learning abilities. A few commercially available faceted search solutions include simple mechanisms for indicating the effect of a facet or facet value on the size of the option space.

7.2.2 Other Approaches We position our work at the intersection of human-computer interaction (HCI), Semantic Web, and e-commerce. Accordingly, we deem relevant three research directions, namely (1) adaptive faceted search interfaces, (2) faceted search over RDF data, and (3) faceted product search on the Semantic Web. 3

Please note, however, that keyword searches as part of faceted search interfaces can clearly cause empty results.

7.2 State of the Art and Related Work

251

7.2.2.1 Adaptive Faceted Search In adaptive faceted search interfaces, user controls dynamically conform to the underlying data constrained by the current selection. An adaptive faceted search interface was proposed in [Abe+ 11] to investigate content within Twitter streams. Facets and facet values are computed based on semantic enrichment of Twitter messages. The search interface adapts according to frequency, user profile, temporal context, and diversification. In [Tva11], the author combines approaches from the Semantic Web, the Adaptive Web, and the Social Web. The goal is to facilitate information access on the Web via an adaptive, exploratory search approach relying on multiple search paradigms like keyword search and faceted navigation. Facets are generated automatically and customized based on the user’s individual relevance judgement. Another work related to personalized faceted search over Web document metadata was proposed in [KZL08], where the facet views adapt according to user ratings.

7.2.2.2 Faceted Search over RDF Data As an easy-to-use alternative for SPARQL querying, faceted search gained wide attraction as a search paradigm for RDF data. Faceted search as a means to navigate over arbitrary datasets with structured data was formalized in [ODD06]. Unlike conventional faceted search, their approach introduces additional expressivity that allows to navigate different types of resources (e.g. not only product offers). A similar approach develops a formal model for question answering (QA) based on faceted queries and regard also ontological reasoning [Are+ 14a]. The work in [FH11] combines the ease-of-use of faceted search with the expressive power of the SPARQL query language. In comparison to the two other works that operate on set operations over resources, this approach provides navigation through query transformations at the syntactic level. Some large-scale faceted search interfaces over real RDF datasets were suggested in [Hah+ 10] and [Are+ 14b]. In [Hah+ 10], the authors built a faceted search interface over structured Wikipedia infobox data (i.e. DBPedia [Aue+ 07]). The work in [Are+ 14b] studies limitations of conventional faceted search systems, and presents a faceted search interface over Yago [cf. SKW07] that incorporates full-text search on top of Lucene [cf. MHG10].

7.2.2.3 Faceted Search over Structured E-Commerce Data A similar approach to ours was suggested in [VvDF12]. The authors demonstrate an implementation of a faceted product search interface over structured e-commerce data

252

7 Faceted Product Search on the Semantic Web

from the Web. The data store4 presently contains a selection of product offers along with review data from selected online stores. The authors claim that their RDF database can be extended with additional products by submitting the respective Uniform Resource Identifiers (URIs) of Web pages featuring product data in RDFa or Microformats. Albeit constituting a valuable contribution, their research does not address a couple of problems that we are tackling in our work. As the main difference, they only support basic commercial properties of product offers, whereas we provide deep product comparison via product features. Our faceted search interface is more directed towards user-friendliness as it supports all search paradigms (i.e. keyword search, faceted browsing, and instancebased search filtering) throughout the whole search process. Furthermore, unlike their approach that categorizes products into a rigid category structure, our faceted search interface is fully instance-driven.

7.2.2.4 Our Approach In summary, the work presented in this chapter differs from the aforementioned works in at least one, if not several, of the following dimensions: • Our application area is the domain of products, • the search user interface is instance-driven, • our approach is versatile, i.e. it supports arbitrary e-commerce data in RDF, as long as it adheres to the GoodRelations meta-model, • the search user interface is designed to be fully iterative, i.e. user interaction feeds knowledge base augmentation and refinement, and • the implementation is fully RDF- and SPARQL-1.1-based, i.e. constituting a native Semantic Web approach. Some interesting properties that other works presented that we currently do not support are among others personalization, and ontology-based adaptive facet generation.

7.3 Adaptive Faceted Search Interface for Product Offers In the following, we describe an adaptive faceted search interface for product offers over RDF data. 4

http://xploreproducts.com/ (accessed on December 30, 2014)

7.3 Adaptive Faceted Search Interface for Product Offers

253

7.3.1 Faceted Search User Interface In Figure 7.1, we propose a mock-up of a general faceted search interface over e-commerce data. The faceted search interface combines the two interaction paradigms keyword search and faceted navigation. As illustrated in the graphic, the keyword search field is placed prominently at the top of the search interface, and the boxes surrounding the result list in the middle represent the faceted navigation controls. The User Dialog box, displayed on the upper right part, can further serve as a means to non-intrusively incorporate user feedback. Keywords ...

Search

Product Details

Results

Feature 1

Product 1

Feature 2

Description

Value 1

Product Image

Value 2

User Dialog

Price

Feature 1 and Feature 2 seem to be equivalent. Shall I consolidate them?

Details

Yes

Feature 3

Categories Keywords ...

Commercial Properties

Product 2

Vendors

Vendor 2

Product Image

Search

Price

Description

Vendor 1

No

Category 1 Category 2

Details

Category 3 Category 4 Category 5

Price

< 10 10-20

Additional Config.s

Filter results by products with images ...

1 2 3 ... 22

Order items by lowest price first

Figure 7.1: Mock-up of a faceted search interface for e-commerce

Figure 7.2 depicts a screenshot of the faceted search prototype that we developed as the main contribution of this chapter. The demonstrated results are based on toy data modeled using the Vehicle Sales Ontology (VSO)5 . Our online tool6 effectively integrates product details, product category information, and commercial properties related to product offers. Thereby, it is possible to conduct deep product comparison without having to overly focus on the price. The user interface in large part corresponds with the mock-up from Figure 7.1. For every result in the result list, a link is provided that, when clicked, opens a modal window where the full product details show up. Nonetheless, the most interesting facts such as product image, name, description, and price are summarized in the result list. If the result set 5 6

http://purl.org/vso/ (accessed on January 15, 2015) http://www.ebusiness-unibw.org/tools/product-search/ (accessed on January 15, 2015)

254

7 Faceted Product Search on the Semantic Web

SPARQL endpoint selection

Tooltips

Expandable facets Product details link

Normalized price values

Price ranges

Pagination

Figure 7.2: Screenshot of our faceted product search prototype

exceeds the number of ten items, then the remaining results are outsourced to other result pages that can be accessed via pagination controls [cf. MC10, pp. 110–116]. The part left to the result list is dedicated to product-related details and commercial properties. It mainly includes product features and prices, but also manufacturers, vendors, payment options, or business functions. The right part features a category filter, along with additional filter configurations such as to exclude product offers without images or to revert the result order. The search user interface makes heavy use of tooltips that help the user at better understanding the option space. Our search interface is instance-driven. By that we mean that it is directly based on the constraints imposed on the underlying RDF data, i.e. it dynamically adapts with the availability of the data, such as the presence of product features or categories. Accordingly, there is no fixed schema necessary for generating the views, but rather the availability of the data determines the appearance. In other words, our approach is able to flexibly cope with structured product data from various data sources. Only the high-level product offer data builds upon the GoodRelations meta-model [Hep08a], which has also been made

7.3 Adaptive Faceted Search Interface for Product Offers

255

the core meta-model for e-commerce data in schema.org and is thus officially endorsed by Google, Microsoft, Yahoo!, and Yandex [Guh12]. This allows for useful guided navigation paths even in the absence of richly axiomatized products.

7.3.1.1 Keyword Search We incorporated two kinds of keyword searches into our faceted search prototype: A first one for product offers, and a second one for searching within product categories. The keyword searches match terms within textual properties attached to objects, which for product offers includes names and descriptions of product offers, instances, and models, and for product categories labels and comments. An autocomplete feature assists the user in finding the right terminology. It is based on a light-weight SPARQL query executed over the product names and product category labels, respectively. Simple keyword search functionality can be obtained using the SPARQL CONTAINS function [HS13], even though such queries are commonly costly for large datasets, as most SPARQL implementations iterate over all relevant objects. Some SPARQL endpoints thus support operations over optimized full-text indexes built from textual properties in the available data. Such search indexes like Apache Lucene [MHG10] even support wildcard queries or fuzzy string matches based on a given threshold limit [MHG10, pp. 99–101]. If supported by the underlying SPARQL endpoint, our prototype relies on Lucene for keyword searches. Otherwise, it falls back to the much slower SPARQL CONTAINS function.

7.3.1.2 Faceted Navigation The exploratory search capability is provided through the faceted navigation controls that complement the keyword searches. Faceted navigation uses boolean constraint filtering based on product dimensions, commercial properties of product offers, and product type information. As the product features facet view in Figure 7.2 suggests, product features are initially displayed in compact form and expanded to their corresponding values as the user clicks on them. The numbers given in square brackets indicate the quantity of instances in the result set affected by applying the respective filter. A selection of multiple product features leads to their conjunction (logical and ), i.e. items to appear in the result list need to match on every selected feature. For the remaining facet views, a disjunctive approach is used (logical or ). If, for instance, a payment option has already been selected

256

7 Faceted Product Search on the Semantic Web

before, then the user could add a second payment option. Matching candidate offers then have either to accept one or both of the selected payment options. The facet values displayed in the faceted search interface correspond to qualitative, quantitative, and datatype properties in the GoodRelations meta-model. Unlike qualitative or datatype values that may be implemented with checkboxes that a user can click on, it works differently for quantitative values. One possible alternative is to group quantitative values into classes given as range intervals (e.g. “$ 0–20”, “$ 20–50”, etc.). Our approach, however, takes a range slider as illustrated on the price view in Figure 7.2. To build up the range slider, we need to generate a useful number of classes with each having the same width. In order to obtain the class width, the interval between the minimum and maximum value is divided by the number of classes, which number takes the amount of values with a specified upper limit (e.g. a maximum of 30 classes). The height of the bars is calculated relative to the class with the highest frequency of values, where the scale is logarithmic and the maximum possible height is predefined. By displaying the frequency of values in every class, a user can quickly gauge the possible outcomes of his decision. Even though this approach works well for point values, we need to rely on a heuristic for range intervals: For closed intervals, we consider the lower bound for the classifying; and for open intervals, we take whatever value is available.

7.3.2 Implementation From a conceptual point of view, the front-end of the application is backed by data from an RDF store accessible through a SPARQL endpoint, which can be configured as per the endpoint selection dropdown menu in Figure 7.2. As of writing this chapter, we had tested our prototype with three different SPARQL endpoints compliant with the World Wide Web Consortium (W3C) SPARQL 1.1 standard, namely Virtuoso Open Source Edition7 , the Jena-TDB-based Fuseki SPARQL server8 , and Stardog9 . Every single facet view of the user interface is generated by executing its own, unique SPARQL query, where each one takes into account the current set of constraints. Facet-value pairs in RDF are represented by properties and instances (or values). Similarly, conjunction of facets and disjunction of facet values within a facet are realized in SPARQL using a sequence of triple patterns and UNION clauses, respectively. The human-readable labels shown throughout the user interface in place of URIs are, unless not available in the RDF store, extracted from product vocabularies and instance labels. http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/ (accessed on May 26, 2014) http://jena.apache.org/documentation/serving_data/ (accessed on May 26, 2014) 9 http://stardog.com/ (accessed on May 26, 2014) 7

8

7.3 Adaptive Faceted Search Interface for Product Offers

257

The front-end of the application was realized using HTML 5 [Hic+ 14] for static content, JavaScript [Eic05] and JQuery10 for user interaction, and Twitter Bootstrap11 for page responsiveness. The back-end is based on the Python programming language and the Jinja 212 templating engine for generating custom-tailored SPARQL queries, that can later be submitted to the SPARQL endpoint. The application can be run on any generalpurpose Apache Web server instance where the mod_python and mod_wsgi modules are enabled. If supported by the SPARQL endpoint, keyword searches execute over a full-text search engine such as Lucene [cf. MHG10].

7.3.3 Incremental Search Strategy At every search step, the user is presented with dynamic facet and result views that represent the search space pruned according to the current selection. Product search supposedly ends as the user is satisfied with the results or not able to find a path leading him or her to better search results. Until then, the search process may repeatedly switch between multiple search paradigms, as illustrated in Figure 7.3.

Keyword Search

Search Start

Faceted Navigation

User Feedback / Alignments

Search Stop

Instance-based Search Filtering

Figure 7.3: Incremental search cycle among multiple search paradigms

The user dialog in faceted search is fundamentally a decision tree problem, where the user interaction steps are branches of the tree. Because the facets are orthogonal to each other, the decision tree can be constructed in any order [ODD06]. However, if we want to optimize the search efficiency for the user, we have to create and, if necessary, update the resulting tree based on a “best split” strategy known from decision tree research in http://jquery.com/ (accessed on February 19, 2016) http://getbootstrap.com/ (accessed on February 19, 2016) 12 http://jinja.pocoo.org/ (accessed on February 19, 2016) 10

11

258

7 Faceted Product Search on the Semantic Web

data mining [TSK05, p. 158]. Literature came up with popular algorithms to iteratively choose attributes maximizing the information gain, above all the Iterative Dichotomizer 3 (ID3) algorithm [Qui86] and its extensions C4.5 [Qui93] and ID5R [Utg89]. In this context, [KZL08] mention some popular facet-pair suggestions strategies, namely relying on frequency, probability, and the information gain. The authors in [VFK13] further give an overview over different metrics appropriate for product search to help decide which facets shall be presented to the user. So, a user would ideally be presented in every search step the options (facets and facet values) that best possible partition the search space. At the time of writing this chapter, the splitting strategy employed by our prototype was relying on facets and facet values presented according to the descending frequency order in the data [cf. KZL08]. Although it is a very simple strategy, we are well aware of this methodology suffering from some important weaknesses. For example, if all or only very few of the available items exhibit a certain feature or feature-value pair, then the current algorithm tends to over- or underrate their value for the search progress. Similarly, the algorithm is easily misguided as multiple instances of the same feature (or interrelated features) belong to a single item (e.g. the feature-value pairs “varnish: red” and “color: red”). 7.3.4 Instance-based Search Filtering As a complement to faceted search filters over aggregated data, we herein present a novel idea for supporting the incremental search process, namely instance-based search filtering. Sometimes, a user might be looking at the product details of a particular product offer (see Figure 7.4) and discover features that he or she intends to consider for the next filtering steps in the search process. Now, by going back to the result list, the user would lose track of the respective features and find them only by chance in the list of displayed product features, namely if they are ranked high with respect to the current items in the option space. Of course, the same holds true for feature values and individuals as well. From a user interaction point of view, we can prevent this level of indirection by letting the user apply filters directly from the product details page (see Figure 7.4). As a nice side effect, this solution facilitates the comparison among similar products via pivoting [Ste+ 09, pp. 83f.] across a collection of items that share the same features. As a requirement for this to work well, however, the properties in the RDF graph need to be consolidated first.

7.3 Adaptive Faceted Search Interface for Product Offers

259

Figure 7.4: Screenshot of a product details modal window with instance-based search filtering

7.3.5 User Feedback For product search with incremental learning, it is not only vital to assist in navigating and pruning the option space, but also to actively engage the user in the search process and to incorporate user feedback. As already depicted in the mock-up in Figure 7.1, a viable approach is to integrate a user dialog in the search interface. This approach goes beyond the traditional relevance feedback methods known from IR (e.g. Rocchio [Roc71; SB90]), where a system would typically consider explicit user feedback (via selection of relevant documents or click feedback) or implicit system feedback (via local or global analysis) to revise a query towards improving query results [BR11, pp. 178–180; cf. Hea+ 02]. Mentioning here all possibilities of providing explicit user feedback in search systems is outside the scope of this work. Yet, a user interface could e.g. ask the user to approve a possible match between two features. The current status of our implementation takes into account user feedback in the form of a dialog box that pops up informing the user about potentially interesting super-concepts with respect to the concept in regard. On that account, a user is able to expand the search scope.

260

7 Faceted Product Search on the Semantic Web

In an RDF environment, corresponding axioms to reflect that newly gathered knowledge can be easily added to the existing graph of data as named RDF graphs [Car+ 05] – potentially managed on a per-user basis. This way, the newly created named graphs can persist in the RDF store in order to reflect past user experiences, or be cleared at some point, e.g. if the materialized axioms were highly specific to a particular information need and a user intends to restart the search from scratch. The basic idea behind the data management with RDF graphs was already explained in Chapter 6.

7.4 Evaluation This chapter investigates the appropriateness of faceted search interfaces for the Web of Data. To test for two fundamental aspects of search interfaces, namely search efficiency and usability, we first measure the impact of specificity in product search on the size of the result set using a simulation of random walks. Then, we conduct a usability study where we contrast our fully dynamic, data-driven faceted search interface to an altered instance of our search interface with hard-wired product features. Finally, we showcase our faceted product search approach on real e-commerce data from the Web.

7.4.1 Impact of Search Specificity on the Size of the Result Set in Product Search We simulated a number of product searches to find out how dispersed the search space for products is and how well a faceted search approach on average performs regarding partitioning the option space.

7.4.1.1 Method We took a random sample of 875 automobile offers13 from the mobile.de car listing Web site. We extracted the product features from the respective Web pages and populated an RDF graph via mapping product features to properties from the VSO ontology14 . For the sake of simplicity, we did not consider quantitative values for our simulation but only qualitative and datatype properties. The variety of qualitative and datatype properties over the whole dataset is shown in Table 7.1. More precisely, we took random result page numbers between 1 and 100 for random price ranges between 1 and 100, 000 Euros. 14 http://purl.org/vso/ns (accessed on October 19, 2014) 13

7.4 Evaluation

261

Table 7.1: Variety of properties and values in an automotive dataset Property

Variety of Values

http://purl.org/vso/ns#bodyStyle http://purl.org/vso/ns#color http://purl.org/vso/ns#condition http://purl.org/vso/ns#feature http://purl.org/vso/ns#fuelType http://purl.org/vso/ns#meetsEmissionStandard http://purl.org/vso/ns#transmission

6 24 5 60 10 5 3

These numbers give a total of 113 possible property-value pairs. From this range of possible property-value combinations, we drew one entry at random and started from there a random walk simulating ten consecutive selection steps. After every selection step, we randomly picked a property-value pair from the reduced option space, which we obtained by issuing a SPARQL query.

7.4.1.2 Results Figure 7.5 outlines the results of our simulation. At the beginning (step 0), the option space always entails the full range of 875 car offers. In search step 1, the median already goes down to circa 150 results, i.e. in 50% of the cases the first filtering step sorts out an average of more than 700 out of 875 automobiles. After having selected three product features, the median of the option space decreases to only three items. For the sake of simplicity, our random walk does not include UNION clauses, i.e. the disjunctive selection of multiple facet values which would expand the option space (e.g. select a car that offers either manual or automatic transmission). However, we argue that in a real setting where users are seeking interesting product offers this operation is anyway rare.

7.4.1.3 Discussion We can see clearly from the analysis that the space of possibly matching products decreases logarithmically with the number of features specified in a query. This confirms our assumption that learning about the option space, i.e. how relaxing and refining requirements and preferences based on the set of remaining choices, is a critical part of product search interaction. It also highlights that in specific branches of product search and thus sparsely populated decision trees, a search interface can benefit from being dynamically generated directly from the data about products and their characteristics.

262

7 Faceted Product Search on the Semantic Web

Number of matching products in the dataset (log-scaled)

103

102

101

100

0

1

2 3 4 5 6 7 8 Number of randomly selected product features in the query

9

10

Figure 7.5: Change of option space with 100 random walk iterations over a decision tree for 875 automobile offers

Of course, the findings presented are currently based on a single sample data set of 875 cars, albeit those have been selected randomly from a very significant real dataset from a car sales portal. The effect of the number of features might be less significant if we took into account the correlation of features (e.g. that a stronger engine is likely to be found in combination with more seating capacity), which we deliberately abstracted from by selecting the features randomly. We would counter, however, that exactly these correlations between product features are unknown ex ante to a person exploring a product space and thus stress the importance of the learning effect of iterative product search.

7.4.2 Usability Studies of Faceted Search Interfaces for Products Faceted search interfaces have more recently attracted significant research interest. Various demonstrators, user studies, and evaluations repeatedly attest them superior usability in contrast with other search paradigms [e.g. Yee+ 03; KZL08; ODD06; FH11]. In a survey in [Wei+ 13], the authors systematically compare faceted search with other popular search paradigms. In here, we conduct a usability study in order to find out whether our instance-driven search

7.4 Evaluation

263

interface has a negative impact on user satisfaction, because hard-wired, consolidated user interfaces found in today’s commercial faceted search applications have the advantage that the facets presented can be based on popular mental models of human users. An instance-driven, adaptive faceted search interface bears the risk of being confusing to users, because the facets being presented and their names may change based on the available data.

7.4.2.1 Method

In order to evaluate that potentially negative effect, we prepared a second variant of our search interface that relies on hard-wired product features15 . As the data to present in our search interfaces, we used a random subset of 25 car offers out of the random sample of 875 car offers from mobile.de. We set up a usability study according to the System Usability Scale (SUS) [Bro96] score. The questionnaire encompasses ten brief questions where each response is represented by a five-point Likert scale ranging from strongly agree to strongly disagree. SUS questions are designed to alternate between positive and negative statements. In addition, we included a gold question to filter out unreliable candidates based on an incorrect response. We placed the gold question at the end of the questionnaire. Otherwise, we feared that participants would potentially give up too early, because it required a bit of effort to look at the information displayed in the search interface. Finally, we asked for optional feedback, which turned out valuable for interpreting results in a later analysis. We put the questionnaire online so that participants could test the search interface and answer to questions remotely. We conducted two separate usability studies. The first one we ran with undergraduate students from our university, who specialize in business management or related fields. They were asked to assess the usability of the original search interface and, later, to repeat the same task with the amended search interface. Our second experiment was harnessing crowd workforce from the CrowdFlower platform, similar to the experimental setup in [Liu+ 12]. As compared to the students experiment, we ran the usability test for both search interfaces in parallel with two distinct groups of participants. 15

http://www.ebusiness-unibw.org/tools/product-search-static/ (accessed on February 20,

2015)

264

7 Faceted Product Search on the Semantic Web

7.4.2.2 Results In the following, we report on the empirical results obtained from the two usability studies, as summarized in Table 7.2. Table 7.2: Results of SUS experiments Students No. participants No. incorrect answers No. answers considered Avg. SUS score

Crowdsourcing

A

B

A

B

39 5 39 66.54

29 3 29 72.59

50 13 37 65.00

50 9 41 68.75

Usability Experiment with Students For students’ rating, we did not eliminate incorrect answers based on the gold question, because after closely investigating their individual responses we found out that they did not fall into the trap of the alternating pattern of the SUS questions. The task completion rate [cf. SL05] for students was thus 34/39 = 87% for search interface A, and 26/29 = 90% for search interface B. From the feedback of the students, we further conclude that they had sometimes too high expectations to our academic search interface as car sales portals like mobile.de are already very mature. Our dynamic search interface (search interface A) achieved an average SUS score of 66.54, which is slightly below the average of 6816 , which was the mean SUS score among 500 system usability studies. Taking on the qualitative, “adjective” rating introduced in [BKM09], the search interface is considered “good” (SUS score close to 71.4). By comparison, the alternative, static search interface (search interface B ) obtained an average SUS score of 72.59. We stated the following null hypothesis to test the difference in the usability scores for significance: Null hypothesis. There is no difference among SUS scores for search interfaces A and B obtained through two samples from the same population of students. A Shapiro-Wilk test [SW65] revealed that we cannot assume that both SUS score samples are normally distributed (p-values of 0.03 and 0.06), thus we compared the two samples using a non-parametric statistical test, i.e. the Wilcoxon rank-sum test [Wil45]. The average usability scores assigned by our students to search interface A (median = 70.00) did not differ significantly from usability scores assigned to search interface B (median = 75.00), W = −1.45, p = 0.15, r = −0.18. 16

http://www.measuringu.com/sus.php (accessed on January 29, 2015)

7.4 Evaluation

265

Usability Experiment with Crowdsourcing Unlike in the previous experiment, we did only accept contributions by crowd workers who correctly answered the gold question. The task completion rate [cf. SL05] for crowd workers was 37/50 = 74% for search interface A, and 41/50 = 82% for search interface B. Search interface A achieved an average SUS score of 65.00, which is below 68, but still “good” according to [BKM09]. Search interface B obtained an average SUS score of 68.75. The null hypothesis below was used to test whether the two usability scores significantly differ: Null hypothesis. There is no difference among SUS scores for search interfaces A and B obtained through two different samples of crowd workers. A Shapiro-Wilk test [SW65] revealed that we cannot assume that both SUS score samples are normally distributed (p-values of 0.13 and 0.01), thus we compared the two samples using a non-parametric statistical test, i.e. the Wilcoxon rank-sum test [Wil45]. The average usability scores assigned by the first group of crowd workers to search interface A (median = 65.00) did not differ significantly from usability scores assigned to search interface B by the second group of crowd workers (median = 73.75), W = −1.30, p = 0.19, r = −0.15.

7.4.2.3 Discussion This analysis shows that, in principle, a fully dynamic search interface directly based on product features found in the data, is not systematically less satisfying for users than one based on established, hard-wired product features used in existing car portals. However, we see a small negative effect in usability, which we expected, because the static, hard-wired set of search dimensions allows a higher degree of users’ familiarity with the terminology and conceptual model of a search interface. We conclude from that small negative effect that a data-driven search interface for products comes at a cost, which must be compensated for by additional gains in precision, recall, and eventually the utility of the finally selected product. We would also like to stress that a usability-based evaluation of novel search interfaces has a systematic weakness, because it only analyzes how well a user can handle the interface, but not the quality of the choices eventually made (e.g. how well the finally selected product meets the user’s needs). As we have shown in the first part of the evaluation, the sparsity and heterogeneity of the product space indicates that a more precise navigation in the option space can return much better product matches.

266

7 Faceted Product Search on the Semantic Web

7.4.3 Proof of Concept with Real Product Data from the Web In this section, we provide a proof of concept by presenting a use case of our faceted product search interface with real e-commerce data from the Web. For setting up our product search demo, we conducted another focused Web crawl over Web shops featuring household appliances and added the data to our crawl dataset from Chapter 3. More precisely, we selected shops from the Rakuten Deutschland platform17 that were classified into categories related to the general topic household. By that we could obtain 23 Web shops, from which 17 shops contained GoodRelations markup. Thanks to our earlier research on BMEcat catalogs (see Chapter 4), we are already in the possession of high-quality, structured product model master data by BSH18 . We used that data source to augment the product offer data from the Web crawl with product features. Overall, we found matching product models and product offers in 15 Web shops, as illustrated in Table 7.3, where we list all graph names (i.e. Uniform Resource Names (URNs) of named graphs [Car+ 05]) in the RDF store along with the number of matching items in each Web shop. Twelve shops were already contained in the big crawl, whereas three shops were added using the household crawl (their provenance is discernible based on the slightly different graph naming pattern that we used). Table 7.3: Web shops with number of matching items Graph Name urn:www.outdoorfurniture.ie urn:kitchenking.tradoria-shop.de urn:futterkiste.tradoria-shop.de urn:www.european-gate.com urn:elektrotresen.tradoria-shop.de urn:www.ay-versand.de urn:fairplaysport.tradoria-shop.at urn:marketplace.b2b-discount.de urn:data-filter-direkt.rakuten-shop.de.rdfa.nt urn:heimundbuero.tradoria-shop.de urn:data-portens.rakuten-shop.de.rdfa.nt urn:computeronlineshop.tradoria.de urn:data-www.staubsaugerbedarf24.de.rdfa.nt urn:www.megashop-express.de urn:top-und-preiswert.tradoria-shop.de

Number of Matches 1 1 2 1 20 3 1 3 4 8 3 2 1 3 50

After that, we executed cleansing and consolidation rules on the data such as expanding gr:includes shortcuts to gr:includesObject patterns (see Section 6.2.5), or converting all Rakuten Deutschland was formerly known as Tradoria and has been acquired by Rakuten, a Japanese e-commerce company. The platform offers software as a service (SaaS), actually shop software as a service. 18 BSH Hausgeräte GmbH, a manufacturer specializing on household appliances. 17

7.5 Conclusion

267

price specifications to a common currency “EUR” (see Section 6.3.8.2). Our demonstrator is accessible online19 via selecting the SPARQL endpoint that contains the data from the household crawl. Figure 7.6 depicts a screenshot snippet of our search interface comprising real Web data from our crawl of household appliances. It becomes immediately apparent that the product categories are missing for this dataset, because neither the Web crawl nor the BMEcat catalog for BSH feature a product classification. Furthermore, the graphic indicates that four shops are selling the vacuum cleaner with the European Article Number (EAN) “4242003551202”. This example nicely shows the great potential of faceted search interfaces and their appropriateness for deep product comparison, because product offers can be compared directly based on their features, which we obtained by integrating product model data from a manufacturer.

Figure 7.6: Screenshot of the search interface with real data from a household crawl

7.5 Conclusion In this chapter, we have proposed an instance-driven, adaptive faceted search interface as a way to navigate over the sparse graph of LOD for e-commerce on the Web with explicit support for user learning about the option space. To support the viability of our approach, we have provided evidence that the selection steps in faceted search interfaces drill down the option space logarithmically, and we have shown that the usability loss of our dynamic, data-driven approach in comparison to an alternative with hard-wired product features is insignificant. Furthermore, we have 19

http://www.ebusiness-unibw.org/tools/product-search/ (accessed on January 15, 2015)

268

7 Faceted Product Search on the Semantic Web

demonstrated a proof of concept of our solution with real product data collected from the Web. The small-scale usability study in this chapter also indicates that users seem to have gotten used to search interfaces that expose rigid navigation structures optimized for individual application domains. While this technique works well at a smaller scale, it is not feasible for e-commerce at Web scale over LOD, where diverse and dynamic product domains need to be consolidated. From the insights that we have gained in this chapter, we think that the ideal solution would consist of a search interface that, instead of being tied to the specification of the system or data structure, strives to adapt to user needs on a data-driven basis in the best possible way. As future work, we envision to enhance our work by more accurate and context-sensitive user dialogs, personalization and diversification of facet and result views (e.g. personalized result ranking that goes beyond the simple sorting of prices), and better algorithms for partitioning the option space that eventually would lead to fewer search iterations. The current way of our search interface to present facet suggestions according to their frequency is still not optimal and needs further elaboration.

8 Discussion and Conclusion

8.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270

8.2

Contributions and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

8.3

Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

8.4

Critical Review and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

8.5

8.6

8.4.1

Prevalence and Validity of Web Ontologies for Products . . . . . . . . . 274

8.4.2

Product Data Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 274

8.4.3

Representativeness of Data Sources . . . . . . . . . . . . . . . . . . . . . 275

8.4.4

Faceted Search Interaction and Evaluation . . . . . . . . . . . . . . . . . 275

8.4.5

Scalability of SPARQL Queries over Large RDF Datasets . . . . . . . . 276

Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 8.5.1

Data Collection with External Data Sources . . . . . . . . . . . . . . . . 277

8.5.2

Ontology Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

8.5.3

Personalization and Recommendation

8.5.4

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

8.5.5

Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

. . . . . . . . . . . . . . . . . . . 278

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

In this thesis, we have studied product search over structured e-commerce data published online. Our research hypothesis stated that structured product data is appropriate for realizing deep product comparison on the Web. In other words, the Semantic Web promises better integration of distributed and heterogeneous content via resource identifiers and shared, partly formalized conceptual data models than state-of-the-art solutions, and offers data formats and vocabularies for the fine-grained description of products and product offers. This way, the multi-dimensional nature of products can be accommodated, which is a necessary condition for deep product comparison. In the following, we summarize the main achievements of our work. Then, we outline the individual contributions along with the core findings. Finally, we point to known limitations of our approach, and conclude with a prospective view on open challenges for future work.

269

270

8 Discussion and Conclusion

8.1 Summary In the introduction of this thesis, we have characterized the main bottlenecks associated with current product searches on the Web, which includes the lack of precision of IR-based approaches over a large body of distributed, unstructured, and heterogeneous Web content, the complex and changing information needs of product searchers, and the insufficient support for user interaction. On that premise, we have put forward the argument that Semantic Web technologies can help to address these problems. The increasing popularity and the growing interest in e-commerce over the Web has recently generated a considerable amount of useful structured product data piggybacked as Resource Description Framework in Attributes (RDFa) and Microdata in Hypertext Markup Language (HTML) Web pages. This structured data unlocks great potential for multi-parametric, exploratory search paradigms like faceted search. However, as we have learned in this thesis, it is difficult to obtain a consolidated view over product information on the Web as required by product search. First of all, we have provided evidence that the large graph of structured product information on the Web is sparsely populated. In order to fill these gaps to some extent, we have suggested (a) to integrate product model master data derived from manufacturer datasheets based on strong identifiers, and (b) to foster the annotation of products with product categories from product classification systems. This altogether adds numerous product features and product type information that facilitate the comparison of product offers. Another important problem is the heterogeneity in the product data, e.g. there is little consensus on naming of product features or the usage of standards for product type information. In this context, we have proposed a generic and light-weight method to cleanse and enrich Resource Description Framework (RDF) data directly in a SPARQL Protocol and RDF Query Language (SPARQL) endpoint using a set of SPARQL CONSTRUCT rules and graph names. Finally, we have developed an adaptive, instance-driven faceted search interface for product offers over structured e-commerce data. In comparison with stateof-the-art search solutions that often suffer from lack of precision (e.g. keyword searches typically offer high recall but at the cost of precision) and no support for incremental learning, faceted search interfaces empower users to learn about the option space via user interaction along several product dimensions. Hence, by the detailed comparison of products on a much wider range of product dimensions, faceted search helps to reduce the risk of a searcher to prune the option space too early in the search process. Moreover, a faceted search interface that can adapt to the actual data is able to deal with the conceptual gaps as commonly found in the large graph of product information on the Web. In respect of future search paradigms for e-commerce data on the Web, we further

8.2 Contributions and Findings

271

envision search scenarios where consolidation rules are triggered by user feedback on the go and immediately fed back into the graph of product data.

8.2 Contributions and Findings In the following, we summarize the core contributions and findings of our thesis: Contribution 1 (Web Crawl of Product Offer Data). We have designed and implemented an e-commerce spider, which enabled us to conduct a substantial Web crawl over structured e-commerce data starting from a positive list of seed Uniform Resource Identifiers (URIs). Findings. Our analysis of the crawl of 2, 628 Web shops revealed that the graph of product information is sparsely populated, i.e. it exhibits very few product features and is thus of little help for deep product comparison on the Web. Furthermore, by comparison of our dataset with an existing, open-source Web crawl (Web Data Commons (WDC)), we could find out that harvesting product data from Web pages is not straightforward, as the product details are typically located in the deep Web pages of a Web site. Our conclusions were drawn based on the fact that for the same pay-level domains our depth-first crawl was able to collect significantly more structured product data than available in the WDC dataset. Contribution 2 (BMEcat Converter for Product Model Master Data). We have defined mappings and developed a command-line tool for converting product catalogs in BMEcat Extensible Markup Language (XML) format [SLK05b] to GoodRelations-compliant [Hep08a] product catalogs for the Semantic Web. We have tested and evaluated our conversion tool with product datasheets from two large manufacturers, i.e. Weidmüller and BSH, where the latter one was further used to analyze the overlap with data from the Web crawl. Findings. By the analyses of the two BMEcat-derived datasets, we could discover that product model master data by manufacturers is – at least at the time of writing – much more granular and complete than product data found on the Web. In order to fill the existing gap of sparse and low-quality product details supplied by vendors, we have evaluated the feasibility of equipping retailer product data with high-quality product model master data from manufacturers, and have provided a preliminary estimate of its potential leverage. Based on our insights, our recommendation is to rely on strong identifiers such as European Article Numbers (EANs), Universal Product Codes (UPCs), Global Trade Item Numbers (GTINs), or combinations of brand names and manufacturer part numbers (MPNs), to integrate product model master data into the sparse graph of product data from the Web.

272

8 Discussion and Conclusion

Contribution 3 (Product Type Information from Product Classification Systems). We have proposed a semi-automatic tool for generating Web ontologies from product categorization standards and proprietary product category systems. We have supported our contribution by converting 13 product classification systems of different sizes, scopes, and structures, where some of them either comprise or are available in multiple languages. These classification systems provide up to ten thousands of classes, properties, and instances that can readily be used for annotating and categorizing products. Findings. The statistics that we have reported on the converted product classification systems indicate a rich array of more or less specific product categories, which to maintain manually would be prohibitively expensive. Due to the inherent context-sensitivity of classification systems, the plain adoption of the subsumption hierarchy from the original, narrow context to the generic context of products and services is usually discouraged, unless no inconsistencies are introduced. In case it is feasible, this would offer the opportunity for reasoning over products. We have illustrated how product annotations by virtue of product classes render products more visible and discernible on the Web, which ultimately promotes multi-parametric product searches. Also, many of the available standards provide multiple translations in various languages and/or synonyms, which are useful features for improving recall in product searches. Contribution 4 (Cleansing and Enrichment). We have developed a typology of obstacles for product search, and have sketched techniques to overcome them by mainly relying on SPARQL CONSTRUCT rules. We have incorporated these rules into a prototype for data management with RDF graphs. We have complemented our contribution with some statistics on the prevalence of the obstacles in our Web crawl. Findings. Our conclusions drawn from the statistics on our Web crawl suggest an urgent need for data cleansing on the e-commerce Web of Data, as product models are defined redundantly, units of measurement and currency codes used incorrectly, links between products and product models missing, product type information not indicated, or wrong datatypes assigned. Accordingly, substantial preliminary cleansing effort is necessary in order to enable product search over RDF data. Contribution 5 (Faceted Product Search). We have suggested an adaptive, faceted search interface over structured e-commerce data as an appropriate approximation for the essential requirements of product search, where the user is ideally able to learn about the option space during the search process. We have used a sample dataset to demonstrate the discriminatory value of faceted search terms using a random walk simulation, and have tested the usability of our search interface using student participants and crowd

8.3 Impact

273

workers. As a proof of concept, we have shown that our prototype can master real structured product data from the Web. Findings. Our results indicate that faceted search interfaces are well suitable for quickly narrowing down the option space. After only three iterations (i.e. filtering steps), the option space of previously 875 items shrinks down to three items on average. Furthermore, our proposed dynamic, instance-driven search interface has shown to be user-friendly, albeit the user survey pointed at some issues that would need to be solved on migrating from an academic prototype to a commercial product, e.g. the technical terminology used on the interface, the excessive load time, or the suboptimal presentation of facet-value pairs in the search interface.

8.3 Impact As has become apparent from our Web crawl, a great portion of Semantic Web adopters in the product domain are small- to medium-sized Web shops that represent the long tail of the market for products and services [cf. And04]. With standard Web searches, it is very hard for them to communicate their value proposition over the Web given the limited precision of keyword searches and simultaneously the enormous marketing expenses of big competitors. By contrast, the semantic annotation of product offers with granular product details from manufacturers helps the average Web shops to increase the visibility of their products almost for free – instead, they can concentrate more on emphasizing the unique value propositions of their product offers. In addition to that, fine-grained product descriptions enhance the comparability of product items offered by different retailers along multiple product dimensions, which after all can be demonstrated on the principle of how faceted search interfaces work. On the basis of the foregoing considerations, we see an important, positive impact of our research on end customers and product vendors. An instance-driven, faceted product search approach essentially helps to bring the value proposition of each vendor more easily to consumers’ attention. Nonetheless, we think that other stakeholders will indirectly benefit as well, e.g. manufacturers and wholesalers through additional sales of their products. From an economic point of view, successful product searches can help to mitigate search frictions in the market and eventually improve overall economic output by a reduction of price dispersion [cf. GQ02]. To be more precise, product comparison over structured data along multiple product dimensions is able to drastically reduce the search effort and search costs generally needed to find good-fitting candidate offers on the Web.

274

8 Discussion and Conclusion

8.4 Critical Review and Limitations This thesis has contributed a novel, instance-driven faceted search approach for ecommerce over the Web of Data. Nonetheless, there still remain some shortcomings that we will summarize in the following.

8.4.1 Prevalence and Validity of Web Ontologies for Products Despite the fact that we have presented in this thesis a powerful method to derive Web ontologies for products and services from product categorization standards, and delivered ready-to-use online deployments1 of product ontologies for Classification of Products by Activity (CPA), Common Procurement Vocabulary (CPV), Global Product Classification (GPC), and Klassifikation der Wirtschaftszweige (Engl.: German Classification of Economic Activities) (WZ), their adoption on the Web remains scarce. Products on the Web are – if at all – primarily classified according to light-weight product ontologies, e.g. the Product Types Ontology (PTO)2 , or informally with proprietary category structures, most notably the Google product taxonomy [Goo13]. Based on our findings, we firmly believe that product ontologies derived from product categorization standards will not very soon become popular on the Web, unless BMEcat catalogs, which make extensive use of product categorization standards, are increasingly exposed to the Web of Linked Data. It was further outside the scope of this thesis to carry out a systematic, in-depth analysis towards finding out whether or to what extent the subsumption relationships between classes in the original context also hold for the domain of products and services (see Chapter 5).

8.4.2 Product Data Dynamics In general, product data tends to become outdated very quickly. Not only the quantity of product offers is fluctuating steadily, but also related terms and conditions might change on a daily, if not on an hourly basis. These dynamics of product data on the Web pose serious challenges on the data management part of deep product comparison engines. Furthermore, considerable parts of our work refer to a Web crawl dating back to late 2011/early 2012, when the distribution of Microdata markup with schema.org was still in 1 2

http://www.ebusiness-unibw.org/ontologies/pcs2owl/ (accessed on September 16, 2014) http://www.productontology.org/ (accessed on May 8, 2014)

8.4 Critical Review and Limitations

275

its infancy. In late 2012, the concepts and properties from the GoodRelations vocabulary have been largely integrated into schema.org [Guh12], which has led to an accelerated publication of product data using schema.org in Microdata. Because of this, the reported data in our evaluations has meanwhile become a bit outdated, as it becomes evident when looking at recent statistics about the deployment of structured data on the Web [MPB14].

8.4.3 Representativeness of Data Sources The proof of concept of our approach was demonstrated using real data collected from the Web, which was matched against the conversion results of a single BMEcat catalog from BSH. However, this small-scale experiment is not unconditionally representative with respect to the general feasibility of our approach, at least for two reasons: First, the data structure and quality of the Web crawl are arguably a bit biased as most data is created by Web shop extensions in an almost uniform fashion. This means that certain data quality problems may not surface until more generators of structured data become available on the Web. Second, as the BMEcat catalog matches merely 94 product offers from the Web crawl, we can only provide preliminary evidence for our approach to work at large scale. For further strengthening the credibility of our approach, we would have to gather additional BMEcat catalogs from different manufacturers and match them against data from a Web crawl.

8.4.4 Faceted Search Interaction and Evaluation We have proposed a faceted search interface for product search over structured e-commerce data on the Web. Faceted search essentially allows to progressively narrow down the option space relying on multi-dimensional product descriptions as can be provided by the Web of Data. For this purpose, faceted search paradigms use a boolean filtering mechanism where queries are refined and relaxed while navigating the option space. However, products have high-dimensional utility functions with strong non-linear components. For example, an eco-conscious car buyer would well trade ten percent of a car’s engine power for half its fuel consumption. Unfortunately, this leads to problems, because classical faceted search cannot meet such requirements. At the same time, it is in the nature of matchmaking (see Chapter 2) to assume multi-dimensional, non-linear utility functions that require the parallel consideration of many dimensions. The result of matchmaking is a ranked list of potentially interesting candidates that to some extent fulfill the requirements. Unlike faceted search, matchmaking is able to return items that are underspecified with

276

8 Discussion and Conclusion

respect to a given demand. According to this, a search for red convertibles with a mileage less than 50, 000 kilometers and a price lower than 10, 000 Euros would most likely match black convertibles that ran 20, 000 kilometers and cost 9, 000 Euros. Similarly, matchmaking would yield a red convertible that matches on all dimensions but has a price of 10, 001 Euros, where faceted search is doomed to fail in that case. Nevertheless, we would hold against the popular understanding of matchmaking that it requires significant annotation effort in the form of machine-understandable specifications, it is computationally expensive, and it describes an automatic, largely autonomous process without human intervention, which altogether limit its practicality. Mainly for these reasons, we regard faceted search as an appropriate approximation for product search. A crucial requirement for user interaction during product searches is to partition the option space according to a “best split” strategy (see Chapter 7). Our current solution presents facet-value pairs based on the frequency of instance matches in the data, which is suboptimal. Another related challenge is to find a ranking strategy for presenting results to the users, which necessitates the exploration of different ranking strategies beyond the order of prices. One option would be to rank results based on the match degree as reviewed for matchmaking (see Chapter 2). Thanks to the implementation of a truly operational software artifact, we were able to showcase how faceted search interfaces can be used for advanced user interaction over product data in RDF. In a user study, we gained insights about the usability of our tool, but what was missing is to measure user satisfaction, in particular to judge whether the presented ranking places relevant results first with respect to a given information need. Established, objective evaluation metrics for retrieval algorithms that could be useful for this task include precision, recall, F1 -measure, binary preference (BPREF), or similar measures (see Chapter 2).

8.4.5 Scalability of SPARQL Queries over Large RDF Datasets Faceted search over RDF data often translates into expensive SPARQL SELECT queries, for instance in the presence of multiple filter constraints that are applied in parallel. But also SPARQL CONSTRUCT queries, which we used to materialize results of data cleansing or consolidation rules as RDF data, can lead to scalability issues, especially if executed over large data sets. The setting presented in our thesis required to load all data from various Web shops into a central, consolidated SPARQL endpoint. From a performance point of view, it would be wiser to distribute the load, e.g. to use several federated SPARQL endpoints with portions of the data. Alternatively, once the technology

8.5 Future Work

277

stack of Linked Data Fragments (LDFs) [Ver+ 14] becomes more pervasive on the Web, and it can properly cope with even very complex SPARQL queries, we could think of submitting queries relying on the LDF client-server architecture, which takes care of splitting them into chunks of triple patterns, thus hitting a SPARQL endpoint with many cheap, light-weight queries rather than with a single complex one at great cost.

8.5 Future Work In addition to the more concrete limitations mentioned in the previous section, our work has produced several, more general ideas for future work. We currently use standard methods and relatively simple heuristics for the fulfillment of our tasks, yet taking into account more advanced techniques of other research areas could bring considerable improvements.

8.5.1 Data Collection with External Data Sources Within the scope of this work, we have integrated diverse data sources to create a uniform view over e-commerce data on the Web. What is still missing are novel ways to gather additional, granular data for product comparison. This could further increase the visibility of product offers, and pave the way for more advanced features like product recommendations. In particular, we envision three alternative, although complementary methods for data collection: 1. Extend the current data with external knowledge bases, e.g. review data3 [HM07], Freebase4 [Bol+ 08], Open Icecat5 , DBPedia6 [Aue+ 07], etc. 2. Develop and apply a range of simple, yet powerful heuristics for data lifting in order to extract price details or other quantitative information out of raw text. 3. Employ natural language processing (NLP) techniques, e.g. named entity recognition (NER) and relation extraction to extract product features from unstructured or semi-structured text, tabular data, etc. If available, it does also make sense to utilize the multi-language support provided by many data sources, especially met in the context of product ontologies. http://revyu.com/ (accessed on May 12, 2014) https://www.freebase.com/ (accessed on May 12, 2014) 5 http://www.icecat.biz/ (accessed on April 10, 2015) 6 http://dbpedia.org/ (accessed on May 12, 2014)

3

4

278

8 Discussion and Conclusion

8.5.2 Ontology Matching Ontology matching tackles the problems of the semantic heterogeneity of ontologies on the Semantic Web. It computes alignments between related concepts, which can help to improve the effectiveness of product searches. Despite significant research effort in the past, ontology matching is still an open research challenge. For instance, while automatic alignments work fairly well for some application domains, they still perform poorly when applied to large-scale practical applications [ORG15]. Also the complexity of the matching problem grows proportionally with the size of the ontologies [SE13]. Thus, additional research has to be devoted to improving the matching quality and efficiency that are necessary to cope with large classification systems from the product domain.

8.5.3 Personalization and Recommendation Even if we have attached importance to personalized and context-aware searches in this thesis, we did not further elaborate on this topic. The consideration of additional metadata can greatly enhance the search experience, for instance by augmenting product search engines with recommender systems that can act on behalf of users. In practice, such relevant background knowledge for recommendation algorithms can be obtained through information about similar item descriptions, user profiles, and user demographics, or historical data about previous searches or past purchases (see Chapter 2). The e-commerce vocabularies on the Web, namely GoodRelations and schema.org, already offer ways to materialize useful axioms for item-based recommendations [TH14]. GoodRelations e.g., defines a rich variety of convenient object properties to express relationships between related products (gr:isSimilarTo, gr:isVariantOf, gr:predecessorOf, gr:successorOf ), accessories and spare parts (gr:isAccessoryOrSparePartFor ), and consumables (gr:isConsumableFor ) [cf. Hep11]. These properties could serve as starting points for product recommendations within product search systems.

8.5.4 Machine Learning Machine learning in the context of product data is another interesting line of research for the future. Despite the manifold areas of application, one possible approach would be to assist in the classification of unlabeled products relying on supervised learning methods that use training data, or alternatively to employ unsupervised learning to find candidate clusters for new product classes within a collection of products. Machine learning is also a promising enabler for better ontology matching [cf. ORG15]. Yet another application

8.6 Conclusion

279

domain for machine learning involves learning to rank, a relatively novel approach with the goal to learn ranking models that can be used for ranking items [e.g. Li11]. 8.5.5 Crowdsourcing Finally, while many of the aforementioned tasks can be done automatically or semiautomatically, human intelligence could be greatly beneficial for improving the outcomes of data cleansing (e.g. solve common data quality problems), NLP (e.g. manually label named entities or relationships), machine learning (e.g. prepare training data for supervised classification), or ontology matching (e.g. approve a potential match, or evaluate the mapping accuracy between two concepts [cf. SE13]). While human intelligence can also be provided by a community or interest group, it is often much easier to temporarily recruit workers from crowdsourcing platforms at a moderate payment. On that premise, many evaluation tasks in our thesis could be enhanced with crowdsourcing, e.g. by asking crowd workers to label product classes as appropriate for a product or not, to test Web sites and user interfaces for usability [Liu+ 12], as we also did in Chapter 7 for our search interface, or to make relevance judgements for a list of search results given to a query.

8.6 Conclusion Deep product comparison at Web scale promises to be one of the next big milestones for e-commerce given the increasing availability of online structured markup for products and services in the recent years. However, as it has been shown in this thesis, product search is a non-trivial task that poses special challenges on data management and the design of search interfaces. At the time of writing this thesis, the contribution presented herein has been the first serious attempt to develop a faceted search interface for deep product searches with user interaction over a large set of real structured e-commerce data on the Web. Despite originally intended as a proposal for e-commerce, selected parts of our approach could just as well be applied to other application domains where searches over structured online data are in high demand, e.g. events, recipes, online library catalogs, and the like.

280

8 Discussion and Conclusion

A User Survey

A.1 System Usability Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 A.1.1

Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

A.1.2

Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282

A.2 Instructions A.2.1

A.2.2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

Student Invitation E-Mails . . . . . . . . . . . . . . . . . . . . . . . . . . 283 A.2.1.1

Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

A.2.1.2

Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

Crowdsourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

A.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 A.3.1

A.3.2

Student Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 A.3.1.1

Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

A.3.1.2

Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

Crowd Workers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 A.3.2.1

Variant 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290

A.3.2.2

Variant 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

To assess the usability of our search system, we prepared two slightly different variants of it: 1. Variant 1: Instance-driven search interface. 2. Variant 2: Rigid search interface with hard-wired product features. For setting up the experiment, some preliminary work was necessary to prepare the data to show up in the demonstrators: 1. Randomly select 50 result pages from mobile.de with each consisting of 20 results (Uniform Resource Identifier (URI) pattern template with random page, min price, and max price parameters). 2. Crawl individual product pages. We obtained data from 875 (out of 1,000) product pages from mobile.de.

281

282

A User Survey

3. Pick 25 automobile offers randomly for making them available in the data management interface, and load them into the Resource Description Framework (RDF) stores of the two demonstrators.

A.1 System Usability Scale A.1.1 Experiment Setup We conducted the same experiment with two different groups of people. Experiment 1 was conducted with students from our university, and experiment 2 with crowd workers from the CrowdFlower crowdsourcing platform. In experiment 1, we asked the same group of students to first evaluate the usability of the first kind of our search interface (variant 1), and then to evaluate a second variant of our search system (variant 2). In experiment 2, we designed a user study with crowd workers: • The same conditions applied as for the students. • We simultaneously assigned 50 tasks for variant 1 and 50 tasks for variant 2 to two independent groups of crowd workers. • We paid ten dollar cent per judgment. • We only accepted responses by workers form German-speaking countries (i.e. Germany, Austria, and Switzerland). • We only accepted high-quality judgments in terms of the CrowdFlower platform. • We only took into account judgements where the golden question (see below) was answered correctly. A.1.2 Questionnaire A System Usability Scale (SUS) questionnaire was used to assess the usability of our search systems, which consisted of the following ten questions (in German1 ): 1. Ich denke, dass ich dieses System gerne regelmäßig nutzen würde. 2. Ich fand das System unnötig komplex. 1

http://minds.coremedia.com/2013/09/18/sus-scale-an-improved-german-translationquestionnaire/ (accessed on January 28, 2015)

A.2 Instructions

283

3. Ich denke, das System war leicht zu benutzen. 4. Ich denke, ich würde die Unterstützung einer fachkundigen Person benötigen, um das System benutzen zu können. 5. Ich fand, die verschiedenen Funktionen des Systems waren gut integriert. 6. Ich halte das System für zu inkonsistent. 7. Ich glaube, dass die meisten Menschen sehr schnell lernen würden, mit dem System umzugehen. 8. Ich fand das System sehr umständlich zu benutzen. 9. Ich fühlte mich bei der Nutzung des Systems sehr sicher. 10. Ich musste viele Dinge lernen, bevor ich mit dem System arbeiten konnte. Golden question for variant 1 11. Welche Leistung (in Kilowatt) hat das teuerste Produkt in unserem System? Golden question for variant 2 11. Welchen Kilometerstand hat der einzige Jahreswagen im System? Furthermore, an optional twelfth question asked for feedback regarding future enhancements to the usability of our search systems (see Section A.3).

A.2 Instructions A.2.1 Student Invitation E-Mails To reach a critical mass of students, we sent out invitation e-mails to students mailing lists of our university (in total, covering roughly 150 students).

A.2.1.1 Variant 1 For the first round, we prepared an e-mail with the following content (in German).

284

A User Survey

From: Alex Stolz Subject: Evaluation eines Produktsuchsystems Date: Tue, 27 Jan 2015 16:18:39 +0100 To: Sehr geehrte Studierende, ich lade Sie hiermit ein, an einer kurzen Umfrage zur Bedienbarkeit eines Systems für die Produktsuche teilzunehmen. Im Rahmen meiner Forschungsarbeit habe ich einen Prototypen zur Produktsuche über das World Wide Web entwickelt, dessen Bedienbarkeit wir nun gerne an Nutzern testen würden. Da das Tool online verfügbar ist, kann ich diese Evaluation entsprechend aus der Ferne durchführen. Ich würde mich freuen, wenn Sie zahlreich an der Umfrage teilnehmen würden. Der Fragebogen kostet Sie nicht mehr als 5-10 Minuten. Gleichzeitig können Sie aber einen wertvollen wissenschaftlichen Beitrag leisten. Die Instruktionen und den Fragebogen finden Sie unter folgender Adresse: Vorab schon vielen Dank für Ihre Mithilfe! Mit freundlichen Grüßen Alex Stolz

A.2.1.2 Variant 2

For the second round, we sent the e-mail presented below to the same students (in German). Please note that even though we reached 29 of the 39 students from the first round, there is no one-hundred-percent guarantee that only students from the first round participated in the second round, since responses were accepted anonymously. Nonetheless, we trust in our results mainly for two reasons: First, since we selected students from past courses so that we already knew each other; and second, because the number of participants decreased from the assessments of search interface variant 1 to search interface variant 2. Observing the opposite would have been highly suspicious instead.

A.2 Instructions

285

From: Alex Stolz Subject: Re: Evaluation eines Produktsuchsystems Date: Fri, 30 Jan 2015 11:53:52 +0100 To: Sehr geehrte Studierende, ich bedanke mich für die letztendlich zahlreiche Teilnahme an meiner Umfrage zur Bedienbarkeit des Produktsuchsystems und für all die konstruktive Kritik, die ich erhalten habe. Falls Sie sich bereits an der vorherigen Studie beteiligt haben (und bitte nur dann!), würde ich Sie bitten, sich nochmals 5 Minuten Zeit zu nehmen. Ich habe die Benutzeroberfläche des Produktsuchsystems nun etwas umgebaut, sodass mir Ihre Meinung zur Bedienbarkeit des geänderten Systems sehr wichtig wäre. Instruktionen sowie den Fragebogen finden Sie nun unter folgender WebAdresse: Vielen Dank für die erneute Mithilfe! Mit freundlichen Grüßen Alex Stolz PS: Mit der Evaluation eines dritten Suchsystems werde ich Sie nicht mehr belästigen, versprochen :-)

A.2.2 Crowdsourcing For the crowdsourcing experiment, we provided task instructions upfront. The instructions presented to crowd workers are indicated below (in German): Bitte beantworten Sie nachfolgend 11 kurze Fragen, um die Benutzerfreundlichkeit unseres Systems zur Produktsuche zu bewerten. Besuchen Sie zunächst die unten angegebene Webseite, wo Sie unseren Suchprototypen vorfinden. Derzeit beinhaltet der Produktkatalog ausschließlich Fahrzeugangebote aus mobile.de (daher sind die Daten -anders als die Sprache der Benutzeroberfläche -- auf Deutsch), jedoch sind

286

A User Survey

prinzipiell Angebote jeden Produkttyps möglich. Bitte beachten Sie, dass es sich um einen akademischen Prototypen handelt, der sich momentan noch in der Entwicklung befindet. Entsprechend können die Abfragen etwas lange dauern sowie manche Informationen inkorrekt dargestellt werden. An den Verbesserungen dieser Punkte werden wir zukünftig noch arbeiten. In dieser kurzen Umfrage möchten wir Ihre Meinung zur Bedienbarkeit des Systems erfahren. Nachdem Sie sich mit der Benutzeroberfläche vertraut gemacht haben, wie würden Sie die Bedienung des Suchsystems einschätzen?

A.3 Results The optional twelfth question, 12. Haben Sie sonst noch Anregungen, um die Bedienbarkeit des Systems weiter zu verbessern? (optional) led to valuable critiques and feedback for possible improvements to our search interfaces. A.3.1 Student Feedback In the following, we outline the responses obtained by the students for search interface variant 1 and variant 2. A.3.1.1 Variant 1 Auf Mobiltelefon sind die Funktionen (Vollständige Anzeige der Features, Preissuche) nur eingeschränkt nutzbar. Eine Menüführung auf nur einer Seite. Auf beiden Seiten einstellbare Optionen wirken unübersichtlich/ ungewohnt. Bei mehreren Wählbaren Optionen (∼ >5) gerne mit dropdown Menü. Eine Funktion um alle Ergebnisse anzuzeigen, gerne auch mit “sortier-reitern”(Preis aufsteigend /absteigend) etc. Die Graphische Darstellung der Verteilung von Angeboten innerhalb der Preisspanne finde ich sehr gut. Unten sollte sich eine Spanne von Seitenzahlen befinden und nicht nur die Zahl 1. So erweckt das leicht den Eindruck als gäbe es nur eine Seite die einzelnen Säulen für die Preisverteilung direkt auswählen zu können. “rundere” / benutzerfreundlichere / ansprechendere Oberfläche Ein Botten mit dem man alle auswahlmodifikatoren auf einmal löschen kann.

A.3 Results

287

Vielleicht liegt es daran, dass mobile.de bereits ein ausgereiftes und bekanntes Suchsystem hat (daher war eventuell die Quelle eher unvorteilhaft). Denn ich fand das UI eher kompliziert bzw. nicht intuitiv. Die Merkmale, nach denen man suchen kann, sind manchmal etwas unpassend zum Thema Auto gewählt. Beispielsweise der Hubraum. Hier macht der Regler seltsame Sprünge von 1871,466667ccm auf 2542,15385 ccm, was nicht wirklich passt und das Ganze kompliziert erscheinen lässt. Die Kurzinfo, die bei der Übersicht der Ergebnisse angezeigt wird, hilft nicht sehr bei der weiteren Auswahl. PS, Hubraum, Verbraucht fehlt gänzlich in der Übersicht. Die Auswahlkriterien sind zum Teil überflüssig. Wer sucht Autos, in dem man nach Autohäusern sucht? Das Design ist sehr nüchtern, ähnelt mehr einer klassischen Suchmaschine und nicht einer attraktiven Möglichkeit, “mein Auto” zu suchen. Man müsste einen Button haben, wo man die Ergebnisse nach verschiedenen Kategorien sortieren kann. Habe es auf den ersten Blick nicht gefunden. Wenn man den Preisfilter neu setzt, sollten auch nur Fahrzeug im angegebenen Intervall auftauchen. Es ändert sich aber nichts. Ich kann damit leider absolut gar nicht umgehen, die Einstellungen und Optionen sind zu tief vergraben, die Eingabe und das ständige Neuladen der Seitenelemente sind verwirrend, die Suche reagiert nicht auf einfache Suchbegriffe wie “bmw”. Unterm Strich nicht wirklich benutzbar wie ich finde. Eine Funktion, mit der man die Suchergebnisse nach bestimmten vorlieben Sortieren kann wäre angenehm. Bsp Preis oder Hubraum auf oder absteigend Sortierbar das ständige neuladen der produktdetails hat mich beim benutzen sehr gestört. eine kleine änderung, und alles muss aktualisiert werden...wenn es schnell gehen würde kein problem. ich habe 5 sekunden gezählt, während ich gewartet habe! (product features) das ist zu lange (aus dem uninetz, von außerhalb dauert es vielleicht noch länger). und mann sollte sich auf eine sprache einigen, die überschriften sind eng und die optionen deu und die auswahl von zb leistung hat wiederum keine deutsche einheit, wie PS oder so. Es wäre sinnvoll, die Suchvorschläge nach verschiedenen Kriterien sortieren zu können (Alphabetisch, Teuerste zuerst, günstigste zuerst, häufig gesuchte Themen etc.) Die Einstellung verschiedener und dann dementsprechend einheitlicher Sprachen für das Systems wäre wünschenswert. Momentan hat man einen Mix aus Deutsch und Englisch. Auch denke ich, können die meisten nichts mit der “SPARQL” Zeile anfangen, aber ich nehme an, die dient eh nur dem Testzweck. Soll aber verdeutlichen, dass man nicht zu spezfisiche Begriffe benutzen sollte. Gute Idee! Eigene Autosuche vor Kurzem beendet. Bisher auf keiner bekannten Plattform Suche nach Anzahl der Zylinder möglich, wäre ein klasse Feature. Gerade im Bezug auf Downsizing Trend generiert eine Suche nach bspw. 6 Zylinder und Diesel oder >4Zylinder echten Mehrwert. Weiter so. Eine Übersicht über die Anzahl der Seiten Wenn man auf die Preisübersicht auf die Balken klicken könnte, um zum Fahrzeug zu kommen wäre es einfacher. Eine Möglichkeit nach Farben zu sortieren. Eine Sortierung nach absteigendem Preis ermöglichen.

288

A User Survey

Ich empfinde es als umständlich, dass nach jeder Einstellungsänderungen die gesamte Seite lädt und man dann erst eine Eingrenzung eingestellt werden kann. Um bspw. die Leistungsspanne einzustellen muss ich zwei mal warten bis die gesamte seite neu geladen ist. Das Suchfeld funktioniert evtl. nicht optimal. Wenn man dort nach “Porsche” sucht, erscheint kein Produkt obwohl ja anscheinend einer in der Liste vorhanden ist (s. leistungsstärkstes Fahrzeug). Mit dem Handy kann man das System nicht wirklich nutzen. Preisleiste lässt sich nur ganz umständlich verschieben, und die kompletten Details zu dem Dodge konnte man auch nicht einsehen. - beim Preis sollte sich zur besseren Übersichtlichkeit ein Punkt nach der Eintausenderstelle befinden (25.000,00€) - bei der Einstellung der Price-Range sind gelegentlich bis zu 5 Nachkommastellen aufgetaucht (2 sollten reichen ;-) ) - hat man die untere Grenze der Price-Range mit der oberen Grenze gleichgesetzt (z.B. zur Beantwortung der Frage Nr. 11), sind die beiden Grenzen anschließend nicht mehr trenn- oder verstellbar - der Einstellungsbalken der Price-Range ist auf den ersten Blick leicht zu übersehen, gleiches gilt für das Eingabefeld des Preises unterhalb (Erkennbarkeit als Schriftfeld nicht gegeben) Trotz dieser kleinen Mängel halte ich das System für eine interessante und angenehm zu bedienende Lösung. Ich hoffe, die Entwicklung wird fortgesetzt & wünsche viel Erfolg dabei :-) 1. Den Verkäufern/Vendors würde ich persönlich weniger Platz zukommen lassen und lieber den Umkreis (per Entfernung in km) entscheiden lassen. Dies spielt bei einer ernsthaften Fahrzeugsuche mMn eine viel größere Rolle -> Aus eigenen Erfahrungen! 2. Das Balkendiagramm bei den Preisen verstört mich ein wenig. Hierzu zwei Anmerkungen: a) die jeweilige Anzahl über jeden einzelnen Balken sichtbar machen b) Balkendiagramm streichen, da mMn diese Aussage rein über den Preis bei einer exakten Suche nach einem Modell keine Aussagekraft bietet – es sei denn, man gestaltet die Suche so genau, dass nur noch genau ein Modell mit genau der selben Ausstattung verglichen wird. Bsp hierzu ein Klick auf die Kategorie “Cabrio/Roadster” bei dem links im Balken dieses erscheint – ich finde es nicht aussagekräftig: http://www .fotos-hochladen.net/view/be ispielcp4x321bj8.jpg2 (accessed on March 31, 2015) 3. Durch meine mehrjährige Suche nach Fahrzeugen, besonders Motorrädern, hier meine ständigen Suchkriterien um diesen ggf mehr Beachtung zukommen zu lassen: a) Preis b) Marke c) Kilometerleistung d) Leistung e) Entfernung f ) Hubraum (Motorrad) g) Marke h) Zusatzausstattung (Abs) Ich hoffe ich konnte ein wenig helfen. Viel Erfolg bei der weiteren Arbeit linkes Unterfenster “vendors” für mich uninteressant. würde ich weglassen stichpunktartige detailübersicht bei jedem Fahrzeug gut!!

2

A.3 Results

289

A.3.1.2 Variant 2 Deutliche Verbesserung, die vielen Schieberegler links verwirren mittlerweile etwas, aber insgesamt für mich deutlich benutzbarer geworden. Kilometerzahl konnte ich auf dem Mobiltelfon nicht ablesen, da die Anzeige, bzw das Scrollen in der Anzeige auf dem iPhone immer noch nicht funktioniert. System zwar praktisch aber nichts “grundlegend neues” Beim Öffnen eines Fahrzeuges, beispielsweise des Jahreswagens sind viele Extras des Fahrzeugs zu lesen. Aufgrund der menge und der schlichten Auflistung ist es schwer dort einen überblick zu verschaffen. Desshalb schlage ich vor die Reihenfolge zu ändern. Erst die wichtigsten Eckdaten (Leistung, km-Stand, etc.) und erst dann im Folgenden die Sonderausstattungen übersichtlicher darzustellen. Eine Möglichkeit hierfür wäre es die Sonderausstattungen in Kategorien einzuteilen. Bsp. Innen, Sicherheit, Armaturenbrett etc. Dies gibt einen Besseren Überblick. Im gesamten finde ich das System dennoch sehr hilfreich und einfach in der Handhabung. Ein Button, oben rechts über der Anzeigetafel der Autos, mit dem man nach den relevantesten Kategorien sortieren kann, wäre hilfreich. Da ich normalerweise zuerst einstellen würde, welche Eigenschaften ich mir für das Auto wünsche und diese danach nach dem Preis, der Entfernung zum Wohnort, Alter usw. nach der derzeitigen Bedürfnissen sortieren lassen würde. Ich sehe noch kein unmittelbares, überzeugendes Argument, was dieses System von anderen Seiten abhebt – kann natürlich auch an mir liegen. Bei dem “Kilometerstandschieber” keine Zahlen mit teilweise 5 Nachkommastellen (nur glatte Zahlen, evtl. 100er-Schritte,o.ä.) Wenn ich die leistung mit dem schieberegler eingrenze und nur noch einen wagen zur verfügung habe, dann kann ich die leistung nicht mehr verändern. also zb wieder nach unten regeln. Ich kann nur irgendwas anwählen wie zb abs und wieder abwählen, damit alles auf die ausgangssituation gesetzt wird. ist die leistungseingrenzung einmal ausgewählt, dann kann die nicht wieder abgewählt werden, wie zb ausstattungen. die performance hat sich gegenüber der ersten version sehr verbessert! - Einstellung des gewünschten Bereichs der Motorleistung auch in PS ermöglichen - Datum/Zeitraum der Erstzulassung ebenfalls mit Schieberegler festlegen lassen - zur besseren Lesbarkeit einen Punkt nach der Tausender-Stelle beim Preis anbringen (Bsp.: 25.000,00 €) Evtl. Einführung eines “Zurück Buttons” das die Auswahl im jeweiligen Feld wie “Ausstattung” oder “Kraftstoff ” zurücksetzt und alle Haken herausnimmt. Ansonsten sehr schön . Die Umsetzung, dass in der Liste zur Eingrenzung nun ein Klick genügt, finde ich sehr gut! Dadurch ist das System bedienungsfreundlicher geworden. Die Schaltfläche “Jahreswagen” erschien erst, nachdem “Vorführwagen” angekreuzt war. bessere Übersichtlichkeit der Ausstattung anstatt einer sehr langen Liste fände es gut wenn man die bilder in solchen fahrzeugplattformen zoomen könnte.

290

A User Survey

1. Statt ausschließlich den “Close”-Button benutzen zu müssen um ein aufgerufenes Inserat (Bspw der gesuchte Jahreswagen) zu Schließen, würde ich zusätzlich die Funktion mit dem Klick neben die aufgerufene Seite einbauen – Steigert die Nutzerfreundlichkeit 2. Sollte das System in einer großen Bandbreite getestet/inbetrieb genommen werden, so erscheint mir die einfache Auflistung nach “Jahren” bei der Erstzulassung am sinnvollsten – ggf würde auch ein weiteres Balkendiagramm passen. 3. “Fahrzeugart” (Jahres, Gebraucht, Neu) würde ich auf die rechte Seite unter das Feld “Kategorie” anordnen 4. Wichtig: Das einzige wesentliche Merkmal welches man beim Betrachten der Mittleren Spalte (Suchergebnisse) erkennt, sind die Autos. Aber leider wird hier aufgrund von falsch angebrachter (oder aber absichtlicher) Informationen nur die Marke und das Modell dargestellt. Auf den ersten Blick ist es nicht möglich zu erkennen: Erstzulassung, Km, Leistung, ... Stattdessen wird willkürlich die Zusatzausstattung des Autos angezeigt und dies auch ohne Vorgabe; Sprich manchmal steht Klimaanlage ganz oben, manchmal weiter unten etc = Man kann es nicht vergleichen. –> Für meinen Geschmack sollte man also die Ausstattung mit den Fahrzeugeigenschaften tauschen und diese statt der Ausstattung unter dem Fahrzeug anzeigen Viel Erfolg weiterhin :) Erste Version war übersichtlicher und leichter zu bedienen! Man hat sich schneller zurecht gefunden als in der jetzigen zweiten Version. Datumsformat besser TT.MM.JJJJ

A.3.2 Crowd Workers In the following, we outline the responses obtained by crowd workers for search interface variant 1 and variant 2. A.3.2.1 Variant 1 Einfachere Filterabwahl: Bei Mehrfachfiltern müssen diese einzeln wieder “rausgeklickt” werden. Normalerweise kann man die Filter (evtl. je Kategorie) wieder löschen/ zurücksetzen. Quick Tips, die erscheinen, wenn man den Cursor über die Funktion zieht ein wenig schlicht. nein nein, alles is in ordnung “Gebraucht-” bzw. “Neuwagen” sollten eigene Auswahlpunkte nicht nur Unterkategorien sein. Auch die einzelnen Marken hätte ich gerne sofort zur Auswahl gehabt – ging selbst mit Auswahl von “Manufacturers” nicht. Die Balken beim Preisregler waren zuerst auch etwas irritierend. nejin

A.3 Results

291

Weniger Informationen auf einmal. nope Die Schaltflächen (Punkte zum Ankreuzen) sind extrem klein und daher anstrengend zu lesen. zu umständlich nein mehr bilder, nicht so static und unübersichtlich die Sprachen sollten beide gleich sein. die Auswahlmöglichkeiten sollten alle auf der gleichen Seite stehen, nicht links ein paar und rechts ein paar. Nein, es ist Ok wie es bisher ist Das teuerste Produkt hat eine Leistung von 85.00 [KWT], was bei Frage 11 aber gar nicht zu Auswahl steht. Seite muss insgesamt noch vielbesser struktoriert werden. nein Eine Möglichkeit des Ordnens. Von anfangs billig und teurerer werdend oder ähnlich.

A.3.2.2 Variant 2 nein nein Nein, gutes System!!! keine Diese Seite erklärt nicht, was sie macht. Dies fand ich ziemlich verwirrend, als das erste mal die Seite sah. Bei Öffnen der Artikelseite: zum Schließen einfach außerhalb des Artikelkastens klicken wäre eine gute Funktion. gut bitte alles verbessren zu wenig Inhalt nein, keine! nichts Balkendiagramm nach Unten verschieben Ein Schließen-Knopf sollte immer sichtbar sein auf einer Detailseite. Aktuell ist weder oben noch unten ein Knopf zu finden, wenn man etwas gescrollt hat. Man muss so erst wieder ganz rauf oder ganz runter scrollen, um das Fenster schließen zu können.

292

A User Survey

darstellung komplett überarbeiten keine Bis auf die noch relativ unansprechende optische Ausarbeitung fand ich das Suchsystem sehr angenehm und innovativ. efw nein Alles so okay!!!!!!!! Suche funktionrt nicht! Nein, gefällt mit gut Ich würde es schätzen, sofort sehen zu können, auf wie viele Seiten sich die Angebote verteilen. nein 15380.00 [KMT] Ich finde es so wie es momentan ist ganz ok

B Index of DVD Contents The attached DVD contains source code, data and other materials. Subsequently, we reproduce the table of contents of the index file (INDEX.txt) provided on the DVD. 1 datasets (This directory contains all datasets created and used in the thesis.) 1.1 bmecat 1.1.1 bsh 1.1.2 weidmueller 1.2 crawl 1.2.1 big crawl 1.2.2 household crawl 1.3 pcs 1.4 product search 1.4.1 bsh and household crawl overlap 1.4.2 mobile.de samples 1.4.3 vso toy example 1.5 unit conversion 1.5.1 currencies 1.5.2 units of measure 2 demos (This directory contains the prototypes and demos developed and presented in the thesis.) 2.1 bmecat conversions 2.1.1 bmecat2goodrelations 2.1.2 bp-feelthedifference 2.1.3 bsh 2.1.4 weidmueller 2.2 grcrawler household 2.3 pcs2owl conversions 2.4 pcs2owl landing page 2.5 product search interfaces 2.5.1 data supply for household crawl demo 2.5.2 product-search 2.5.3 product-search-static

293

294

B Index of DVD Contents

3 evaluations (This directory contains all source code and data related to the evaluations of the thesis.) 3.1 bmecat leverage 3.2 cleansing 3.3 crawl 3.4 lib 3.5 listings 3.6 pcs2owl 3.6.1 reverse engineering approach 3.6.2 subsumption relationship evaluation 3.7 product search interfaces 3.7.1 decision tree analysis 3.7.2 sus score 3.7.3 usability raw data 3.8 release sizes eclass 4 thesis (This directory contains the thesis document along with supplementary material.) 4.1 authorship declarations 4.2 publications 5 tools and libraries (This directory contains the source code of tools and libraries implemented in the thesis.) 5.1 thesis contributions 5.1.1 bmecat2goodrelations 5.1.2 currency2currency 5.1.3 grcrawler 5.1.4 grsnippetgen logreader 5.1.5 mobile.de scraper 5.1.6 pcs2owl 5.1.7 product-search 5.1.8 rdf-translator 5.1.9 rdflib serializers 5.2 third-party software packages

C Online Tools and Web Resources The work on this thesis led to the publication of numerous tools, services, demonstrators, documentations, and source code repositories on the Web, which links are itemized below (accessed on April 15, 2015): • http://wiki.goodrelations-vocabulary.org/Tools/GRCrawler • https://bitbucket.org/alexstolz/grcrawler • http://wiki.goodrelations-vocabulary.org/Tools/BMEcat2GR • https://github.com/alexstolz/bmecat2goodrelations • https://github.com/alexstolz/bmecat2goodrelations/wiki/Usage • http://www.ebusiness-unibw.org/projects/bmecat2goodrelations/example/ • http://wiki.goodrelations-vocabulary.org/Tools/PCS2OWL • http://www.ebusiness-unibw.org/ontologies/pcs2owl/ • http://www.ebusiness-unibw.org/ontologies/pcs2owl/evaluation/ • https://bitbucket.org/alexstolz/pcs2owl/wiki/Home • http://www.currency2currency.org/ • https://bitbucket.org/alexstolz/currency2currency • http://www.ebusiness-unibw.org/tools/product-search/ • http://www.ebusiness-unibw.org/tools/product-search-static/ • https://bitbucket.org/alexstolz/product-search • http://rdf-translator.appspot.com/ • http://www.stalsoft.com/publications/rdf-translator-TR.pdf • https://bitbucket.org/alexstolz/rdf-translator • https://github.com/alexstolz/rdflib-rdfa-serializer • https://github.com/alexstolz/rdflib-microdata-serializer

295

Bibliography Note: Missing publication year information is indicated by superscript “ ND ” (no date) in the citation label. [Abe+ 11]

F. Abel, I. Celik, G.-J. Houben, and P. Siehndel: “Leveraging the Semantics of Tweets for Adaptive Faceted Search on Twitter”. In: Proceedings of the 10th International Semantic Web Conference (ISWC 2011). Bonn, Germany, 2011, pp. 1–17.

[ABH11]

B. Adida, M. Birbeck, and I. Herman: “Semantic Annotation and Retrieval: Web of Hypertext – RDFa and Microformats”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 5, pp. 157–190.

[Abr14]

J. Abraham: Product Information Management: Theory and Practice. Springer International Publishing Switzerland, 2014.

[Adi+ 08]

B. Adida, M. Birbeck, S. McCarron, and S. Pemberton: RDFa in XHTML: Syntax and Processing: A Collection of Attributes and Processing Rules for Extending XHTML to Support RDF. W3C Recommendation 14 October 2008. 2008. url: http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014/ (accessed on May 16, 2014).

[Adi+ 13]

B. Adida, M. Birbeck, S. McCarron, and I. Herman: RDFa Core 1.1 – Second Edition: Syntax and Processing Rules for Embedding RDF through Attributes. W3C Recommendation 22 August 2013. 2013. url: http://www.w3.org/ TR/2013/REC-rdfa-core-20130822/ (accessed on May 16, 2015).

[AH11]

D. Allemang and J. Hendler: Semantic Web for the Working Ontologist. 2nd ed. Morgan Kaufmann Publishers, 2011.

[Ake70]

G. A. Akerlof: “The Market for “Lemons”: Quality Uncertainty and the Market Mechanism”. In: Quarterly Journal of Economics 84 (3) (1970), pp. 488–500.

297

298

[AL05]

Bibliography

S. Agarwal and S. Lamparter: “SMART – A Semantic Matchmaking Portal for Electronic Markets”. In: Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC 2005). Munich, Germany, 2005, pp. 405–408.

[Ama+ 09]

B. R. Amarnath, T. S. Somasundaram, M. Ellappan, and R. Buyya: “Ontologybased Grid Resource Management”. In: Software Practice and Experience 39 (17) (2009), pp. 1419–1438.

[Ama16]

Amazon: Was sind EANs, UPCs, ISBNs und ASINs? 2016. url: http: //www.amazon.de/gp/seller/asin-upc-isbn-info.html (accessed on

February 3, 2016). [And04]

C. Anderson: “The Long Tail”. In: Wired Magazine 12 (10) (2004), pp. 170– 177.

[Ara+ 01]

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepke, and S. Raghavan: “Searching the Web”. In: ACM Transactions on Internet Technology 1 (1) (2001), pp. 2– 43.

[Are+ 14a]

M. Arenas, B. Cuenca Grau, E. Kharlamov, S. Marciuska, and D. Zheleznyakov: “Faceted Search over Ontology-enhanced RDF Data”. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM 2014). Shanghai, China, 2014, pp. 939–948.

[Are+ 14b]

M. Arenas, B. Cuenca Grau, E. Kharlamov, S. Marciuska, and D. Zheleznyakov: “Towards Semantic Faceted Search”. In: Poster Proceedings of the 23rd International World Wide Web Conference (WWW 2014), Companion Volume. Seoul, Korea, 2014, pp. 219–220.

[Ari16]

Ariba: cXML User’s Guide. Version 1.2.029. 2016.

[Arr63]

K. J. Arrow: “Uncertainty and the Welfare Economics of Medical Care”. In: The American Economic Review 53 (5) (1963), pp. 941–973.

[AT05]

G. Adomavicius and A. Tuzhilin: “Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions”. In: IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005), pp. 734–749.

[Aue+ 07]

S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives: “DBpedia: A Nucleus for a Web of Open Data”. In: Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC 2007). Busan, Korea, 2007, pp. 722–735.

Bibliography

[AvH08]

299

G. Antoniou and F. van Harmelen: A Semantic Web Primer. 2nd ed. The MIT Press, 2008.

[Bak98]

Y. Bakos: “The Emerging Role of Electronic Marketplaces on the Internet”. In: Communications of the ACM 41 (8) (1998), pp. 35–42.

[Bat89]

M. J. Bates: “The Design of Browsing and Berrypicking Techniques for the Online Search Interface”. In: Online Information Review 13 (5) (1989), pp. 407–424.

[Bat95]

M. Bates: “Models of Natural Language Understanding”. In: Proceedings of the National Academy of Sciences of the United States of America 92 (22) (1995), pp. 9977–9982.

[BC04]

T. Berners-Lee and D. Connolly: Delta: An Ontology for the Distribution of Differences between RDF Graphs. Technical Report. MIT Computer Science and Artificial Intelligence Laboratory, 2004.

[BC11]

T. Berners-Lee and D. Connolly: Notation3 (N3): A Readable RDF Syntax. W3C Team Submission 28 March 2011. 2011. url: http://www.w3.org/ TeamSubmission/2011/SUBM-n3-20110328/ (accessed on May 16, 2014).

[BD08]

A. A. Batabyal and G. J. DeAngelo: “To Match or Not to Match: Aspects of Marital Matchmaking under Uncertainty”. In: Operations Research Letters 36 (1) (2008), pp. 94–98.

[Ben23]

J. Bentham: An Introduction to the Principles of Morals and Legislation. Clarendon Press, Oxford, 1823.

[Ber+ 06]

T. Berners-Lee, Y. Chen, L. Chilton, D. Connolly, R. Dhanaraj, J. Hollenbach, A. Lerer, and D. Sheets: “Tabulator: Exploring and Analyzing Linked Data on the Semantic Web”. In: Proceedings of the 3rd International Semantic Web User Interaction Workshop (SWUI 2006). Athens, GA, USA, 2006.

[Ber05]

T. Berners-Lee: Notation 3 Logic: An RDF Language for the Semantic Web. 2005. url: http://www.w3.org/DesignIssues/Notation3 (accessed on May 15, 2014).

[Ber06]

T. Berners-Lee: Linked Data – Design Issues. 2006. url: http://www.w3. org/DesignIssues/LinkedData.html (accessed on May 8, 2014).

[Ber98]

T. Berners-Lee: Cool URIs Don’t Change. 1998. url: http://www.w3.org/ Provider/Style/URI (accessed on May 13, 2014).

300

[BF99]

Bibliography

T. Berners-Lee and M. Fischetti: Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web by Its Inventor. HarperCollins Publishers, 1999.

[BFM05]

T. Berners-Lee, R. T. Fielding, and L. Masinter: Uniform Resource Identifier (URI): Generic Syntax. Request for Comments 3986. 2005. url: http : //www.ietf.org/rfc/rfc3986.txt (accessed on May 7, 2014).

[BG14]

D. Brickley and R. V. Guha: RDF Schema 1.1. W3C Recommendation 25 February 2014. 2014. url: http : / / www . w3 . org / TR / 2014 / REC - rdf schema-20140225/ (accessed on May 20, 2014).

[BGG01]

R. Bapna, P. Goes, and A. Gupta: “Insights and Analyses of Online Auctions”. In: Communications of the ACM 44 (11) (2001), pp. 42–50.

[Bha+ 08]

R. Bhagdev, S. Chapman, F. Ciravegna, V. Lanfranchi, and D. Petrelli: “Hybrid Search: Effectively Combining Keywords and Semantic Searches”. In: Proceedings of the 5th European Semantic Web Conference (ESWC 2008). Tenerife, Spain, 2008, pp. 554–568.

[BHB09]

C. Bizer, T. Heath, and T. Berners-Lee: “Linked Data – The Story So Far”. In: International Journal on Semantic Web and Information Systems (IJSWIS) 5 (3) (2009), pp. 1–22.

[BHL01]

T. Berners-Lee, J. Hendler, and O. Lassila: “The Semantic Web”. In: Scientific American 284 (5) (2001), pp. 34–43.

[Bif+ 05]

A. Bifet, C. Castillo, P.-A. Chirita, and I. Weber: “An Analysis of Factors Used in Search Engine Ranking”. In: Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005). Chiba, Japan, 2005, pp. 48–57.

[Biz+ 13]

C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, M. Schuhmacher, and J. Völker: “Deployment of RDFa, Microdata, and Microformats on the Web – A Quantitative Analysis”. In: Proceedings of the 12th International Semantic Web Conference (ISWC 2013). Sydney, Australia, 2013, pp. 17–32.

[BK12]

F. Bauer and M. Kaltenböck: Linked Open Data: The Essentials. Vienna, Austria: edition mono/monochrom, 2012.

[BKL09]

S. Bird, E. Klein, and E. Loper: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, 2009.

[BKM09]

A. Bangor, P. Kortum, and J. Miller: “Determining What Individual SUS Scores Mean: Adding an Adjective Rating Scale”. In: Journal of Usability Studies 4 (3) (2009), pp. 114–123.

Bibliography

[BLP98]

301

J. R. Bettman, M. F. Luce, and J. W. Payne: “Constructive Consumer Choice Processes”. In: Journal of Consumer Research 25 (3) (1998), pp. 187– 217.

[BM08]

D. Beneventano and D. Montanari: “Ontological Mappings of Product Catalogues”. In: Poster Proceedings of the 3rd International Workshop on Ontology Matching (OM 2008). Karlsruhe, Germany, 2008.

[BM10]

M. Birbeck and S. McCarron: CURIE Syntax 1.0: A Syntax for Expressing Compact URIs. W3C Working Group Note 16 December 2010. 2010. url: https://www.w3.org/TR/2010/NOTE- curie- 20101216/ (accessed on

February 9, 2016). [BM14]

D. Brickley and L. Miller: FOAF Vocabulary Specification 0.99. Namespace Document 14 January 2014 - Paddington Edition. 2014. url: http : / / xmlns.com/foaf/spec/20140114.html (accessed on April 20, 2015).

[Bob+ 13]

J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez: “Recommender Systems Survey”. In: Knowledge-Based Systems 46 (2013), pp. 109–132.

[Bol+ 07]

H. Boley, M. Kifer, P.-L. Patranjan, and A. Polleres: “Rule Interchange on the Web”. In: Reasoning Web. Ed. by G. Antoniou, U. Aßmann, C. Baroglio, S. Decker, N. Henze, P.-L. Patranjan, and R. Tolksdorf. Vol. 4636. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2007, pp. 269–309.

[Bol+ 08]

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor: “Freebase: A Collaboratively Created Graph Database for Structuring Human Knowledge”. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD 2008). Vancouver, BC, Canada, 2008, pp. 1247–1250.

[Bor97]

W. N. Borst: “Construction of Engineering Ontologies for Knowledge Sharing and Reuse”. PhD thesis. University of Twente, Enschede, The Netherlands, 1997.

[BP08]

D. Berrueta and J. Phipps: Best Practice Recipes for Publishing RDF Vocabularies. W3C Working Group Note 28 August 2008. 2008. url: https: //www.w3.org/TR/2008/NOTE-swbp-vocab-pub-20080828/ (accessed

on February 19, 2016). [BP98]

S. Brin and L. Page: “The Anatomy of a Large-Scale Hypertextual Web Search Engine”. In: Proceedings of the Seventh International World Wide Web Conference (WWW 1998). Brisbane, Australia, 1998, pp. 107–117.

302

[BPJ02]

Bibliography

S. Balasubramanian, R. A. Peterson, and S. L. Jarvenpaa: “Exploring the Implications of M-Commerce for Markets and Marketing”. In: Journal of the Academy of Marketing Science 30 (4) (2002), pp. 348–361.

[BR11]

R. A. Baeza-Yates and B. A. Ribeiro-Neto: Modern Information Retrieval: The Concepts and Technology behind Search. 2nd ed. Addison-Wesley, 2011.

[Bra+ 08]

T. Bray, J. Paoli, C. M. Sperberg-McQueen, E. Maler, and F. Yergeau: Extensible Markup Language (XML) 1.0 (Fifth Edition). W3C Recommendation 26 November 2008. 2008. url: http://www.w3.org/TR/2008/RECxml-20081126/ (accessed on May 15, 2014).

[Bra14]

T. Bray: The JavaScript Object Notation (JSON) Data Interchange Format. Request for Comments 7159. 2014. url: http : / / www . ietf . org / rfc / rfc7159.txt (accessed on May 16, 2014).

[Bra83]

R. J. Brachman: “What IS-A Is and Isn’t: An Analysis of Taxonomic Links in Semantic Networks”. In: IEEE Computer 16 (10) (1983), pp. 30–36.

[Bri06]

British Computer Society: Isn’t It Semantic? Interview. 2006. url: http: //www.bcs.org/content/conWebDoc/3337 (accessed on May 16, 2014).

[Bri79]

E. Brill: “A Simple Rule-based Part of Speech Tagger”. In: Proceedings of the Third Conference on Applied Natural Language Processing (ANLC 1992). Trento, Italy, 1979, pp. 152–155.

[Bro96]

J. Brooke: “SUS – A Quick and Dirty Usability Scale”. In: Usability Evaluation in Industry. Ed. by P. Jordan, B. Thomas, B. Weerdmeester, and I. McClelland. Taylor & Francis, 1996, pp. 189–194.

[Bru+ 07]

J.-S. Brunner, L. Ma, C. Wang, L. Zhang, D. C. Wolfson, Y. Pan, and K. Srinivas: “Explorations in the Use of Semantic Web Technologies for Product Information Management”. In: Proceedings of the 16th International World Wide Web Conference (WWW 2007). Banff, Alberta, Canada, 2007, pp. 747–756.

[BS00]

E. Brynjolfsson and M. Smith: “Frictionless Commerce? A Comparison of Internet and Conventional Retailers”. In: Management Science 46 (4) (2000), pp. 563–585.

[BSG99]

P. Bingi, M. K. Sharma, and J. K. Godla: “Critical Issues Affecting an ERP Implementation”. In: Information Systems Management 16 (3) (1999), pp. 7–14.

[BSV12]

F. Branco, M. Sun, and J. M. Villas-Boas: “Optimal Search for Product Information”. In: Management Science 58 (11) (2012), pp. 2037–2056.

Bibliography

[Bui+ 13]

303

C. Buil-Aranda, M. Arenas, O. Corcho, and A. Polleres: “Federating Queries in SPARQL 1.1: Syntax, Semantics and Evaluation”. In: Web Semantics: Science, Services and Agents on the World Wide Web 18 (1) (2013), pp. 1–17.

[Bur07]

R. Burke: “Hybrid Web Recommender Systems”. In: The Adaptive Web. Ed. by P. Brusilovsky, A. Kobsa, and W. Nejdl. Vol. 4321. Lecture Notes of Computer Science. Springer Berlin Heidelberg, 2007. Chap. 12, pp. 377–408.

[Bus45]

V. Bush: “As We May Think”. In: The Atlantic Monthly 176 (1) (1945), pp. 101–108.

[BV04]

C. Buckley and E. M. Voorhees: “Retrieval Evaluation with Incomplete Information”. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2004). Sheffield, UK, 2004, pp. 25–32.

[Car+ 05]

J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler: “Named Graphs”. In: Journal of Web Semantics 3 (4) (2005), pp. 247–267.

[Car14]

G. Carothers: RDF 1.1 N-Quads: A Line-based Syntax for RDF Datasets. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/ TR/2014/REC-n-quads-20140225/ (accessed on May 16, 2014).

[Car97]

J. M. Carroll: “Human-Computer Interaction: Psychology as a Science of Design”. In: Annual Review of Psychology 48 (1) (1997), pp. 61–83.

[Cas+ 04]

C. Castillo, M. Marin, A. Rodriguez, and R. Baeza-Yates: “Scheduling Algorithms for Web Crawling”. In: Proceedings of the Joint Conference 10th Brazilian Symposium on Multimedia and the Web & 2nd Latin American Web Congress (WebMedia & LA-Web 2004). Ribeirao Preto-SP, Brazil, 2004, pp. 10–17.

[Cas+ 11]

S. Castano, A. Ferrara, S. Montanelli, and G. Varese: “Ontology and Instance Matching”. In: Knowlege-Driven Multimedia Information Extraction and Ontology Evolution. Ed. by G. Paliouras, C. D. Spyropoulos, and G. Tsatsaronis. Vol. 6050. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2011, pp. 167–195.

[CC03]

M. Chau and H. Chen: “Personalized and Focused Web Spiders”. In: Web Intelligence. Ed. by N. Zhong, J. Liu, and Y. Yao. Springer Berlin Heidelberg, 2003. Chap. 10, pp. 197–217.

[Çel14]

T. Çelik: h-product. 2014. url: http : / / microformats . org / wiki / h product (accessed on February 9, 2016).

304

[CFG03]

Bibliography

O. Corcho, M. Fernandez-Lopez, and A. Gomez-Perez: “Methodologies, Tools and Languages for Building Ontologies. Where is Their Meeting Point?” In: Data & Knowledge Engineering 46 (1) (2003), pp. 41–64.

[CG01]

O. Corcho and A. Gomez-Perez: “Solving Integration Problems of E-Commerce Standards and Initiatives through Ontological Mappings”. In: Proceedings of the Workshop on Ontologies and Information Sharing. Seattle, Washington, USA, 2001, pp. 131–140.

[Cha09a]

D. Chaffey: E-Business and E-Commerce Management: Strategy, Implementation and Practice. 4th ed. Prentice Hall, 2009.

[Cha09b]

C.-Y. C. Chang: “Does Price Matter? How Price Influences Online Consumer Decision-Making”. In: Japanese Journal of Administrative Science 22 (3) (2009), pp. 245–254.

[Cle67]

C. Cleverdon: “The Cranfield Tests on Index Language Devices”. In: ASLIB Proceedings 19 (6) (1967), pp. 173–194.

[Coa37]

R. Coase: “The Nature of the Firm”. In: Economica 4 (16) (1937), pp. 386– 405.

[Coa60]

R. Coase: “The Problem of Social Cost”. In: The Journal of Law & Economics 3 (1) (1960), pp. 1–44.

[Col+ 06]

S. Colucci, T. Di Noia, E. Di Sciascio, F. M. Donini, A. Ragone, and R. Rizzi: “A Semantic-based Fully Visual Application for Matchmaking and Query Refinement in B2C E-Marketplaces”. In: Proceedings of the 8th International Conference on Electronic Commerce (ICEC 2006). Fredericton, Canada, 2006, pp. 174–184.

[Cre+ 98]

F. Crestani, M. Lalmas, C. J. van Rijsbergen, and I. Campbell: “"Is This Document Relevant? ... Probably": A Survey of Probabilistic Models in Information Retrieval”. In: ACM Computing Surveys 30 (4) (1998), pp. 528– 552.

[CRF03]

W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg: “A Comparison of String Distance Metrics for Name-Matching Tasks”. In: Proceedings of IJCAI-03 Workshop on Information Integration on the Web (IIWeb 2003). Acapulco, Mexico, 2003, pp. 73–78.

[CS14a]

G. Carothers and A. Seaborne: RDF 1.1 N-Triples: A Line-based Syntax for an RDF Graph. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-n-triples-20140225/ (accessed on

May 16, 2014).

Bibliography

[CS14b]

305

G. Carothers and A. Seaborne: RDF 1.1 TriG: RDF Dataset Language. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/ TR/2014/REC-trig-20140225/ (accessed on May 16, 2014).

[Cul+ 07]

A. Culotta, M. Wick, R. Hall, M. Marzilli, and A. McCallum: “Canonicalization of Database Records Using Adaptive Similarity Measures”. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2007). San Jose, California, USA, 2007, pp. 201–209.

[Cun02]

H. Cunningham: “GATE, a General Architecture for Text Engineering”. In: Computers and the Humanities 36 (2002), pp. 223–254.

[CvBD99]

S. Chakrabarti, M. van den Berg, and B. Dom: “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”. In: Computer Networks 31 (11-16) (1999), pp. 1623–1640.

[CWL14]

R. Cyganiak, D. Wood, and M. Lanthaler: RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation 25 February 2014. 2014. url: http://www. w3.org/TR/2014/REC-rdf11-concepts-20140225/ (accessed on May 14,

2014). [Cyg+ 08]

R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker, and G. Tummarello: “Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web”. In: Proceedings of the 5th European Semantic Web Conference (ESWC 2008). Tenerife, Spain, 2008, pp. 690–704.

[dBru+ 05]

J. de Bruijn, A. Polleres, R. Lara, and D. Fensel: “OWL DL vs. OWL Flight: Conceptual Modeling and Reasoning for the Semantic Web”. In: Proceedings of the 14th International World Wide Web Conference (WWW 2005). Chiba, Japan, 2005, pp. 623–632.

[Dee+ 90]

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman: “Indexing by Latent Semantic Analysis”. In: Journal of the American Society for Information Science 41 (6) (1990), pp. 391–407.

[DFH11]

J. Domingue, D. Fensel, and J. A. Hendler: “Introduction to the Semantic Web Technologies”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 1, pp. 3–41.

[DH05]

A. Doan and A. Y. Halevy: “Semantic-Integration Research in the Database Community”. In: AI Magazine 26 (1) (2005), pp. 83–94.

306

[DHI12]

Bibliography

A. Doan, A. Halevy, and Z. Ives: Principles of Data Integration. Morgan Kaufmann Publishers, 2012.

[Di + 03]

T. Di Noia, E. Di Sciascio, F. M. Donini, and M. Mongiello: “A System for Principled Matchmaking in an Electronic Marketplace”. In: Proceedings of the 12th International World Wide Web Conference (WWW 2003). Budapest, Hungary, 2003, pp. 321–330.

[Dij82]

E. W. Dijkstra: “EWD 447: On the Role of Scientific Thought”. In: Selected Writings on Computing: A Personal Perspective. Springer New York, 1982, pp. 60–66.

[Din+ 04]

L. Ding, T. Finin, A. Joshi, Y. Peng, R. Scott Cost, J. Sachs, R. Pan, P. Reddivari, and V. Doshi: “Swoogle: A Semantic Web Search and Metadata Engine”. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM 2004). Washington, DC, USA, 2004, pp. 652–659.

[Din+ 05]

L. Ding, T. Finin, A. Joshi, Y. Peng, R. Pan, and P. Reddivari: “Search on the Semantic Web”. In: IEEE Computer 38 (10) (2005), pp. 62–69.

[DLS01]

F.-D. Dorloff, J. Leukel, and V. Schmitz: “Standards für den Austausch von elektronischen Produktkatalogen”. In: WISU 30 (11) (2001), pp. 1528–1536.

[DM11]

M. D’Aquin and E. Motta: “Watson, More Than a Semantic Web Search Engine”. In: Semantic Web 2 (1) (2011), pp. 55–63.

[Dod06]

L. Dodds: Slug: A Semantic Web Crawler. 2006. url: http://www.ldodds. com/projects/slug/slug-a-semantic-web-crawler.pdf (accessed on

July 23, 2014). [Don+ 14]

X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang: “Knowledge Vault: A Web-scale Approach to Probabilistic Knowledge Fusion”. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2014). New York, NY, USA, 2014, pp. 601–610.

[DR10]

T. Di Noia and A. Ragone: “Electronic Markets, a Look Behind the Curtains: How Can Semantic Matchmaking and Negotiation Boost E-Commerce?” In: Proceedings of the 11th International Conference on Electronic Commerce and Web Technologies (EC-Web 2010). Bilbao, Spain, 2010, pp. 241–252.

[Dre+ 08]

A. Dreibelbis, E. Hechler, I. Milman, M. Oberhofer, P. van Run, and D. Wolfson: Enterprise Master Data Management: An SOA Approach to Managing Core Information. IBM Press, 2008.

Bibliography

[DS04]

307

M. Dean and G. Schreiber: OWL Web Ontology Language Reference. W3C Recommendation 10 February 2004. 2004. url: http://www.w3.org/TR/ 2004/REC-owl-ref-20040210/ (accessed on May 20, 2014).

[DS05]

M. Dürst and M. Suignard: Internationalized Resource Identifiers (IRIs). Request for Comments 3987. 2005. url: http : / / www . ietf . org / rfc / rfc3987.txt (accessed on May 7, 2014).

[Du+ 04]

R. Du, E. Foo, C. Boyd, and B. Fitzgerald: “Defining Security Services for Electronic Tendering”. In: Proceedings of the Second Australasian Information Security Workshop (AISW 2004). Dunedin, New Zealand, 2004, pp. 43–52.

[EBa15]

EBay: What Is a Manufacturer Part Number? 2015. url: http://www.ebay. com/gds/What-Is-a-Manufacturer-Part-Number-/10000000177404842/ g.html (accessed on February 16, 2016).

[EClND ]

ECl@ss e. V.: eCl@ss Classification and Product Description. url: http: //www.eclass.de/ (accessed on May 16, 2014).

[ECl14]

ECl@ss e. V.: Category:Products – wiki.eclass.eu. 2014. url: http://wiki. eclass.eu/wiki/Category:Products (accessed on September 17, 2014).

[Ehr07]

M. Ehrig: Ontology Alignment: Bridging the Semantic Gap. Springer US, 2007.

[Eic05]

B. Eich: “JavaScript at Ten Years”. In: Proceedings of the Tenth ACM SIGPLAN International Conference on Functional Programming (ICFP 2005). Tallinn, Estonia, 2005, pp. 129–129.

[Eit+ 01]

T. Eiter, D. Veit, J. P. Müller, and M. Schneider: “Matchmaking for Structured Objects”. In: Proceedings of the Third International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2001). Munich, Germany, 2001, pp. 186–194.

[Eks+ 11]

M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl: “Rethinking the Recommender Research Ecosystem: Reproducibility, Openness, and LensKit”. In: Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys 2011). Chicago, IL, USA, 2011, pp. 133–140.

[EleND ]

Electronic Commerce Code Management Association: Why eOTD? url: http://www.eccma.org/whyeotd.php (accessed on May 8, 2014).

[ES07]

J. Euzenat and P. Shvaiko: Ontology Matching. Springer Berlin Heidelberg, 2007.

308

[Eur08a]

Bibliography

European Commission: “Commission Regulation (EC) No 213/2008 of 28 November 2007 Amending Regulation (EC) No 2195/2002 of the European Parliament and of the Council on the Common Procurement Vocabulary (CPV) and Directives 2004/17/EC and 2004/18/EC of the European Parliament”. In: Official Journal of the European Union L 74 (2008).

[Eur08b]

European Commission: “Regulation (EC) No 451/2008 of the European Parliament and of the Council of 23 April 2008 Establishing a New Statistical Classification of Products by Activity (CPA) and Repealing Council Regulation (ECC) No 3696/93”. In: Official Journal of the European Union L 145 (2008).

[Eva07]

M. P. Evans: “Analysing Google Rankings Through Search Engine Optimization Data”. In: Internet Research 17 (1) (2007), pp. 21–37.

[Fac12]

Facebook Inc.: The Open Graph Protocol. 2012. url: http :/ / ogp .me/ (accessed on May 16, 2014).

[FCB12]

D. C. Faye, O. Cure, and G. Blin: “A Survey of RDF Storage Approaches”. In: ARIMA Journal 15 (2012), pp. 11–35.

[Fei+ 13]

L. Feigenbaum, G. T. Williams, K. G. Clark, and E. Torres: SPARQL 1.1 Protocol. W3C Recommendation 21 March 2013. 2013. url: http : //www.w3.org/TR/2013/REC-sparql11-protocol-20130321/ (accessed

on May 26, 2014). [Fen+ 01]

D. Fensel, Y. Ding, B. Omelayenko, E. Schulten, G. Botquin, M. Brown, and A. Flett: “Product Data Integration in B2B E-Commerce”. In: IEEE Intelligent Systems 16 (4) (2001), pp. 54–59.

[FFS07]

A. Felfernig, G. Friedrich, and L. Schmidt-Thieme: “Recommender Systems”. In: IEEE Intelligent Systems 22 (3) (2007), pp. 18–21.

[FH10]

C. Fürber and M. Hepp: “Using Semantic Web Resources for Data Quality Management”. In: Proceedings of the 17th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2010). Lisbon, Portugal, 2010, pp. 211–225.

[FH11]

S. Ferré and A. Hermann: “Semantic Search: Reconciling Expressive Querying and Exploratory Search”. In: Proceedings of the 10th International Semantic Web Conference (ISWC 2011). Bonn, Germany, 2011, pp. 177–192.

Bibliography

[FH13]

309

C. Fantapié Altobelli and D. Hilger: “F-Commerce – Möglichkeiten und Grenzen von Facebook als Vertriebskanal am Beispiel von Dienstleistern”. In: Dienstleistungsmanagement und Social Media: Potenziale, Strategien und Instrumente. Ed. by M. Bruhn and K. Hadwich. Springer Fachmedien Wiesbaden, 2013. Chap. 6, pp. 469–489.

[Fie+ 99]

R. T. Fielding, J. Gettys, J. C. Mogul, H. F. Nielsen, L. Masinter, P. J. Leach, and T. Berners-Lee: Hypertext Transfer Protocol – HTTP/1.1. Request for Comments 2616. 1999. url: http://www.ietf.org/rfc/rfc2616.txt (accessed on May 7, 2014).

[Fie00]

R. T. Fielding: “Architectural Styles and the Design of Network-based Software Architectures”. PhD thesis. University of California, Irvine, 2000.

[FLR14]

R. T. Fielding, Y. Lafon, and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Range Requests. Request for Comments 7233. 2014. url: http: //www.ietf.org/rfc/rfc7233.txt (accessed on February 5, 2016).

[FNR14]

R. T. Fielding, M. Nottingham, and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Caching. Request for Comments 7234. 2014. url: http:// www.ietf.org/rfc/rfc7234.txt (accessed on February 5, 2016).

[FR14a]

R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Authentication. Request for Comments 7235. 2014. url: http://www.ietf. org/rfc/rfc7235.txt (accessed on February 5, 2016).

[FR14b]

R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Conditional Requests. Request for Comments 7232. 2014. url: http://www. ietf.org/rfc/rfc7232.txt (accessed on February 5, 2016).

[FR14c]

R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Message Syntax and Routing. Request for Comments 7230. 2014. url: http: //www.ietf.org/rfc/rfc7230.txt (accessed on February 5, 2016).

[FR14d]

R. T. Fielding and J. Reschke: Hypertext Transfer Protocol (HTTP/1.1): Semantics and Content. Request for Comments 7231. 2014. url: http : //www.ietf.org/rfc/rfc7231.txt (accessed on February 5, 2016).

[FS85]

J. Farrell and G. Saloner: “Standardization, Compatibility, and Innovation”. In: Rand Journal of Economics 16 (1) (1985), pp. 70–83.

[Fuh92]

N. Fuhr: “Probabilistic Models in Information Retrieval”. In: The Computer Journal 35 (3) (1992), pp. 243–255.

310

[FW98]

Bibliography

E. C. Freuder and R. J. Wallace: “Suggestion Strategies for Constraint-based Matchmaker Agents”. In: Proceedings of the 4th International Conference on Principles and Practice of Constraint Programming (CP 1998). Pisa, Italy, 1998, pp. 192–204.

[Gan+ 11]

F. L. Gandon, R. Krummenacher, S.-K. Han, and I. Toma: “Semantic Annotation and Retrieval: RDF”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 4, pp. 117–155.

[Gar04]

L. M. Garshol: “Metadata? Thesauri? Taxonomies? Topic Maps! Making Sense of It All”. In: Journal of Information Science 30 (4) (2004), pp. 378– 391.

[GGH09]

K. Goel, R. V. Guha, and O. Hansson: Introducing Rich Snippets. Google Webmaster Central Blog. 2009. url: http://googlewebmastercentral. blogspot.de/2009/05/introducing- rich- snippets.html (accessed

on August 11, 2014). [GI12]

GS1 Germany GmbH and Institut der deutschen Wirtschaft Köln Consult GmbH: Economic Success Thanks to eBusiness Standards: Entrepreneurs Show How It Works. Cologne, Germany, 2012.

[GL02]

M. Gruninger and J. Lee: “Ontology Applications and Design”. In: Communications of the ACM 45 (2) (2002), pp. 39–41.

[GMM03]

R. V. Guha, R. McCool, and E. Miller: “Semantic Search”. In: Proceedings of the Twelfth International World Wide Web Conference (WWW 2003). Budapest, Hungary, 2003, pp. 700–709.

[Gol08]

A. Goldfarb: “Electronic Commerce”. In: The New Palgrave Dictionary of Economics. Ed. by S. N. Durlauf and L. E. Blume. 2nd ed. Basingstoke: Palgrave Macmillan, 2008.

[Gol76]

V. P. Goldberg: “Regulation and Administered Contracts”. In: The Bell Journal of Economics 7 (2) (1976), pp. 426–448.

[GooND ]

Google: About Unique Product Identifiers. url: https://support.google. com / merchants / answer / 160161 ? hl = en % 7B % 5C & %7Dref _ topic = 6244294 (accessed on February 16, 2016).

[Goo13]

Google: The Google Product Taxonomy. Google Merchant Center Help. 2013. url: https://www.google.com/basepages/producttype/taxonomy. en-US.txt (accessed on May 16, 2014).

Bibliography

[Goo15a]

311

Google: About schema.org. 2015. url: https://developers.google.com/ structured-data/schema-org (accessed on February 10, 2016).

[Goo15b]

Google: Content API for Shopping – Best Practices. Google Developers. 2015. url: https://developers.google.com/shopping-content/v2/bestpractices (accessed on February 19, 2016).

[Goo15c]

Google: Inside Search: Algorithms. 2015. url: https : / / www . google . com / insidesearch / howsearchworks / algorithms . html (accessed on

January 26, 2016). [Goo16]

Google: Rich Snippets. 2016. url: https://developers.google.com/ structured-data/rich-snippets/ (accessed on February 10, 2016).

[GOS09]

N. Guarino, D. Oberle, and S. Staab: “What Is an Ontology?” In: Handbook on Ontologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg, 2009, pp. 1–17.

[GPP13]

P. Gearon, A. Passant, and A. Polleres: SPARQL 1.1 Update. W3C Recommendation 21 March 2013. 2013. url: http://www.w3.org/TR/2013/RECsparql11-update-20130321/ (accessed on May 24, 2014).

[GQ02]

T. Gupta and A. Qasem: “Reduction of Price Dispersion through Semantic ECommerce: A Position Paper”. In: Proceedings of the Semantic Web Workshop. Hawaii, USA, 2002, pp. 1–2.

[GR98]

J. C. Giarratano and G. D. Riley: Expert Systems: Principles and Programming. Boston, Massachusetts, USA: PWS Publishing Company, 1998.

[Gri+ 11]

S. Grimm, A. Abecker, J. Völker, and R. Studer: “Ontologies and the Semantic Web”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 13, pp. 507–579.

[Gri03]

M. Grieger: “Electronic Marketplaces: A Literature Review and a Call for Supply Chain Management Research”. In: European Journal of Operational Research 144 (2) (2003), pp. 280–294.

[Gru12]

J. Grudin: “A Moving Target – The Evolution of Human-Computer Interaction”. In: The Human-Computer Interaction Handbook. Ed. by J. A. Jacko. Vol. 3. CRC Press, 2012. Chap. Introducti, pp. xxvii–lxi.

[Gru93]

T. R. Gruber: “A Translation Approach to Portable Ontology Specifications”. In: Knowledge Acquisition 5 (2) (1993), pp. 199–220.

[GS1ND ]

GS1 AISBL: The Value and Benefits of the GS1 System of Standards. Brussels, Belgium: GS1.

312

[GS105]

Bibliography

GS1 AISBL: Global Product Classification (GPC): The Global Language for Classifying Goods. 3rd ed. GS1, 2005.

[GS115]

GS1 AISBL: Global Trade Item Number (GTIN). GS1, 2015.

[GS116]

GS1 AISBL: GS1 General Specifications. GS1, 2016.

[GS14]

F. Gandon and G. Schreiber: RDF 1.1 XML Syntax. W3C Recommendation 25 February 2014. 2014. url: http://www.w3.org/TR/2014/REC-rdfsyntax-grammar-20140225/ (accessed on May 15, 2014).

[GTB01]

J. Gonzalez-Castillo, D. Trastour, and C. Bartolini: Description Logics for Matchmaking of Services. Technical Report HPL–2001–265. HP Laboratories Bristol, 2001.

[Guh12]

R. V. Guha: Good Relations and Schema.org. 2012. url: http://blog. schema.org/2012/11/good-relations-and-schemaorg.html (accessed

on April 15, 2015). [GW02]

N. Guarino and C. Welty: “Evaluating Ontological Decisions with OntoClean”. In: Communications of the ACM 45 (2) (2002), pp. 61–65.

[GW09]

N. Guarino and C. Welty: “An Overview of OntoClean”. In: Handbook on Ontologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg, 2009, pp. 201–220.

[Haa+ 04]

P. Haase, J. Broekstra, A. Eberhart, and R. Volz: “A Comparison of RDF Query Languages”. In: Proceedings of the Third International Semantic Web Conference (ISWC 2004). Hiroshima, Japan, 2004, pp. 502–517.

[Haa+ 11]

K. Haas, P. Mika, P. Tarjan, and R. Blanco: “Enhanced Results for Web Search”. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2011). Beijing, China, 2011, pp. 725–734.

[Hah+ 10]

R. Hahn, C. Bizer, C. Sahnwaldt, C. Herta, S. Robinson, M. Bürgle, H. Düwiger, and U. Scheel: “Faceted Wikipedia Search”. In: Proceedings of the 13th International Conference on Business Information Systems (BIS 2010). Berlin, Germany, 2010, pp. 1–11.

[Hak+ 06]

S. Hakkarainen, L. Hella, D. Strasunskas, and S. Tuxen: “A Semantic Transformation Approach for ISO 15926”. In: Proceedings of the First International Workshop on Ontologizing Industrial Standards (OIS 2006). Tucson, Arizona, USA, 2006, pp. 281–290.

[Hal05]

A. Y. Halevy: “Why Your Data Won’t Mix”. In: ACM Queue 3 (8) (2005), pp. 50–58.

Bibliography

[Han07]

313

O. Handle: “Konzeption und Realisierung eines branchenübergreifenden Produktklassifikationssystems für das Bauwesen unter Nutzung der produktspezifischen Fachkompetenz der Baustoffindustrie”. Master thesis. MCI Management Center Innsbruck, Innsbruck, Austria, 2007.

[Han14]

Handelsverband Deutschland: E-Commerce-Umsätze. 2014. url: http:// www . einzelhandel. de / index. php / presse /zahlenfaktengrafiken / item/110185-e-commerce-umsaetze (accessed on February 26, 2015).

[Har+ 04]

A. Harth, S. Decker, Y. He, H. Tanmunarunkit, and C. Kesselman: “A Semantic Matchmaker Service on the Grid”. In: Poster Proceedings of the 13th International World Wide Web Conference (WWW 2004), Alternate Track. New York, NY, USA, 2004, pp. 326–327.

[Has+ 11]

B. Haslhofer, E. Momeni, B. Schandl, and S. Zander: Europeana RDF Store Report: The Results of Qualitative and Quantitative Study of Existing RDF Stores in the Context of Europeana. Technical Report. EuropeanaConnect, 2011.

[HB11]

T. Heath and C. Bizer: Linked Data: Evolving the Web into a Global Data Space. 1st ed. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, 2011.

[HBS09]

A. Hertel, J. Broekstra, and H. Stuckenschmidt: “RDF Storage and Retrieval Systems”. In: Handbook on Ontologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg, 2009, pp. 489–508.

[HdB07]

M. Hepp and J. de Bruijn: “GenTax: A Generic Methodology for Deriving OWL and RDF-S Ontologies from Hierarchical Classifications, Thesauri, and Inconsistent Taxonomies”. In: Proceedings of the 4th European Semantic Web Conference (ESWC 2007). Innsbruck, Austria, 2007, pp. 129–144.

[Hea+ 02]

M. A. Hearst, A. Elliott, J. English, R. Sinha, K. Searingen, and K.-P. Yee: “Finding the Flow in Web Site Search”. In: Communications of the ACM 45 (9) (2002), pp. 42–49.

[Hea09]

M. A. Hearst: “The Design of Search User Interfaces”. In: Search User Interfaces. Cambridge University Press, 2009. Chap. 1.

[Hea11]

M. A. Hearst: “User Interfaces for Search”. In: Modern Information Retrieval: The Concepts and Technology behind Search. Vol. 2. Addison-Wesley, 2011. Chap. 2, pp. 21–55.

[Hed10]

H. Hedden: The Accidental Taxonomist. Information Today, 2010.

314

[Hen10]

Bibliography

J. Hendler: “Web 3.0: The Dawn of Semantic Search”. In: Computer 43 (1) (2010), pp. 77–80.

[Hep03]

M. Hepp: “Güterklassifikation als semantisches Standardisierungsproblem”. PhD thesis. Universität Würzburg, Würzburg, Germany, 2003.

[Hep05a]

M. Hepp: “A Methodology for Deriving OWL Ontologies from Products and Services Categorization Standards”. In: Proceedings of the 13th European Conference on Information Systems (ECIS 2005). Regensburg, Germany, 2005, pp. 1–12.

[Hep05b]

M. Hepp: “eClassOWL: A Fully-fledged Products and Services Ontology in OWL”. In: Poster and Demo Proceedings of the 4th International Semantic Web Conference (ISWC 2005). Galway, Ireland, 2005.

[Hep06]

M. Hepp: “Products and Services Ontologies: A Methodology for Deriving OWL Ontologies from Industrial Categorization Standards”. In: International Journal on Semantic Web and Information Systems (IJSWIS) 2 (1) (2006), pp. 72–99.

[Hep07a]

M. Hepp: “Possible Ontologies: How Reality Constrains the Development of Relevant Ontologies”. In: IEEE Internet Computing 11 (1) (2007), pp. 90–96.

[Hep07b]

M. Hepp: “ProdLight: A Lightweight Ontology for Product Description Based on Datatype Properties”. In: Proceedings of the 10th International Conference on Business Information Systems (BIS 2007). Poznan, Poland, 2007, pp. 260–272.

[Hep08a]

M. Hepp: “GoodRelations: An Ontology for Describing Products and Services Offers on the Web”. In: Proceedings of the 16th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2008). Acritezza, Italy, 2008, pp. 329–346.

[Hep08b]

M. Hepp: GoodRelations: An Ontology for Describing Web Offerings. Technical Report pp. 2008–05–15. SEBIS, 2008.

[Hep11]

M. Hepp: GoodRelations Language Reference. V 1.0, Release 2011-10-01. 2011. url: http://www.heppnetz.de/ontologies/goodrelations/v1. html (accessed on May 22, 2014).

[Hep12a]

M. Hepp: GoodRelations for Manufacturers of Commodities. 2012. url: http://wiki.goodrelations-vocabulary.org/GoodRelations _ for _ manufacturers (accessed on November 12, 2015).

Bibliography

[Hep12b]

315

M. Hepp: “The Web of Data for E-Commerce in Brief”. In: Proceedings of the 12th International Conference on Web Engineering (ICWE 2012). Berlin, Germany, 2012, pp. 510–511.

[Hep13]

M. Hepp: Useful Rules, Axioms, and Mappings for GoodRelations. 2013. url: http://wiki.goodrelations-vocabulary.org/Axioms (accessed on February 20, 2016).

[Hep15a]

M. Hepp: GoodRelations as Part of Schema.org. 2015. url: http://wiki. goodrelations- vocabulary.org/Cookbook/Schema.org (accessed on

February 19, 2016). [Hep15b]

M. Hepp: “The Web of Data for E-Commerce: Schema.org and GoodRelations for Researchers and Practitioners”. In: Proceedings of the 15th International Conference on Web Engineering (ICWE 2015). Rotterdam, The Netherlands, 2015, pp. 723–727.

[Her+ 04]

S. C. Herring, L. A. Scheidt, S. Bonus, and E. Wright: “Bridging the Gap: A Genre Analysis of Weblogs”. In: Proceedings of the 37th Hawaii International Conference on System Sciences (HICCS 2004). Big Island, Hawaii, USA, 2004.

[Her+ 13]

I. Herman, B. Adida, M. Sporny, and M. Birbeck: RDFa 1.1 Primer – Second Edition: Rich Structured Data Markup for Web Documents. W3C Working Group Note 22 August 2013. 2013. url: http://www.w3.org/TR/2013/ NOTE-rdfa-primer-20130822/ (accessed on May 17, 2014).

[Hev+ 04]

A. R. Hevner, S. T. March, J. Park, and S. Ram: “Design Science in Information Systems Research”. In: MIS Quarterly 28 (1) (2004), pp. 75– 105.

[HGR09]

M. Hepp, R. Garcıa, and A. Radinger: “RDF2RDFa: Turning RDF into Snippets for Copy-and-Paste”. In: Poster and Demo Proceedings of the 8th International Semantic Web Conference (ISWC 2009). Washington, DC, USA, 2009.

[HH00]

J. Hefflin and J. Hendler: “Searching the Web with SHOE”. In: Artificial Intelligence for Web Search. Papers from the AAAI Workshop. Menlo Park, CA, USA, 2000, pp. 35–40.

[Hic+ 14]

I. Hickson, R. Berjon, S. Faulkner, T. Leithead, E. Doyle Navara, E. O’Connor, and S. Pfeiffer: HTML5: A Vocabulary and Associated APIs for HTML and XHTML. W3C Recommendation 28 October 2014. 2014. url: http:

316

Bibliography //www.w3.org/TR/2014/REC-html5-20141028/ (accessed on April 13,

2015). [Hic13]

I. Hickson: HTML Microdata. W3C Working Group Note 29 October 2013. 2013. url: http://www.w3.org/TR/2013/NOTE-microdata-20131029/ (accessed on May 16, 2014).

[Hil99]

P. Hill: “Tangibles, Intangibles and Services: A New Taxonomy for the Classification of Output”. In: The Canadian Journal of Economics 32 (2) (1999), pp. 426–446.

[Hjø08]

B. Hjørland: “What is Knowledge Organization (KO)?” In: Knowledge Organization. International Journal devoted to Concept Theory, Classification, Indexing and Knowledge Representation 35 (2/3) (2008), pp. 86–101.

[HLS07]

M. Hepp, J. Leukel, and V. Schmitz: “A Quantitative Analysis of Product Categorization Standards: Content, Coverage, and Maintenance of eCl@ss, UNSPSC, eOTD, and the RosettaNet Technical Dictionary”. In: Knowledge and Information Systems 13 (1) (2007), pp. 77–114.

[HM07]

T. Heath and E. Motta: “Revyu.com: A Reviewing and Rating Site for the Web of Data”. In: Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC 2007). Busan, Korea, 2007, pp. 895–902.

[Hod+ 14]

R. Hodgson, P. J. Keller, J. Hodges, and J. Spivak: QUDT – Quantities, Units, Dimensions and Data Types Ontologies. 2014. url: http://qudt.org/ (accessed on October 30, 2014).

[Hod00]

G. Hodge: Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. Washington, DC, USA: Council on Library and Information Resources, 2000.

[Hof+ 00]

Y. Hoffner, C. Facciorusso, S. Field, and A. Schade: “Distribution Issues in the Design and Implementation of a Virtual Market Place”. In: Computer Networks 32 (6) (2000), pp. 717–730.

[Hog+ 10]

A. Hogan, A. Polleres, J. Umbrich, and A. Zimmermann: “Some Entities Are More Equal than Others: Statistical Methods to Consolidate Linked Data”. In: Proceedings of the Workshop on New Forms of Reasoning for the Semantic Web: Scalable & Dynamic (NeFoRS 2010). Heraklion, Greece, 2010.

Bibliography

[Hog+ 11]

317

A. Hogan, A. Harth, J. Umbrich, S. Kinsella, A. Polleres, and S. Decker: “Searching and Browsing Linked Data with SWSE: The Semantic Web Search Engine”. In: Web Semantics: Science, Services and Agents on the World Wide Web 9 (4) (2011), pp. 365–401.

[Hol79]

B. Holmström: “Moral Hazard and Observability”. In: The Bell Journal of Economics 10 (1) (1979), pp. 74–91.

[Hop08]

E. Hopkins: “Price Dispersion”. In: The New Palgrave Dictionary of Economics. Ed. by S. N. Durlauf and L. E. Blume. 2nd ed. Basingstoke: Palgrave Macmillan, 2008.

[Hor+ 04]

I. Horrocks, P. F. Patel-Schneider, H. Boley, S. Tabet, B. Grosof, and M. Dean: SWRL: A Semantic Web Rule Language Combining OWL and RuleML. W3C Member Submission 21 May 2004. 2004. url: http://www.w3.org/ Submission/2004/SUBM-SWRL-20040521/ (accessed on May 26, 2014).

[HP11]

I. Horrocks and P. F. Patel-Schneider: “KR and Reasoning on the Semantic Web: OWL”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 9, pp. 365–398.

[HR04]

M. Huth and M. Ryan: Logic in Computer Science: Modelling and Reasoning about Systems. 2nd ed. Cambridge University Press, 2004.

[HR09]

M. Hepp and A. Radinger: “SKOS2OWL: An Online Tool for Deriving OWL and RDF-S Ontologies from SKOS Vocabularies”. In: Poster and Demo Proceedings of the 8th International Semantic Web Conference (ISWC 2009). Washington, DC, USA, 2009.

[HRO06]

A. Y. Halevy, A. Rajaraman, and J. J. Ordille: “Data Integration: The Teenage Years”. In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006). Seoul, Korea, 2006, pp. 9–16.

[HS00]

C. Hümpel and V. Schmitz: “BMEcat – An XML Standard for Electronic Product Data Interchange”. In: Proceedings of the First German Conference XML 2000. Heidelberg, Germany, 2000, pp. 1–11.

[HS13]

S. Harris and A. Seaborne: SPARQL 1.1 Query Language. W3C Recommendation 21 March 2013. 2013. url: http://www.w3.org/TR/2013/RECsparql11-query-20130321/ (accessed on May 23, 2014).

318

[HT05]

Bibliography

M. Hepp and R. Thome: “XML-Spezifikationen und Standards für den Datenaustausch”. In: Electronic Commerce und Electronic Business. Mehrwert durch Integration und Automation. Ed. by R. Thome, H. Schinzer, and M. Hepp. 3rd ed. Vahlen, München, 2005, pp. 191–216.

[HUD06]

A. Harth, J. Umbrich, and S. Decker: “MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data”. In: Proceedings of the 5th International Semantic Web Conference (ISWC 2006). Athens, GA, USA, 2006, pp. 258–271.

[Hue00]

C. Huemer: “XML vs. UN/EDIFACT or Flexibility vs. Standardisation”. In: Proceedings of the 13th International Bled Electronic Commerce Conference. Bled, Slovenia, 2000.

[IBM11]

IBM: IBM100 – E-Business. 2011. url: http://www-03.ibm.com/ibm/ history/ibm100/us/en/icons/ebusiness/transform/ (accessed on

May 16, 2014). [II94]

International Organization for Standardization and International Electrotechnical Commission: Information Technology – Open Systems Interconnection – Basic Reference Model: The Basic Model. ISO/IEC 74. 1994.

[Inf15]

Informationsstelle für Arzneispezialitäten: Technische Hinweise zur PZNCodierung im Code 39. 2015.

[Inm02]

W. H. Inmon: Building the Data Warehouse. 3rd ed. John Wiley & Sons, Inc., 2002.

[IntND ]

International Organization for Standardization: Language Codes – ISO 639. url: http://www.iso.org/iso/home/standards/language _ codes. htm (accessed on May 16, 2014).

[Int02a]

International Organization for Standardization: ISO 10303-21:2002: Industrial Automation Systems and Integration – Product Data Representation and Exchange – Part 21: Implementation Methods: Clear Text Encoding of the Exchange Structure. 2002.

[Int02b]

International Organization for Standardization: ISO 639-1:2002: Codes for the Representation of Names of Languages – Part 1: Alpha-2 Code. 2002.

[Int04]

International Organization for Standardization: ISO 10303-11:2004: Industrial Automation Systems and Integration – Product Data Representation and Exchange – Part 11: Description Methods: The EXPRESS Language Reference Manual. 2004.

Bibliography

[Int05]

319

International Organization for Standardization: ISO 2108:2005: Information and Documentation – International Standard Book Number (ISBN). 2005.

[Int07a]

International Organization for Standardization: ISO 10303-28:2007: Industrial Automation Systems and Integration – Product Data Representation and Exchange – Part 28: Implementation Methods: XML Representations of EXPRESS Schemas and Data, Using XML Schemas. 2007.

[Int07b]

International Organization for Standardization: ISO 639-3:2007: Codes for the Representation of Names of Languages – Part 3: Alpha-3 Code for Comprehensive Coverage of Languages. 2007.

[Int08]

International Organization for Standardization: ISO 4217:2008: Codes for the Representation of Currencies and Funds. 2008.

[Int11]

International Organization for Standardization: ISO 25964-1:2011: Information and Documentation – Thesauri and Interoperability with Other Vocabularies – Part 1: Thesauri for Information Retrieval. 2011.

[Int13a]

International Organization for Standardization: ISO 25964-2:2013: Information and Documentation – Thesauri and Interoperability with Other Vocabularies – Part 2: Interoperability with Other Vocabularies. 2013.

[Int13b]

International Organization for Standardization: ISO 3166-1:2013: Codes for the Representation of Names of Countries and Their Subdivisions – Part 1: Country Codes. 2013.

[Int13c]

International Organization for Standardization: ISO 3166-2:2013: Codes for the Representation of Names of Countries and Their Subdivisions – Part 2: Country Subdivision Code. 2013.

[Int13d]

International Organization for Standardization: ISO 3166-3:2013: Codes for the Representation of Names of Countries and Their Subdivisions – Part 3: Code for Formerly Used Names of Countries. 2013.

[Int15]

International Data Corporation: As Tablets Slow and PCs Face Ongoing Challenges, Smartphones Grab an Ever-Larger Share of the Smart Connected Device Market Through 2019, According to IDC. IDC Press Release. 2015. url: http://www.idc.com/getdoc.jsp?containerId=prUS25500515 (accessed on November 5, 2015).

[Int88]

International Organization for Standardization: ISO 8601:1988: Data Elements and Interchange Formats – Information Interchange – Representation of Dates and Times. 1988.

320

[Int98]

Bibliography

International Organization for Standardization: ISO 639-2:1998: Codes for the Representation of Names of Languages – Part 2: Alpha-3 Code. 1998.

[Ise+ 10]

R. Isele, J. Umbrich, C. Bizer, and A. Harth: “LDSpider: An Open-Source Crawling Framework for the Web of Linked Data”. In: Poster and Demo Proceedings of the 9th International Semantic Web Conference (ISWC 2010). Shanghai, China, 2010.

[Jac12]

P. Jaccard: “The Distribution of the Flora in the Alpine Zone”. In: New Phytologist 11 (1912), pp. 37–50.

[JJD11]

J. Jimenez-Rodriguez, G. Jimenez-Diaz, and B. Diaz-Agudo: “Matchmaking and Case-based Recommendations”. In: Proceedings of the Workshop on Case-based Reasoning for Computer Games. Greenwich, London, UK, 2011, pp. 53–62.

[JM09]

D. Jurafsky and J. H. Martin: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd ed. Prentice Hall, 2009.

[JM76]

M. C. Jensen and W. H. Meckling: “Theory of the Firm: Managerial Behavior, Agency Costs and Ownership Structure”. In: Journal of Financial Economics 3 (4) (1976), pp. 305–360.

[Joa02]

T. Joachims: “Optimizing Search Engines Using Clickthrough Data”. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2002). Edmonton, Alberta, Canada, 2002, pp. 133–142.

[JW04]

I. Jacobs and N. Walsh: Architecture of the World Wide Web, Volume One. W3C Recommendation 15 December 2004. 2004. url: http://www.w3.org/ TR/2004/REC-webarch-20041215/ (accessed on May 13, 2014).

[Kar+ 05]

T. Karlsson, C. Kuttainen, L. Pitt, and S. Spyropoulou: “Price as a Variable in online Consumer Trade-offs”. In: Marketing Intelligence & Planning 23 (4) (2005), pp. 350–358.

[KD11]

A. Kiryakov and M. Damova: “Storing the Semantic Web: Repositories”. In: Handbook of Semantic Web Technologies. Ed. by J. Domingue, D. Fensel, and J. A. Hendler. Springer Berlin Heidelberg, 2011. Chap. 7, pp. 231–297.

[Kel14]

G. Kellogg: Microdata to RDF – Second Edition: Transformation from HTML+Microdata to RDF. W3C Interest Group Note 16 December 2014. 2014. url: https : / / www . w3 . org / TR / 2014 / NOTE - microdata - rdf 20141216/ (accessed on February 19, 2016).

Bibliography

[Ker+ 00]

321

S. Kerridge, C. Halaris, G. Mentzas, and S. Kerridge: “Virtual Tendering and Bidding in the Construction Sector”. In: Proceedings of the First International Conference on Electronic Commerce and Web Technologies (EC-Web 2000). London, UK, 2000, pp. 379–388.

[KH96]

D. Kuokka and L. Harada: “Integrating Information via Matchmaking”. In: Journal of Intelligent Information Systems: Integrating Artificial Intelligence and Database Technologies (JIIS) 6 (2-3) (1996), pp. 261–279.

[Kha06]

R. Khare: “Microformats: The Next (Small) Thing on the Semantic Web?” In: IEEE Internet Computing 10 (1) (2006), pp. 68–75.

[Kle02]

M. Klein: DAML+OIL and RDF Schema Representation of UNSPSC. 2002. url: http://www.cs.vu.nl/%7B~%7Dmcaklein/unspsc/ (accessed on February 19, 2016).

[Knu09]

H. Knublauch: Currency Conversion with the Units Ontology, SPARQLMotion and SPIN. 2009. url: http://composing- the- semantic- web. blogspot.it/2009/09/currency-conversion-with-units-ontology. html (accessed on September 30, 2014).

[Knu13]

H. Knublauch: Defining SPARQL Functions with SWP. 2013. url: http: //composing- the- semantic- web.blogspot.de/2013/06/definingsparql-functions-with-swp.html (accessed on November 17, 2014).

[Knu84]

D. E. Knuth: “Literate Programming”. In: The Computer Journal 27 (2) (1984), pp. 97–111.

[Koh+ 09]

R. Kohavi, R. Longbotham, D. Sommerfield, and R. M. Henne: “Controlled Experiments on the Web: Survey and Practical Guide”. In: Data Mining and Knowledge Discovery 18 (1) (2009), pp. 140–181.

[Kos07]

M. Koster: A Standard for Robot Exclusion. The Web Robots Pages. 2007. url: http://www.robotstxt.org/orig.html (accessed on February 18, 2016).

[Kos95]

M. Koster: Robots in the Web: Threat or Treat? The Web Robots Pages. 1995. url: http://www.robotstxt.org/threat-or-treat.html (accessed on February 18, 2016).

[KR03]

R. Kalakota and M. Robinson: “Electronic Commerce”. In: Encyclopedia of Computer Science. Vol. 4. Chichester, UK: John Wiley and Sons Ltd., 2003, pp. 628–634.

322

[KS85]

Bibliography

M. L. Katz and C. Shapiro: “Network Externalities, Competition, and Compatibility”. In: The American Economic Review 75 (3) (1985), pp. 424– 440.

[KW97]

R. Kalakota and A. B. Whinston: Electronic Commerce: A Manager’s Guide. Addison-Wesley, 1997.

[KZL08]

J. Koren, Y. Zhang, and X. Liu: “Personalized Interactive Faceted Search”. In: Proceedings of the 17th International World Wide Web Conference (WWW 2008). Beijing, China, 2008, pp. 477–485.

[Law00]

S. Lawrence: “Context in Web Search”. In: IEEE Data Engineering Bulletin 23 (3) (2000), pp. 25–32.

[LC01]

B. Leuf and W. Cunningham: The Wiki Way: Quick Collaboration on the Web. Addison-Wesley, 2001.

[Leh92]

F. Lehmann: “Semantic Networks”. In: Computers & Mathematics with Applications 23 (2-5) (1992), pp. 1–50.

[Len02]

M. Lenzerini: “Data Integration: A Theoretical Perspective”. In: Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 2002). Madison, Wisconsin, USA, 2002, pp. 233– 246.

[Lev66]

V. I. Levenshtein: “Binary Codes Capable of Correcting Deletions, Insertions, and Reversals”. In: Soviet Physics Doklady 10 (8) (1966), pp. 707–710.

[LG12]

M. Lanthaler and C. Gütl: “On Using JSON-LD to Create Evolvable RESTful Services”. In: Proceedings of the Third International Workshop on RESTful Design (WS-REST 2012). Lyon, France, 2012, pp. 25–32.

[LH03]

L. Li and I. Horrocks: “A Software Framework for Matchmaking Based on Semantic Web Technology”. In: Proceedings of the Twelfth International World Wide Web Conference (WWW 2003). Budapest, Hungary, 2003, pp. 331–339.

[Li11]

H. Li: “A Short Introduction to Learning to Rank”. In: IEICE Transactions on Information and Systems E94-D (10) (2011), pp. 1854–1862.

[Liu+ 09]

W. Liu, Y. Zeng, M. Maletz, and D. Brisson: “Product Lifecycle Management: A Review”. In: Proceedings of the ASME 2009 International Design Engineering Technical Conferences and Computers & Information in Engineering Conference (IDETC/CIE 2009). San Diego, California, USA, 2009, pp. 1213–1225.

Bibliography

[Liu+ 12]

323

D. Liu, R. G. Bias, M. Lease, and R. Kuipers: “Crowdsourcing for Usability Testing”. In: Proceedings of the 75th Annual Meeting of the Association for Information Science and Technology (ASIST 2012) 49 (1) (2012).

[LKH14]

B. Lika, K. Kolomvatsos, and S. Hadjiefthymiades: “Facing the Cold Start Problem in Recommender Systems”. In: Expert Systems with Applications 41 (4) (2014), pp. 2065–2073.

[LMS07]

M. Lenders, J. Müller, and G. Schuh: “PLM mit Modellcharakter”. In: CADplus Business+Engineering 46 (6) (2007), pp. 32–35.

[Los09]

D. Loshin: Master Data Management. Morgan Kaufmann Publishers, 2009.

[Lov68]

J. B. Lovins: “Development of a Stemming Algorithm”. In: Mechanical Translation and Computational Linguistics 11 (1/2) (1968), pp. 22–31.

[LR05]

S. A. Ludwig and S. Reyhani: “Introduction of Semantic Matchmaking to Grid Computing”. In: Journal of Parallel and Distributed Computing 65 (12) (2005), pp. 1533–1541.

[LS96]

D. D. Lewis and K. Sparck Jones: “Natural Language Processing for Information Retrieval”. In: Communications of the ACM 39 (1) (1996), pp. 92– 101.

[LSR96]

S. Luke, L. Spector, and D. Rager: “Ontology-based Knowledge Discovery on the World-Wide Web”. In: Proceedings of the Workshop on Internetbased Information Systems at the 13th National Conference on Artificial Intelligence (AAAI-96). 1996, pp. 96–102.

[Luh57]

H. P. Luhn: “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”. In: IBM Journal of Research and Development 1 (4) (1957), pp. 309–317.

[Mad03]

S. E. Madnick: “Oh, so That Is What You Meant! The Interplay of Data Quality and Data Semantics”. In: Proceedings of the 22nd International Conference on Conceptual Modeling (ER 2003). Chicago, IL, USA, 2003, pp. 3–13.

[Man+ 11]

J. Manweiler, S. Agarwal, M. Zhang, R. Roy Choudhury, and P. Bahl: “Switchboard: A Matchmaking System for Multiplayer Mobile Games”. In: Proceedings of the 9th International Conference on Mobile Systems Applications and Services (MobiSys 2011). Bethesda, MD, USA, 2011, pp. 71– 84.

324

[Man98]

Bibliography

B. Manaris: “Natural Language Processing: A Human-Computer Interaction Perspective”. In: Advances in Computers. Ed. by M. V. Zelkowitz. Vol. 47. Academic Press, 1998, pp. 1–66.

[Mar04]

K. Marx: A Contribution to the Critique of Political Economy. Chicago, IL, USA: Charles H. Kerr & Company, 1904.

[Mar06]

G. Marchionini: “Exploratory Search: From Finding to Understanding”. In: Communications of the ACM 49 (4) (2006), pp. 41–46.

[Mas+ 11]

J. E. Masters, I. Polikoff, R. Hodgson, and D. Mekonnen: QUDT Units Vocabulary (without Dimensions) Version 1.1. Turtle File. 2011. url: http: //qudt.org/1.1/vocab/OVG _ units- qudt- (v1.1).ttl (accessed on

December 18, 2015). [Mat09]

M. Mattern: “Transforming BMEcat Catalogs into Semantic Web Annotation Data for Offerings”. Master thesis. University of Innsbruck, Innsbruck, Austria, 2009.

[MB09]

A. Miles and S. Bechhofer: SKOS Simple Knowledge Organization System Reference. W3C Recommendation 18 August 2009. 2009. url: http://www. w3.org/TR/2009/REC-skos-reference-20090818/ (accessed on May 22,

2014). [MB12]

H. Mühleisen and C. Bizer: “Web Data Commons – Extracting Structured Data from Two Large Web Corpora”. In: Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW 2012). Lyon, France, 2012.

[MC10]

P. Morville and J. Callender: Search Patterns. O’Reilly Media, 2010.

[McD05]

M. McDermott: “Knowledge Workers: You Can Gauge Their Effectiveness”. In: Leadership Excellence 22 (10) (2005), pp. 15–17.

[McK12]

W. McKinney: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly Media, 2012.

[MH09]

R. Möller and V. Haarslev: “Tableau-based Reasoning”. In: Handbook on Ontologies. Ed. by S. Staab and R. Studer. 2nd ed. Springer Berlin Heidelberg, 2009, pp. 509–528.

[MHG10]

M. McCandless, E. Hatcher, and O. Gospodnetic: Lucene in Action. 2nd ed. Manning Publications Co., 2010.

[MicND ]

Microsoft: Bing Rich Captions. Bing Ads. url: http://advertise.bingads. microsoft.com/en-us/bing-rich-captions (accessed on February 17,

2016).

Bibliography

[Mik08]

325

P. Mika: “Anatomy of a SearchMonkey”. In: Nodalities Magazine (2008), pp. 1–7.

[Mik11]

P. Mika: Microformats and RDFa Deployment across the Web. 2011. url: http://tripletalk.wordpress.com/2011/01/25/rdfa-deploymentacross-the-web/ (accessed on May 16, 2014).

[Mil06]

J. S. Mill: Utilitarianism. University of Chicago Press, 1906.

[Mil95]

G. A. Miller: “WordNet: A Lexical Database for English”. In: Communications of the ACM 38 (11) (1995), pp. 39–41.

[MK60]

M. E. Maron and J. L. Kuhns: “On Relevance, Probabilistic Indexing and Information Retrieval”. In: Journal of the ACM 7 (3) (1960), pp. 216–244.

[MM04]

F. Manola and E. Miller: RDF Primer. W3C Recommendation 10 February 2004. 2004. url: http : / / www . w3 . org / TR / 2004 / REC - rdf - primer 20040210/ (accessed on May 16, 2014).

[MMB14]

R. Meusel, P. Mika, and R. Blanco: “Focused Crawling for Structured Data”. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM 2014). Shanghai, China, 2014, pp. 1039– 1048.

[MP12]

P. Mika and T. Potter: “Metadata Statistics for a Large Web Corpus”. In: Proceedings of the WWW2012 Workshop on Linked Data on the Web (LDOW 2012). Lyon, France, 2012.

[MP15]

R. Meusel and H. Paulheim: “Heuristics for Fixing Common Errors in Deployed schema.org Microdata”. In: Proceedings of the 12th European Semantic Web Conference (ESWC 2015). Portoroz, Slovenia, 2015, pp. 152– 168.

[MPB14]

R. Meusel, P. Petrovski, and C. Bizer: “The WebDataCommons Microdata, RDFa and Microformat Dataset Series”. In: Proceedings of the 13th International Semantic Web Conference (ISWC 2014). Riva del Garda, Trentino, Italy, 2014, pp. 277–292.

[MPP12]

B. Motik, P. F. Patel-Schneider, and B. Parsia: OWL 2 Web Ontology Language: Structural Specification and Functional-Style Syntax (Second Edition). W3C Recommendation 11 December 2012. 2012. url: http://www.w3.org/ TR/2012/REC-owl2-syntax-20121211/ (accessed on October 1, 2014).

[MR06]

P. Morville and L. Rosenfeld: Information Architecture for the World Wide Web. 3rd ed. O’Reilly Media, 2006.

326

[MRN14]

Bibliography

A. Moro, A. Raganato, and R. Navigli: “Entity Linking Meets Word Sense Disambiguation: A Unified Approach”. In: Transactions of the Association for Computational Linguistics (TACL) 2 (2014), pp. 231–244.

[MRR12]

K. Mauge, K. Rohanimanesh, and J.-D. Ruvini: “Structuring E-Commerce Inventory”. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012). Jeju, Republic of Korea, 2012, pp. 805–814.

[MRS09]

C. D. Manning, P. Raghavan, and H. Schütze: An Introduction to Information Retrieval. Cambridge University Press, 2009.

[MS99]

C. D. Manning and H. Schütze: Foundations of Statistical Natural Language Processing. The MIT Press, 1999.

[MSM93]

M. Marcus, B. Santorini, and M. A. Marcinkiewicz: Building a Large Annotated Corpus of English: The Penn Treebank. Technical Report MS–CIS–93– 87. University of Pennsylvania, Department of Computer & Information Science, 1993.

[MV05]

H. D. Morris and D. Vesset: Managing Master Data for Business Performance Management: The Issues and Hyperion’s Solution. IDC White Paper. IDC, 2005.

[MYB87]

T. W. Malone, J. Yates, and R. I. Benjamin: “Electronic Markets and Electronic Hierarchies”. In: Communications of the ACM 30 (6) (1987), pp. 484–497.

[Mye98]

B. A. Myers: “A Brief History of Human-Computer Interaction Technology”. In: Interactions 5 (2) (1998), pp. 44–54.

[Nar+ 04]

B. A. Nardi, D. J. Schiano, M. Gumbrecht, and L. Swartz: “Why We Blog”. In: Communications of the ACM 47 (12) (2004), pp. 41–46.

[NAS10]

NASA: The NASA Quantity – Unit – Dimension – Type Ontology. Ontology Documentation. 2010. url: http://www.qudt.org/qudt/owl/1.0.0/ qudt/index.html (accessed on January 22, 2015).

[Nat05]

National Information Standards Organization: Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies. Technical Report ANSI/NISO Z39.19–2005 (R2010). National Information Standards Organization, 2005.

Bibliography

[Nat08]

327

National Institute of Standards and Technology (NIST): The International System of Units (SI) – NIST Special Publication 330. Ed. by B. N. Taylor and A. Thompson. Gaithersburg, MD, USA: National Institute of Standards and Technology, 2008.

[Nav09]

R. Navigli: “Word Sense Disambiguation: A Survey”. In: ACM Computing Surveys 41 (2) (2009), 10:1–10:69.

[Nel65]

T. Nelson: “Complex Information Processing: A File Structure for the Complex, the Changing and the Indeterminate”. In: Proceedings of the 20th ACM National Conference. Cleveland, Ohio, USA, 1965, pp. 84–100.

[Nel70]

P. Nelson: “Information and Consumer Behavior”. In: Journal of Political Economy 78 (2) (1970), pp. 311–329.

[NO95]

C. F. Naiman and A. M. Ouksel: “A Classification of Semantic Conflicts in Heterogeneous Database Systems”. In: Journal of Organizational Computing 5 (2) (1995), pp. 167–193.

[NS08]

W. Nicholson and C. Snyder: Microeconomic Theory: Basic Principles and Extensions. 10th ed. Thomson South-Western, 2008.

[NS10]

P. Nowakowski and H. Stuckenschmidt: “Ontology-based Product Catalogues: An Example Implementation”. In: Proceedings of Multikonferenz Wirtschaftsinformatik (MKWI 2010). Göttingen, Germany, 2010, pp. 15–25.

[NW01]

M. Najork and J. L. Wiener: “Breadth-First Search Crawling Yields HighQuality Pages”. In: Proceedings of the Tenth International World Wide Web Conference (WWW 2001). Hong Kong, China, 2001, pp. 114–118.

[ODD06]

E. Oren, R. Delbru, and S. Decker: “Extending Faceted Navigation for RDF Data”. In: Proceedings of the 5th International Semantic Web Conference (ISWC 2006). Athens, GA, USA, 2006, pp. 559–572.

[OHS09]

A. Oulasvirta, J. P. Hukkinen, and B. Schwartz: “When More is Less: The Paradox of Choice in Search Engine Use”. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). Boston, Massachusetts, USA, 2009, pp. 516–523.

[OJ93]

V. L. O’Day and R. Jeffries: “Orienteering in an Information Landscape: How Information Seekers Get from Here to There”. In: Proceedings of the INTERCHI Conference on Human Factors in Computing Systems (INTERCHI 1993). Amsterdam, The Netherlands, 1993, pp. 438–445.

[Ora11]

Oracle: Master Data Management. Oracle White Paper. Oracle, 2011.

328

[Ore+ 08]

Bibliography

E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello: “Sindice.com: A Document-oriented Lookup Index for Open Linked Data”. In: International Journal of Metadata, Semantics and Ontologies (IJMSO) 3 (1) (2008), pp. 37–52.

[ORe05]

T. O’Reilly: What is Web 2.0: Design Patterns and Business Models for the Next Generation of Software. 2005. url: http://oreilly.com/web2/ archive/what-is-web-20.html (accessed on January 25, 2016).

[ORe07]

T. O’Reilly: “What Is Web 2.0: Design Patterns and Business Models for the Next Generation of Software”. In: International Journal of Digital Economics 65 (2007), pp. 17–37.

[ORG15]

L. Otero-Cerdeira, F. J. Rodriguez-Martinez, and A. Gomez-Rodriguez: “Ontology Matching: A Literature Review”. In: Expert Systems with Applications 42 (2) (2015), pp. 949–971.

[PAA08]

L. Polo Paredes, J. M. Alvarez Rodriguez, and E. R. Azcona: “Promoting Government Controlled Vocabularies for the Semantic Web: The EUROVOC Thesaurus and the CPV Product Classification System”. In: Proceedings of the Semantic Interoperability in the European Digital Library Workshop (SIEDL 2008). Tenerife, Spain, 2008, pp. 111–122.

[Pag+ 98]

L. Page, S. Brin, R. Motwani, and T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web. Technical Report “1999–66”. Stanford InfoLab, 1998.

[Pao+ 02]

M. Paolucci, T. Kawamura, T. R. Payne, and K. P. Sycara: “Semantic Matching of Web Services Capabilities”. In: Proceedings of the First International Semantic Web Conference (ISWC 2002). Chia, Sardinia, Italy, 2002, pp. 333–347.

[Par72]

D. L. Parnas: “On the Criteria to Be Used in Decomposing Systems into Modules”. In: Communications of the ACM 15 (12) (1972), pp. 1053–1058.

[PB13]

E. Prud’hommeaux and C. Buil-Aranda: SPARQL 1.1 Federated Query. W3C Recommendation 21 March 2013. 2013. url: http://www.w3.org/ TR / 2013 / REC - sparql11 - federated - query - 20130321/ (accessed on

May 24, 2014). [PBB14]

P. Petrovski, V. Bryl, and C. Bizer: “Integrating Product Data from Websites Offering Microdata Markup”. In: Proceedings of the 23rd World Wide Web Conference (WWW 2014), Companion Volume: Workshop on Data

Bibliography

329

Extraction and Object Search (DEOS 2014). Seoul, Korea, 2014, pp. 1299– 1304. [PC14]

E. Prud’hommeaux and G. Carothers: RDF 1.1 Turtle: Terse RDF Triple Language. W3C Recommendation 25 February 2014. 2014. url: http:// www . w3 . org / TR / 2014 / REC - turtle - 20140225/ (accessed on May 15,

2014). [Per95]

J. Persky: “Retrospectives: The Ethology of Homo Economicus”. In: Journal of Economic Perspectives 9 (2) (1995), pp. 221–231.

[PFH06]

A. Polleres, C. Feier, and A. Harth: “Rules with Contextually Scoped Negation”. In: Proceedings of the 3rd European Semantic Web Conference (ESWC 2006). Budva, Montenegro, 2006, pp. 332–347.

[PG07]

F. Pérez and B. E. Granger: “IPython: A System for Interactive Scientific Computing”. In: Computing in Science and Engineering 9 (3) (2007), pp. 21– 29.

[Pis01]

C. Pissarides: “Search, Economics of”. In: International Encyclopedia of the Social & Behavioral Sciences. Ed. by N. J. Smelser and P. B. Baltes. Oxford, UK: Elsevier, 2001, pp. 13760–13768.

[Pit+ 02]

J. Pitkow, H. Schütze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, and T. Breuel: “Personalized Search”. In: Communications of the ACM 45 (9) (2002), pp. 50–55.

[PM10]

A. Passant and P. N. Mendes: “SparqlPuSH: Proactive Notification of Data Updates in RDF Stores Using PubSubHubbub”. In: Proceedings of the Sixth Workshop on Scripting and Development for the Semantic Web (SFSW 2010). Heraklion, Greece, 2010.

[Poo+ 11]

F. Poon, T. Chin, M. Bentrovato, O. Shafiq, A. Chen, F. Triant, J. Rokne, and R. Alhajj: “Semantically Enhanced Matchmaking of Consumers and Providers: A Canadian Real Estate Case Study”. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services (iiWAS 2011). Ho Chi Minh City, Vietnam, 2011, pp. 198–205.

[Por80]

M. F. Porter: “An Algorithm for Suffix Stripping”. In: Program 14 (3) (1980), pp. 130–137.

[Pra05]

M. J. Pratt: “ISO 10303, the STEP Standard for Product Data Exchange, and Its PLM Capabilities”. In: International Journal of Product Lifecycle Management 1 (1) (2005), pp. 86–94.

330

[PRS04]

Bibliography

X. Pan, B. T. Ratchford, and V. Shankar: “Price Dispersion on the Internet: A Review and Directions for Future Research”. In: Journal of Interactive Marketing 18 (4) (2004), pp. 116–135.

[PRW08]

A. Picot, R. Reichwald, and R. T. Wigand: Information, Organization and Management. Springer Berlin Heidelberg, 2008.

[PS08]

E. Prud’hommeaux and A. Seaborne: SPARQL Query Language for RDF. W3C Recommendation 15 January 2008. 2008. url: http://www.w3.org/ TR / 2008 / REC - rdf - sparql - query - 20080115/ (accessed on May 23,

2014). [PW97]

B. Peat and D. Webber: Introducing XML/EDI... "the e-Business framework". 1997. url: http : / / web . archive . org / web / 20011005233701 / http : / / www . geocities . com / WallStreet / Floor / 5815 / start . htm

(accessed on October 29, 2015). [Qui86]

J. R. Quinlan: “Induction of Decision Trees”. In: Machine Learning 1 (1) (1986), pp. 81–106.

[Qui93]

J. R. Quinlan: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

[Rad+ 13]

A. Radinger, B. Rodriguez-Castro, A. Stolz, and M. Hepp: “BauDataWeb: The Austrian Building and Construction Materials Market as Linked Data”. In: Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS 2013). Graz, Austria, 2013, pp. 25–32.

[Rag+ 08]

A. Ragone, U. Straccia, F. Bobillo, T. Di Noia, and E. Di Sciascio: “Fuzzy Bilateral Matchmaking in E-Marketplaces”. In: Proceedings of the 12th International Conference on Knowledge-based Intelligent Information and Engineering Systems (KES 2008). Zagreb, Croatia, 2008, pp. 293–301.

[RB01]

E. Rahm and P. A. Bernstein: “A Survey of Approaches to Automatic Schema Matching”. In: The VLDB Journal 10 (4) (2001), pp. 334–350.

[RD00]

E. Rahm and H. H. Do: “Data Cleaning: Problems and Current Approaches”. In: IEEE Data Engineering Bulletin 23 (4) (2000), pp. 3–13.

[RLS98]

R. Raman, M. Livny, and M. Solomon: “Matchmaking: Distributed Resource Management for High Throughput Computing”. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing (HPDC 1998). Chicago, IL, USA, 1998, pp. 140–146.

[Rob77]

S. E. Robertson: “The Probability Ranking Principle in IR”. In: Journal of Documentation 33 (4) (1977), pp. 294–304.

Bibliography

[Roc71]

331

J. J. Rocchio: “Relevance Feedback in Information Retrieval”. In: The SMART Retrieval System – Experiments in Automatic Document Processing. Ed. by G. Salton. Englewood Cliffs, New Jersey: Prentice Hall, 1971, pp. 313–323.

[Ros73]

S. A. Ross: “The Economic Theory of Agency: The Principal’s Problem”. In: American Economic Review 63 (2) (1973), pp. 134–139.

[RRS11]

F. Ricci, L. Rokach, and B. Shapira: “Introduction to Recommender Systems Handbook”. In: Recommender Systems Handbook. Ed. by F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Springer US, 2011. Chap. 1, pp. 1–35.

[RS76]

S. E. Robertson and K. Sparck Jones: “Relevance Weighting of Search Terms”. In: Journal of the American Society for Information Science 27 (3) (1976), pp. 129–146.

[RSA04]

C. Rocha, D. Schwabe, and M. P. Aragao: “A Hybrid Approach for Searching in the Semantic Web”. In: Proceedings of the 13th International World Wide Web Conference (WWW 2004). New York, NY, USA, 2004, pp. 374–383.

[RvAT13]

H. Rijgersberg, M. van Assem, and J. Top: “Ontology of Units of Measure and Related Concepts”. In: Semantic Web - Linked Data for Science and Education 4 (1) (2013), pp. 3–13.

[RZ09]

S. E. Robertson and H. Zaragoza: “The Probabilistic Relevance Framework: BM25 and Beyond”. In: Foundations and Trends in Information Retrieval 3 (4) (2009), pp. 333–389.

[Sac00]

G. M. Sacco: “Dynamic Taxonomies: A Model for Large Information Bases”. In: IEEE Transactions on Knowledge and Data Engineering 12 (3) (2000), pp. 468–479.

[Sac05]

G. M. Sacco: “The Intelligent E-Store: Easy Interactive Product Selection and Comparison”. In: Proceedings of the Seventh IEEE International Conference on E-Commerce Technology (CEC 2005). Munich, Germany, 2005, pp. 240– 248.

[SB88]

G. Salton and C. Buckley: “Term-weighting Approaches in Automatic Text Retrieval”. In: Information Processing and Management: An International Journal 24 (5) (1988), pp. 513–523.

[SB90]

G. Salton and C. Buckley: “Improving Retrieval Performance by Relevance Feedback”. In: Journal of the American Society for Information Science 41 (4) (1990), pp. 288–297.

332

[SBC97]

Bibliography

B. Shneiderman, D. Byrd, and W. B. Croft: “Clarifying Search: A UserInterface Framework for Text Searches”. In: D-Lib Magazine 3 (1) (1997).

[SBF98]

R. Studer, R. Benjamins, and D. Fensel: “Knowledge Engineering: Principles and Methods”. In: Data & Knowledge Engineering 25 (1-2) (1998), pp. 161– 197.

[SBM96]

A. Singhal, C. Buckley, and M. Mitra: “Pivoted Document Length Normalization”. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1996). Zurich, Switzerland, 1996, pp. 21–29.

[SC08]

L. Sauermann and R. Cyganiak: Cool URIs for the Semantic Web. W3C Interest Group Note 03 December 2008. 2008. url: http://www.w3.org/ TR/2008/NOTE-cooluris-20081203/ (accessed on May 13, 2014).

[SC10]

A. Singhal and M. Cutts: Using Site Speed in Web Search Ranking. Google Webmaster Central Blog. 2010. url: https://googlewebmastercentral. blogspot.de/2010/04/using-site-speed-in-web-search-ranking. html (accessed on February 18, 2016).

[SC12]

M. Sanderson and W. B. Croft: “The History of Information Retrieval Research”. In: Proceedings of the IEEE 100 (Centennial Issue) (2012), pp. 1444– 1451.

[Sch+ 04]

D. J. Schiano, B. A. Nardi, M. Gumbrecht, and L. Swartz: “Blogging by the Rest of Us”. In: Extended Abstracts of the 2004 Conference on Human Factors in Computing Systems (CHI EA 2004). Vienna, Austria, 2004, pp. 1143– 1146.

[Sch+ 14]

M. Schmachtenberg, C. Bizer, A. Jentzsch, and R. Cyganiak: Linking Open Data Cloud Diagram 2014. 2014. url: http://lod-cloud.net/ (accessed on February 26, 2015).

[SchND ]

Schema.org: Welcome to Schema.org. url: http://schema.org/ (accessed on October 19, 2015).

[Sch04]

B. Schwartz: The Paradox of Choice: Why More Is Less. Harper Perennial, 2004.

[SE13]

P. Shvaiko and J. Euzenat: “Ontology Matching: State of the Art and Future Challenges”. In: IEEE Transactions on Knowledge and Data Engineering 25 (1) (2013), pp. 158–176.

[Sen00]

J. A. Senn: “The Emergence of M-Commerce”. In: Computer 33 (12) (2000), pp. 148–150.

Bibliography

[SFW83]

333

G. Salton, E. A. Fox, and H. Wu: “Extended Boolean Information Retrieval”. In: Communications of the ACM 26 (12) (1983), pp. 1022–1036.

[SGH12]

A. Stolz, M. Ge, and M. Hepp: “GR4PHP: A Programming API for Consuming E-Commerce Data from the Semantic Web”. In: Proceedings of the First Workshop on Programming the Semantic Web (PSW 2012). Boston, MA, USA, 2012.

[SH01]

M. Stonebraker and J. M. Hellerstein: “Content Integration for E-Business”. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data (SIGMOD 2001). Santa Barbara, California, USA, 2001, pp. 552–560.

[SH13a]

A. Stolz and M. Hepp: “Currency Conversion the Linked Data Way”. In: Proceedings of the First Workshop on Services and Applications over Linked APIs and Data (SALAD2013). Montpellier, France, 2013, pp. 44–55.

[SH13b]

A. Stolz and M. Hepp: “From RDF to RSS and Atom: Content Syndication with Linked Data”. In: Proceedings of the 24th ACM Conference on Hypertext and Social Media (Hypertext 2013). Paris, France, 2013, pp. 236–241.

[SH14]

A. Stolz and M. Hepp: GR2RSS: Publishing Linked Open Commerce Data as RSS and Atom Feeds. Technical Report TR–2014–1. E-Business and Web Science Research Group, Universität der Bundeswehr München, 2014.

[SH15a]

A. Stolz and M. Hepp: “Adaptive Faceted Search for Product Comparison on the Web of Data”. In: Proceedings of the 15th International Conference on Web Engineering (ICWE 2015). Rotterdam, The Netherlands, 2015, pp. 420– 429.

[SH15b]

A. Stolz and M. Hepp: “An Adaptive Faceted Search Interface for Structured Product Offers on the Web”. In: Proceedings of the 4th International Workshop on Intelligent Exploration of Semantic Data (IESD 2015). Bethlehem, PA, USA, 2015.

[SH15c]

A. Stolz and M. Hepp: “Towards Crawling the Web for Structured Data: Pitfalls of Common Crawl for E-Commerce”. In: Proceedings of the 6th International Workshop on Consuming Linked Data (COLD 2015). Bethlehem, PA, USA, 2015.

[SHB06]

N. Shadbolt, W. Hall, and T. Berners-Lee: “The Semantic Web Revisited”. In: IEEE Intelligent Systems 21 (3) (2006), pp. 96–101.

334

[She99]

Bibliography

A. P. Sheth: “Changing Focus on Interoperability in Information Systems: From System, Syntax, Structure to Semantics”. In: Interoperating Geographic Information Systems. Ed. by M. F. Goodchild, M. J. Egenhofer, R. Fegeas, and C. A. Kottman. Vol. 495. The Springer International Series in Engineering and Computer Science. Springer US, 1999, pp. 5–29.

[SHH07]

M. Stollberg, M. Hepp, and J. Hoffmann: “A Caching Mechanism for Semantic Web Service Discovery”. In: Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC 2007). Busan, Korea, 2007, pp. 480–493.

[SI05]

A. Saaksvuori and A. Immonen: Product Lifecycle Management. 2nd ed. Springer Berlin Heidelberg, 2005.

[Sil+ 11]

R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo: “Managing One Master Data – Challenges and Preconditions”. In: Industrial Management & Data Systems 111 (1) (2011), pp. 146–162.

[Sim59]

H. A. Simon: “Theories of Decision-Making in Economics and Behavioral Science”. In: The American Economic Review 49 (3) (1959), pp. 253–283.

[Sim97]

H. A. Simon: Administrative Behavior: A Study of Decision-making Processes in Administrative Organisations. 4th ed. New York, NY, USA: The Free Press, 1997.

[Sin01]

A. Singhal: “Modern Information Retrieval: A Brief Overview”. In: IEEE Data Engineering Bulletin 24 (4) (2001), pp. 35–43.

[Sin12]

A. Singhal: Introducing the Knowledge Graph: Things, Not Strings. Google Official Blog. 2012. url: http : / / googleblog . blogspot . com / 2012 / 05 / introducing - knowledge - graph - things - not . html (accessed on

September 24, 2015). [Sit08]

Sitemaps.org: Sitemaps XML format. 2008. url: http://www.sitemaps. org/protocol.html (accessed on October 19, 2015).

[SK08]

H. Stuckenschmidt and M. Kolb: “Partial Matchmaking for Complex Product and Service Descriptions”. In: Proceedings of Multikonferenz Wirtschaftsinformatik (MKWI 2008). Munich, Germany, 2008.

[SKL14]

M. Sporny, G. Kellogg, and M. Lanthaler: JSON-LD 1.0: A JSON-based Serialization for Linked Data. W3C Recommendation 16 January 2014. 2014. url: http://www.w3.org/TR/2014/REC-json-ld-20140116/ (accessed on May 16, 2014).

Bibliography

[SKR99]

335

J. B. Schafer, J. Konstan, and J. Riedl: “Recommender Systems in ECommerce”. In: Proceedings of the 1st ACM Conference on Electronic Commerce (EC 1999). Denver, Colorado, USA, 1999, pp. 158–166.

[SKW07]

F. M. Suchanek, G. Kasneci, and G. Weikum: “YAGO: A Core of Semantic Knowledge Unifying WordNet and Wikipedia”. In: Proceedings of the 16th International World Wide Web Conference (WWW 2007). Banff, Alberta, Canada, 2007, pp. 697–706.

[SL05]

J. Sauro and J. R. Lewis: “Estimating Completion Rates from Small Samples Using Binomial Confidence Intervals: Comparisons and Recommendations”. In: Proceedings of the Human Factors and Ergonomics Society 49th Annual Meeting (HFES 2005) 49 (24) (2005), pp. 2100–2104.

[SL98]

B. F. Schmid and M. A. Lindemann: “Elements of a Reference Model for Electronic Markets”. In: Proceedings of the 31st Hawaii International Conference on System Sciences (HICCS 1998). Kohala Coast, HI, 1998, pp. 193–201.

[SLK04]

V. Schmitz, J. Leukel, and O. Kelkar: “XML-based Data Exchange of Product Model Data in E-Procurement and E-Sales: The Case of BMEcat 2.0”. In: Proceedings of the International Conference on Economic, Technical and Organisational Aspects of Product Configuration Systems (PETO 2004). Copenhagen, Denmark, 2004, pp. 97–107.

[SLK05a]

V. Schmitz, J. Leukel, and O. Kelkar: Specification BMEcat 2005. Frankfurt am Main, Germany: Bundesverband Materialwirtschaft, Einkauf und Logistik e.V., 2005.

[SLK05b]

V. Schmitz, J. Leukel, and O. Kelkar: Specification BMEcat 2005. Frankfurt am Main, Germany: Bundesverband Materialwirtschaft, Einkauf und Logistik e.V., 2005.

[SLÖ08]

J. W. Schemm, C. Legner, and H. Österle: “Global Data Synchronization – Lösungsansatz für das überbetriebliche Produktstammdatenmanagement zwischen Konsumgüterindustrie und Handel?” In: Wertschöpfungsnetzwerke. Ed. by J. Becker, R. Knackstedt, and D. Pfeiffer. Physica-Verlag Heidelberg, 2008, pp. 173–192.

[SM12]

T. Steiner and S. Mirea: “SEKI@home, or Crowdsourcing an Open Knowledge Graph”. In: Proceedings of the First International Workshop on Knowledge Extraction and Consolidation from Social Media (KECSM 2012). Boston, USA, 2012.

336

[SM13]

Bibliography

G. Schadow and C. J. McDonald: The Unified Code for Units of Measure. Version: 1.9. 2013. url: http://unitsofmeasure.org/ucum.html (accessed on October 30, 2014).

[SN10]

S. Sakr and G. Al-Naymat: “Relational Processing of RDF Queries: A Survey”. In: SIGMOD Record 38 (4) (2010), pp. 23–28.

[Sor+ 10]

S. Sorrentino, S. Bergamaschi, M. Gawinecki, and L. Po: “Schema Label Normalization for Improving Schema Matching”. In: Data & Knowledge Engineering 69 (12) (2010), pp. 1254–1273.

[Sow13]

J. F. Sowa: Semantic Networks. 2013. url: http://www.jfsowa.com/ pubs/semnet.htm (accessed on May 16, 2014).

[Spa72]

K. Sparck Jones: “A Statistical Interpretation of Term Specificity and Its Application in Retrieval”. In: Journal of Documentation 28 (1) (1972), pp. 11– 21.

[Spr90]

K. Spremann: “Asymmetrische Information”. In: Zeitschrift für Betriebswirtschaft 60 (5/6) (1990), pp. 561–586.

[SR14]

G. Schreiber and Y. Raimond: RDF 1.1 Primer. W3C Working Group Note 25 February 2014. 2014. url: http://www.w3.org/TR/2014/NOTE-rdf11primer-20140225/ (accessed on May 16, 2014).

[SRH13a]

A. Stolz, B. Rodriguez-Castro, and M. Hepp: RDF Translator: A RESTful Multi-Format Syntax Converter for the Semantic Web. Technical Report TR–2013–1. E-Business and Web Science Research Group, Universität der Bundeswehr München, 2013.

[SRH13b]

A. Stolz, B. Rodriguez-Castro, and M. Hepp: “Using BMEcat Catalogs as a Lever for Product Master Data on the Semantic Web”. In: Proceedings of the 10th Extended Semantic Web Conference (ESWC 2013). Montpellier, France, 2013, pp. 623–638.

[SS09]

U. Schonfeld and N. Shivakumar: “Sitemaps: Above and Beyond the Crawl of Duty”. In: Proceedings of the 18th International World Wide Web Conference (WWW 2009). Madrid, Spain, 2009, pp. 991–1000.

[ST09]

G. M. Sacco and Y. Tzitzikas: Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience. Springer Berlin Heidelberg, 2009.

[Sta08]

Statistisches Bundesamt: Klassifikation der Wirtschaftszweige: Mit Erläuterungen. Wiesbaden, Germany: Statistisches Bundesamt, 2008.

[Sta11]

J. Stark: Product Lifecycle Management: 21st Century Paradigm for Product Realisation. 2nd ed. Springer-Verlag London, 2011.

Bibliography

[Ste+ 09]

337

M. Stefaner, S. Ferré, S. Perugini, J. Koren, and Y. Zhang: “User Interface Design”. In: Dynamic Taxonomies and Faceted Search. Ed. by G. M. Sacco and Y. Tzitzikas. Springer Berlin Heidelberg, 2009. Chap. 4, pp. 75–112.

[Sti61]

G. J. Stigler: “The Economics of Information”. In: The Journal of Political Economy 69 (3) (1961), pp. 213–255.

[Sto+ 07]

M. Stollberg, U. Keller, H. Lausen, and S. Heymans: “Two-Phase Web Service Discovery Based on Rich Functional Descriptions”. In: Proceedings of the 4th European Semantic Web Conference (ESWC 2007). Innsbruck, Austria, 2007, pp. 99–113.

[Sto+ 14]

A. Stolz, B. Rodriguez-Castro, A. Radinger, and M. Hepp: “PCS2OWL: A Generic Approach for Deriving Web Ontologies from Product Classification Systems”. In: Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014). Anissaras/Hersonissou, Crete, Greece, 2014, pp. 644–658.

[Su+ 14]

A.-J. Su, Y. C. Hu, A. Kuzmanovic, and C.-K. Koh: “How to Improve Your Search Engine Ranking: Myths and Reality”. In: ACM Transactions on the Web (TWEB) 8 (2) (2014), 8:1–8:25.

[SV99]

C. Shapiro and H. R. Varian: Information Rules: A Strategic Guide to the Network Economy. Harvard Business School Press, 1999.

[SW65]

S. S. Shapiro and M. B. Wilk: “An Analysis of Variance Test for Normality (Complete Samples)”. In: Biometrika 52 (3-4) (1965), pp. 591–611.

[SWY75]

G. Salton, A. Wong, and C. Yang: “A Vector Space Model for Automatic Indexing”. In: Communications of the ACM 18 (11) (1975), pp. 613–620.

[Syc+ 02]

K. Sycara, S. Widoff, M. Klusch, and J. Lu: “LARKS: Dynamic Matchmaking Among Heterogeneous Software Agents in Cyberspace”. In: Autonomous Agents and Multi-Agent Systems 5 (2) (2002), pp. 173–203.

[Syc+ 99]

K. Sycara, J. Lu, M. Klusch, and S. Widoff: “Matchmaking among Heterogeneous Agents on the Internet”. In: Proceedings of the AAAI Spring Symposium on lntelligent Agents in Cyberspace. Stanford, USA, 1999, pp. 152–164.

[TB96]

C. P. Thorpe and J. C. L. Bailey: Commercial Contracts: A Practical Guide to Deals, Contracts, Agreements and Promises. Woodhead Publishing, 1996.

[TBH06]

R. Tiwari, S. Buse, and C. Herstatt: “From Electronic to Mobile Commerce: Opportunities through Technology Convergence for Business Services”. In: Asia Pacific Tech Monitor 23 (5) (2006), pp. 38–45.

338

[TBL10]

Bibliography

E. Turban, N. Bolloju, and T.-P. Liang: “Social Commerce: An E-Commerce Perspective”. In: Proceedings of the 12th International Conference on Electronic Commerce: Roadmap for the Future of Electronic Business (ICEC 2010). Honolulu, Hawaii, 2010, pp. 33–42.

[TH06]

P. Thomas and D. Hawking: “Evaluation by Comparing Result Sets in Context”. In: Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM 2006). Arlington, Virginia, USA, 2006, pp. 94–101.

[TH14]

L. Török and M. Hepp: “Towards Portable Shopping Histories: Using GoodRelations to Expose Ownership Information to E-Commerce Sites”. In: Proceedings of the 11th Extended Semantic Web Conference (ESWC 2014). Anissaras/Hersonissou, Crete, Greece, 2014, pp. 691–705.

[TS08]

G. Tindsley and P. Stephenson: “E-Tendering Process within Construction: A UK Perspective”. In: Tsinghua Science and Technology 13 (S1) (2008), pp. 273–278.

[TSK05]

P.-N. Tan, M. Steinbach, and V. Kumar: “Classification: Basic Concepts, Decision Trees, and Model Evaluation”. In: Introduction to Data Mining. 1st ed. Addison-Wesley, 2005. Chap. 4, pp. 145–205.

[Tun09]

D. Tunkelang: Faceted Search. Synthesis Lectures on Information Concepts, Retrieval, and Services. Morgan & Claypool, 2009.

[Tva11]

M. Tvarožek: “Exploratory Search in the Adaptive Social Semantic Web”. In: Information Sciences and Technologies Bulletin of the ACM Slovakia 3 (1) (2011), pp. 42–51.

[UN 12]

UN Economic Commission for Europe: UN/EDIFACT – Price/Sales Catalogue Message. United Nations Directories for Electronic Data Interchange for Administration, Commerce and Transport. 2012. url: http : / / www . unece . org / trade / untdid / d12b / trmd / pricat _ c . htm (accessed on

May 16, 2014). [UN 14]

UN Economic Commission for Europe: UN/EDIFACT – Product Data Message. United Nations Directories for Electronic Data Interchange for Administration, Commerce and Transport. 2014. url: http://www.unece.org/ fileadmin/DAM/trade/untdid/d13b/trmd/prodat_c.htm (accessed on

May 16, 2014).

Bibliography

[UniND ]

339

United Nations Development Programme: The United Nations Standard Products and Services Code (UNSPSC). url: http://www.unspsc.org/ (accessed on May 16, 2014).

[Uni06]

United Nations Economic Commission for Europe: Recommendation No. 20: Codes for Units of Measure Used in International Trade. Revision 4. UNECE, 2006.

[Uni09a]

United Nations Economic Commission for Europe: Codes for Units of Measure Used in International Trade Revision 6 – Annex II & Annex III. UN/ECE CEFACT Trade Facilitation Recommendation No.20. UNECE, 2009.

[Uni09b]

United Nations Economic Commission for Europe: Recommendation No. 20: Codes for Units of Measure Used in International Trade. Revision 6. UNECE, 2009.

[Uni14]

United States Census Bureau: Quarterly Retail E-Commerce Sales: 4th Quarter 2014. Washington, DC, USA, 2014. url: http://www2.census. gov / retail / releases / historical / ecomm / 14q4 . pdf (accessed on

February 26, 2015). [Utg89]

P. E. Utgoff: “Incremental Induction of Decision Trees”. In: Machine Learning 4 (2) (1989), pp. 161–186.

[Vei+ 01]

D. Veit, J. P. Müller, M. Schneider, and B. Fiehn: “Matchmaking for Autonomous Agents in Electronic Marketplaces”. In: Proceedings of the Fifth International Conference on Autonomous Agents (AGENTS 2001). Montreal, Canada, 2001, pp. 65–66.

[Vei03]

D. Veit: Matchmaking in Electronic Markets: An Agent-based Approach towards Matchmaking in Electronic Negotiations. Springer Berlin Heidelberg, 2003.

[Ver+ 14]

R. Verborgh, O. Hartig, B. De Meester, G. Haesendonk, L. De Vocht, M. Vander Sande, R. Cyganiak, P. Colpaert, E. Mannens, and R. Van de Walle: “Querying Datasets on the Web with High Availability”. In: Proceedings of the 13th International Semantic Web Conference (ISWC 2014). Riva del Garda, Trentino, Italy, 2014, pp. 180–196.

[VFK13]

D. Vandic, F. Frasincar, and U. Kaymak: “Facet Selection Algorithms for Web Product Search”. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management (CIKM 2013). San Francisco, CA, USA, 2013, pp. 2327–2332.

340

[Vil11]

Bibliography

B. Villazon-Terrazas: “A Method for Reusing and Re-engineering Nonontological Resources for Building Ontologies”. PhD thesis. Universidad Politécnica de Madrid, 2011.

[VK14]

D. Vrandečić and M. Krötzsch: “Wikidata: A Free Collaborative Knowledgebase”. In: Communications of the ACM 57 (10) (2014), pp. 78–85.

[Vol+ 09]

J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov: “Silk – A Link Discovery Framework for the Web of Data”. In: Proceedings of the WWW2009 Workshop on Linked Data on the Web (LDOW 2009). Madrid, Spain, 2009.

[Vor94]

E. M. Vorhees: “Query Expansion Using Lexical-Semantic Relations”. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1994). Dublin, Ireland, 1994, pp. 61–69.

[VvDF12]

D. Vandic, J.-W. van Dam, and F. Frasincar: “Faceted Product Search Powered by the Semantic Web”. In: Decision Support Systems 53 (3) (2012), pp. 425–437.

[VVK00]

U. Varshney, R. J. Vetter, and R. Kalakota: “Mobile Commerce: A New Frontier”. In: Computer 33 (10) (2000), pp. 32–38.

[VWM02]

D. Veit, C. Weinhardt, and J. P. Müller: “Multidimensional Matchmaking for Electronic Markets”. In: Applied Artificial Intelligence 16 (9-10) (2002), pp. 853–869.

[Wan+ 09]

X. Wang, X. Sun, F. Cao, L. Ma, N. Kanellos, K. Zhang, Y. Pan, and Y. Yu: “SMDM: Enhancing Enterprise-wide Master Data Management Using Semantic Web Technologies”. In: Proceedings of the VLDB Endowment 2 (2) (2009), pp. 1594–1597.

[WebND a]

Web Data Commons: Web Data Commons Extraction Report – August 2012 Corpus. url: http://www.webdatacommons.org/structureddata/201208/stats/stats.html (accessed on July 22, 2014).

[WebND b]

Web Data Commons: Web Data Commons Extraction Report – February 2012 Corpus. url: http://www.webdatacommons.org/structureddata/201202/stats/stats.html (accessed on July 22, 2014).

[WebND c]

Web Data Commons: Web Data Commons – RDFa, Microdata, and Microformats Data Sets – December 2014. url: http://www.webdatacommons. org/structureddata/2014-12/stats/stats.html (accessed on June 1,

2015).

Bibliography

[WebND d]

341

Web Data Commons: Web Data Commons – RDFa, Microdata, and Microformats Data Sets – November 2013. url: http://www.webdatacommons. org/structureddata/2013-11/stats/stats.html (accessed on July 22,

2014). [Web11]

A. Weber: “Marktanalyse von Software für Produkt-Informations-Management (PIM)”. Bachelor thesis. Universität der Bundeswehr München, Neubiberg, Germany, 2011.

[Wed+ 95]

C. Wedekind, T. Seebeck, F. Bettens, and A. J. Paepke: “MHC-dependent Mate Preferences in Humans”. In: Biological Sciences 260 (1359) (1995), pp. 245–249.

[Wei+ 11]

D. Wei, T. Wang, J. Wang, and A. Bernstein: “SAWSDL-iMatcher: A Customizable and Effective Semantic Web Service Matchmaker”. In: Web Semantics: Science, Services and Agents on the World Wide Web 9 (4) (2011), pp. 402–417.

[Wei+ 13]

B. Wei, J. Liu, Q. Zheng, W. Zhang, X. Fu, and B. Feng: “A Survey of Faceted Search”. In: Journal of Web Engineering 12 (1) (2013), pp. 41–64.

[Wei12]

D. M. Weijers: “Hedonism and Happiness in Theory and Practice”. PhD thesis. Victoria University of Wellington, 2012.

[Whi+ 06a]

A. White, D. Newman, D. Logan, and J. Radcliffe: Mastering Master Data Management. Research Report. Stamford: Gartner, 2006.

[Whi+ 06b] R. W. White, B. Kules, S. M. Drucker, and M. Schraefel: “Supporting Exploratory Search”. In: Communications of the ACM 49 (4) (2006), pp. 37– 39. [Whi07]

A. White: Magic Quadrant for Product Information Management, 2Q07. Research Report. Stamford: Gartner, 2007.

[Wil45]

F. Wilcoxon: “Individual Comparisons by Ranking Methods”. In: Biometrics Bulletin 1 (6) (1945), pp. 80–83.

[Wil81]

O. E. Williamson: “The Economics of Organization: The Transaction Cost Approach”. In: The American Journal of Sociology 87 (3) (1981), pp. 548– 577.

[Wil83]

O. E. Williamson: “Credible Commitments: Using Hostages to Support Exchange”. In: American Economic Review 73 (4) (1983), pp. 519–538.

[Woo14]

D. Wood: What’s New in RDF 1.1. W3C Working Group Note 25 February 2014. 2014. url: http : / / www . w3 . org / TR / 2014 / NOTE - rdf11 - new 20140225/ (accessed on May 8, 2014).

342

[WS96]

Bibliography

R. Y. Wang and D. M. Strong: “Beyond Accuracy: What Data Quality Means to Data Consumers”. In: Journal of Management Information Systems 12 (4) (1996), pp. 5–34.

[WZ12]

C. Wang and P. Zhang: “The Evolution of Social Commerce: The People, Management, Technology, and Information Dimensions”. In: Communications of the Association for Information Systems 31 (5) (2012), pp. 105–127.

[Yee+ 03]

K.-P. Yee, K. Searingen, K. Li, and M. A. Hearst: “Faceted Metadata for Image Search and Browsing”. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2003). Fort Lauderdale, Florida, USA, 2003, pp. 401–408.

[Zar+ 13]

M. Zaremba, S. Bhiri, T. Vitvar, and M. Hauswirth: “Matchmaking of IaaS Cloud Computing Offers Leveraging Linked Data”. In: Proceedings of the 28th Annual ACM Symposium on Applied Computing (SAC 2013). Coimbra, Portugal, 2013, pp. 383–388.

[ZD04]

P. Ziegler and K. R. Dittrich: “Three Decades of Data Integration - All Problems Solved?” In: Proceedings of the 18th IFIP World Computer Congress (WCC 2004). Toulouse, France, 2004, pp. 3–12.

[ZL03]

Y. Zhao and J. Lövdahl: “A Reuse-based Method of Developing the Ontology for E-Procurement”. In: Proceedings of the Nordic Conference on Web Services (NCWS). Växjö, Sweden, 2003.

[ZM06]

J. Zobel and A. Moffat: “Inverted Files for Text Search Engines”. In: ACM Computing Surveys 38 (2) (2006).

View more...

Comments

Copyright © 2017 DATENPDF Inc.