Here's a list of some large datasets, and the kind of software that can handle these sets on a laptop.
I'll be back with more, there is some amazing stuff out here now.
Resources
1. Open Data
DBpedia - Querying Wikipedia Like a Database
DBpedia is a community effort to extract structured information from Wikipedia editions in over 90 languages and to make the resulting knowledge base available on the Web. The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.6 million are classified in a consistent ontology. It is one of the most comprehensive multi-lingual knowledge bases that currently exist and has developed into an interlinking hub for the Web of Data. The knowledge base is widely used by research projects as well as in industry. More information about the project is found on the DBpedia website.
Duration: Active since 2007
Project partners: Universität Leipzig, OpenLink Software, and a world-wide community of developers and mapping editors.
Project partners: Universität Leipzig, OpenLink Software, and a world-wide community of developers and mapping editors.
W3C Linking Open Data
W3C Linking Open Data community project supports and loosely coordinates the extension of the Web with a global data space by publishing open-license datasets as RDF and by setting data links between data items within different data sources. The project maintains the LOD dataset catalogue on CKAN as well as tool listings in the W3C ESW wiki. It regularly publishes statistics about the LOD data cloud and maintains the LOD cloud diagram. More information about the project is found on the LOD website.
Duration: Active since 2007
Project partners: Over 100 world-wide including the Massachusetts Institute of Technology (USA), DERI (Ireland), Talis (UK), University of Southampton (UK), Open University (UK), OpenLink Software (USA), BBC (UK), Geonames (USA).
Project partners: Over 100 world-wide including the Massachusetts Institute of Technology (USA), DERI (Ireland), Talis (UK), University of Southampton (UK), Open University (UK), OpenLink Software (USA), BBC (UK), Geonames (USA).
Web Data Commons
More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types. More information about the project is found at WebDataCommons.org.
Duration: Active since March 2012
Project partner: Karlsruhe Institut of Technology (Germany)
Project partner: Karlsruhe Institut of Technology (Germany)
Web Data Commons - Hyperlink Graph
The project provides a large hyperlink graph for public download and analyses the topology of the graph. The WDC Hyperlink Graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The graph and the results of the analysis are found at http://webdatacommons.org/hyperlinkgraph
Duration: Active since November 2013
JudaicaLink
Scholarly reference works like encyclopediae, glossars, or catalogs function as guides to a scholarly domain and as anchor points and manifestations of scholarly work. On the web of Linked Data, they can take on a key function to interlink resources related to the described concepts. Within the context of JudaicaLink, we provide support to publish and interlink existing reference works of the Jewish culture and history as Linked Data. More information about the project is found at JudaicaLink.org.
Duration: Active since May 2013
Project partner: European Association for Jewish Culture (France, UK)
Project partner: European Association for Jewish Culture (France, UK)
2. Open Source Software
Silk - Link Discovery Framework
The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk can also be used as an identity resolution component within Linked Data applications. Silk provides a declarative language for expressing identity resolution heuristics and implements a sophisticated blocking method (MultiBlock). There is a single machine and a Hadoop-based implementation available. More information about the project is found on the Silk website.
Duration: Active since 2009
RapidMiner Linked Open Data Extension
The RapidMiner Linked Open Data Extension is an extension to the open source data mining software RapidMiner. It allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. More information about the extension as well as its use cases are found on the project's website.
Duration: Active since 2013
D2RQ Plattform - Accessing Relational Databases as Virtual RDF Graphs
The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. It offers RDF-based access to the content of relational databases without having to replicate it into an RDF store. Using D2RQ you can Using D2RQ you can: 1. query a non-RDF database using SPARQL; 2. access the content of the database as Linked Data over the Web; 3. create custom dumps of the database in RDF formats for loading into an RDF store; 4. access information in a non-RDF database using the Apache Jena API. The D2RQ Plattform has been downloaded over 15.000 times from Sourceforge. More information about the plattform is found on the D2RQ website.
Duration: Active since 2004
Project partners: DERI (Ireland)
OEM distributor: TopBraid (USA)
Project partners: DERI (Ireland)
OEM distributor: TopBraid (USA)
LDIF - Linked Data Integration Framework
The LDIF – Linked Data Integration Framework is a Hadoop-based framework for integrating and cleansing large amounts of web and enterprise data. LDIF provides an expressive mapping language, an identity resolution component, as well as data quality assessment and data fusion modules. More information about the project is found on the LDIF website.
Duration: Active since June 2011
ALCOMO - Applying Logical Constraints to Match Ontologies
ALCOMO is a project that has been developed by Christian Meilicke in the context of his Phd. It is a debugging system that allows to transform incoherent alignments in coherent alignments by removing some correspondences from the alignment. The removed part of the alignment is called a diagnosis. It is complete in the sense that it detects any kind of incoherence in SHIN(D) ontologies. At the same time a computed diagnosis is always minimal in the sense that the tool never removes too much, i.e., the removed subset of the alignment is always a minimal hitting set over all conflicts.The system is availabe under MIT license and can be downloaded here.
Duration: Available since 2012
Semtinel - Thesaurus analysis beyond numbers
Semtinel is a graphical thesaurus analysis and maintenance system developed mainly by Kai Eckert as part of his dissertation. It also formed the technical basis for several master theses and student research projects. The software is available at Semtinel.org.
Duration: Available since 2008
WDC - Extraction Framework
The Web Data Commons - Extraction Framework is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon EC2 cloud services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks. More information and the download instructions can be found on the Web Data Commons website.
Duration: Available since July 2014
3. Benchmarks
Berlin SPARQL Benchmark (BSBM)
The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix illustrates the search and navigation pattern of a consumer looking for a product. More information about the benchmark is found on the BSBM website.
Duration: Active since 2008
OAEI Anatomy and Library Track
The Ontology Alignment Evaluation Initiative (OAEI) is a coordinated international initiative to assess strengths and weaknesses of alignment/matching systems and to compare the performance of techniques. Since 2006 we offered for the first time the Anatomy track. This track consists of finding alignments between the Adult Mouse Anatomy and a part of the NCI Thesaurus (describing the human anatomy). The task is placed in a domain where we find large, carefully designed ontologies that are described in technical terms. Since 2012 we are offering a second track, called the Library track. The Library track is a real-word task to match the STW and the TheSoz thesaurus. Both provide a vocabulary for economic resp. social science subjects and are used by libraries for indexation and retrieval. The latest versions of the datasets as well as the tools to process them are available viahttp://oaei.ontologymatching.org/2012/.
Duration: Since 2006 as part of the annual OAEI campaign