Opendata, Software, OpenEdu -- Life is good

Here's a list of some large datasets, and the kind of software that can handle these sets on a laptop.

I'll be back with more, there is some amazing stuff out here now.

Resources

1. Open Data

DBpedia - Querying Wikipedia Like a Database

DBpedia is a community effort to extract structured information from Wikipedia editions in over 90 languages and to make the resulting knowledge base available on the Web. The DBpedia knowledge base currently describes more than 3.5 million things, out of which 1.6 million are classified in a consistent ontology. It is one of the most comprehensive multi-lingual knowledge bases that currently exist and has developed into an interlinking hub for the Web of Data. The knowledge base is widely used by research projects as well as in industry. More information about the project is found on the DBpedia website.

Duration: Active since 2007
Project partners: Universität Leipzig, OpenLink Software, and a world-wide community of developers and mapping editors.

W3C Linking Open Data

W3C Linking Open Data community project supports and loosely coordinates the extension of the Web with a global data space by publishing open-license datasets as RDF and by setting data links between data items within different data sources. The project maintains the LOD dataset catalogue on CKAN as well as tool listings in the W3C ESW wiki. It regularly publishes statistics about the LOD data cloud and maintains the LOD cloud diagram. More information about the project is found on the LOD website.

Duration: Active since 2007
Project partners: Over 100 world-wide including the Massachusetts Institute of Technology (USA), DERI (Ireland), Talis (UK), University of Southampton (UK), Open University (UK), OpenLink Software (USA), BBC (UK), Geonames (USA).

Web Data Commons

More and more websites embed structured data describing for instance products, people, organizations, places, events, resumes, and cooking recipes into their HTML pages using encoding standards such as Microformats, Microdatas and RDFa. The Web Data Commons project extracts all Microformat, Microdata and RDFa data from the Common Crawl web corpus, the largest and most up-to-data web corpus that is currently available to the public, and provide the extracted data for download in the form of RDF-quads and also in the form of CSV-tables for common entity types. More information about the project is found at WebDataCommons.org.

Duration: Active since March 2012
Project partner: Karlsruhe Institut of Technology (Germany)

Web Data Commons - Hyperlink Graph

The project provides a large hyperlink graph for public download and analyses the topology of the graph. The WDC Hyperlink Graph has been extracted from the Common Crawl 2012 web corpus and covers 3.5 billion web pages and 128 billion hyperlinks between these pages. To the best of our knowledge, this graph is the largest hyperlink graph that is available to the public outside companies such as Google, Yahoo, and Microsoft. The graph and the results of the analysis are found at http://webdatacommons.org/hyperlinkgraph

Duration: Active since November 2013

JudaicaLink

Scholarly reference works like encyclopediae, glossars, or catalogs function as guides to a scholarly domain and as anchor points and manifestations of scholarly work. On the web of Linked Data, they can take on a key function to interlink resources related to the described concepts. Within the context of JudaicaLink, we provide support to publish and interlink existing reference works of the Jewish culture and history as Linked Data. More information about the project is found at JudaicaLink.org.

Duration: Active since May 2013
Project partner: European Association for Jewish Culture (France, UK)

2. Open Source Software

Silk - Link Discovery Framework

The Silk framework is a tool for discovering relationships between data items within different Linked Data sources. Data publishers can use Silk to set RDF links from their data sources to other data sources on the Web. Silk can also be used as an identity resolution component within Linked Data applications. Silk provides a declarative language for expressing identity resolution heuristics and implements a sophisticated blocking method (MultiBlock). There is a single machine and a Hadoop-based implementation available. More information about the project is found on the Silk website.

Duration: Active since 2009

RapidMiner Linked Open Data Extension

The RapidMiner Linked Open Data Extension is an extension to the open source data mining software RapidMiner. It allows using data from Linked Open Data both as an input for data mining as well as for enriching existing datasets with background knowledge. More information about the extension as well as its use cases are found on the project's website.

Duration: Active since 2013

D2RQ Plattform - Accessing Relational Databases as Virtual RDF Graphs

The D2RQ Platform is a system for accessing relational databases as virtual, read-only RDF graphs. It offers RDF-based access to the content of relational databases without having to replicate it into an RDF store. Using D2RQ you can Using D2RQ you can: 1. query a non-RDF database using SPARQL; 2. access the content of the database as Linked Data over the Web; 3. create custom dumps of the database in RDF formats for loading into an RDF store; 4. access information in a non-RDF database using the Apache Jena API. The D2RQ Plattform has been downloaded over 15.000 times from Sourceforge. More information about the plattform is found on the D2RQ website.

Duration: Active since 2004
Project partners: DERI (Ireland)
OEM distributor: TopBraid (USA)

LDIF - Linked Data Integration Framework

The LDIF – Linked Data Integration Framework is a Hadoop-based framework for integrating and cleansing large amounts of web and enterprise data. LDIF provides an expressive mapping language, an identity resolution component, as well as data quality assessment and data fusion modules. More information about the project is found on the LDIF website.

Duration: Active since June 2011

ALCOMO - Applying Logical Constraints to Match Ontologies

ALCOMO is a project that has been developed by Christian Meilicke in the context of his Phd. It is a debugging system that allows to transform incoherent alignments in coherent alignments by removing some correspondences from the alignment. The removed part of the alignment is called a diagnosis. It is complete in the sense that it detects any kind of incoherence in SHIN(D) ontologies. At the same time a computed diagnosis is always minimal in the sense that the tool never removes too much, i.e., the removed subset of the alignment is always a minimal hitting set over all conflicts.The system is availabe under MIT license and can be downloaded here.

Duration: Available since 2012

Semtinel - Thesaurus analysis beyond numbers

Semtinel is a graphical thesaurus analysis and maintenance system developed mainly by Kai Eckert as part of his dissertation. It also formed the technical basis for several master theses and student research projects. The software is available at Semtinel.org.

Duration: Available since 2008

WDC - Extraction Framework

The Web Data Commons - Extraction Framework is used by the Web Data Commons project to extract Microdata, Microformats and RDFa data, Web graphs, and HTML tables from the web crawls provided by the Common Crawl Foundation. The framework provides an easy to use basis for the distributed processing of large web crawls using Amazon EC2 cloud services. The framework is published under the terms of the Apache license and can be simply customized to perform different data extraction tasks. More information and the download instructions can be found on the Web Data Commons website.

Duration: Available since July 2014

3. Benchmarks

Berlin SPARQL Benchmark (BSBM)

The SPARQL Query Language for RDF and the SPARQL Protocol for RDF are implemented by a growing number of storage systems. As SPARQL is taken up by the community there is a growing need for benchmarks to compare the performance of storage systems that expose SPARQL endpoints via the SPARQL protocol. The Berlin SPARQL Benchmark (BSBM) defines a suite of benchmarks for comparing the performance of these systems across architectures. The benchmark is built around an e-commerce use case in which a set of products is offered by different vendors and consumers have posted reviews about products. The benchmark query mix illustrates the search and navigation pattern of a consumer looking for a product. More information about the benchmark is found on the BSBM website.

Duration: Active since 2008

OAEI Anatomy and Library Track

The Ontology Alignment Evaluation Initiative (OAEI) is a coordinated international initiative to assess strengths and weaknesses of alignment/matching systems and to compare the performance of techniques. Since 2006 we offered for the first time the Anatomy track. This track consists of finding alignments between the Adult Mouse Anatomy and a part of the NCI Thesaurus (describing the human anatomy). The task is placed in a domain where we find large, carefully designed ontologies that are described in technical terms. Since 2012 we are offering a second track, called the Library track. The Library track is a real-word task to match the STW and the TheSoz thesaurus. Both provide a vocabulary for economic resp. social science subjects and are used by libraries for indexation and retrieval. The latest versions of the datasets as well as the tools to process them are available viahttp://oaei.ontologymatching.org/2012/.

Duration: Since 2006 as part of the annual OAEI campaign

Thoughts on Bullshit

Das ist nicht nur nicht richtig, es ist nicht einmal falsch!
(It is not only not right, it is not even wrong)

Some big minds tell us that it is impossible for someone to lie unless he thinks he knows the truth. So this make Bullshit different than lying. Producing bullshit requires no such conviction. A person who lies is responding to the truth, and for his side of it, he believes he understands what the truth is.

When an honest man speaks, he says only what he believes to be true; and for the liar, he believes his statements to be false. For the bullshitter, however, all these bets are off:

Authoring Tools and A Sadistic utility You'll probably try Anyway

Over the last couple of weeks I've been deep delving into linguistics and grammar parsing. Learned some great stuff about Sentiment Programming, and analysis strategy. In doing all of this I've gathered up a long list of software utilities which I'm now trying to catalog and comment on in case you would like to try some of these out. They were an amazing help, and I certainly would not have learned as much as I did in the short time I gave myself to understand these areas of research.

What did I learn? Well I learned that the last couple of weeks was spent deep delving into areas of research about exactly the wrong areas. However, I never would have found the right area if I didn't go there.

I also learned some useful aspects of Sentiment, and Big Data, both of which I'll be posting on as well over the next couple of weeks.

This first list is a collection of some Authoring software which I have enjoyed on various levels. Gir will show you the ones I find most useful.

Spreadsheets are essential tools for enterprises. Other financial and business intelligence applications have emerged over the years, but the spreadsheet remains the fundamental tool for financial reporting and analysis and for sharing numerical or tabular data. Spreadsheet applications have grown more powerful, and spreadsheets themselves have grown bigger and more complex, making it more difficult for spreadsheet users and internal auditors to identity risks and errors in spreadsheet data and formulas. If uncorrected, these risks and errors could lead to strategic missteps, erroneous financial reporting, regulatory fines, and other costly outcomes. To reduce risks and errors while keeping spreadsheet users productive, enterprises need a spreadsheet risk management solution that is:

Comprehensive
Fast
Accurate
Scalable
Easy to use
Easy to learn
Supportive of best practices

A Look at Propaganda in the Ukraine

The New York Times has an interview with Mr. Pomerantsev which is certainly worth looking into if you are interested in the use of Propaganda on the Cable News. The differences between the use of aggressive persuasion on the TV and in Text are of course different via the medium constraints. For example Guilt as a means of motivation can be evoked 17 ways (that I know of) through video, and only 4 ways through text. Text trumps in other directions.

Mr. Pomerantsev’s book, “Nothing Is True and Everything Is Possible,” has particular resonance, describing a world where laws change at the whim of the powerful and where television provides an ever-present, entertaining and emotionally charged distortion of reality.

Mr. Pomerantsev’s area of study is propaganda, and he believes he saw many classic techniques at work in Moscow. He says one favorite trick was to put a credible expert next to a neo-Nazi, juxtaposing fact with fiction so as to encourage so much cynicism that viewers believed very little. Another was to give credence to conspiracy theories — by definition difficult to rebut because their proponents are immune to reasoned debate.

Debunk from F.E.E. on the Common Core Deniers

The Foundation for Excellence in Education, states --below in the Background area of their article -- that they respect much of the work of the American Principles Project and the work of Ms. Gallagher.

I don't.

First off, why is a 501.(c)(3) Non-Profit commenting on and campaigning against Government Policy in the first place?

So far every claim they have made against Common Core has been (dis)Information. There is a very important difference between "misinformation" and "disinformation". Misinformation means that you didn't know what you were propagating was inaccurate. It means that you had no other agenda other than attempting to provide the best information you had available, and that what you knew was of importance.

DisInformation is something else entirely,

Republican Propaganda
Editing the State of the Nation Speech

Republicans Post Doctored Version Of State Of The Union, Censor Facts On Climate Change
BY EMILY ATKIN POSTED ON JANUARY 21, 2015 AT 10:42 AM

The official website for House Republicans has posted on YouTube a version of President Obama’s State of the Union address which cuts out comments where the President was critical of Republican rhetoric on climate change, ThinkProgress has learned.

NeuroRomancer

Wikipedia