Statistical database in dbms

A statistical database in dbms (Database Management System) is used for this analysis purposes.  Statistical database is an online analytical processing (OLAP), instead of  online transaction processing (OLTP) system. Recent decision, and old statistical databases are more closer to the relational model than the multidimensional model commonly used in OLAP systems in these days.

It is typically  has parameter data and the measured data for these parameters. For an instance, parameter data consists of the various values for changing conditions in an experiment (e.g., temperature, time). The calculated data (or variables) are the measurements taken in the experiment under these changing conditions.

Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two choice for dealing with the sparseness: (1) leave the null values in there and use compression techniques to squeeze them out or (2) remove the entries that only have null values.

Statistical databases often incorporate support for advanced statistical analysis techniques, such as correlations, which goes beyond SQL . They also show unique sequrity  concerns, which were the concentrate of much research, particularly in the years of 1970’s and 1980’s.

Statistical databases security

In a statistical database, it is often allows query access only to aggregate data, not individual files or records. Protecting such a database is a difficult issue, since intelligent users can use a combination of aggregate queries to derive information about a single individual.

Some common approaches are:

  • only permitting aggregate queries (SUM, COUNT, AVG, STDEV, etc.)
  • rather than returning exact values for sensitive data like income, only return which partition it belongs to (e.g. 35k-40k)
  • return imprecise counts (e.g. rather than 141 records met query, only indicate 130-150 records met it.)
  • don’t permit overly selective ” WHERE ” clauses
  • audit all users queries, so users using system incorrectly can be investigated
  • use intelligent agents to detect automatically inappropriate system use

Research in this field has largely stalled; reference 3 below showed that, in general, securing statistical databases was an impossible aim: if they were open to legitimate use, they were also open to abuse; and if they were limited so tightly as to be inability of abuse, they would then be useless for practical statistical purposes.

To illustrate it as a quote :
The end is that this statistical databases are almost always subject to compromise. Severe limitation on allowable query set sizes will render the database  as waste to source of statistical data but will not secure the most confidential files.