Counting Queries: Extracting Key Business Metrics from Datasets
This post is part of a series on differential privacy. Learn more and browse all the posts published to date on the differential privacy blog series page in NIST’s Privacy Engineering Collaboration Space.
How many people drink pumpkin spice lattes in October, and how would you calculate this without learning specifically who is drinking them, and who is not?
While they seem simple or trivial, counting queries are used extremely often. Counting queries such as histograms can express many useful business metrics. How many transactions took place last week? How did this compare to the previous week? Which market has produced the most sales? In fact, one paper showed that more than half of queries written at Uber in 2016 were counting queries.
Counting queries are often the basis for more complicated analyses, too. For example, the U.S. Census releases data that is constructed essentially by issuing many counting queries over sensitive raw data collected from residents. Each of these queries belongs in the class of counting queries we will discuss below, and computes the number of people living in the U.S. with a particular set of properties (e.g., living in a certain geographic area, having a particular income, belonging to a particular demographic).
Counting queries over tabular data are the simplest kind of query to answer with differential privacy, and have the support of a large amount of research. Informally, counting queries have the form: “how many rows in the database have the property X?” For example, each row could correspond to a survey respondent, and the property X could be “replied ‘yes’ to the survey”. As we mentioned above, this simple form can be extended to fit a number of common use cases when analyzing data.
Our treatment of counting queries requires some background knowledge of tabular data and database queries—ideally, the basics of either SQL or Pandas.
In SQL, counting queries look like the following:
GROUP BY A1, A2, …
Here, T is a database table we want to search, P1… are the desired properties of the rows to be counted, and A1… are the columns of the data to group by. The number and complexity of the properties in the query does not affect the difficulty of answering the query with differential privacy (as long as the properties don’t depend on other rows in the database). When the list A1… is non-empty, the query is called a histogram.
In Pandas, the same class of queries can be expressed using a series of filters over a dataframe, followed by the use of len or shape:
df_1 = d_f[P1]
df_2 = df_1[P2]
df_3 = df_2[P3]
The class of simple counting queries excludes queries that perform other operations (e.g., SUM and AVG), and queries with joins and semijoins—more on these later in the series.
To achieve differential privacy for counting queries, we add noise to the query’s final answer. To correctly satisfy the definition of differential privacy, we scale the noise to the sensitivity of the query. In general, determining the sensitivity of an arbitrary query is not an easy task. However, determining the sensitivity of counting queries is essentially a solved problem, and what follows is a summary of the general solution.
As described in our first post, we add noise drawn from the Laplace distribution to the answer of a counting query to achieve differential privacy. The noise needs to be scaled to the sensitivity of the query. Sensitivity measures how much the answer changes based on a single user’s contribution of data. For counting queries, this value is always 1: the final count can only change by 1 when a single user’s data is added or removed. Crucially, this argument holds no matter what the property is, or the columns being grouped. As a rule of thumb, counting queries have a sensitivity of 1. This means we can easily determine the scale of the noise required. The formula for the noise as Python code is:
len(d_f[P1]) + np.random.laplace(loc=0, scale=1/ε)
For counting queries, this is all that’s required to achieve differential privacy! In fact, some of the software tools described below take exactly this approach.
Differential privacy works extremely well for counting large groups (e.g., residents of NYC), but as the group size gets smaller, accuracy degrades. In the worst case—a group of size 1—the “signal” (the exact count) and the noise (the added Laplacian noise) have the same scale. This is exactly what we want in order to protect privacy—a query that examines a single individual would definitely violate that individual’s privacy, so it should return a useless result in order to guarantee protection of privacy.
A common pitfall to avoid when using differential privacy: ensure the signal is significantly larger than the noise in order to achieve useful results.
Let’s look at two of the more actively-supported and accessible tools for answering counting queries with differential privacy. The first, Diffprivlib, is best-suited for datasets that fit in memory; queries are written as Python programs. The second, Google’s differential privacy library, integrates with PostgreSQL to support larger datasets; queries are written in SQL. Both tools can be found in NIST’s Privacy Engineering Collaboration Space, where you are invited to share feedback or use cases related to these tools.
IBM recently released an open source Python library for enforcing differential privacy, called Diffprivlib. Diffprivlib integrates with other commonly-used Python libraries like NumPy and scikit-learn, and makes it easy for Python programmers to add differential privacy to their programs. For example, the following Python code constructs a differentially private histogram with an ε value of 0.1:
from diffprivlib import tools as dp
dp_hist, _ = dp.histogram(data, epsilon=0.1)
Diffprivlib is suitable for differentially private analyses on any data that will easily fit in memory. Diffprivlib relies on libraries like NumPy to perform the underlying query tasks, and these libraries are not well-suited to large-scale data that does not fit in memory. For any kind of analysis you might normally perform on your laptop, Diffprivlib is likely to work well; for larger problems, other solutions are likely to be faster.
Programmers track the privacy budget in Diffprivlib explicitly, using a BudgetAccountant—which provides considerable flexibility, but also requires the programmer to understand privacy budgets. The following example will throw an error while computing the second histogram, since it exceeds the total allowed privacy budget.
acc = BudgetAccountant(epsilon=1, delta=0)
dp_hist1, _ = dp.histogram(data, epsilon=0.5, accountant=acc)
dp_hist2, _ = dp.histogram(data, epsilon=0.8, accountant=acc)
Google’s differential privacy library is an open source library written in C++ for high-performance, differentially private analytics. The library includes other functionality in addition to the C++ API, for example, an extension to PostgreSQL for executing differentially private database queries.
Through this integration, the library is capable of directly answering queries written in SQL. In most cases, the programmer can simply replace the COUNT aggregation function with the ANON_COUNT function (passing a value for ε), and the system will output differentially private results. For example, the following SQL query will compute a differentially private histogram with ε=0.1:
SELECT ANON_COUNT(*, 0.1)
GROUP BY age
Since Google’s library integrates directly with a commonly-used database system, it is likely to scale more easily to large datasets than Diffprivlib. Unlike NumPy, PostgreSQL is designed to work well even when the database being queried does not fit in memory, and Google’s approach retains this benefit.
Data environments which use another DBMS would need to switch to PostgreSQL or implement a similar integration for their own DBMS. For environments already using PostgreSQL, Google’s library provides a directly deployable solution for answering queries written in SQL. The library also provides APIs in Go and Java for building more complex differentially private algorithms.
The two tools discussed here are both robust and well-maintained, but both require at least some expertise in differential privacy. Systems with additional tools to aid in exploring data and understanding the implications of differential privacy have been proposed in academic projects, but have not yet been developed into practical, well-supported tools. For example, the Ψ system, developed as part of the Harvard Privacy Tools project, has a user interface designed specifically around helping the analyst to understand how best to allocate the privacy budget and how much error to expect from each query.
What if instead of the total, you’d like to know the average number of people drinking pumpkin spice lattes across different geographic areas? Our next post will go beyond counting queries into other kinds of aggregations—like sums and averages—to answer questions like these while maintaining differential privacy. These kinds of queries can be more challenging to answer with differential privacy, and require special care to ensure privacy and accuracy in the results.