Configuration File
The quickstart shows you the smallest useful configuration file. That is a nice way to get started, but there is more you can tell Anonimatron to do. This page walks through the XML configuration file and the options you can use when databases, files, custom anonymizers, and a few special cases come together.
If you ever get lost, ask Anonimatron to print an example:
./anonimatron.sh -configexample
Once you have a configuration file, run it like this:
./anonimatron.sh -config config.xml -synonyms synonyms.xml
The synonym file is where Anonimatron remembers that value A was replaced with value X. If you use the same synonym file in the next run, the same input values will get the same replacement values.
A Bigger Example
Here is a configuration file which shows most of the moving parts. You probably do not need all of this on day one, but it is useful to see how the pieces fit together.
<?xml version="1.0" encoding="UTF-8"?>
<configuration jdbcurl="jdbc:mysql://localhost:3306/mydb"
userid="myuser"
password="mypassword"
salt="example-salt">
<anonymizerclass>my.package.FixedValueAnonymizer</anonymizerclass>
<filefilterclass>my.package.CsvFileFilter</filefilterclass>
<table name="userdata" fetchSize="500">
<column name="firstname" type="ROMAN_NAME" />
<column name="lastname" type="ELVEN_NAME" />
<column name="creditcardnr" type="RANDOMDIGITS">
<parameter id="mask">1111********</parameter>
</column>
<column name="session_token" type="STRING" shortlived="true" />
<discriminator columnname="contact_type" value="email">
<column name="contact_value" type="EMAIL_ADDRESS" />
</discriminator>
<discriminator columnname="contact_type" value="phone">
<column name="contact_value" type="RANDOMDIGITS" />
</discriminator>
</table>
<file inFile="customers.csv"
outFile="customers-anonymized.csv"
reader="com.rolfje.anonimatron.file.CsvFileReader"
writer="com.rolfje.anonimatron.file.CsvFileWriter"
encoding="UTF-8">
<readerParameter id="delimiter">;</readerParameter>
<writerParameter id="delimiter">;</writerParameter>
<column name="1" type="ROMAN_NAME" />
<column name="2" type="EMAIL_ADDRESS" />
</file>
</configuration>
When a configuration contains both <table> and <file> elements, Anonimatron processes all tables
first and then all files. The same synonym cache is used for both, so a value anonymized in a table
can be replaced consistently in a file later in the same run.
The Configuration Element
Everything starts with the <configuration> element. For database anonymization, this is where the
database connection lives:
<configuration jdbcurl="jdbc:mysql://localhost:3306/mydb"
userid="myuser"
password="mypassword">
The jdbcurl, userid, and password attributes are only needed when you anonymize database
tables. You can also provide them on the command line with -jdbcurl, -userid, and -password.
Command line values override the values in the XML file, which is handy when you do not want to put
passwords in a configuration file.
There is one other root attribute worth knowing about: salt. When you set it, Anonimatron hashes
the original values in the synonym file using that salt. Keep the salt stable if you want old synonym
files to remain usable across runs.
You can also register extra classes at the top level:
<anonymizerclass>my.package.FixedValueAnonymizer</anonymizerclass>
<filefilterclass>my.package.CsvFileFilter</filefilterclass>
Use anonymizerclass for your own anonymizers. Use filefilterclass when a file input points to a
directory and you want to decide which files in that directory should be processed.
Database Tables
To anonymize a table, add a <table> element and then list the columns Anonimatron should touch:
<table name="userdata">
<column name="firstname" type="ROMAN_NAME" />
<column name="lastname" type="ELVEN_NAME" />
</table>
This tells Anonimatron to read the userdata table and replace the values in the firstname and
lastname columns. The type attribute points to an anonymizer type. You can find the built-in
types in the available anonymizers page.
Tables must have a primary key. Anonimatron uses that primary key to update rows in place. This also means you can not anonymize a primary key column.
For large tables, you can set a JDBC fetch size:
<table name="userdata" fetchSize="500">
Whether this helps depends on the JDBC driver, but it can reduce memory usage for large result sets.
Schema-qualified table names such as myschema.userdata are supported when the database exposes the
right schema metadata.
Files
Anonimatron can anonymize files too. A file configuration tells Anonimatron where to read from, where to write to, and which reader and writer classes to use:
<file inFile="customers.csv"
outFile="customers-anonymized.csv"
reader="com.rolfje.anonimatron.file.CsvFileReader"
writer="com.rolfje.anonimatron.file.CsvFileWriter">
<column name="1" type="ROMAN_NAME" />
<column name="2" type="EMAIL_ADDRESS" />
</file>
The built-in CSV reader does not use header names. It numbers fields from left to right, starting at
- So
name="1"means the first field,name="2"means the second field, and so on.
CSV files are read and written as UTF-8 by default. If your file uses another encoding, set it with a Java charset name:
<file inFile="customers.csv"
outFile="customers-anonymized.csv"
reader="com.rolfje.anonimatron.file.CsvFileReader"
writer="com.rolfje.anonimatron.file.CsvFileWriter"
encoding="ISO-8859-1">
The built-in CSV reader and writer also support a delimiter parameter:
<readerParameter id="delimiter">;</readerParameter>
<writerParameter id="delimiter">;</writerParameter>
If you leave this out, the reader treats comma, semicolon, and tab as delimiters. The writer uses comma by default. The CSV support is intentionally simple and does not try to be a full spreadsheet parser.
The inFile attribute may point to a single file or a directory. When it points to a directory,
Anonimatron considers the direct child files in that directory. If outFile exists as a directory,
output files are written there with their original file names. Anonimatron will refuse to overwrite
an existing output file or use the same file as both input and output.
Columns
Columns are where you tell Anonimatron what kind of replacement data you want:
<column name="creditcardnr" type="RANDOMDIGITS" />
The most important attributes are:
| Attribute | Meaning |
|---|---|
name |
Database column name, or CSV field number for the built-in CSV reader. |
type |
Anonymizer type, such as ROMAN_NAME, EMAIL_ADDRESS, or RANDOMDIGITS. |
size |
Optional size passed to the anonymizer. For database columns, JDBC metadata supplies this. |
shortlived |
Set to true when the generated synonym should not be stored in the synonym file. |
For database columns, type may be left out. In that case Anonimatron asks JDBC for the Java class
name of the column and tries to use a default anonymizer mapping. In practice, it is usually clearer
to set the type yourself.
Short-lived columns are useful for values that just need to be filled with plausible junk and do not need to be stable between runs:
<column name="session_token" type="STRING" shortlived="true" />
Use this with care. Because short-lived synonyms are not stored, repeated runs will generate new values for the same input.
Parameters
Some anonymizers accept extra parameters. Parameters are written inside a column:
<column name="creditcardnr" type="RANDOMDIGITS">
<parameter id="mask">1111********</parameter>
</column>
For RANDOMDIGITS, the mask parameter tells Anonimatron which characters to keep. Digits in the
mask keep the original character at that position; other positions are replaced by random digits.
So 1111**** keeps the first four characters and replaces the rest.
RANDOMCHARACTERS accepts a characters parameter:
<column name="code" type="RANDOMCHARACTERS">
<parameter id="characters">ABC123</parameter>
</column>
This makes Anonimatron generate replacement values using only those characters. Custom anonymizers can define their own parameters.
Discriminators
Sometimes one column contains different kinds of data. A contact_value column might contain a phone
number for one row and an email address for another. Discriminators handle that case:
<table name="contacts">
<column name="contact_value" type="RANDOMDIGITS" />
<discriminator columnname="contact_type" value="email">
<column name="contact_value" type="EMAIL_ADDRESS" />
</discriminator>
</table>
Here, contact_value is treated as random digits by default. But when contact_type is email,
Anonimatron uses the EMAIL_ADDRESS anonymizer instead. If the discriminator has no value
attribute, it matches null.
Discriminators are applied to database tables. The XML mapping also accepts them under <file>, but
the current file anonymization flow does not apply file-level discriminators.
Custom Classes
Anonimatron can load your own anonymizers. Put the jar on the Anonimatron classpath. In the binary
distribution, the usual place is the anonymizers directory. Then register the class:
<configuration>
<anonymizerclass>my.package.ToLowerAnonymizer</anonymizerclass>
<table name="userdata">
<column name="firstname" type="TO_LOWER_CASE" />
</table>
</configuration>
The important part is that the anonymizer returns a type, and that the column uses that same type.
You can also plug in your own file readers and writers. Implement RecordReader or RecordWriter.
If your reader or writer needs configuration, implement ParameterizedRecordReader or
ParameterizedRecordWriter and use readerParameter or writerParameter in the XML.
Command Line Options
The XML file does not contain everything. These runtime options are provided on the command line:
| Option | What it does |
|---|---|
-config config.xml |
Reads the XML configuration file. |
-synonyms synonyms.xml |
Reads and writes the synonym file. If it does not exist, it will be created. |
-dryrun |
Runs database anonymization without writing changes to the database. It can still write a synonym file. |
-jdbcurl, -userid, -password |
Override the matching XML attributes. |
-configexample |
Prints supported JDBC URL formats and a generated example configuration. |
When building your configuration file, start small, anonymize one or two columns, inspect the result, and then add the rest. It is much easier to fix a tiny configuration than a giant one that tries to anonymize everything at once.