Computer programs have trouble interpreting data intended for humans. Humans try to get around
this problem by coming up with conventions, languages, and binary formats to capture meaning for a subset of data.
This does not solve the problem for good, just lets us mostly get on with our work.
Such behaviour is illustrated by the
creation of large databases filled with information which is distilled from information-rich, readily-available
This idea is analogous to Yahoo! in the early days of the internet. Yahoo attempted to classify the web in the form of collections of logical links. This approach was doomed to fail, mostly because the rise of online search (e.g. Google) greatly out-performed link aggregators. Both in terms of performance and relevancy. Google could automatically distill data down to something the users asked for.
I've written some software ("libExplicator" and "libDemarcator") that attacks this issue in a specific domain: radiotherapy patient data. Instead of attempting to classify and aggregate patient data into logical links, my software lets users perform online search for whatever they are looking for. (We're starting somewhere small, but well-defined and practical.)
Contour data is noisy. Humans have all sorts of naming schemes for things. We don't like to type out "Left Parotid" and so instead invent clever nicknames like "l_par" or "lfet partoid" (sic). When a researcher needs to process thousands of patient files, it becomes impractical to deal manually with the clever names in each data set. This is the problem that libExplicator and libDemarcator can solve. Here's a workflow diagram showing how my software fits in:
The flow of data from a DICOM file -- geometrical and lexicographical contour data, in this case -- is shown to confuse the analysis program on the right. It is a one-off program which performs some wonderfully complex dosimetric computations. It just can't handle things like "lfet partoid". And it shouldn't. It would be even worse than manually sorting out the data to have to implement clever-name-deciphering logic for every one-off program.
My solution is to ask the user to provide some domain knowledge up front, draw as much information from that domain knowledge as reasonably possible, provide an easy library interface to use that knowledge for translation of messy data, and try improve translation as additional information is added.
Then the one-off program above will effectively be able to read any DICOM-format data given to it without having to explicitly be programmed to perform any nitty-gritty translations. In a nutshell, libExplicator and libDemarcator are like a fuzzy Rosetta Stone for radiotherapy applications. If somebody, somewhere along the line accidentally puts a "lfet partoid" into the data, the one-off program will easily be able to handle it. In fact, the translation filter means the "lfet partoid" will be completely hidden from the one-off program. The only thing it will encounter is a nicely-formatted "Left Parotid".
LibExplicator and libDemarcator are part of the DICOMautomaton software suite, and can identify unknown contoured structures in DICOM files. LibExplicator works on contour labels; libDemarcator works on contoured volumes. They are both written in C++, and are both GPLv3 software.
LibExplicator (the contour name recognizer) is available on GitHub here. The libExplicator approach is lightweight and generally works quite well. LibDemarcator (the geometrical tool) works better in some cases, but requires drastically more storage space and computation time. The code may be released in the future. Please enquire via for early access or comments.