Résumé. Updated August 2015.

David Soergel, Ph.D.

28 Mountain Laurel Path
Northampton, MA 01062
(650) 303-5324 (c)
(413) 282-9961 (h)



Software engineer and research scientist with recent experience in computational biology, metagenomics, machine learning, natural language processing, bibliometrics, and open access advocacy. Expert in large-scale data management, database design, and cluster computing. Experienced in project management and system administration. Proficient with a wide variety of computing technologies and platforms. Effective team player; also able independently to complete entire projects from conception through launch.


A software engineering position involving design and implementation of systems for data management and analysis, ideally regarding natural sciences, environmental conservation, or renewable energy. Alternatively, an academic position in computer science, particularly involving computational biology and large-scale computing.

Education and Appointments

California Institute of Technology (1993 - 1995)

Stanford University (1995 - 1998)

University of California, Berkeley (2003 - 2010)

University of Colorado, Boulder and University of California, Berkeley (2011)


Effective communicator in spoken and written English. Skilled at discussing technology projects with non-technical clients. Fluent in German.

Work Experience

2013 -

Software Engineer, Google, Inc.

2011 - 2013

Research Scientist and Software Engineer, Information Extraction and Synthesis Laboratory (McCallum lab), School of Computer Science, University of Massachusetts Amherst.

Large-scale machine learning infrastructure for natural language processing. Advocate for open access and open evaluation of research literature; principal architect, openreview.net.

2001 - 2003

Lead Bioinformatics Developer, The Molecular Sciences Institute.

Databases and web applications supporting basic research in biology.
1999 - 2003
Founder and Principal, Asha Technologies.
Consulting firm focussing on database-driven web applications for socially beneficial purposes.
2000 - 2001
Co-founder and Director of Research and Development, Little Engine, Inc.
Information technology for preschool teachers and parents.
1999 - 2000

Research Associate, Cavalli-Sforza lab, Department of Genetics, Stanford University School of Medicine.

Databases and software for analyzing the geographic distributions of human genes.
1998 - 1999
Vice President for Technology, Padra.org
1997 - 1999

Software Developer, Science and Technology in the Making, Stanford University Libraries

Summer 1996

Research Assistant, Institute for Scientific Computing Research, Lawrence Livermore National Laboratory

Summer 1994
Samuel P. And Frances Krown Summer Undergraduate Research Fellow, San Onofre/Palo Verde Neutrino-Oscillation Experiment, Caltech
Summer 1992

Summer Intern, Deutsches Elektronen-Synchrotron (DESY), Hamburg, Germany

Selected Open-Source Software (Scala)

worldmake. WorldMake is a system for executing computational workflows, tracking provenance, and keeping derived results up to date. Its initial purpose is to provide for reproducibility in scientific research. It can also be used for software compilation, package management, continuous integration, and testing.

iesl-sbt-base. SBT plugin providing all manner of boilerplate, so that the Build.scala file for a project can be trivially short. Includes simplified dependency resolution with automatic updating; clarity on what transitive dependencies are used; and unified logging configuration.

namejuggler. Normalizer for person names, parsing a wide variety of name formats into constituent components.

bibmogrify (open-source release planned). A general framework for translating and processing citation metadata, with plugins for reading and writing a variety of formats. Streaming and concurrent operation allows rapid processing of very large datasets (commonly runs on tens of millions of records).

pdf2meta (open-source release planned). Extracts citation metadata from PDFs of scholarly articles, on the basis of numerous text and layout features.

Selected Open-Source Software (Java)

jLibSVM. Heavily refactored Java port of LIBSVM, providing efficient training of Support Vector Machines. Provides many new features, including a fully generified API; the ability to add custom kernels for arbitrary data types; and integrated scaling and normalization.

ml. Generic machine learning package. Provides a framework for supervised and unsupervised clustering (both online and batch), and currently implements naive Bayesian, k-NN, K-means, and Kohonen SOM clustering. Computes Variable Memory Markov models (aka Probabilistic Suffix Trees) on strings. Also, implements various Monte Carlo methods, including Metropolis-coupled MCMC.

conja. Library providing functional concurrency in Java. Conja lets code take advantage of multicore processors with no configuration and minimal code changes. Schedules nested concurrent tasks in a memory-efficient depth-first manner.

phyloutils. Provides data structures for weighted phylogenetic trees, and various operations on such trees. Includes phylogenetic alpha and beta diversity measures such as Weighted UniFrac.

pdftank. Automatically navigate journal web sites to download and cache full-text PDFs.

Selected Open-Source Software (Perl)

s3napback. Cycling, incremental, compressed, encrypted backups to Amazon S3.

RTAX. Rapid and accurate taxonomic classification of short paired-end sequence reads from the 16S ribosomal RNA gene. Available as part of the QIIME microbial ecology pipeline.

Teaching and Mentoring

Research Mentor for graduate rotation students, undergraduate research assistants, and software developers. (5 total, 2005-2010)

Graduate Student Instructor for Microbial Genetics and Genomics, U.C. Berkeley (2007)


Grants and Awards

Chang-Lin Tien Scholar in Environmental Sciences and Biodiversity, UC Berkeley. (2008-2010)

Contributing author to a successful NIH R01 grant to Rob Knight. (2011)

Predoctoral Fellow, Howard Hughes Medical Institute. (2003-2008)

National Defense Science and Engineering Graduate Fellowship. (2003, declined)

Caltech and Stanford Summer Undergraduate Research Fellowships. (1994, 1995, 1997)

Caltech Merit Awards. (1994, 1995)

Robert Andrews Millikan Scholar, Caltech. (1993)

Travel Awards. NAS Sackler Colloquium on Tapestry of Life, Irvine, CA (2005); 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006).


Soergel DAW. (2015). Rampant software errors may undermine scientific results. F1000Research 3: 303. Full Text, PDF, Reviews and Discussion

Soergel DAW, Saunders AC, McCallum A. (2013). Open Scholarship and Peer Review: a Time for Experimentation. ICML Workshop on Peer Reviewing and Publishing Models (PEER). PDF, Discussion

Dey N, Soergel DAW, Repo S, Brenner SE. (2013). Association of gut microbiota with post-operative clinical course in Crohn's disease. BMC Gastroenterology 13: 131. Full text, PDF

F1000 Recommended

Soergel DAW, Dey N, Knight R, Brenner SE. (2012). Selection of primers for optimal taxonomic classification of environmental 16S rRNA gene sequences. The ISME Journal 6: 1440-1444. Full text, PDF

Yooseph S, Sutton G, Rusch DB, … Soergel DAW, … Venter, JC. (2007). The Sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biology 5: e16. Full text

Lareau LF, Brooks AN, Soergel DAW, Meng Q, Brenner SE. (2007). The coupling of alternative splicing and nonsense-mediated mRNA decay. In Blencowe B and Graveley B, ed., Alternative splicing in the post-genomic era (pp. 190-211), Landes Bioscience. PDF

Soergel DAW, Lareau LF, Brenner SE. (2006). Regulation of gene expression by the coupling of alternative splicing and nonsense-mediated mRNA decay. In Maquat L, ed., Nonsense-mediated mRNA decay (pp. 175-196), Landes Bioscience. PDF

Soergel DAW, Choi K, Thomson T, Doane J, George B, Morgan-Linial R, Brent R, Endy D. (2004). MONOD, a collaborative tool for manipulating biological knowledge. Working paper

Posters and Presentations

Computational approaches to evaluating microbial diversity. UC Berkeley campus seminar in environmental microbiology. (2008)

Sequence compositional biases and microbial diversity. UC Berkeley Graduate Group in Genomic and Computational Biology Retreat. (2008)

Explorations in environmental and medical metagenomics. Metagenomics 2007, San Diego, CA. (2007)

Explorations in environmental and medical metagenomics. HHMI predoctoral fellows meeting, Chevy Chase, MD. (2006)

Interpreting metagenomic data using oligonucleotide signatures. Metagenomics 2006, San Diego, CA. (2006)

Interpreting metagenomic data using oligonucleotide signatures. California Metagenomics Workshop, Berkeley, CA. (2006)

Interpreting environmental sequence data using oligonucleotide signatures. 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006)

MONOD, a collaborative tool for manipulating biological knowledge. Formal Languages for Biological Processes, CSHL Banbury Center, Cold Spring Harbor, NY. (2003)

MONOD, the modeller's notebook and datastore. DARPA BioComp PI meeting, Washington, DC. (2002)

Human Gene Geography: a database of human genome variation. CSHL Conference on Human Evolution, Cold Spring Harbor, NY. (1999)