California Institute of Technology (1993 - 1995)
Stanford University (1995 - 1998)
University of California, Berkeley (2003 - 2010)
University of Colorado, Boulder and University of California, Berkeley (2011)
Research Scientist and Software Engineer, Information Extraction and Synthesis Laboratory (McCallum lab), School of Computer Science, University of Massachusetts Amherst.
Large-scale machine learning for natural language processing. Open access and open evaluation of research literature; project lead, OpenReview.net.
Lead Bioinformatics Developer, The Molecular Sciences Institute.
Research Associate, Cavalli-Sforza lab, Department of Genetics, Stanford University School of Medicine.
Software Developer, Science and Technology in the Making, Stanford University Libraries
Research Assistant, Institute for Scientific Computing Research, Lawrence Livermore National Laboratory
Summer Intern, Deutsches Elektronen-Synchrotron (DESY), Hamburg, Germany
Open scholarship and reproducible research.
OpenReview.net, a platform for open evaluation of scholarly articles. OpenReview.net aims to promote openness in scientific communication, particularly regarding the peer review process. We are implementing a platform for peer review that generalizes over many subtle gradations of openness, allowing conference organizers, journals, and other “reviewing entities” to configure the specific policy of their choice. We intend to act as a testbed for different policies, to help scientific communities experiment with open scholarship while addressing legitimate concerns regarding confidentiality, attribution, and bias. We are collaborating with sociologists in this investigation. Our initial focus is on computer science conferences; to date our system has provided paper submission, reviewing, and public discussion for ICLR 2013, ICLR 2014, ICML/Inferning 2013, ICML/Peer Review 2013, and AKBC 2013.
WorldMake.org, a versioned data analysis tool for reproducible research. WorldMake is a system for describing, sharing, and executing computational workflows in a manner that guarantees reproducible results. It provides a means of ensuring that a set of computational results are up-to-date with respect to the inputs, that they are internally consistent, and that their provenance is rigorously tracked. It also provides a means of sharing inputs, intermediate results, and final outputs, so as to facilitate collaboration while avoiding redundant computation. A predecessor of this system drove all of the computations for my dissertation, involving on the order of one million digital artifacts (i.e., files containing inputs, intermediate results, and outputs), and requiring weeks of computation on a large cluster.
MONOD, a collaborative tool for manipulating biological knowledge. MONOD (for “Modeler's Notebook and Datastore”) was a web application designed to capture and communicate knowledge generated during the process of building models of many-component biological systems. We used MONOD to construct a model of the pheromone response signaling pathway of Saccharomyces cerevisiae. MONOD allowed the accumulation, documentation, and exchange of data, valuations, assumptions, and decisions generated during the model building process. MONOD thus helped preserve a record of the steps taken on the path from the experimental data to the computable model. Our goals were to streamline the processes of building models, communicating with other researchers, and managing and manipulating biological knowledge. Once fully realized, “collaborative annotation”—fine-grained, structured, searchable communication enabled by software tools of this type—promises to enhance the practice of research in every field of science and engineering.
Information extraction from scholarly literature.
Extraction of citation metadata from PDFs of scholarly articles, using numerous text and layout features.
Frameworks for translating and processing citation metadata, with plugins for reading and writing a variety of formats. Streaming and concurrent operation allows rapid processing of very large datasets (commonly runs on tens of millions of records).
Normalizing person names, parsing a wide variety of name formats into constituent components.
Concurrent programming.
Microbial ecology and metagenomics.
Research Mentor for graduate rotation students, undergraduate research assistants, and software developers. (5 total, 2005-2010)
Graduate Student Instructor for Microbial Genetics and Genomics, U.C. Berkeley (2007)
Chang-Lin Tien Scholar in Environmental Sciences and Biodiversity, UC Berkeley. (2008-2010)
Contributing author to a successful NIH R01 grant to Rob Knight. (2011)
Predoctoral Fellow, Howard Hughes Medical Institute. (2003-2008)
Caltech and Stanford Summer Undergraduate Research Fellowships. (1994, 1995, 1997)
Caltech Merit Awards. (1994, 1995)
Robert Andrews Millikan Scholar, Caltech. (1993)
Travel Awards. NAS Sackler Colloquium on Tapestry of Life, Irvine, CA (2005); 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006).
Soergel DAW. (2015). Rampant software errors may undermine scientific results. F1000Research 3: 303. Full Text, PDF, Reviews and Discussion
Soergel DAW, Saunders AC, McCallum A. (2013). Open Scholarship and Peer Review: a Time for Experimentation. ICML Workshop on Peer Reviewing and Publishing Models (WPEER). PDF, Discussion
Yooseph S, Sutton G, Rusch DB, … Soergel DAW, … Venter JC. (2007). The Sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biology 5: e16. Full text, PDF
Lareau LF, Brooks AN, Soergel DAW, Meng Q, Brenner SE. (2007). The coupling of alternative splicing and nonsense-mediated mRNA decay. In Blencowe B and Graveley B, ed., Alternative splicing in the post-genomic era (pp. 190-211), Landes Bioscience. PDF
Soergel DAW, Lareau LF, Brenner SE. (2006). Regulation of gene expression by the coupling of alternative splicing and nonsense-mediated mRNA decay. In Maquat L, ed., Nonsense-mediated mRNA decay (pp. 175-196), Landes Bioscience. PDF
Soergel DAW, Choi K, Thomson T, Doane J, George B, Morgan-Linial R, Brent R, Endy D. (2004). MONOD, a collaborative tool for manipulating biological knowledge. Working paper
Computational approaches to evaluating microbial diversity. UC Berkeley campus seminar in environmental microbiology. (2008)
Sequence compositional biases and microbial diversity. UC Berkeley Graduate Group in Genomic and Computational Biology Retreat. (2008)
Explorations in environmental and medical metagenomics. Metagenomics 2007, San Diego, CA. (2007)
Explorations in environmental and medical metagenomics. HHMI predoctoral fellows meeting, Chevy Chase, MD. (2006)
Interpreting metagenomic data using oligonucleotide signatures. Metagenomics 2006, San Diego, CA. (2006)
Interpreting metagenomic data using oligonucleotide signatures. California Metagenomics Workshop, Berkeley, CA. (2006)
Interpreting environmental sequence data using oligonucleotide signatures. 14th International Conference on Microbial Genomes, Lake Arrowhead, CA. (2006)
MONOD, a collaborative tool for manipulating biological knowledge. Formal Languages for Biological Processes, CSHL Banbury Center, Cold Spring Harbor, NY. (2003)
MONOD, the modeller's notebook and datastore. DARPA BioComp PI meeting, Washington, DC. (2002)
Human Gene Geography: a database of human genome variation. CSHL Conference on Human Evolution, Cold Spring Harbor, NY. (1999)