With the increase usage of next generation sequencing, the problem of effectively storing and transmitting such massive amounts of data will need to be addressed. Current repositories such as the Sequence Read Archive (SRA) currently use the FASTQ format and a general-purpose compression systems (gzip) for data archiving. In this work, we investigate how gzip (and bzip2) can be made more effective for read archiving by pre-sorting the reads. The improvement in compression effectiveness of just the sequences is a reduction of at most 12% and of up to 6% when the original FASTQ data is considered.
New DNA sequencing technologies have achieved breakthroughs in throughput, at the expense of higher error rates. The primary way of interpreting biological sequences is via alignment, but standard alignment methods assume the sequences are accurate. Here, we describe how to incorporate the per-base error probabilities reported by sequencers into alignment. Unlike existing tools for DNA read mapping, our method models both sequencer errors and real sequence differences. This approach consistently improves mapping accuracy, even when the rate of real sequence difference is only 0.2%. Furthermore, when mapping Drosophila melanogaster reads to the Drosophila simulans genome, it increased the amount of correctly mapped reads from 49 to 66%. This approach enables more effective use of DNA reads from organisms that lack reference genomes, are extinct or are highly polymorphic.
Background: Visualization tools allow researchers to obtain a global view of the interrelationships between the probes or experiments of a gene expression (e.g. microarray) data set. Some existing methods include hierarchical clustering and k-means. In recent years, others have proposed applying minimum spanning trees (MST) for microarray clustering. Although MST-based clustering is formally equivalent to the dendrograms produced by hierarchical clustering under certain conditions; visually they can be quite different.
Methods: HAMSTER (Helpful Abstraction using Minimum Spanning Trees for Expression Relations) is an open source system for generating a set of MSTs from the experiments of a microarray data set. While previous works have generated a single MST from a data set for data clustering, we recursively merge experiments and repeat this process to obtain a set of MSTs for data visualization. Depending on the parameters chosen, each tree is analogous to a snapshot of one step of the hierarchical clustering process. We scored and ranked these trees using one of three proposed schemes. HAMSTER is implemented in C++ and makes use of Graphviz for laying out each MST.
Results: We report on the running time of HAMSTER and demonstrate using data sets from the NCBI Gene Expression Omnibus (GEO) that the images created by HAMSTER offer insights that differ from the dendrograms of hierarchical clustering. In addition to the C++ program which is available as open source, we also provided a web-based version (HAMSTER+) which allows users to apply our system through a web browser without any computer programming knowledge.
Conclusions: Researchers may find it helpful to include HAMSTER in their microarray analysis workflow as it can offer insights that differ from hierarchical clustering. We believe that HAMSTER would be useful for certain types of gradient data sets (e.g time-series data) and data that indicate relationships between cells/tissues. Both the source and the web server variant of HAMSTER are available from http://hamster.cbrc.jp/.
Probabilistic latent semantic analysis (PLSA) is considered an effective technique for information retrieval, but has one notable drawback: its dramatic consumption of computing resources, in terms of both execution time and internal memory. This drawback limits the practical application of the technique only to document collections of modest size.
In this paper, we look into the practice of implementing PLSA with the aim of improving its efficiency without changing its output. Recently, Hong et al.  has shown how the execution time of PLSA can be improved by employing OpenMP for shared memory parallelization. We extend their work by also studying the effects from using it in combination with the Message Passing Interface (MPI) for distributed memory parallelization. We show how a more careful implementation of PLSA reduces execution time and memory costs by applying our method on several text collections commonly used in the literature.
This chapter examines some of the available techniques for analyzing a protein interaction network (PIN) when depicted as an undirected graph. Within this graph, algorithms have been developed which identifies "notable" smaller building blocks called network motifs. We examine these algorithms by dividing them into two broad categories based on their definition of "notable": (a) statistically-based methods and (b) frequency-based methods. We describe how these two classes of algorithms differ not only in terms of efficiency, but also in terms of the type of results that they report. Some publicly-available programs are demonstrated as part of our comparison. While most of the techniques are generic and were originally proposed for other types of networks, the focus of this chapter is on the application of these methods and software tools to PINs.
(In X.-L. Li and S.-K. Ng, editors, Biological Data Mining in Protein Interaction Networks)
The BM25 similarity computation has been shown to provide effective document retrieval. In operational terms, the formulae which form the basis for BM25 employ both term frequency and document length normalization. This paper considers an alternative form of normalization using document-centric impacts, and shows that the new normalization simplifies BM25 and reduces the number of tuning parameters. Motivation is provided by a preliminary analysis of a document collection that shows that impacts are more likely to identify documents whose lengths resemble those of the relevant judgments. Experiments on TREC data demonstrate that impact-based BM25 is as good as or better than the original term frequency-based BM25 in terms of retrieval effectiveness.
Microarrays are high-throughput technologies whose data are known to be noisy. In this work, we propose a graph-based method which first identifies the extent to which a single microarray experiment is noisy and then applies an error function to clean individual expression levels. These two steps are unified within a framework based on a graph representation of a separate data set from some repository.
We demonstrate the utility of our method by comparing our results against statistical methods by applying both techniques to simulated microarray data. Our results are encouraging and indicate one potential use of microarray data from past experiments.
This report describes the joint work by Kyoto University and the University of Melbourne for the TREC Genomics Track in 2007. As with 2006, the task for this year was the retrieval of passages from a biomedical document collection. The overall framework of our system from 2006 remains unchanged and is comprised of two parts: a paragraph-level retrieval system and a passage extraction system. These two systems are based on the vector space model and a probabilistic word-based aspect model, respectively. This year, we have adopted numerous changes to our 2007 system which we believe corrected some problems.
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once -- usually as a block size, but sometimes as a direct megabyte limit. In this work we consider the Re-Pair mechanism of Larsson and Moffat, which processes large messages as disjoint blocks in order to limit memory consumption. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
Note: Extended and updated version of the CPM 2001 paper and the PhD thesis (Chapter 5).
This report summarizes the work done at Kyoto University and the University of Melbourne for the TREC 2006 Genomics Track. The single task for this year was to retrieve passages from a biomedical document collection. We devised a system made of two parts to deal with this problem. The first part was an existing IR system based on the vector-space model. The second part was a newly developed probabilistic word-based aspect model for identifying passages within relevant documents (or paragraphs).
Biological data presents unique problems for data analysis due to its high dimensions. Microarray data is one example of such data which has received much attention in recent years. Machine learning algorithms such as support vector machines (SVM) are ideal for microarray data due to its high classification accuracies. However, sometimes the information being sought is a list of genes which best separates the classes, and not a classification rate.
Decision trees are one alternative which do not perform as well as SVMs, but their output is easily understood by non-specialists. A major obstacle with applying current decision tree implementations for high-dimensional data sets is their tendency to assign the same scores for multiple attributes. In this paper, we propose two distribution-dependant criteria for decision trees to improve their usefulness for microarray classification.
This paper proposes a method for cleaning the noise found in microarray expression data sets. While other methods either concentrate on the image processing phase, or apply normalization techniques to the resulting microarray raw expression values, we introduce a method based on Markov random fields for the data set. The cleaning process is guided by genes with similar expression profiles. As a result, data cleaning is conducted using biological similarity of genes.
Compression and information retrieval are two areas of document management that exist separately due to the conflicting methods of achieving their goals. This research examines a mechanism which provides lossless compression and phrase-based browsing and searching of large document collections. The framework for the investigation is an existing off-line dictionary-based compression algorithm.
An analysis of the algorithm, supported by previous work and experiments, highlights two factors that are important for retrieval: efficient decoding, and a separate dictionary stream. However, three areas of improvement are necessary, prior to the inclusion of the algorithm into a browsing system.
First, in order to accommodate retrieval, the algorithm must produce a dictionary built up on words, rather than characters. A pre-processing stage is introduced which separates the message into words and non-words, along with word modifiers.
Second, the memory requirements of the algorithm prevent the processing of large documents. Earlier work has proposed a solution which separates the message into individual blocks prior to compression. Here, a post-processing stage is proposed which combines the blocks in a series of phases. Experiments show the trade-offs between the number of phases performed and the improvements in compression levels.
Organisations which provide access to an information retrieval system to users are sometimes concerned with the amount of disk space available. But users have a different point of view and may place response time at a higher priority. Indeed, faster computers and network connections translate into plummeting patience levels of users. The last improvement to the compression algorithm is two new coding schemes which replace the entropy coder that was used in previous work. While deploying them sacrifices compression effectiveness, these two mechanisms offer improved efficiency, as shown through experiments.
With the enhancements to the compression algorithm in hand, a technique to efficiently support phrase browsing is presented. Phrase contexts can be searched and progressively refined through the word modifiers. Because of the three changes to the algorithm, phrases are more visually appealing, larger documents can be processed, and response times are improved.
An experimental framework for the evaluation of statistically generated phrases is described. Two methods are considered. The first compares the phrases with those found by a natural language processing system; while the second examines the ability to locate all instances of each phrase. The motivation for this work is to determine the usefulness of statistically generated phrases in a document retrieval system.
To bound memory consumption, most compression systems provide a facility that controls the amount of data that may be processed at once. In this work we consider the Re-Pair mechanism of Larsson and Moffat , which processes large messages as disjoint blocks. We show that the blocks emitted by Re-Pair can be post-processed to yield further savings, and describe techniques that allow files of 500 MB or more to be compressed in a holistic manner using less than that much main memory. The block merging process we describe has the additional advantage of allowing new text to be appended to the end of the compressed file.
We describe a software system for managing text files of up to several hundred megabytes that combines a number of useful facilities. First, the text is stored compressed using a variant of the Re-Pair mechanism described by Larsson and Moffat, with space savings comparable to those obtained by other widely used general-purpose compression systems. Second, we provide, as a byproduct of the compression process, a phrase-based browsing tool that allows users to explore the contents of the source text in a natural and useful manner. And third, once a set of desired phrases has been determined through the use of the browsing tool, the compressed text can be searched to determine locations at which those phrases appear, without decompressing the whole of the stored text, and without use of an additional index. That is, we show how the Re-Pair compression regime can be extended to allow phrase-based browsing and fast interactive searching.
Providing the infrastructure that supports the World-Wide Web is expensive. The costs incurred in setting up a web site include those associated with the content being served; those associated with the hardware necessary to support the site; and the network costs incurred in transmitting that content to the end consumers. In this work we examine mechanisms for compressing web content so as to reduce the third of these three costs, and describe a scheme that exploits the known connectivities between web pages to derive improved transmission cost savings compared to the obvious approach of simply compressing each page on the site using a standard tool such as GZip. Experiments on a medium-sized web site confirm our claims that considerable reductions in network bandwidth requirements can be achieved.List of publications