This is a discussion group for people working in information retrieval, data mining, document computing, social media, and similar fields. Everyone's welcome.
We meet every second Monday, from 4-5pm, in the CSIRO seminar room in the ANU CS&IT building. CSIRO are kind enough to provide wine (and soft drinks) and cheese. We occasionally have seminars and events at other times as well.
"IR and friends" aims to encourage discussion between people working in similar fields; to provide a venue for feedback on work in progress; and to get people from different groups talking to each other.
Want to present something? If there's any work in progress (or completed, or not really started) which you would like to share over wine and cheese, please let Paul or Gaya know. You don't need finished work, and you don't need an hour's worth of fancy slides, just a willingness to talk about what you're up to. The idea is to get discussion going, not to present eternal verities.
Query-biased summaries, where document summaries are modified to take into account a user's query, have been very useful when searching amongst collections of text. However, in applications such as data portals the collections are of tabular data—e.g. spreadsheets—and there is no similar support.
We describe a method for producing query-biased summaries of tabular data, which aims to support a user's decision whether or not to download a data set---or even to answer the question on the spot, with no further interaction. The method infers simple types in the data and query; automatically refines queries, where that makes sense; extracts relevant subsets of the complete table; and generates both graphical and tabular summaries of what remains. A small-scale user study suggests this both helps users identify useful results (fewer false negatives), and reduces wasted downloads (fewer false positives).
This is joint work with Vincent Au (ANU) and Gaya Jayasinghe (CSIRO).
Vertical scrolling is the standard method of exploring search results pages. For touch-enabled mobile devices that are not equipped with a mouse or keyboard, we adopt other methods of controlling the viewport with the aim of investigating search interaction. From the intuition that people are used to reading books by turning pages horizontally, we conducted a user experiment to investigate the effects of horizontal and vertical control types (pagination versus scrolling) on a touch-enabled mobile phone. Our ndings suggest that pagination improves search over scrolling, despite scrolling being more familiar. The main reason for this is the time taken for the scroll itself. Participants using scrolling also spend less time reading lower-ranked results with lower search accuracy even if this is where relevant documents are found. We conclude that search engines need to provide different viewport controls to allow a better search experience on touch-enabled mobile devices.
Here's what we did earlier:
We propose a method to generate simulated text corpora of arbitrary size. Such corpora are potentially useful when working with private data or as a means to reproducible studies of the efficiency and scalability of retrieval algorithms. For eight different corpora we extract attributes and model the distributions of both term frequencies (piecewise linear with special treatment of head and tail) and document lengths (Gaussian). We model how those attributes and distributions change across samples of a corpus as the samples grow from 1% to 100% of the parent.
We use the above models and a synthetic collection generator (code to be made available as open source) to emulate each of the corpora. Our generator creates documents comprising synthetic words in random order and very accurately mimicks vocabulary size and the term probability distribution of the base corpus. Using the static model for a 1% corpus and applying a generic growth model derived from multiple corpora we are able to emulate key parameters of the original 100% corpus with reasonable accuracy.
Synthetic collections from our generator potentially allow exactly reproducible efficiency experiments and accurate study of algorithmic scalability. They avoid the normal confounds of differences in tokenization and character set conversion of normal text. Our generator mimicks a real corpus with sufficient fidelity to evaluate core aspects of indexing efficiency around postings list lengths, document table and term table. With important provisos, this can be done with negligible interference to CPU and memory caches, allowing on-the-fly generation internal to an indexer. Generated corpora can be tailored to the sizes and characteristics needed for specific experiments. They can be shared with other researchers by communicating less than a kilobyte, even if derived from a private corpus.
"IR and friends" first met on 26 April 2006, when Peter Christen and Tom Rowlands talked about their work-in-progress. Since then we've had over 150 regular talks from universities, government, and industry; plus special talks from visiting colleagues around the world. Help us celebrate ten years of research, discussion, and community.
We will mark the occasion with short talks, reflecting on research and practice, from Tom Rowlands (Australian Crime Commission), Simon Kravis (2XX), Robert Power (CSIRO), Tom Gedeon (ANU), and David Hawking (Microsoft). There'll also be cake.
We also met in 2006 to 2015.