This is a discussion group for people working in information retrieval, data mining, document computing, social media, and similar fields. Everyone's welcome.
We meet every second Monday, from 4-5pm, in the CSIRO seminar room in the ANU CS&IT building. CSIRO are kind enough to provide wine (and soft drinks) and cheese. We occasionally have seminars and events at other times as well.
"IR and friends" aims to encourage discussion between people working in similar fields; to provide a venue for feedback on work in progress; and to get people from different groups talking to each other.
Want to present something? If there's any work in progress (or completed, or not really started) which you would like to share over wine and cheese, please let Paul or Gaya know. You don't need finished work, and you don't need an hour's worth of fancy slides, just a willingness to talk about what you're up to. The idea is to get discussion going, not to present eternal verities.
Computer malware has been around since the floppy disc era, surged when email became commonplace and a received further boost from the possibility of anonymous Internet financial transactions. Malware’s latest variant, encrypting ransomware, has proved a cybercrime gold mine, by restricting computer users access to information. This talk examines a particular ransomware infection, how it was (fortuitously) dealt with, and what protective measures can be taken.
Vertical scrolling is the standard method of exploring search results pages. For touch-enabled mobile devices that are not equipped with a mouse or keyboard, we adopt other methods of controlling the viewport with the aim of investigating search interaction. From the intuition that people are used to reading books by turning pages horizontally, we conducted a user experiment to investigate the effects of horizontal and vertical control types (pagination versus scrolling) on a touch-enabled mobile phone. Our ndings suggest that pagination improves search over scrolling, despite scrolling being more familiar. The main reason for this is the time taken for the scroll itself. Participants using scrolling also spend less time reading lower-ranked results with lower search accuracy even if this is where relevant documents are found. We conclude that search engines need to provide different viewport controls to allow a better search experience on touch-enabled mobile devices.
Here's what we did earlier:
We propose a method to generate simulated text corpora of arbitrary size. Such corpora are potentially useful when working with private data or as a means to reproducible studies of the efficiency and scalability of retrieval algorithms. For eight different corpora we extract attributes and model the distributions of both term frequencies (piecewise linear with special treatment of head and tail) and document lengths (Gaussian). We model how those attributes and distributions change across samples of a corpus as the samples grow from 1% to 100% of the parent.
We use the above models and a synthetic collection generator (code to be made available as open source) to emulate each of the corpora. Our generator creates documents comprising synthetic words in random order and very accurately mimicks vocabulary size and the term probability distribution of the base corpus. Using the static model for a 1% corpus and applying a generic growth model derived from multiple corpora we are able to emulate key parameters of the original 100% corpus with reasonable accuracy.
Synthetic collections from our generator potentially allow exactly reproducible efficiency experiments and accurate study of algorithmic scalability. They avoid the normal confounds of differences in tokenization and character set conversion of normal text. Our generator mimicks a real corpus with sufficient fidelity to evaluate core aspects of indexing efficiency around postings list lengths, document table and term table. With important provisos, this can be done with negligible interference to CPU and memory caches, allowing on-the-fly generation internal to an indexer. Generated corpora can be tailored to the sizes and characteristics needed for specific experiments. They can be shared with other researchers by communicating less than a kilobyte, even if derived from a private corpus.
"IR and friends" first met on 26 April 2006, when Peter Christen and Tom Rowlands talked about their work-in-progress. Since then we've had over 150 regular talks from universities, government, and industry; plus special talks from visiting colleagues around the world. Help us celebrate ten years of research, discussion, and community.
We will mark the occasion with short talks, reflecting on research and practice, from Tom Rowlands (Australian Crime Commission), Simon Kravis (2XX), Robert Power (CSIRO), Tom Gedeon (ANU), and David Hawking (Microsoft). There'll also be cake.
Query-biased summaries, where document summaries are modified to take into account a user's query, have been very useful when searching amongst collections of text. However, in applications such as data portals the collections are of tabular data—e.g. spreadsheets—and there is no similar support.
We describe a method for producing query-biased summaries of tabular data, which aims to support a user's decision whether or not to download a data set---or even to answer the question on the spot, with no further interaction. The method infers simple types in the data and query; automatically refines queries, where that makes sense; extracts relevant subsets of the complete table; and generates both graphical and tabular summaries of what remains. A small-scale user study suggests this both helps users identify useful results (fewer false negatives), and reduces wasted downloads (fewer false positives).
This is joint work with Vincent Au (ANU) and Gaya Jayasinghe (CSIRO).
Laws are pervasive in human society. Yet most members of the community find laws very difficult to incomprehensible to understand. Although extensive work has been carried out to improve how law is written, the foregoing still remains true. How might we employ computational tools to enhance the communication of law? This presentation will outline some of the areas explored in PhD research on this topic. Corpus linguistics is applied to better understand the linguistic characteristics of law. Novel communication of information contained in legal documents is explored through application of graph techniques and visualization. Automating the visualization of selected contract clauses is prototyped. The state of the art in online publication of law is investigated. The research led to a collaborative online citizen science project between the Australian National University and Cornell University collecting tens of thousands of user evaluations of the readability and usability of law. Machine learning and correlation studies were applied to develop an initial model for detection of language difficulty in legislative sentences. The research also reflected on the nature of law from a multidisciplinary perspective. Law is often traditionally understood as a form of command (a concept having resonance within the imperative programming paradigm). Multidisciplinary investigation of law allows us to conclude that its character is far more complex and provides new tools for its investigation.
Michael Curtotti has recently submitted his PhD through the ANU Research School of Computer Science. His research has been presented at the International Conference on Artificial Intelligence and the Law Via the Internet Conferences and is available in publication. Michael is Principal Lawyer with the ANU Students Association and the ANU Postgraduate & Research Students Association.
We also met in 2006 to 2015.