Data Wrangling and Processing for Genomics

By Adam Thomas, Ahmed R. Hasan, Aniello Infante, Anita Schürch, dbmarchant, Dev Paudel, Erin Alison Becker, Fotis Psomopoulos, François Michonneau, Gaius Augustus, Gregg TeHennepe, Jason Williams, Jessica Elizabeth Mizzi, Karen Cranston, Kari L Jordan, Kate Crosby, Kevin Weitemier, Lex Nederbragt, Luis Avila, Peter R. Hoyt, Rayna Michelle Harris, Ryan Peek, Sheldon John McKay, Sheldon McKay, Taylor Reiter, Tessa Pierce, Toby Hodges, Tracy Teal, Vasilis Lenis, Winni Kretzschmar.

Jan 1, 0001

Edit this page

Abstract

Data Carpentry lesson to learn how to use command-line tools to perform quality control, align reads to a reference genome, and identify and visualize between-sample variation. A lot of genomics analysis is done using command-line tools for three reasons: 1) you will often be working with a large number of files, and working through the command-line rather than through a graphical user interface (GUI) allows you to automate repetitive tasks, 2) you will often need more compute power than is available on your personal computer, and connecting to and interacting with remote computers requires a command-line interface, and 3) you will often need to customize your analyses, and command-line tools often enable more customization than the corresponding GUI tools (if in fact a GUI tool even exists). In a previous lesson, you learned how to use the bash shell to interact with your computer through a command line interface. In this lesson, you will be applying this new knowledge to carry out a common genomics workflow - identifying variants among sequencing samples taken from multiple individuals within a population. We will be starting with a set of sequenced reads (.fastq files), performing some quality control steps, aligning those reads to a reference genome, and ending by identifying and visualizing variations among these samples. As you progress through this lesson, keep in mind that, even if you aren’t going to be doing this same workflow in your research, you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.

Link to resource: https://datacarpentry.org/wrangling-genomics/

Type of resources: Module

Education level(s): Graduate / Professional

Primary user(s): student, teacher

Subject area(s): Computer Science, Information Science, Genetics, Measurement and Data

Language(s): English