In my master Data Science for Life sciences I worked on a personal health project with Behzad Barati, Kylie Keijzer, André de la Rambelje and Henrike Vaartstra. In this project we looked at the effect of a sudden change in diet to a fully plant-based diet
This project was used as a way to improve in our data visualization skills. We created a Dashboard in Python with Panel and used the Bokeh visualization library for scientific plots.
Veganism is an increasingly popular phenomenon in the western world since the 2010s (Burt, 2012; Pendergrast, 2015; Talia, 2015). It was kickstarted by the vegetarian food movement of the counterculture of the 1960s in the United States, whose members were concerned about their diet and the effect of mass production of meat on the environment (Iacobbo & Iacobbo, 2004; Smith, 2011). Even though veganism is a drastic change in diet, there has been only limited study into the effect of veganism on body composition, mental health and gut microbiome (David et al., 2013; Iguacel et al., 2020; Le & Sabaté, 2014; Zimmer et al., 2012). There has been plenty of research done on longterm diet influencing the gut microbiome composition but there is substantially less research on the effect of short-term diet on the gut microbiome composition (Muegge et al., 2011; Wu et al., 2011). Here we show the effects of an abrupt change from an omnivore diet to a vegan diet on the human mental state, body composition and gut microbiome. We found that a short-time vegan diet does not significantly influence the mental health nor the body composition. Furthermore, we could not find significant evidence for a shift in the gut microbiome composition which subverted our expectations. There was expected that following a vegan diet for four weeks would be enough time to see a significant change in the gut microbiome, based on previous studies (David et al., 2014; Turnbaugh et al., 2009; Zimmer et al., 2012). Our results represent the outcomes of the collected measurement data compared between the omnivore diet and the vegan diet. There are several data visualizations created in an interactive dashboard. We anticipate our study to be a starting point to more research on the short-term effects of a vegan diet on the gut microbiome, using alternative techniques for wetand dry laboratory work.
At naturalis I did my graduation internship titled “automation of the analysis of insect camera data”. In this project I helped in developing a pipeline that uses deep learning in order to detect, track (deduplicate), segment, identify and estimate the biomass of terrestrial arthropods photographed by insect cameras.
Insects are vital to many ecosystems as pollinators and as a food source for insec- tivores. Recent studies have indicated a worldwide decline in biodiversity, biomass and occurrence of many different insect species. Traditional methods to measure these insect trends are very time consuming, labour intensive and are lethal to the insects. We propose a novel method of measuring these trends with insect cameras. In the summer of 2019 in total eighty-seven insect cameras have taken over thirteen million pictures, of which eight million pictures were labelled to contain activity. The analysis of all these pictures would take entomologists very long.
In this study, we present a pipeline that automates the analysis of insect camera data. This pipeline detects, tracks (deduplicates), segments, identifies and estimates the biomass of terrestrial arthropods with help of different convolutional neural networks.
Finally, we present a discussion about the possibility of using these cameras as an insect monitoring network for the Netherlands.
Falling is the most common type of injury for construction workers. When a fall has taken effect, it can be difficult for others to notice on time. The victim of the fall might not be able to contact others to ask for help. By making a fall detection system a phone application, the construction worker avoids carrying additional equipment. Although there are plenty of options for iOS, the choices on Android are lacking. Here we present Android Fall Detection System (AFDS): an application written in Python to detect and notify others of a fall happening in a construction environment. AFDS works by using the accelerometer to detect a fall in three phases: free fall, impact and the lack of movement. After a fall has been detected, a chosen phone contact gets notified by means of a call and GPS location through SMS. We show that the algorithm for the detection of falls above half a meter has an accuracy of 92% with n=100. Contrarily the separation of daily activities of construction workers (true negatives) with falls (false positives) has an identification accuracy of 95% with n=100. Although these tests have proven to be reliable the validity has to be taken in question. There is still a need for further tests that more truthfully simulate a falling person. Also noted is the need of improvement in the usability of the application in the form of making the fall detection proceed while the phone is locked.
Construction workers are exposed to severe hazards. The most serious hazards these construction workers are exposed to include: falls, being struck by an object, electrocutions, and being caught in between two objects. (Occupational Safety and Health Administration, 2017).
Current regulations regarding fall protection state that employers are responsible for providing fall protection. This fall protection can be provided by guardrails, safety nets, personal fall arrest systems, positioning device systems, and warning line systems (Occupational Safety and Health Administration, 2016).
Despite European and Dutch norms and regulations, construction workers are exposed to possible falls. This represents an important hazard, therefore there are specific norms to prevent accidents to happen. One possible additional approach is providing a fall detection system that provides an alarm and contacts the appropriate person in case of an emergency. For the target group previously mentioned, currently the market provides the following products:
Available products in the market intended for the aforementioned target group are limited, and they work only in Apple devices. These products are relatively expensive and have limited functionality considering that most people in the Netherlands uses Android as mobile operating system (StatCounter, 2019) (Appendix A).
The goal of this project is to create a fall detection system that detects falls of construction workers by using two sensors and sending a signal to another device, which receives and reads the signal in case of an emergency. The main goal is to detect falls within an acceptable level of accuracy, keeping both the number of false negatives and false positives as low as possible. The prototype cannot exceed the cost of 50 euros, it needs to be wearable and ergonomic, and its battery needs to be rechargeable (Appendix B).
Construction workers already carry a lot of equipment for their day-to-day tasks. Adding even more devices to carry would be inconvenient or even dangerous. However, what this group of people often do carry outside of their work-related tools is their personal mobile phone. Therefore, it has been decided to design and develop a sensor system as an android application that uses the sensors within a mobile phone running on android to detect falls of construction workers. A lower height limit of half a metre has been set as a minimum to detect falls.
Here we present the Android Fall Detection System (AFDS). AFDS uses the accelerometer and GPS of an android phone to detect and locate falls of construction workers. When a fall has been detected the application informs a previously added contact that a construction worker may need help by making a phone call and sending an SMS containing the location of the fall to the contact.
Android Fall Detection System was written in Python 2.7.13 with help of the Kivy library (Kivy.org, 2019). Kivy is an open source library for Python to develop cross platform applications for Linux, Windows, OS X, Android, iOS and Raspberry Pi. Kivy includes an Application Programming Interface (API) called Plyer (Plyer.readthedocs.io, 2019). Plyer was used to access sensors and other features of the android phone such as calling and SMS (Appendix C). The chosen features include the use of the accelerometer for detecting falls, the GPS for locating where a worker has fallen and SMS and calling for contacting help. The source code of Android Fall Detection System is available under Appendix D.
The Python code of the application is split up into four classes in which each class represents a screen. By using the kv design language the User Interface is split up into a main, contacts, accelerometer and timer screen (Kivy.org, 2019). In this kv-file all the graphical elements are defined to adopt to the separation of concerns principle (Laplante, 2007).
The fall is detected by the accelerometer, the detection of the fall is determined in three stages. The first stage is when the G-force of the device is lower than 0.1 for more than 319 milliseconds, this means that the device is in freefall from a height of half a metre (Formula 1). The second stage is completed when the accelerometer measures a force around 1,5G, this means the device has had an impact with a surface. For the last stage the accelerometer measures if the G-force is between 0,75G and 1,5G. If the accelerometer measures values outside of these thresholds it is deduced that the person is still moving or getting up resulting in the algorithm concluding that the fall was only minor. The previously mentioned thresholds have been based on literature of reading the values of the accelerometer in free fall (Sposaro and Tyson, 2009) (Appendix F).
Formula 1: Time t taken for an object to fall distance d. This formula was used to calculate that a fall of half a metre takes roughly 319 milliseconds.
After the fall detection algorithm concludes a fall has taken place the timer code activates. This code counts down from 30 seconds in which every five seconds a Waveform Audio File (WAV) is played asking the construction worker to click a button confirming that he does not need help. If the construction worker fails to do this within 25 seconds a second audio clip is played, and the GPS of the android phone is turned on to start calibrating the exact location of the construction worker. When the timer reaches zero the location of the construction worker is finalized and gets send as a Google maps location to a phone number that the construction worker has specified under the contacts section of the application. This phone number will also be called to both notify and communicate with the helping party.
To save phone numbers into memory the Python csv library was used. This library provides the ability to write and read text to csv-files (comma-separated values) such that when the application is closed the phone number is still saved. When the user enters a phone number it gets written to a csv-file hidden in its application folder. Once the contacts screen is loaded the csv-file is read and the phone number is being shown to the user.
Mobile applications on the Android operating system are often programmed in programming languages such as Java, Kotlin or C++. To be able to create an Android application package (APK) from Python code Buildozer was used (Buildozer.readthedocs.io, 2019). Buildozer is a tool from Kivy that automates the python-for-android compiler with help of a buildozer.spec file (Python-for-android.readthedocs.io, 2019). Although the Kivy developer team has made no official statement it has been determined that python-for-android, and thus Buildozer, has no windows support at time of writing. To solve this limitation a virtual machine running Xubuntu 17.04 has been used to compile the Python source code into compatible code for android (Xubuntu.org, 2019). To be able to detect android phones inside the virtual machine it was needed to install the Android Debug Bridge (adb) (Android Developers, 2019).
This section of the technical report aims to show the results of this project, starting with a short explanation of the functionality of the application and then providing a test of its accuracy. AFDS (Android Fall Detection System) allows the user to start/stop detecting falls and add an emergency contact that would be contacted if a fall is detected and the user does not stop the alarm.
Figure 1: Android Fall Detection System
After opening the application (Figure 1) the user has access to the main menu (Figure 2). “Edit Contacts” allows the user to add an emergency contact through the contact screen (Figure 3). The “Back” button in this screen makes the application show the main screen again, from where the user can start or stop the fall detection system. To start detecting falls the user can press this button, which leads to the fall detecting screen. From this screen the user can go back to the main menu with the “back” button (Figure 4). In case a fall is detected a countdown is activated, as it can be seen in the fall detection screen (Figure 5). If the construction worker needs no assistance, the button “I am okay” stops the countdown. If this button is not pressed within thirty seconds, the application makes a call to the emergency contact (Figure 8) and sends an SMS with the location of the fallen construction worker (Figure 7).
Figure 2: Main screen of AFDS
Figure 3: Contact screen android AFDS
Figure 4: Detecting fall screen android AFDS
Figure 5: Detected fall screen android AFDS
Figure 6: Cancelled countdown screen android AFDS
Figure 7: Missed call after fall detection
Figure 8: SMS sent after fall detection
The GPS test has been performed in three different environments to test the difference in accuracy between inside and outside (Table 1). A Samsung Galaxy S9 was used in all GPS tests. The data has been split into the degree of longitude and latitude and the test was performed in Assen, The Netherlands (53,00°, 6,57°). In these tests the manually curated location of the phone, which has been deduced as the most accurate estimation, is referred to as the true location. Each guess is what the Android Fall Detection System chooses as the location of the fall. The errors of the true and guessed locations with the average errors have been calculated and are shown in Table 2 and Table 3. To be able to interpret this data more easily, the errors have been converted from degrees latitude and longitude into meters (Table 4) (Veness, 2019).
To better visualize the errors of the GPS both the true and guessed locations have been plotted onto a satellite image of the building in which the tests have been run (Figure 9a and 9b).
Table 1: Longitude and latitude given by the GPS of three different locations compared to the true location.
Table 2: Errors between the true and the guessed locations based on the GPS accuracy test. The table states the absolute differences between each guess and the true value of the three determined locations. Also included is the average error per location.
Table 3: The total average of all errors of all locations.
Table 4: Errors between the true and guess locations based on the GPS accuracy test in meters. The Haversine formula has been used to calculate the great-circle distance between two GPS points. The great-circle distance is the shortest distance over the Earth’s surface.
Figure 9a: True and guessed locations plotted on a satellite image. The locations shown as a star icon are the true locations and the marker icons represent the guessed locations. The icons highlighted in orange represent tests done outside, the blue icons represent tests done under a roof and icons in green represent tests done under two ceilings and a roof.
Figure 9b: Zoomed in version of Figure 9a
To test the fall detection algorithm, both true falls that should be detected and false falls that represent daily activities of a construction worker have been tested (Table 5 and Table 6). A Samsung Galaxy S7 was used in all fall detection tests. The testing of true fall detection was performed by letting the phone fall from a specified height. To protect the phone, a protective case and a landing platform consisting out of a summer jacket was used. In order to test the daily activities, a test subject placed the phone into their front pocket while performing said activities. All tests were categorized depending on whether the falls have been detected or not (Table 7).
Table 5: Detection of true falls from different heights.
Table 6: Detection of false falls while performing activities.
Table 7: Overall accuracy of fall detection.
To be able to test what a typical freefall would look like the values of the accelerometer in the android phone were read (Figure 10). Based on these values and literature it has been concluded that a fall consists of three phases (Sposaro and Tyson, 2009). The first phase can be seen as a freefall, in which the values go below 1 m/s2. Following a freefall is an impact with the ground, causing the total acceleration to spike to well over 15 m/s2. The third phase is the detection of movement. If the total acceleration goes outside of 6 and 15 m/s2 for more than two and a half seconds it is concluded that significant motion has taken place and the user is okay.
Figure 10: Total acceleration over time of a free fall of one meter. Phone was let go from one meter and landed onto a blanket. The total acceleration of the X, Y and Z axis combined have been plotted in m/s2.
Android Fall Detection System can be used to detect falls of construction workers and contact help if needed. AFDS does however come with some disadvantages.
Because of the need of the Kivy Buildozer to create Android-compatible code, AFDS has been limited to Python 2.7. This limits the maintainability of the source code since Python 2.7 will be deprecated in the beginning of 2020 (Python.org, 2019). AFDS could also benefit from some extra features such as a back button after stopping the countdown and the ability to add multiple phone numbers in case the first person is not available. The most necessary feature AFDS is missing is arguably the ability to keep the fall detection algorithm running while the phone is locked.
Part of the Android Fall Detection System is to send the location of the fall to get help. This location is determined by the GPS of the phone (Table 1, Table 2 and Table 3). As seen in Figure 9a and Figure 9b the results significantly vary based on the environment of the fall.
In the case of a fall happening in an area without a roof the guessed locations seem to be the most accurate of the three different locations. However, the precision is less than under a single roof. This could have been by stochastic error since the sample size of these tests were limited to three.
Although the tests under a single roof are the most precise, they seem to be the most inaccurate. The hypothesis is that this is caused by the many windows of this location with the reflection causing the GPS location to be moved up north.
As expected, the tests under two ceilings and a roof seem to be both somewhat inaccurate and not precise. The environment of this location is excluded from the open air which makes determining the GPS location more difficult.
The test to see whether the Android Fall Detection System can detect true or false falls has been carried out. When referring to Table 4, fall detection at variable heights impacts the accuracy of the algorithm. The application has detected falls 65% of the time at half a metre with a sample size of twenty. However, the accuracy was at 98.75% from one to two and a half metres which is significantly higher than at half a metre. Thus, it has been concluded that a fall above at least half a metre significantly improves the accuracy of the fall detection.
Another test has been performed to detect if daily activities of a construction worker would set off the fall detection algorithm (Table 5). Five activities that could be misunderstood by the algorithm as a fall, which is referred to as a ‘false fall’, have been performed. These tests have been conducted twenty times each for a total of one hundred trials to see how accurate the Android Fall Detection System is in distinguishing true negatives from false positives. When the phone was dropped from 0,4 metre, the application identified the drop as a fall 25% of the time resulting in a 75% accuracy. In the other activities the application did not give a false conclusion giving a 100% accuracy.
The conditions of testing the accelerometer were however less than ideal. All tests were performed by letting the phone fall freely onto a summer jacket to lessen the impact. This could have impacted the resulting data and made the results less valid. In an optimal test it would be best to have a human or test dummy fall from the specified heights onto a concrete floor with the phone being placed in the pocket of some pants.
Universally a fall consists of a period of low acceleration while falling and a spike in high acceleration at impact. Sposaro and Tyson, 2009 describe a typical fall to have these low values under 0,6G and high values above 3G (Figure 11). However, in said paper a fall is defined as a standing person falling backwards onto a surface of the same height as they are standing on. Since Android Fall Detection System defines a fall as a free fall above at least half a metre the thresholds diverge from this paper. The algorithm described in this report uses a lower threshold of 0,1G for more than 319 milliseconds with an upper threshold of at least 1,5G.
Figure 11: Total acceleration over time of a typical standing fall. An example of total acceleration values of a backwards fall of a person that was standing. Based on these falls the deduced thresholds are around 0.6G and 3G. Graph adapted from (Sposaro and Tyson, 2009).
Within our test conditions Android Fall Detection System has been deemed accurate enough to reliably detect falls of construction workers. Further testing is however needed to make a definite conclusion about the appliance of AFDS in a real-world setting. For further research it would be beneficial to test the fall detection algorithm by letting a human or test dummy fall on a more typical work floor such as one made from concrete.
The locations found by the GPS have been regarded to be accurate enough to locate where the fall has happened. The sending of the location to another phone number has proven to be reliable. It can be said that GPS is not an ideal way of locating a fall since construction sites often have multiple floors and GPS does not detect height.
A major downside of this version of the application is that the detection of falls is paused when the phone is locked. Having the phone unlocked in their pocket can cause the user to accidently press other buttons while working and make the app not detect falls anymore. It is therefore recommended that this is at the top of the list of features to work on in further development.
The use of Kivy and Buildozer to create Android applications from Python code is troublesome and time-consuming. Because of poor documentation and many outdated dependencies, it is not recommended to use these libraries and a better alternative is to write the code directly in Java.
Android Developers. (2019). Android Debug Bridge (adb) Android Developers. [online] Available at: https://developer.android.com/studio/command-line/adb [Accessed 19 Jun. 2019].
Apple Watch Series 4. (2019). Apple (América Latina). Retrieved 9 May 2019, from https://www.apple.com/la/apple-watch-series-4/
Buildozer.readthedocs.io. (2019). Welcome to Buildozer’s documentation! — Buildozer 0.11 documentation. [online] Available at: https://buildozer.readthedocs.io/en/latest/ [Accessed 19 Jun. 2019].
Veness, C. (2019). Calculate distance and bearing between two Latitude/Longitude points using haversine formula in JavaScript. [online] Movable-type.co.uk. Available at: https://www.movable-type.co.uk/scripts/latlong.html [Accessed 26 Jun. 2019].
Distribution dashboard Android Developers. (2019). Android Developers. Retrieved 9 May 2019, from https://developer.android.com/about/dashboard
Kivy on Android — Kivy 1.10.1 documentation. (2019). Kivy.org. Retrieved 9 May 2019, from https://kivy.org/doc/stable/guide/android.htm
Kivy.org. (2019). Kv Design Language — Kivy 1.11.0 documentation. [online] Available at: https://kivy.org/doc/stable/gettingstarted/rules.html [Accessed 19 Jun. 2019].
Kivy.org. (2019). Welcome to Kivy — Kivy 1.11.0 documentation. [online] Available at: https://kivy.org/doc/stable/ [Accessed 19 Jun. 2019].
Laplante, P. (2007). What Every Engineer Should Know about Software Engineering.
Mobile Operating System Market Share Netherlands StatCounter Global Stats. (2019). StatCounter Global Stats. Retrieved 18 June 2019, from http://gs.statcounter.com/os-market-share/mobile/netherlands
Python.org. (2019). PEP 373 – Python 2.7 Release Schedule. [online] Available at: https://www.python.org/dev/peps/pep-0373/ [Accessed 20 Jun. 2019].
Plyer.readthedocs.io. (2019). Welcome to Plyer — Plyer 1.4.1.dev0 documentation. [online] Available at: https://plyer.readthedocs.io/en/latest/ [Accessed 19 Jun. 2019].
Python-for-android.readthedocs.io. (2019). python-for-android — python-for-android 0.1 documentation. [online] Available at: https://python-for-android.readthedocs.io/en/latest/ [Accessed 19 Jun. 2019].
Raadgevendbureauspeelman. (2019). Raadgevendbureauspeelman.nl. Retrieved 9 May 2019, from https://www.raadgevendbureauspeelman.nl/wp-content/uploads/2016/05/1.1-Cursusboek-VCA-versie-8.pdf
Sposaro, F. and Tyson, G. (2009). iFall: An android application for fall monitoring and response. 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.
Welcome to Plyer — Plyer 1.4.1.dev0 documentation. (2019). Plyer.readthedocs.io. Retrieved 9 May 2019, from https://plyer.readthedocs.io/en/latest/
WorkerSafety Pro — Protects, detects, & alerts organizations automatically. (2019). Health & Safety solutions for Work and for Home. Retrieved 9 May 2019, from
Xubuntu.org. (2019). Xubuntu 17.04 released! « Xubuntu. [online] Available at: https://xubuntu.org/news/xubuntu-17-04-release/ [Accessed 19 Jun. 2019].
Highlighted in green are APIs that could be interesting for a fall detection system for construction workers that sends a signal to their foreman or colleagues. Edited from https://github.com/kivy/plyer.
https://github.com/Magiduck/AFDS2
https://drive.google.com/open?id=1OglZ2_CQCEmmW9vKLWWSUKcVu_fdZfcH&usp=sharing
We chose apart from the potentiometer to also use the on-board photosensor of the Innoesys educational shield. The resulting values are displayed on three on-board LEDs and a servomotor acting as an analogue dial.
The right button is used to switch between measuring the sensors. The left button toggles the LED display on and off.
We also established communication to and from the Arduino. The user can send the following commands:
Below you can find a live demo in video format.
The Arduino code can be found below:
/*
* Final assignment made by Magor Katay and Tijs van Lieshout
*
*/
// Servo library
#include <Servo.h>
// Pins for the Innoesys shield on-board LEDs + variable for LED values
const int redPin = 6;
const int greenPin = 9;
const int bluePin = 10;
int LEDValue = 0;
// Pin for the Innoesys shield on-board for the Potentiometer
const int potentioMeter = A1;
int potentioValue = 0;
// Pin for the Innoesys shield on-board for the Photosensor
const int photoSensor = A2;
int photoValue = 0;
// Servo object from the servo library with value
Servo myservo;
int servoValue = 0;
// boolean to allow servo to be active (for the "S" command)
boolean allowServo = true;
// Used to switch between sensors
const int buttonRight = 7;
int buttonRightState;
int previousRight = 0;
boolean isPotentioOn = true;
// Used to turn on/off the LED display
const int buttonLeft = 8;
int buttonLeftState;
int previousLeft = 0;
boolean isDisplayOn = true;
// Used for blinking led at min and max values
boolean isBlinkOn = false;
unsigned long previousMillis2 = 0;
const long interval2 = 500;
// Used for both buttons
long time = 0;
long debounce = 200;
// Used for sending data every 1Hz with the "C" command
unsigned long previousMillis = 0;
const long interval = 1000;
boolean keepReturningValues = false;
// Pin of the buzzer for the "B" command (also used for the right button)
const int buzzer = 5;
void setup() {
// Outputs
pinMode(redPin, OUTPUT);
pinMode(greenPin, OUTPUT);
pinMode(bluePin, OUTPUT);
pinMode(buzzer, OUTPUT);
myservo.attach(3);
// Inputs
pinMode(potentioMeter, INPUT);
pinMode(photoSensor, INPUT);
pinMode(buttonRight, INPUT);
pinMode(buttonLeft, INPUT);
// Baud rate for the serial monitor
Serial.begin(9600);
}
void loop() {
// For switching between Potentiometer and Photosensor with the right button on the shield
buttonRightState = digitalRead(buttonRight);
// Determine if button has been pressed and if <isPotentioOn> should be true (use potentiometer) or false (use photosensor)
determineButtonRightPress(buttonRightState);
// Set the value of the LEDs based on the measured sensor value
setLEDValueBasedOnActiveSensor();
// For moving the servo (servo won't be allowed if the user toggled it off with a "S" command)
if(allowServo){
myservo.write(servoValue);
delay(15);
}
// For turning the LED display on/off again
buttonLeftState = digitalRead(buttonLeft);
// For switching with the left button
determineButtonLeftPress(buttonLeftState);
// For handling all the commands the user can send to the arduino
handleCommands(potentioMeter, photoSensor);
// For turning leds on/off, triggered with the left button or the "L" command
if(isDisplayOn) {
displayValuesToLED(LEDValue);
} else {
turnOffLEDs();
}
// If the "C" command has been send then keep sending values at a speed of 1Hz
if(keepReturningValues) {
returnValuesContiniously();
}
}
// For determining if the right button was pressed and if the potentiometer should be read or the photosensor.
void determineButtonRightPress(int buttonRightState) {
if(buttonRightState == 1 && previousRight == 0 && millis() - time > debounce) {
digitalWrite(buzzer, 1);
delay(50);
digitalWrite(buzzer, 0);
if(isPotentioOn) {
isPotentioOn = false;
} else {
isPotentioOn = true;
}
time = millis();
}
previousRight = buttonRightState;
}
// Checking which sensor should be measured and converting that value to a value for the LEDs
void setLEDValueBasedOnActiveSensor() {
// for checking which sensor should be measured
if(isPotentioOn) {
potentioValue = analogRead(potentioMeter);
// mapping sensor value appropiately for the LED and servo
LEDValue = map(potentioValue, 0, 1023, 0, 255);
servoValue = map(potentioValue, 0, 1023, 180, 0);
} else {
photoValue = analogRead(photoSensor);
// mapping sensor value appropiatley for the LED and servo
LEDValue = map(photoValue, 0, 1023, 0, 255);
servoValue = map(photoValue, 0, 1023, 180, 0);
}
}
// For determining if the left button was pressed and if the LED display should be on or not
void determineButtonLeftPress(int buttonLeftState) {
if(buttonLeftState == 1 && previousLeft == 0 && millis() - time > debounce) {
if(isDisplayOn) {
isDisplayOn = false;
} else {
isDisplayOn = true;
}
time = millis();
}
previousLeft = buttonLeftState;
}
// For handling commands from the serial monitor:
// L turning on/off LED display
// M Sending both sensor values once
// C Sending both sensor values at a speed of 1Hz until toggling it off again
// B Play a note through the buzzer based on the LEDvalue
// S Toggling the use of the servo as a display
void handleCommands(int potentioMeter, int photoSensor) {
int command;
if (Serial.available() > 0) {
command = Serial.read();
if(command == 'L') {
if (isDisplayOn) {
isDisplayOn = false;
} else {
isDisplayOn = true;
}
} else if(command == 'M') {
potentioValue = analogRead(potentioMeter);
photoValue = analogRead(photoSensor);
Serial.println("Potentiometer value: " + String(potentioValue) + " Photometer value: " + String(photoValue));
} else if(command == 'C') {
if(keepReturningValues) {
keepReturningValues = false;
} else {
keepReturningValues = true;
}
} else if(command == 'B') {
if(LEDValue >= 0 && LEDValue <= 85) {
tone(buzzer, 31);
delay(50);
noTone(buzzer);
}
else if(LEDValue > 85 && LEDValue <= 170){
tone(buzzer, 494);
delay(50);
noTone(buzzer);
}
else if(LEDValue > 170 && LEDValue <= 255){
tone(buzzer, 3951);
delay(50);
noTone(buzzer);
}
} else if(command == 'S') {
if(allowServo){
allowServo = false;
Serial.println("You turned the Servo off");
} else {
allowServo = true;
Serial.println("You turned the Servo on");
}
}
}
}
// Displaying values between 0 and 255 to three LEDs (digitally, it was at some point analog but the servo library causes problems)
void displayValuesToLED(int value) {
if(value <= 0) {
digitalWrite(greenPin, 0);
digitalWrite(bluePin, 0);
// For blinking at min and max value
unsigned long currentMillis2 = millis();
if (currentMillis2 - previousMillis2 >= interval2) {
previousMillis2 = currentMillis2;
if(isBlinkOn){
digitalWrite(redPin, 0);
isBlinkOn = false;
} else {
digitalWrite(redPin, 1);
isBlinkOn = true;
}
}
}
else if(value > 0 && value <= 85) {
digitalWrite(greenPin, 0);
digitalWrite(bluePin, 0);
digitalWrite(redPin, 1);
}
else if(value > 85 && value <= 170){
digitalWrite(redPin, 0);
digitalWrite(bluePin, 0);
digitalWrite(greenPin, 1);
}
else if(value > 170 && value < 255){
digitalWrite(greenPin, 0);
digitalWrite(redPin, 0);
digitalWrite(bluePin, 1);
}
else if(value >= 255) {
digitalWrite(greenPin, 0);
digitalWrite(redPin, 0);
// For blinking at min and max value
unsigned long currentMillis2 = millis();
if (currentMillis2 - previousMillis2 >= interval2) {
previousMillis2 = currentMillis2;
if(isBlinkOn){
digitalWrite(bluePin, 0);
isBlinkOn = false;
} else {
digitalWrite(bluePin, 1);
isBlinkOn = true;
}
}
}
}
// Turning all LEDs off at the same time, can be triggered through the "L" command or through the left button press on the shield
void turnOffLEDs() {
digitalWrite(greenPin, 0);
digitalWrite(bluePin, 0);
digitalWrite(redPin, 0);
}
// Used for sending both values at 1Hz if the user has sent the "C" command
void returnValuesContiniously() {
unsigned long currentMillis = millis();
if (currentMillis - previousMillis >= interval) {
previousMillis = currentMillis;
potentioValue = analogRead(potentioMeter);
photoValue = analogRead(photoSensor);
Serial.println("Potentiometer value: " + String(potentioValue) + " Photosensor value: " + String(photoValue));
}
}
The variability between genome architectures, the collection of non-random arrangements of functional elements within a genome, can be used to determine the similarities and differences in a genome comparison to study genome evolution. Long-read sequencing brings the benefit of a more accurate assembly in regions of the genome with repetitive sequences and complex structural variation. Long-reads can be more easily aligned to each other which leads to more accurate comparisons of genome architectures. Instead of aligning sequences, a single graph-based data structure can offer more information of similarities and differences in genome architecture in variable regions. Ptolemy is a tool that is able to compare genome architectures without the need of a reference in a gene graph with vertices representing genes. The interpretation of these gene graphs can however be difficult without visualization, and currently few tools provide biological depth when representing these graphs. In this study we present Directed Acyclic Graphs for GEnomic readability (DAGGER), a tool with a graphical user interface for the interactive visualization of gene graphs from GFA1-formatted files. We show the features of DAGGER by visualizing the genome architecture comparison of three novel Escherichia phages (most closely related to vB_EcoM_AYO145A) in a single gene graph using Ptolemy. We also present the stand-alone tool POET (Ptolemy Output Enhancement Tool). This tool can be used for the further analysis of genome graph data generated by Ptolemy. With help of Ptolemy and POET we partially analyse the genome architecture of a dataset of long-sequencing reads of a chimeric generated bacteriophage of E. coli bacteria BL21phi10. Finally, we present a discussion about the advantages and disadvantages of different layout algorithms to visualize gene graphs.
BANDAGE a Bioinformatics Application for Navigating De novo Assembly Graphs Easily
DAGGER Directed Acyclic Graphs for GEnomic Readability
GFA1 Graphical Fragment Assembly format 1
JUNG Java Universal Network/Graph framework
POET Ptolemy Output Enhancement Tool
The first sequencing of a genome (bacteriophage MS2) was reported in 1976 (Fiers et al., 1976). The upcoming of comparative genomics would however be much later, starting with the whole genome sequencing of Mycoplasma genitalium (Fraser et al., 1995) and Haemophilus influenza (Fleischmann et al., 1995). Currently, comparative genomics is widely used when analysing sequenced genomes (Koonin & Galperin, 2013). The field of comparative genomics aims to compare the genetic content of two or more species to identify the similarities and differences between genomes. The totality of non-random arrangements of functional elements in a genome can be described as genome architecture (Koonin, 2009). The study of genome evolution (the process by which a genome changes in structure over time through mutation, horizontal gene transfer and sexual reproduction) compares genome architectures between closely and distantly related genomes. This comparison can indicate the similarities and differences through evolution and lead to findings of how a genome is able to evolve through time.
A sequencing run generates data in the form of reads. A read is an inferred sequence of base pairs of as given molecule. Reads that share an overlap can be assembled into contigs. A contig can be described as a collection of partially or entirely overlapping reads that together represent a consensus sequence of a genetic fragment. The genome assembly can contain multiple contigs, which indicates the presence of a gap. To bridge the gap between these contigs it is needed to create a scaffold (Figure 1) (Fierst, 2015; Waterston, Lander, & Sulston, 2002). Outside of these contigs and scaffolds it is also possible to visualize the result of the assembly as a graph. Assembly graphs provide extra layer of information, as it contains connections between reads (Wick, Schultz, Zobel, & Holt, 2015).
Figure 1: Schematic representation of sequence information. Reads that have a partial or complete overlap with another read makes a contig. A contig, a collection of overlapping reads, starts at the lowest base pair position of said reads and ends at the highest base pair positions of said reads. If a single contig does not cover all the reads than there is a need for scaffolds. Adapted from https://www.biostars.org/p/253222/.
When comparing genome architectures one of the genomes is selected as a reference genome. Marschall et al. describes that a reference genome can be built up in multiple ways: (1) Getting the genome of one single individual as a reference; (2) a consensus genome of an entire population; (3) use the functional elements of a genome (with mutations); and (4) aligning all sequences ever detected to analyse jointly, defined as an “maximal genome”. The representation of this “maximal genome” in a “pan-genome” instead of a single reference genome proves to contain a more complete set of functional elements to analyse (Marschall et al., 2018). A pan-genome can be defined as the core genome containing all the genes that are found in all strains and a dispensable, accessory genome containing all the genes that are only found in a sub set of strains (Tettelin et al., 2005). This pan-genome creates a collection of reference genomes to be analysed jointly (Marschall et al., 2018). The storage of these collections of genomic sequences into graph-based data structures has proven to improve the analysis of complex genomic regions like the major histocompatibility complex (Dilthey, Cox, Iqbal, Nelson, & McVean, 2015).
Most Next Generation Sequencing (NGS) methods, often referred to as high-throughput sequencing, e.g. Illumina, produce a great number of short reads (Fierst, 2015). These short reads have a read length between 100 to 500 base pairs (Jünemann et al., 2013). While the accuracy of genotyping is fairly high with these reads in non-repetitive regions, they cannot provide a contiguous de novo assembly. These short reads are not accurate enough to assemble a genome with repetitive sequences or complex structural variation (Jain et al., 2018).
More recently came the technology of long-read sequencing. Single-molecule real-time sequencers from Pacific Biosciences are able to produce reads of over 10 kilo base pairs long (Gordon et al., 2016). These long reads can greatly help with the assembly of a genome. Although it has been reported that these single-molecule real-time sequencers suffer from significantly higher error rates in the reads compared to shorter read sequencers from Illumina. To create an accurate assembly the use of de novo assembly algorithms and accurate short reads mapping is necessary (Chaisson, Wilson, & Eichler, 2015; Fierst, 2015).
An alternative recently developed technology in sequencing is “pore sequencing”, e.g. nanopore sequencing from Oxford Nanopore Technologies. In this electrochemical method, a nanopore detects single-molecules by electrophoretically driving said molecules in solution (Branton et al., 2008). Recent improvements lead to ultra-long-reads vastly helping in the assembly of complete genomes (Jain et al., 2018). By only using a single sequencing technology (Loman, Quick, & Simpson, 2015), Jain et al. has been able to produce at 30× coverage an human genome assembly with accuracy to 99.44%. Because of this the reconstruction of genomes containing complex structural variation is highly improved, facilitating the comparison of genomic architectures (Jain et al., 2018). Long reads provide a substantial improvement when compared to short reads in assembly contiguity, creating a more accurate assembly.
A gene graph can be defined as a graph-based data structure that describes the order of genes in a genome. The gene graph is an ordered pair G = (V,E) with the vertex set V representing the genes and the edge set E representing the order of which genes follow each other. Similar to the definition of a computational pan-genome, a gene graph comparing multiple genome architectures also represents a collection of genomic sequences to be analysed (Marschall et al., 2018). The goal of a gene graph is to reduce the reference bias for a more accurate analysis of genome comparisons.
The construction of gene graphs can be done from whole-genome alignments by detecting the relationship in homology between genome sequences. An example of a tool that tackles this graph construction problem is Ptolemy (Salazar & Abeel, 2018). This tool is able to compare genome architectures of multiple microbial assemblies without the need of a reference genome. Ptolemy can be used to study structural variation and pan-genomes from a ‘top-down’ approach, it uses gene annotations instead of DNA sequences (Salazar & Abeel, 2018).
Contrary to this is the ‘bottom-up’ approach that compares genome architectures based on alignments (Angiuoli & Salzberg, 2010). An example of this is NovoGraph, which performs a global pairwise alignment of each contig relative to the reference genome. A global multiple sequence alignment follows to embed the homology between contigs and the reference genome. Finally an acyclic graph-based data structure is computed to connect the contigs at homologous-identical positions (Biederstedt, Oliver, Hansen, Jajoo, Dunn, et al., 2018).
Another graph-based tool for genome comparison is the Variation Graph toolkit (Garrison et al., 2018). This tool benefits most for short-read data and in which known variation is assembled into a graph to align the short-reads against (Garrison et al., 2018).
Although tools such as Ptolemy and NovoGraph create the comparison of the genome architectures in a graph-based data structure, they do not provide a visualization. The interpretation of a gene graph purely from a non-visual point of view can be significantly restricted. There is a need for the visualization of these gene graphs to get an overview and understanding of their contents since existing graph toolkits do not support gene graphs based on whole-genome alignments (Biederstedt, Oliver, Hansen, Jajoo, Olson, et al., 2018).
Gephi is an example of a widely used visualization tool for networks and graph-based data structures (et al. M. Bastian, S. Heymann, 2009). However, it’s broad applicability, difficults the understanding of complex biological data sets.
A specific tool created for visualizing de novo assembly graphs is BANDAGE. Here, the vertices of the graph are single assembled contigs. The edges are the connection between these (Wick et al., 2015). Nowadays, BANDAGE is a widely used tool to assess the quality of an assembly that is graph assisted. However, it is not designed to compare genome architectures, as it lacks the ability to represent individual paths, compare multiple genome architectures or analyse pan-genomes.
In this study, we aim to build a tool to study graph-based visualization of genome architectures. This tool aims to show Directed Acyclic Graphs for Genomic Readability (DAGGER). DAGGER was specifically developed for Ptolemy, where vertices represent genes in GFA1-formatted files (Li, 2016). Here, we demonstrate the use of DAGGER by the comparison of three novel coliphages. However, DAGGER struggles with bigger datasets (bigger datasets contain more cycles in a graph causing problems with a Sugiyama layout), as visualization is still troublesome. For that we developed Ptolemy Output Enhancement Tool (POET). This standalone tool was intended to datamine Ptolemy, to improve performance of DAGGER. For this, we have used a chimeric phage dataset (long read data) resulting from evolutionary experiments with Escherichia Coli BL21(DE3).
This tool is written in Java 8 with help of the JavaFX GUI toolkit, and uses the Java Universal Network/Graph framework (JUNG) for the in-memory storage and layout of graphs. JUNG was chosen to separate the Application Programming Interface between the layout and visualization algorithms, which allowed full customization when drawing with JavaFX. The use of the GUI toolkit JavaFX instead of Swing in DAGGER was chosen based on the recommendation from Oracle to migrate Java Applets to Java Web Start and JNLP (docs.oracle.com, 2018). JavaFX provides the ability to write the structure of the graphical user interface separate from the logic of the application in FXML, an XML-based language. In DAGGER the visualization of graphs is drawn with primitives on the graphical context of a canvas instead of shapes on a pane to provide better performance when looking at graphs with many edges and/or vertices.
DAGGER takes as input a GFA1-formatted file (Li, 2016) in which the gene graph is build up from all the lines in the file that describe a path of vertices (Figure 2). The edges and vertices of all paths are added to a graph implementation of JUNG for in-memory storage and layout algorithms.
Figure 2: An example of a minimal GFA1-formatted file. Fields of lines are separated by tabular characters and can be seen as light orange arrows. The first line starts with the character “H” and describes the header of a GFA1 file. Vertices are represented in lines with the tag “S” for segment. In such a line the “S” character is followed by a unique identifier and an optional nucleotide sequence (an “N” can be placed to signify no available DNA sequence). Lines starting with “L” are Links and represent edges. A Link line contains the identifier of the outgoing segment followed by its orientation and the identifier for the incoming segment again followed by its orientation. The last field of these lines is an optional CIGAR string to describe the overlap between the vertices (Li et al., 2009). The paths are described by lines starting with a “P”. In this line a unique path identifier has to be given with a comma-separated list of segment identifiers. An optional list of CIGAR strings can be provided to describe the individual segments.
The first of the two layout options in DAGGER is the Sugiyama layout, a hierarchical layered layout in which all edges traverse from the left to the right based on the methods developed by Sugiyama et al. (1981). The Sugiyama method of laying out graphs has five principles for making a layout of a multilevel digraph readable: (1) The layout should be hierarchical, with each edge pointing in the same direction. (2) There should be as few as possible edge crossings. (3) The edges connecting vertices should be as straight as possible. (4) The edges connecting vertices should be as short as possible. (5) Vertices should be balanced and evenly distributed to visualize the structural information of diverging and converging paths (Sugiyama, Tagawa, & Toda, 1981). DAGGER uses an implementation of the Sugiyama layout that is compatible with the Java Universal Network/Graph framework. The algorithm can be divided into three main steps: (1) The removal of cycles in the graph-based data structure. Here, a depth-first search is performed, in which a graph is traversed over a branch, visiting vertices and backtracking when it reaches the end of a branch. Visiting the same vertex more than once indicates the presence of a cycle. (2) Layer assignment, in which the vertices are assigned to a level from left to right based on the incoming and outgoing edges of vertices. (3) Reduction of edge crossings, where vertices are moved up or down within a Barycenter level (Sugiyama et al., 1981). This method determines the vertical position of vertices at the same level, by the average vertical position of vertices in their neighbouring levels.
In JavaFX a canvas is fully rendered, causing memory-related issues when a canvas has large dimensions. The implementation of the Sugiyama layout is visualized in a JavaFX ListView of a canvas per cell (Figure 3). The number of cells is first determined as described in Figure 4. A ListView is linked to an ObservableList which allows the canvases to update based on if their cell is visible for the user. The vertices and edges are assigned to canvases based on the coordinates assigned by the Sugiyama layout. Cells which contain canvases that are not in view will not be rendered, reducing memory allocation during visualization.
Figure 3: An abstract flowchart of the algorithm to display the Sugiyama layout into a ListView of CanvasCells.
Figure 4: Getting the number of canvases for the ListView to visualize the Sugiyama layout on.
A circular visualization is achieved onto a JavaFX canvas. The position of vertices and edges are calculated by the CircleLayout algorithm provided with JUNG. Depending on the size of the gene graph, the circular layout can get too large to view in detail when representing the entire figure. To solve this problem, zooming and panning are possible with mouse controls on the canvas on which the circular layout is drawn (Figure 5).
Figure 5: An abstract flowchart of handling zooming and panning of the Circular Layout. The zooming of the visualization is done by setting the scale of the contents of the pane depending on the position of the cursor and a scrolling event. Similarly, the panning of the visualization is handled by getting the start and end position of the cursor after a dragging event.
To easily extract biological relevant data from big data sets analysed by Ptolemy, a data enhancement tool was created. POET is a standalone application that can be run from the command line or terminal. POET was written in Java 8 without GUI support. POET is designed to analyse read-based gene graphs. This tool takes as input the GFA1-formatted file and a tab-delimited file containing annotation of gene identifiers generated by Ptolemy. The lines within the GFA1-formatted file that describe a path are parsed and added into a list. With this list of all paths containing at least one read, the user can choose to determine the number of unique reads and their frequency. Annotation of genes is based off the annotation file generated by Ptolemy (Salazar & Abeel, 2018). POET can also quantify reads containing interesting gene identifiers based on the users’ input.
A provided data set of three novel Coliphages, have been used to demonstrate the features of DAGGER. These three novel Coliphages came from a larger dataset of twelve viral samples consisting of ten Escherichia coli BL21 isolates along with a Bacillus subtilis and a Klebsiella pneumonia isolate. The three novel Coliphages have been determined to be most closely related to Escherichia phage vB_EcoM_AYO145A (accession code: NC_028825) with the help of BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990).
This dataset consists of long-sequencing reads generated by MinION of a chimeric generated virus of bacteria (bacteriophage or in short phage), BL21phi10 with 20423 reads after polishing. In order to study the genome architectures contained in this dataset a gene graph was created. To create this Ptolemy was used on the reads in combination with Escherichia coli BL21(DE3), complete genome (accession code: NC_012892.2). POET was used to get the most common unique paths of this gene graph. Annotations of the prophage region of this genome were re-annotated based on a Enterobacteria phage DE3, complete genome (accession code: EU078592.1) reference, BLASTp (Altschul et al., 1990) and HHphred web server (Zimmermann et al., 2018). POET was used with the gene graph generated by Ptolemy (Salazar & Abeel, 2018) to determine which reads contained at least all essential genes.
This subset of reads that contain all essential genes were further analysed in Geneious Prime version 2019.0.4 (Kearse et al., 2012). In Geneious Prime these reads were manually reversed if needed and zeroed to make alignments for each of these reads against the Enterobacteria phage DE3, complete genome reference. The whole genome alignment algorithm chosen to align these reads to the reference was the progressiveMauve algorithm (Darling, Mau, Blattner, & Perna, 2004). After confirming that all of these reads were in the same order and zeroed properly, they were annotated with the automatic annotation pipeline myRAST (Overbeek et al., 2013).
Here we present Directed Acyclic Graphs for GEnomic Readability – DAGGER. This tool with a graphical user interface (GUI) can be used for the visualizations of gene graphs. DAGGER was designed with gene graphs that are generated by Ptolemy (Salazar & Abeel, 2018) in mind. The GUI of DAGGER contains three main regions (Figure 6). In the options menu region, the user can load a gene graph with, optionally, an annotation file to visualize (Figure 7). From this file DAGGER takes all the paths of vertices and combines them into one labelled directed ordered multigraph, also referred to as a multilevel digraph, or simply a hierarchy. This gene graph can be visualized in two different layouts: (1) A Sugiyama layout, a hierarchical layered layout in which all edges traverse from the left to the right (Sugiyama et al., 1981). (2) A circular layout in which edges traverse clockwise around the circle except for when there is an alternative sub-path in a region of common vertices between multiple paths.
Figure 6: The Graphical User Interface of DAGGER. The GUI of DAGGER can be split into three main regions: (1) The options menu region contains all the options for graphs. The options are divided into submenus categorized on functionality. (2) The view is where the graph is visualized. This region is interactive and depending on the layout offers various controls. (3) The info panel shows information about a vertex after user interaction in the view.
Figure 7: Loading a gene graph with annotation and choosing a layout. (A) DAGGER accepts a gene graph as a GFA1-formatted file (Li, 2016) created by Ptolemy (Salazar & Abeel, 2018). (B) The user can choose to load a directory with annotation files and FASTA files created by Ptolemy. (C) To visualize a gene graph the user has to select either the Sugiyama layout or Circular layout.
In the implementation of the Sugiyama method a user can get a “local” overview of the graph to easily see the branching and joining of paths (Figure 8). To get an overview of which paths contain which vertices the user can select the option to highlight one or multiple paths. Paths that are close to each other and are overlaid on top of the same vertices share those vertices. Vertices and edges can be coloured by their depth (Figure 9). The depth is the number of traversing paths over a certain edge or vertex, which can also be called the coverage. The user can also choose to colour the edges based on which paths traverse them, with each path represented by a unique colour that the user is able to edit (Figure 10).
Figure 8: a Sugiyama layout of three coliphages visualized in DAGGER.
Figure 9: A region of a gene graph of three coliphages coloured by coverage/depth, visualized in DAGGER with the Sugiyama layout. (A) Vertices are drawn as circles representing genes with a gene identifier assigned by Ptolemy on top. The edges are drawn as lines between the circles. The coverage/depth can be described as the number of paths traversing an edge or vertex. The coloration scales from black to red with black being the lowest coverage/depth and red the highest. In this example the vertices and edges coloured in black have a coverage/depth of one. The vertices and edges with a brighter red colouration have a coverage/depth of two. (B) The options to select to create this coloration.
Figure 10: A region of a gene graph of three coli phages, visualized in DAGGER with the Sugiyama layout. (A) Vertices are drawn as circles representing genes with a gene identifier assigned by Ptolemy on top. Edges between the vertices are visualized in colour based on which path the edge belongs to. In this case each path represents an E. coli phage genome. In this region the blue path does not have common vertices with the red and yellow path. The red and yellow paths do have common vertices between them, which are visualized by the edges being close by and laid over those vertices. (B) The options to select to create this highlighting
DAGGER also provides a circular layout to get a more global overview of the entire gene graph. In this layout the vertices and edges are laid out clockwise around the circle until there is an alternative sub-path that is shared in a region of common vertices of multiple paths. In that case an edge could go from a vertex on one side of the circle to another vertex anywhere else in the circle (Figure 11). This circular layout can be zoomed in on areas of the graph. Clicking on one of these vertices and selecting the Sugiyama layout brings that area in view to get a better understanding of what is happening in that region of the graph.
Figure 11: Gene graph of three coliphages visualized in DAGGER with a circular layout. Vertices are drawn as circles representing genes with edges being drawn as coloured lines between the vertices. In this case there are multiple red lines going from one side of the circle to another side of the circle indicating a region of alternative sub-paths within a region of vertices that are common in multiple paths.
In both of these layouts the user can click on any vertex to show more information about a vertex such as outgoing and incoming vertices, traversing paths. Identifiers of the traversing paths can be exported to a text file. This can be used to quickly get identifiers of reads that contain a specific gene. Switching to the Sugiyama layout brings the user a local overview around the clicked vertex. Based on if the user has provided an annotation file generated by Ptolemy, a FASTA-format sequence and gene product/description can also be seen when clicking on any vertex. With this annotation information added it is also possible to search for gene identifiers that were assigned by Ptolemy and gene product/descriptions. To search for vertices with the same exact annotation as the query the user can surround the query by double quotes. It is also possible to submit a query with multiple annotations by using either the
Figure 12: The functionality of DAGGER to highlight vertices based on a query. In this case the search query tail AND capsid returns a list of all gene identifiers given by Ptolemy with an annotation that contains either “tail” or “capsid”. All the query results annotated with tail and capsid are shown in yellow and orange, respectively.
A graph can also be filtered based on how many times a vertex or edge is found in the paths. The user can specify this cut-off N for both vertices and edges to make a subgraph containing vertices and edges that are contained in at least N paths. After filtering it is possible to export this new graph-based data structure as a GFA1-formatted file.
In this study we also present POET (Ptolemy Output Enhancement Tool). POET is a tool used for getting more data out of Ptolemy (Salazar & Abeel, 2018). POET accepts gene graphs with vertices representing genes in a GFA1-formatted file (Li, 2016). With this GFA1-formatted file provided POET can calculate the frequency of all unique paths of vertices. If these paths represent the gene order of reads this step will give a simplistic overview of gene content of the unique reads with frequency. This data can also be visualized into an image with the paths of vertices with frequency in descending order.
With POET it is also possible to get a list of paths containing duplicate vertices and self-looping edges in which an edge starts and ends at the same vertex. When working with reads as paths this list will show paths containing one or multiple duplications of individual and multiple genes.
It can also be interesting to get statistics of individual genes such as number of genes in reads and frequency of genes overall. The user can also provide a manually curated list of vertex identifiers that represent genes of interest. With this list it is possible to analyse which combinations of these interesting genes show up in reads, which reads contain at least 2 of these interesting genes and finally which reads contain all interesting reads.
The choices that the user can make within POET to perform these tasks is visualized in Figure 13.
Figure 13: Workflow overview of POET. This workflow shows all the choices the user can make while using POET.
Three Coliphages from the species Escherichia phage vB_EcoM_AYO145A (accession code: NC_028825) have been visualized in DAGGER. The genomes have been given the identifiers 01, 05 and 07 in a previous study. Genomes 01, 05 and 07 contained 159, 317 and 168 genes respectively. The Sugiyama layout showed that there is an overlap of 116 genes between 01 and 07 (Figure 14). The overlap between these two genomes is not constant but equally distributed over their genomes from start to end. The genome 05 does not overlap with the other two genomes.
Figure 16: Overlap in genes of the starting region of the three coliphage genomes. Circles represent genes with their gene identifier given by Ptolemy above. A line between two circles signify the order of genes following each other (from left to right). The traversal over the genes of each of the three coliphage genomes is represented by their own colour (red, blue and yellow).
To study the genome architecture of the BL21phi10 chimeric phage dataset a gene graph was created with Ptolemy. The resulting gene graph is represented in a GFA-1 formatted file (Li, 2016) with each of the 20423 reads represented as a path of vertices and edges. All of these 20423 paths combined form a single gene graph. The most frequent unique paths of an average size over 6 kilobase pairs were determined and visualized with POET (Figure 15). This gene graph contained 171 unique genes, the distribution of reads containing a certain number of genes is seen in Figure 16. The frequency of each unique gene in the reads has been plotted (Figure 17). The annotation of genes that show up more than once in the set of 20423 reads has also been plotted (Figure 18). Through manual annotation based on the Enterobacteria phage DE3, complete genome (accession code: EU078592.1) reference it was determined which of these genes that show up more than once are essential for a phage genome to be able to replicate (Table 1). There were twenty-three genes of essentiality found of which five have a role in DNA packaging or maturation cluster and the other eighteen had a role in a phage structural cluster.
Figure 15: The most frequent genome architectures in BL21Phi10 using Ptolemy found by POET. In this figure each arrow represents a gene. Arrows that represent the same gene have the same identifier and colour. The different genome architectures are aligned to the genome architecture of the reference Escherichia coli BL21(DE3), complete genome. (accession number: NC_012892.2). This figure shows only a part, the prophage region, of the larger Escherichia coli BL21(DE3), complete genome. The number of genes outside of this prophage region can be found on the dots on the top left and top right. The relative frequency of how many reads contain a particular genome architecture can be found in blue underlined left of the genome architectures. A difference in gene order compared to the reference is visualized as a square containing the differently ordered genes.
Figure 16: The frequency of number of genes per read of 20423 reads in BL21Phi10.
Figure 17: The frequency of genes of 20423 reads in BL21Phi10. The genes are laid out in order of the reference Escherichia coli BL21(DE3), complete genome (accession code: NC_012892.2) over the x-axis. All 171 genes in this plot show up at least once, but the genes which show up once are not represented by a bar because of the logarithmic y-axis.
Figure 18: The frequency of genes that show up more than once with their annotations in BL21Phi10. Data from the chimeric phage dataset of 20423 reads. The genes are laid out in order of the reference Escherichia coli BL21(DE3), complete genome (accession code: NC_012892.2) over the y-axis, skipping genes that only show up once in the entire dataset.
Table 1: The annotation, essentiality and role of genes that show up more than once in BL21Phi10.
POET was used to determine the 46 reads which contain at least all essential genes. The size of said reads were compared to the reference Enterobacteria phage DE3, complete genome (Figure 19).
Figure 19: The size of raw reads in BL21Phi10 containing at least all essential genes compared to an Enterobacteria phage DE3, complete genome. 46 of the 20423 reads contain at least all determined essential genes to create a “minimal” phage genome. All 46 of these reads contained a larger sequence than the reference Enterobacteria phage DE3, complete genome (EU078592.1).
DAGGER was used to visualize gene graphs in which vertices represents genes as created by Ptolemy. BANDAGE was used previously to visualize Ptolemy but would lead to a poorly readable and difficult to interpret layout (Figure 20).
Figure 20: A gene graph of a pool of bacterial and phage DNA generated by Ptolemy and then visualized in BANDAGE. The vertices (squares) represent genes, the edges (black stripes connecting two vertices) represent a connection between two genes. The coloration of the vertices is random and has no biological relevance.
A major disadvantage of DAGGER is the use of directed acyclic graphs. A directed acyclic graph-based data structure has proven to be a not ideal data structure to represent multiple genomes or reads. There are mutations in the genome which can cause a cycle when representing the genome in a graph-based data structure (Figure 21). Gene graphs containing multiple genomes or reads wherein vertices represent genes can contain also contain cycles even when they do not have a duplication themselves, but have an inverse order of genes (Figure 21). The use of a hierarchical (layered) layout such as a Sugiyama layout (Sugiyama et al., 1981) (Figure 8) proved to have some issues in some cases of gene graphs. The construction of these layout algorithms starts with making a given graph-based data structure acyclical. In the case of gene graphs biological information such as the mutations described in Figure 21 could be lost to the user by this strategy.
Figure 21: The types of architectures that can cause incompatibility with the Sugiyama layout algorithm. In these three cases shapes with the same coloration represent the same gene. The Sugiyama layout works by making graphs acyclic, which can cause lost biological relevance when using the algorithm with a gene graph. The first case of a cycle in a gene graph is a gene duplication. A gene duplication can cause a cycle in a graph-based data structure when each vertex represents a unique gene. A duplicated gene which follows immediately after the original gene causes a similar problem. In this case the gene graph would contain a self-looping gene/vertex, which also is a circle and not permitted in the Sugiyama layout. If a gene graph consists of multiple paths it can also happen that a path with an inverse gene order is compared to another path in the same graph-based data structure. In this case there are incoming and outgoing edges of all the vertices they share, causing many circles in the gene graph.
The circular layout that is used in DAGGER can give a global overview of the entire gene graph, but it can be confusing to interpret diverging paths since they are not always easily visible. The crossing of edges from one part of the circle to another part of the circle does not necessarily have a biologically relevant meaning.
An alternative strategy for laying out these gene graphs could be using force-directed graph drawing algorithms. Force-directed graph drawings do come with the downside of a high running time and are best suited for graphs with under 500 vertices, but the combination of force-directed graphs with multilevel methods haven proven to be efficient in large graphs (Hachul & Junger, 2005). The genome assembly viewer BANDAGE uses this algorithm (Wick et al., 2015). However, the visualization of gene graphs generated by Ptolemy in BANDAGE still causes many overlapping edges and vertices, making the visualization hard to interpret without a great amount of manual correction. Because of this it could be interesting to build a graph up starting with only the most supported edges, adding the less supported edges manually and as a result create an easier to interpret layout with minimal edge crossings. This strategy is based on the assumption that there is a difference of support in edges between regions within the graph and that an area with the most supported edges would be fairly linear.
From the issues with the Sugiyama layout and circular layout came the need for another way of analysing the gene graphs generated by Ptolemy. POET (Ptolemy Output Enhancement Tool) was created specifically for this task. Some features of POET have however been made unnecessary with the new release of Ptolemy, e.g., the frequency of individual genes in the set of paths is now included in the GFA1-formatted file that Ptolemy produces as output.
POET is able to extract and visualize all unique paths of reads from a Ptolemy generated GFA-1 formatted file. However, this data should not be interpreted as the frequency of all unique genome architectures, since this simply calculates the frequency of unique gene orders in paths from the first gene to last gene. This would mean that a subset within a unique gene order set would not be detected or counted towards that subset of gene order e.g., given these two paths of vertices in order:
Path 1: 0,1,2,3,4,5
Path 2: 1,2,3,4,5,6
In this case these two paths would both be counted as unique. Although they have a subset (of 1,2,3,4,5) that is overlapping, this subset is not detected as a genome architecture that occurs twice by POET.
The most frequently found genome architectures in the case of the BL21phi10 dataset (Figure 15) looks to have three majorly different groups of genome architectures. These genome architectures are however only the most frequently found exact gene order with a cut-off of over 6 kilobase pairs long and it cannot be fully established that this represents the entire dataset. It would be interesting for a further study to compare the overlap of all unique paths and base the groups of unique genome architectures on the differently located overlapping regions within the gene order.
The frequency of genes in the 20423 reads of the BL21phi10 dataset (Figure 17) indicates that the entire gene set of the E. coli bacteria is present (counted at least once) with the prophage region showing up far more. This is expected as the dataset consist mainly of phage DNA that was released through lysis.
The whole genome alignment of the forty-six annotated reads containing at least all essential genes had very poor result. Of these forty-six reads, the progressiveMauve algorithm was able to only get a consensus of twelve base pairs between two reads.
Comparing genome architectures with a gene graph can be useful, but the best way to visualize the resulting gene graph is not trivial.
The assumption that a gene graph is acyclical is false. The use of an implementation of the Sugiyama (Sugiyama et al., 1981) method for laying out gene graphs in which vertices represent genes leaves out valuable biological information. There are very common cases in which mutations on gene level can cause a gene graph to contain circles which is not compatible with a Sugiyama layout. The single gene graph consists of multiple genomes when comparing genome architectures. The result can also cause cycles between those paths, without the paths themselves containing any cycles.
The implementation of this circular layout used in this study can give an overview of an entire gene graph, but lacks the ability to easily interpret the non-linear regions of such a gene graph.
Force-directed graph drawing of gene graphs generally suffers from too much edge crossings. Building a graph up from most to least supported edges could lead to less edge crossings and a better interpretation in a force-directed graph, although this strategy assumes that there is a variety of support within the gene graph.
When analysing unique genome architectures from a gene graph one should also look at the subsets that are contained in the individual paths. A partial overlap between paths should also be interpreted as a shared genome architecture. Simply determining the frequency of unique gene ordering based on the exact path does not outright prove the groups of unique genome architectures.
I want to thank Stan for allowing me to do my bachelor internship in his lab.
I am grateful to Thomas for making me more critical in asking myself what I do and do not know.
Thanks to Alex for learning me to communicate more clearly in non-technical terms.
I am also incredibly grateful for Franklin who helped me become a better researcher, and a more biological-aware computist. Although this project had it’s ups and downs I not only learned a lot but also grew as a person thanks to you.
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., & Lipman, D. J. (1990). Basic local alignment search tool. Journal of Molecular Biology, 215(3), 403–410.
Angiuoli, S. V, & Salzberg, S. L. (2010). Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics, 27(3), 334–342.
Biederstedt, E., Oliver, J. C., Hansen, N. F., Jajoo, A., Dunn, N., Olson, A., … Dilthey, A. T. (2018). NovoGraph: Genome graph construction from multiple long-read de novo assemblies. F1000Research, 7(0), 1391. https://doi.org/10.12688/f1000research.15895.1
Biederstedt, E., Oliver, J. C., Hansen, N. F., Jajoo, A., Olson, A., Busby, B., & Dilthey, A. T. (2018). NovoGraph : Human genome graph construction from multiple long-read de novo assemblies [ version 2 ; referees : 2 approved ] A genome graph representation of seven ethnically diverse whole human genomes Referee Status :
Branton, D., Deamer, D. W., Marziali, A., Bayley, H., Benner, S. A., Butler, T., … Schloss, J. A. (2008). The potential and challenges of nanopore sequencing. Nature Biotechnology, 26(10), 1146–1153. https://doi.org/10.1038/nbt.1495
Chaisson, M. J. P., Wilson, R. K., & Eichler, E. E. (2015). Genetic variation and the de novo assembly of human genomes. Nature Reviews Genetics, 16, 627. Retrieved from https://doi.org/10.1038/nrg3933
Darling, A. C. E., Mau, B., Blattner, F. R., & Perna, N. T. (2004). Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Research, 14(7), 1394–1403.
Dilthey, A., Cox, C., Iqbal, Z., Nelson, M. R., & McVean, G. (2015). Improved genome inference in the MHC using a population reference graph. Nature Genetics, 47, 682. Retrieved from https://doi.org/10.1038/ng.3257
et al. M. Bastian, S. Heymann, M. J. (2009). Gephi: an open source software for exploring and manipulating networks. Proceedings of International AAAI Conference on Web and Social Media, 3, 361–362. Retrieved from http://bit.ly/1SVVbop
Fiers, W., Contreras, R., Duerinck, F., Haegeman, G., Iserentant, D., Merregaert, J., … Ysebaert, M. (1976). Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene. Nature, 260, 500. Retrieved from https://doi.org/10.1038/260500a0
Fierst, J. L. (2015). Using linkage maps to correct and scaffold de novo genome assemblies: Methods, challenges, and computational tools. Frontiers in Genetics, 6(JUN), 1–8. https://doi.org/10.3389/fgene.2015.00220
Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., … Merrick, J. M. (1995). Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science, 269(5223), 496–512.
Fraser, C. M., Gocayne, J. D., White, O., Adams, M. D., Clayton, R. A., Fleischmann, R. D., … Kelley, J. M. (1995). The minimal gene complement of Mycoplasma genitalium. Science, 270(5235), 397–404.
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., … Durbin, R. (2018). Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36, 875. Retrieved from https://doi.org/10.1038/nbt.4227
Gordon, D., Huddleston, J., Chaisson, M. J. P., Hill, C. M., Kronenberg, Z. N., Munson, K. M., … Eichler, E. E. (2016). Long-read sequence assembly of the gorilla genome. Science, 352(6281), aae0344. https://doi.org/10.1126/science.aae0344
Hachul, S., & Junger, M. (2005). Large-Graph Layout with the Fast Multipole Multilevel Method. ACM Transactions on Graphics, V(December), 1–27. https://doi.org/10.7155/jgaa.00150
Jain, M., Koren, S., Miga, K. H., Quick, J., Rand, A. C., Sasani, T. A., … Loose, M. (2018). Nanopore sequencing and assembly of a human genome with ultra-long reads. Nature Biotechnology, 36, 338. Retrieved from https://doi.org/10.1038/nbt.4060
Jünemann, S., Sedlazeck, F. J., Prior, K., Albersmeier, A., John, U., Kalinowski, J., … Harmsen, D. (2013). Updating benchtop sequencing performance comparison. Nature Biotechnology, 31, 294. Retrieved from https://doi.org/10.1038/nbt.2522
Kearse, M., Moir, R., Wilson, A., Stones-Havas, S., Cheung, M., Sturrock, S., … Duran, C. (2012). Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics, 28(12), 1647–1649.
Koonin, E. V. (2009). Evolution of genome architecture. International Journal of Biochemistry and Cell Biology, 41(2), 298–306. https://doi.org/10.1016/j.biocel.2008.09.015
Koonin, E. V, & Galperin, M. (2013). Sequence—evolution—function: computational approaches in comparative genomics. Springer Science & Business Media.
Li, H. (2016). Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14), 2103–2110. Retrieved from http://dx.doi.org/10.1093/bioinformatics/btw152
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079.
Loman, N. J., Quick, J., & Simpson, J. T. (2015). A complete bacterial genome assembled de novo using only nanopore sequencing data. Nature Methods, 12, 733. Retrieved from https://doi.org/10.1038/nmeth.3444
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B. E., Ghaffaari, A., … Schönhuth, A. (2018). Computational pan-genomics: Status, promises and challenges. Briefings in Bioinformatics, 19(1), 118–135. https://doi.org/10.1093/bib/bbw089
Overbeek, R., Olson, R., Pusch, G. D., Olsen, G. J., Davis, J. J., Disz, T., … Shukla, M. (2013). The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research, 42(D1), D206–D214.
Salazar, A. N., & Abeel, T. (2018). Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations. Bioinformatics, 34(17), i732–i742. https://doi.org/10.1093/bioinformatics/bty614
Sugiyama, K., Tagawa, S., & Toda, M. (1981). Methods for visual understanding of hierarchical system structures. IEEE Transactions on Systems, Man, and Cybernetics, 11(2), 109–125.
Tettelin, H., Masignani, V., Cieslewicz, M. J., Donati, C., Medini, D., Ward, N. L., … Fraser, C. M. (2005). Genome analysis of multiple pathogenic isolates of <em>Streptococcus agalactiae</em>: Implications for the microbial “pan-genome.” Proceedings of the National Academy of Sciences of the United States of America, 102(39), 13950 LP-13955. https://doi.org/10.1073/pnas.0506758102
Waterston, R. H., Lander, E. S., & Sulston, J. E. (2002). On the sequencing of the human genome. Proceedings of the National Academy of Sciences, 99(6), 3712 LP-3716. https://doi.org/10.1073/pnas.042692499
Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: Interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350–3352. https://doi.org/10.1093/bioinformatics/btv383
Zimmermann, L., Stephens, A., Nam, S.-Z., Rau, D., Kübler, J., Lozajic, M., … Alva, V. (2018). A completely Reimplemented MPI bioinformatics toolkit with a new HHpred server at its Core. Journal of Molecular Biology, 430(15), 2237–2243.
]]>