Superstring Assembly Suppose that multiple copies of a long string are cut up in
ID: 3746894 • Letter: S
Question
Superstring Assembly Suppose that multiple copies of a long string are cut up into much smaller pieces. If you are given the available fragments, is it possible to rebuild the larger string? For example, if the original string was "algorithmsarefunfunfun", and you got given a list of fragments like "refun" and "unfun" "rith" and "lgo" and so on, plus lots more small fragments, could you rebuild the original string? Probably you could, right? But what about if the original string was millions of characters long, like War and Peace? Or billions of characters long, like your genetic sequence? Let's ask a slightly different question now: if you got given a set of small string fragments each from a larger string that you didn't know anything about, could you at least find the shortest string in which every supplied fragment occurs at least once? This problem is known as the shortest superstring problem, and can be thought of as having parameters n, the number of string fragments provided, and m, the total length of those string fragments. The shortest superstring problem is very important in bioinformatics, where genome reassembly from short fragments called reads helps contribute to our understanding of the behavior of organisms. Unfortunately, finding optimal solutions to the shortest superstring problem has been shown to be very challenging, and falls into a family of intractable problems that we'll talk about later in the semester. In this project you will implement some approximation techniques that generate non-optimal superstrings from a collection of string fragments Input Data Input to your program will (always) come from stdin, and will (always) consist of strictly lower-case alphabetic strings, one per input line, each line terminated by a single newline 'n' character. Be sure to read the message on the FAQ page about newlines. For example, the following file testo.txt containing six fragments is a valid input accgtcgatg gcctag gtacctacta cgatgcc tcgatgccgca atgagaccgtc As you develop your program according to the stages listed below, the output will evolve. Full output examples for this and two other test files are linked from the FAQ page. You should also check your program against other inputs too; and can easily generate your own tests using a text editor. Testing and debugging is your responsibilityExplanation / Answer
#include <bits/stdc++.h>
using namespace std;
//compares and returns the smaller number
int getSmallerNumber(int a, int b)
{
return (a < b) ? a : b;
}
//Calculate maximum overlap between two strings
int calculateTheLengthOfMaximumOverlappingPair(string input1, string input2, string &result)
{
//maximum length of the matching prefix and suffix
int maxOverlapLength = INT_MIN;
int strLen1 = input1.length();
int strLen2 = input2.length();
// compare suffix of input1 with prefix of input2
for (int i = 1; i <= getSmallerNumber(strLen1, strLen2); i++)
{
// comparing input1's ending characters of length i with starting characters of input2
if (input1.compare(strLen1-i, i, input2, 0, i) == 0)
{
if (maxOverlapLength < i)
{
//update max and result
maxOverlapLength = i;
result = input1 + input2.substr(i);
}
}
}
// similarly compare prefix of input1 with suffix of input2
for (int i = 1; i <= min(strLen1, strLen2); i++)
{
// comparing input1's starting i characters with ending i characters in input2
if (input1.compare(0, i, input2, strLen2-i, i) == 0)
{
if (maxOverlapLength < i)
{
//update max and result
maxOverlapLength = i;
result = input2 + input1.substr(i);
}
}
}
return maxOverlapLength;
}
//Calculate shortest superstring in an array
string calculateShortestSuperstringInArray(string fragments[], int len)
{
while(len != 1)
{
int maxOverlapLength = INT_MIN;
int index1, index2; // to store array index of strings involved in maximum overlap
string result; // to store resultant string after maximum overlap
for (int i = 0; i < len; i++)
{
for (int j = i + 1; j < len; j++)
{
string str;
// res = maximum overlap length between two strings
// str = resultant overalap string of the two given strings
int res = calculateTheLengthOfMaximumOverlappingPair(fragments[i], fragments[j], str);
// check for maximum overlap
if (maxOverlapLength < res)
{
maxOverlapLength = res;
result.assign(str);
index1 = i, index2 = j;
}
}
}
//moving on to the next cycle, exempting the last element
len--;
// if no overlap, append the complete string to the previous string
if (maxOverlapLength == INT_MIN)
fragments[0] += fragments[len];
else
{
fragments[index1] = result; // copy resultant string to index1
fragments[index2] = fragments[len]; // copy string at last index to index2
}
}
return fragments[0];
}
//Main Function
int main()
{
string fragments[100];
//file reading to get the fragments into an array
ifstream test("sample.txt");
//check if file is present or not
if(!test) {
cout << "Cannot open the file. ";
return 1;
}
char str[255];
int i = 0;
while(test) {
test.getline(str, 255); // delim defaults to ' '
if(test){
fragments[i] = str; //adding to fragments array one by one
}
i++;
}
test.close(); //closing the file
//Stage 0 - Number of fragments and total number of charcters
cout << "Total number of fragments are " << i << endl;
cout << "The Shortest Superstring for the given fragments is "
<< calculateShortestSuperstringInArray(fragments, i);
return 0;
}
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.