Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

As some of you know well, and others of you may be interested to learn, a number

ID: 3607414 • Letter: A

Question

As some of you know well, and others of you may be interested to learn, a number of languages (including Chinese and Japanese) are written without spaces between the words. Consequently, software that works with text written in these languages must address the word segmentation problem--inferring li_kely boundaries between consecutive words in the text. If English were written without spaces, the analogous problem would consist of taking a string like "meetateight" and deciding that the best segmentation is "meet at eight" (and not "me et at eight," or "meet ate ight," or any of a huge number of even less plausible alternatives). How could we automate this process? A simple approach that is at least reasonably effective is to find a segmentation that simply maximizes the cumulative "quality" of its individual constituent words. Thus, suppose you are given a black box that, for any string of letters x = xlx2-., xk, will return a number quality(x). This number can be either positive or negative; larger numbers correspond to more plausible English words. (So quaIity("rne") would be positive, while quality("ght") would be negative.) Given a long string of letters y = YlY2 ¯" "Yn, a segmentation of y is a partition of its letters into contiguous blocks of letters; each block corresponds to a word in the segmentation. The total quality of a segmentation is determined by adding up the qualities of each of its blocks. (So we’d get the right answer above provided that quaIity("rneet") + quality("at") + quality(" eight") was greater than the total quality of any other segmentation of the string.) Give an efficient algorithm that takes a string y and computes a segmentation of maximum total quality. (You can treat a single call to the black box computing quality(x) as a single computational step.) (A final note, not necessary for solving the problem: To achieve better performance, word segmentation software in practice works with a more complex formulation of the problem--for example, incorporating the notion that solutions should not only be reasonable at the word level, but also form coherent phrases and sentences. If we consider the example "theyouthevent," there are at least three valid ways to segment this into common English words, but one constitutes a much more coherent phrase than the other two. If we think of this in the terminology of formal languages, this broader problem is like searching for a segmentation that also can be parsed well according to a grammar for the underlying language. But even with these additional criteria and constraints, dynamic programming approaches lie at the heart of a number of successful segmentation systems.)

Explanation / Answer

Given :-- a string , quality function that given an sequence of letters returns the quality of the sequence (the higher the value returned , the better is the quality.

To Calculate :-- The problem is to compute the optimal segmentation of the input string into words such the sum of these word's quality is the highest possible.

Solution :--   We will use dynamic programming approach to compute the optimal segmentation. Let getRes(j) be the following function defined recursively , for 0<= j <= n

For j = 0 : getRes(0) = 0

For j>0 : getRes(j) = max {getRes(k-1) + quality(yk...yj)} ( 1<=k<= j and yk...yj means sequence of characters from yk to yj)

This recursive definition states that the optimal way to segment the input string is by finding the best possible split of the input string into two parts: a right-hand side part that will consist of just one word, and a left-hand side part that will be a sequence which will be segmented recursively .

Algorithm of DP approach :-

     

{
arr[0] := 0;

for (j := 1 to n) {
temp := -infinity
for (k = 1 to j) {
if temp < arr[k-1] + quality(yk...yj)) then {
temp := arr[k-1] + quality(yk...yj))
}
arr[j] := temp
}
}

The above algorithm works with 0(n2) time complexity . We will take an array of size n+1 (where n is length of string). We first do initialization with arr[0] = 0 . After that we will run a loop from 1 to n and take the variable temp as -infinity . We will take another loop inside the previous loop that will run from 1 to j (where j is value of outer loop while iteration) . Inside second loop, we will check the expression temp < arr[k-1] + quality(yk..yj) . Here quality(yk...yj) means sum of quality of sequence of letters from k to j . If expression is true , then we will update the value of temp and at last we assging arr[j] = temp . At the end arr[n] will hold the required result.

// Please let me know , if you have any doubts .