Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

need to construct a class that defines a parser. This parser will be able to bre

ID: 3849452 • Letter: N

Question

need to construct a class that defines a parser. This parser will be able to break up a string into tokens. A token is a distinct string of text – it can be a word, a symbol, or a combination of these. Your parser will eliminate whitespace and break string snippets into individual tokens (where single-character tokens are defined at runtime).

Your parser should behave as follows:

1) Whitespace (spaces, tabs, and new lines) is used only to separate tokens – otherwise, it is ignored.

2) Most tokens are made up of strings of text separated punctuation and other symbols.
3) In addition, single-character tokens (punctuation and/or other symbols) may be defined.

For example, consider the following string:

If no symbols are considered tokens, the string would be parsed like this:

If the same string considered parentheses to be tokens, they would be broken out too:

...And if punctuation were added, the result would be this:

The parser for this assignment will, by default, treat symbols like any other text character – in other words, the default behavior will that all characters not separated by whitespace will be part of the same token. However, the parser will provide several functions (detailed below) that will allow the user to definie additional, single character symbols as tokens as well.

The class TokenParser should be declared in TokenParser.h and contain the following methods:

// Default constructor TokenParser() Default constructor for this class; creates a parser with no initial single-character tokens.

TokenParser(const vector<char>& tokens) Constructor for this class; creates a parser with the provided single-character tokens defined.

void setCharacterTokens(const vector<char>& tokens) Resets this parser’s single-character tokens to the provided list of tokens.

void addCharacterTokens(const vector<char>& newTokens) Adds the provided single-character tokens to the set of tokens this parser recognizes.

void addCharacterToken(char newToken) Adds the provided single-character token to the set of tokens this parser recognizes.

vector<string> parse(string input) const Parses the string and returns a vector of tokens in string form.

Explanation / Answer

The first step in writing a parser is to tokenize the input string. This means to separate the input string into short bits that represent the basic entities in the expression. We could do this by hand, reading one character at a time and assembling the tokens character by character. This would be tedious, difficult to maintain and hard to extend should we decide to add more features to the tokenizer.

Instead we decide to use the java.util.regex package to do the work for us. This will allow us to implement quite a general tokenizer that can be used for any kind of grammar.

Let’s start by creating a class called Tokenizer and defining an internal class TokenInfo that holds information about the individual tokens.

As you can see TokenInfo contains two fields. The regular expression that is used to match the input string against the token is stored in the Pattern regex. We store Pattern objects instead of the regular expression string to improve performance. Regular expressions have to be compiled which takes time. Pattern stores the regular expression in compiled form. The code of the token is given by the integer value ‘token’. Each type of token should have its own code.

We define a linked list, called tokenInfos, to hold the information for all the tokens. The linked list is created in the constructor of our Tokenizer class. Next we need to add the token information in our list. We do this via the add() method.

The user can pass a regular expression string and a token code to the method. The method will then the “^” character to the user supplied regular expression. It causes the regular expression to match only the beginning of a string. This is needed because we will be removing any token always looking for the next token at the beginning of the input string.

We also want to store the information about the tokens we have seen. We need to store the token code and the string that corresponds to the token. The string is needed because the token code does not retain the full information about the input. When we have found a variable we will give the token a special code for variable tokens but we need to also keep the name of the variable so that we can later use it to store or retrieve information. To this end we define another internal class called Token and a linked list of these tokens.

In our constructor of Tokenizer we now also have to initialize the tokens list.

We can now add token information in our main method.

The code above adds regular expressions that match functions, brackets, mathematical operators, integer numbers and variables. Note how each type of token gets a unique code. Note also how we have to escape special characters in the regular expressions with a double backslash.

We are now ready to tokenize an input string. This is done in the tokenize method. In this method we first define a local string that contains the input string but without any leading or trailing spaces. Also we clear the tokens list from any previous data.

The main loop is carried out until the local input string is empty. When we find a token we will remove it from the front of the string. If we are successful and the whole string could be tokenized then we will eventually end up with an empty string.

The match variable indicates if any of the tokens provided a match with the beginning of the input string. Initially we have no match. We now loop through all the token infos. and create a Matcher for the input string.

If the pattern is found in the input string then match is set to true. The group() method of the Matcher returns the string of the last match. A new Token object is appended to the list of tokens. The token object contains the code of the token that resulted in the match and the matched string. In the next step, the matcher is used to remove the token from the beginning of the input string. We do this with the replaceFirst() method which replaces the first (and only) match with an empty string. Finally we break out of the loop over the token infos because we don’t need to check any other token types in this round.

After the loop has finished we check if a match was found. If not we throw an exception. We are using a ParserException which is extends RuntimeException without adding new functionality to it.

This is it! We are done with tokenizing the user input.

If you paid close attention you might have realized that the regular expression for variable tokens also matches any function token. This is not a bug. By storing the token information in a linked list, and always ensuring that we look for matches from the beginning of the list, we are giving precedence to patterns added first. If the input sequence starts with alphabetic characters the function pattern will first be tested and will match any one of the specified function. Only if there is no match the variable pattern will be tested.

To access the result of tokenizing an input string we define the getter for the tokens list.

Now we are ready to tokenize the input. In our main method we add the following code: