C and unix runnable . Warning: As you will see below, the descriptions of the as
ID: 3802777 • Letter: C
Question
C and unix runnable . Warning: As you will see below, the descriptions of the assignments will be increasingly complex because we are asking you to build increasingly bigger programs. Make sure to read the assignment carefully! This is critical because this document describes the requirements for your program. 0. Introduction In this assignment you will practice using the file system API (as well as pointers in different data structures). In particular you will be creating, opening, reading, writing, and deleting files. Your task is to write an indexing program, called an indexer. Given a set of files, an indexer will parse the files and create an inverted index, which maps each token found in the files to the subset of files that contain that token. In your indexer, you will also maintain the frequency with which each token appears in each file. The indexer should tokenize the files and produce an inverted index of how many times the word occurred in each file, sorted by word. Your output should be in the following format: count0 count1 count2 count3 count4 The above depiction gives a logical view of the inverted index. In your program, you have to define data structures to hold the mappings (token to list) and the records (file name, count). An inverted index is a sequence of mappings where each mapping maps a token (e.g., “dog”) to a list of records, with each record containing the name of a file whose content contains the token and the frequency with which the token appears in that filename. Here is an example of how the indexer should work. If you are given the following set of files: File Path File Content /adir/boo A dog named named Boo /adir/baa A cat named Baa /adir/bdir/baa Cat cat Your indexer should output: 1 1 1 1 3 1 2 1 The inverted index file that your indexer writes must follow the XML format defined above. Words must be sorted in alphanumeric order. All characters of a word should be first converted to lowercase before the word is counted. Your output should print with the lists arranged in alphanumeric order (a to z, 0 to 9) of the tokens. The filenames in your output should be in descending order by frequency count (highest frequency to lowest frequency).If there is a word with the same frequency in two or more files, order them by path name alphanumerically (a to z, 0 to 9). After constructing the entire inverted index in memory, the indexer will save it to a file. 2. Implementation Your program must implement the following command-line interface: invertedIndex The first argument, , gives the name of a file that you should create to hold your inverted index. The second argument, , gives the name of the directory or file that your indexer should index. If the second argument is a directory, you need to recursively index all files in the directory (and its sub-directories). If the second argument is a file, you just need to index that single file. When indexing files in a directory, you may have files that have the same name in separate directories. You should combine all token frequencies for the same filename regardless of which directory it appears in. We define tokens as any sequence of consecutive alphanumeric characters (a- z, A-Z, 0-9) starting with an alphabetic character. Examples of tokens according to the above definition include: a, aba, c123 If a file contains This an$example12 mail@rutgers it should tokenize to this an example12 mail rutgers The XML format lets us easily read the inverted index for debugging. You should carefully consider how the program may break and code it robustly. You should outline and implement a strategy to deal with potential problems and inconsistencies. For example, if a file already exists with the same name as the inverted-index file name, you should ask the user if they wish to overwrite it. If the name of the directory or file you want to index does not exist, your indexer should print an error message and exit gracefully rather than crash. There are many other error cases that you ought to consider. 3. Hints Data structures that might be useful include a list that sorts as you insert and/or a hash table. A custom record type (e.g., a record ({"baa" : 3}) that can be inserted into multiple data structures, such as a sorted list and/or a hash table). You should probably approach this in steps: o First, build a simple tokenizer to parse tokens from a file. o Next, get your program to walk through a directory. o Next, implement a data structure that allows you to count the number of occurrences of each unique token in a file. And so on ...
Explanation / Answer
General Syntax
A string is a series of characters, and array, followed by a null byte. Before there were no classes, there were structs. No visibility classes. No message passing, methods.
Pointers
A pointer variable contains an address. If I say "pointer to" and "address of", it's equivalent.
Subscripts can be positive, negative, and zero. When you use arrays in C, there is no array bounds checking. If you have a ten element array, C does not mind if you just meander in either direction.
Pointers to Functions
September 18th, 2013 Programming Assignment 1: Tokenizer
Introduction
In this assignment, you will practice programming with C pointers. Much of the pointer manipulation will come in the form of operating on C strings, although you will be dealing with some pointers to structs as well.
Your task is to write a type and a set of functions in essence, the equivalent of a Java class that implements a tokenizer. The tokenizer should accept two strings, the first of which will contain a set of separator characters while the second will contain a set of terms separated by one or more separator characters. The tokenizer should return the terms in the second string one at a time, each term is called a token, hence your program is called a tokenizer. For example, when given the following two strings:
your tokenizer should return:
When given the following two strings:
your tokenizer should return:
A string is a sequence of characters delimeted by double quotes ("). Strings can contain newline or double-quote characters, but special syntax is required to contain them and certain other characters. These special characters are represented with escape sequences:
Implementation
Your implementation needs to export the interface given in the attached tokenizer.c file. In pa
rticular, you need to define the type needed to represent a tokenizer and three functions for creating and destroying tokenizer objects and getting the next token. Note that we have only defined the minimal interface needed for external code (e.g., our testing code) to use your tokenizer. You will likely need to design and implement additional types and functions.
A token is a sequence of any ASCII character that does not contain a separator character. Separator characters are provided as a string of one or more ASCII characters. Each pair of tokens are separated by one or more separator characters. Multiple separators may be next to each other (see second example above), and/or at the beginning and/or end of the term string. When this happens, your tokenizer should discard all separators.
Your implementation must not modify the two original strings in any way. Further, your implementation must return each token as a C string in a character array of the exact right length. For example, the token usr should be returned in a character array with 4 elements (the last holds the character ’’ to signify the end of a C string).
You may use string functions from the standard C library accessible through string.h (e.g, strlen()). However, you may not use strtok(), strsep() or any similar function that already performs the complete tokenization process.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.