Week 9: Files

Week 9: In-Class#

Coding Practice#

Code 9.1: Surname Percentage#

Here you will use the file week_09_files/efternavne.csv, which you have read in Prep 9.8: CSV Files.

This file contains a list of surnames registered in Denmark and the number of people with that surname. The file is a CSV file, which means that each line contains a surname and the number of people with that surname separated by a comma. Names with letters not in the English alphabet have been removed from the file.

Using this file, we wish to solve the following problem: For a given surname, what percentage of people in Denmark have that surname? For example, if we look up a certain surname and find that 10 thousand people have that surname, and the total number of people is 5 million, the percentage of people with that surname is \((10 000 \cdot 100) / 5 000 000 = 5\) percent. If the name is not in the file, the percentage is 0.

Problem Analysis

First, inspect the content of the file. You can open a CSV file in any text editor, and since VS Code has a built-in text editor, you can open it directly in VS Code.

  1. Think of a surname you are interested in, and find it in the file. What is the number of people with that surname?

  2. Next, consider the other number you need to compute the percentage: the total number of people. Could you still compute it by hand you recieved the entire filed printed on paper? Could you still do it? Would it be boring or error-prone?

  3. Instead, we you will test your solution on an input where you know the correct answer. Create a dummy file dummy_efternavne.csv with a few surnames and their counts. Use numbers that are easy to compute with. An example of the file content could be:

Adams,4000
Brown,1000
Clark,3000
Davis,2000

Now, calculate by hand, what the expected output for Adams and Brown would be.

Code 9.2: Get Texts#

In this week’s coding practice, we’ll walk you trough the process of writing a Python function that reads a text file and counts how many times each letter appears in the text. Writing such a function is a complex task, and we will break it down into smaller steps here. If you follow the steps, you will have a working function by the end of the practice. Before we start, consider how you would approach this problem. You can discuss your ideas with a few classmates.

We will work on several test files. You should download the zip file texts.zip, place it in your CWD and unzip. Inspect the files to understand their content.

Code 9.3: Count a Letter#

In this practice, you will write code that counts how many times a certain letter appears in a text file. We will break down the task into smaller steps.

Define a variable filename with the value texts/quick_fox.txt. Write the code that reads the file and saves its content in a variable content. Print the content or its length, just to make sure everything is working.

Define a variable letter with the value of a letter, for example, 'a'. Define also a variable count with the value 0. You will use count to store how many times the letter appears in the text.

Write a loop that goes through each character in the content. If the character is equal to the letter, increment the count by 1. Outside the loop, print the count.

Make a modification, such that your code counts both uppercase and lowercase letters. A way of accomplishing this is to convert each character you read to lowercase just before comparing it to the letter. You should use a string method for this.

Based on the previous code, write a function count_letter(filename, letter) that takes two arguments, a filename and a letter to count. The function should return the count of the letter in the file, as shown below.

>>> count_letter("texts/quick_fox.txt", "r")
2

Code 9.4: Letter Frequency#

The goal of this practice is to count how many times each letter of the English alphabet appears in a text file. The final result should be a dictionary where each key is the letter of the English alphabet and the corresponding value is the count of that letter in the text saved in filename. We will break down the task into smaller steps.

First, define a string letters with all the letters of the English alphabet, that is, 'abcdefghijklmnopqrstuvwxyz'. Then, define a dictionary letter_counts as an empty dictionary. Now write a loop that goes through each letter in letters, and adds the letter as a key to letter_counts with the value 0.

Read the content of the file filename and save it in a variable content, as you did in the previous code. Write a loop that goes through each character in the content. In the body of the loop, check whether the character is a letter by checking whether it is a key in the dictionary letter_counts. If the character is a letter, access the dictionary value for that letter and increment it by 1.

Finally, make the modification, such that both uppercase and lowercase letters are counted.

Based on the previous code, write a function count_letters(filename) that a filename as input. The function should return the dictionary containing the counts of the letters in the file. You should only count letters from the English alphabet. Both lowercase and uppercase letters should be counted as the same (lowercase) letter.

Code 9.5: Combining Files#

For this practice you should download the zip file number_lines.zip, place it in your CWD and unzip. You will use the same files in one of the problem solving exercises.

The task is to write the code which reads all the .txt files in a given folder and combines them into one file. We will use the code to combine the files in the folder number_lines.

First, get the list of all files in the folder number_lines. Loop through the list of files and print the name of each file. Probably, all the files in the folder are .txt files, but add the if-sentence to check that the file ends with .txt. Now add the code which reads the content of all .txt file and prints the length of each file content.

Initialize an empty list content_list before the loop. In the loop, add the content of each file as an element to the list.

Decide what will be printed between the content of each file. For example, you can make a string consisting of 2 newline characters, followed by 15 dashes, followed by 2 newline characters. Save this string in a variable separator. Now combine the content of all files into one string, where the content of each file is separated by the separator, and write this to a new file combined.txt. Inspect the file to see that the content is as expected.

Problem Solving#

Problem 9.6 Lix Number#

LIX is a readability measure indicating the difficulty of reading a text. It is defined as a sum of two numbers: the average sentence length and the percentage of words of more than six letters.

Write a function that takes a filename as argument and computes the lix number of the text in the file. Words in the text are separated by spaces. All sentences end with a period (.), an exclamation mark (!), or a question mark (?). If there are no sentences, the function should return 0.

The function specification is shown below.

lix_number.py

lix_number(filename)

Compute the lix number of the text in the given file.

Parameters:

  • filename

str

The path of the text file.

Returns:

  • float

The lix number of the text.

Here are some expected lix numbers for some of the provided files.

>>> lix_number('texts/text_lix_a.txt')
3.2
>>> lix_number('texts/text_lix_g.txt')
45.52702702702703

Test your function, use the two test-cases given above, and the rest of the files in the provided folder texts, with a filename starting with text_lix. You should get lix numbers between 3.2 (the smallest) and 91.4 (the largest).

Problem 9.7: Number Lines#

You have a collection of song lyrics in several files. You want to number the lines in each song text. All songs have the same format. In the first line, the title of the song is written, followed by an empty line. Then follows the song text written in lines, with verses separated by empty lines.

You want to number only the lines with lyrics, not the title of the song, or the empty lines. The numbers should take up two characters, i.e. single-digit numbers should have a leading space. After the number, there should be two spaces, and then the text of the line.

You want to keep the original files and create new files with the numbered lines. The new files should be saved in the same directory with the same name as the original files, but with the suffix _numbered added to the name, just before the extension txt.

For example, consider the file bro_bro_brille.txt with the content shown below.

Bro bro brille

Bro, bro, brille!
Klokken ringer el’ve.
Kejseren står på sit høje hvide slot,
så hvidt som et kridt,
så sort som et kul.

Fare, fare, krigsmand,
døden skal du lide,
den, som kommer allersidst,
skal i den sorte gryde.

Første gang så la’r vi ham gå,
anden gang så lige så,
tredie gang så ta’r vi ham
og putter ham i gryden!

After calling number_lines('bro_bro_brille.txt') , the file bro_bro_brille_numbered.txt with the content shown below should be created

Bro bro brille

 1  Bro, bro, brille!
 2  Klokken ringer el’ve.
 3  Kejseren står på sit høje hvide slot,
 4  så hvidt som et kridt,
 5  så sort som et kul.

 6  Fare, fare, krigsmand,
 7  døden skal du lide,
 8  den, som kommer allersidst,
 9  skal i den sorte gryde.

10  Første gang så la’r vi ham gå,
11  anden gang så lige så,
12  tredie gang så ta’r vi ham
13  og putter ham i gryden!

The specifications are:

number_lines.py

number_lines(filename)

Create a new file with numbered song text lines.

Parameters:

  • filename

str

Filename of the original file.

You can test your code with the files from the folder number_lines provided in the coding practice. Look at the created files to confirm that the result is correct.

Problem 9.8: Nitrate Levels #

Once a week, samples of drinking water are tested for nitrate. The test results are stored in a file where each line contains a floating-point number representing one nitrate level measurement. Nitrate levels are categorized as:

  • Very low: Nitrate levels less than or equal to 4.0 mg/l.

  • Low: Nitrate levels above 4.0 but less than or equal to 9.0 mg/l.

  • Normal: Nitrate levels above 9.0 and below 40.0 mg/l.

  • High: Nitrate levels greater than or equal to 40.0 but below 50.0 mg/l.

  • Very high: Nitrate levels greater than or equal to 50.0 mg/l.

Note here that when the nitrate level falls on the border between two categories, it is included in the category further from normal. For example, a nitrate level of 4.0 mg/l is very low, and a nitrate level of 40.0 mg/l is high.

Write a function that takes a string containing the file name with the nitrate levels as input. The function should return the number of weeks where the nitrate levels were very low, low, normal, high, and very high, respectively, as shown in the example below.

Consider the file week_09_files/nitrate_data_A.txt with the content below.

34.5
34.9
36.7
29.9
34.5
44.5
34.5
46.5
29.9
34.5

None of the values are below 9.0, so none belong to the lower two categories. Eight values are in the range from 9.0 to 40.0, classifying them as normal. Two values are between 40.0 and 50.0, placing them in the high category. There are no values that are classified as very high. The function therefore returns 0, 0, 8, 2, 0.

The expected output may be seen in the example.

>>> nitrate_levels('week_09_files/nitrate_data_A.txt')
(0, 0, 8, 2, 0)

The filename and requirements are in the box below:

nitrate_levels.py

nitrate_levels(filename)

Return the number of weekly measurements in each category.

Parameters:

  • filename

str

Filename of the data file.

Returns:

  • tuple

Number of measurements in each of five categories for nitrate levels.

Use the following script to check your function test_nitrate_levels.py. If your function fails the test in this script, it will also fail when you hand it in.

Problem 9.9: Count Differences #

The results of an experiment are recorded by two independent observers. The observers record the results as a sequence of comma-separated integers, which is saved in a file containing one line of text. We need to count the number of differences between the recorded results of the two observers.

Write a function that takes as input two strings containing the names of the files with the experiment results. If the number of results in one file is different from the number of results in the second file, the function should return -1. If the number of results is the same in the two files, the function should return the number of results that the two observers have recorded differently. Consequently, the function should return 0 if the results in both files are the same.

As an example, consider the two files below.

>>> filename1 = 'week_09_files/results_A1.txt'
>>> filename2 = 'week_09_files/results_A2.txt'

The content of the first file is:

345, 349, 367, 299, 345, 445, 345, 465, 299, 345

The content of the second file is:

345, 349, 367, 300, 354, 445, 345, 465, 300, 345

Both files contain 10 recorded results, so we inspect each pair of recorded results. The first three pairs are the same (345, 349, 367) but the fourth pair is different (299 and 300). Furthermore, the fifth and ninth pairs are different. The function should therefore return 3, as shown in the code cell below.

>>> count_differences(filename1, filename2)
3

The filename and requirements are in the box below:

count_differences.py

count_differences(filename1, filename2)

Number of differences in recorded results.

Parameters:

  • filename1

str

Filename of the first file.

  • filename2

str

Filename of the second file.

Returns:

  • int

Number of differences in recorded results.

Use the following script to check your function test_count_differences.py. If your function fails the test in this script, it will also fail when you hand it in.