Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

From the table Appendix A. Metrics defined that I attached in research paper Ope

ID: 3868699 • Letter: F

Question

From the table Appendix A. Metrics defined that I attached in research paper

Open data quality measurement framework Definition and application:

Extract the following based on in the figure 4.2

2 Base measures

2 Derived measures

2 Indicators

2 Information products

2 Attributes

2 Variables

Track of creation

Track of updates

S: Source

dc: Date of creation

lu: List of update

du: Dates of updates

tc=2s + dc

tu = lu +du

[0, 3]

[0, 2]

tcn = tc/3

tun = tu/2

-

-

percentage of

current rows

ncr: Numbers of not current rows

nr: Number of rows

Several authors gave different definitions of timelines and currency (Heinrich, Klier and Kaiser, 2009). One of the most used (adopted by methodologies DQA, COLDQ, CDQ), is timeliness defined as: Timeliness = (max(0; 1- Currency / Volatility)) (Batini, Cappiello, Francalanci and Maurino, 2009).

Other references: Heinrich (2002) & Ballou, Wang, Pazer and Tayi (1998)

da: Date of information availability

dp: Date of publication

sd: Start date of the period of time referred by the dataset

ed: End date of the period of time referred by the dataset.

da = ed + 1

dp = 1- (dp-da/ed-sd)

, 1)

ed: Expiration date

cd: Current date

sd: Start date of the period of time referred by the dataset

ed: End date of the period of time referred by the dataset.

(, + )

if (dae<=0)

daen =0

else if (

dae<=1) daen = rs

else if (dae>1)

dae = 1

nr: Number of rows

nc: Number of columns

ic: Number of incomplete cells

ncl: Number of cells

ncl = nr*nc

pcc = (1-ic/nc)* 100

completeness with the "open world" assumption (i.e., assumption that in the schema not all the real world entities are represented)

(Batini & Scannapieco, 2006).

percentage of complete rows

percentage of standard columns eGMS

Compliance

nr:Number of rows

nir:Number of incomplete rows

ns: Number of columns with associated standards

nsr:Number of standardized columns

s:source

dc:Date of creation

c:Category

t: Title

d: Description(if applicable)

id:Identifier (if applicable)

pb: Publisher (if applicable

cv:Coverage(recommended only)

l:Language (recommended only)

pcpr=(1-nir/nr)*100

psc=(ns/nsc)*100

egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)

[0%,100%][0%,100%]

[0-5]

pcpn=pcpr//100

egmscn=egmsc/5

Interpretability(metric used in the Data Warehouse

Quality-DWQ metrology), defined as: "Number of tuples with interpretable data,

documentation for key values"

(Batini et al., 2009; Jeusfeld et al., 1998).

Five star Open

Data

This metric does not require any formula;

the value assigned

depends on the level of the scheme in which the dataset is.

ncm:Number of column with metadata

nc:Number of columns

ncuf:Number of columns in understandable format

nc:Number of columns

nce:Number of cells with errors

nci: Number of cells

e:Errors sum

s:Scale

oav:Own aggregation value

dav: Dataset aggregation value

e=n |davi – oavi|

i=1

ea = 1- (e/s)

, 1]

if (ea<=0)

ean=0

else if (ea<=0.9)

ean=0.25*ea

else if (ea<=0.95)

ean=0.5*ea

else if (ean<=0.999)

ean = 0.75*ea

if (ea>0.999)

ean =ea

characteristic Metric Vaiables Formula Scale Normalization Alternative in literature Traceability

Track of creation

Track of updates

S: Source

dc: Date of creation

lu: List of update

du: Dates of updates

tc=2s + dc

tu = lu +du

[0, 3]

[0, 2]

tcn = tc/3

tun = tu/2

-

-

Currentness

percentage of

current rows

ncr: Numbers of not current rows

nr: Number of rows

pcr = (1 - ncr/nr) * 100 [0%, 100] pcrn=pcr/100

Several authors gave different definitions of timelines and currency (Heinrich, Klier and Kaiser, 2009). One of the most used (adopted by methodologies DQA, COLDQ, CDQ), is timeliness defined as: Timeliness = (max(0; 1- Currency / Volatility)) (Batini, Cappiello, Francalanci and Maurino, 2009).

Other references: Heinrich (2002) & Ballou, Wang, Pazer and Tayi (1998)

Delay in publication

da: Date of information availability

dp: Date of publication

sd: Start date of the period of time referred by the dataset

ed: End date of the period of time referred by the dataset.

da = ed + 1

dp = 1- (dp-da/ed-sd)

-

, 1)

dpn = dp Expiration Delay after expiration

ed: Expiration date

cd: Current date

sd: Start date of the period of time referred by the dataset

ed: End date of the period of time referred by the dataset.

dae = 1 - (cd-ed/ed-sd) -

(, + )

if (dae<=0)

daen =0

else if (

dae<=1) daen = rs

else if (dae>1)

dae = 1

- Completeness Percentage of complete cells

nr: Number of rows

nc: Number of columns

ic: Number of incomplete cells

ncl: Number of cells

ncl = nr*nc

pcc = (1-ic/nc)* 100

[0%, 100%] pccn = pcc/100

completeness with the "open world" assumption (i.e., assumption that in the schema not all the real world entities are represented)

(Batini & Scannapieco, 2006).

Compliance

percentage of complete rows

percentage of standard columns eGMS

Compliance

nr:Number of rows

nir:Number of incomplete rows

ns: Number of columns with associated standards

nsr:Number of standardized columns

s:source

dc:Date of creation

c:Category

t: Title

d: Description(if applicable)

id:Identifier (if applicable)

pb: Publisher (if applicable

cv:Coverage(recommended only)

l:Language (recommended only)

pcpr=(1-nir/nr)*100

psc=(ns/nsc)*100

egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)

[0%,100%][0%,100%]

[0-5]

pcpn=pcpr//100

egmscn=egmsc/5

Interpretability(metric used in the Data Warehouse

Quality-DWQ metrology), defined as: "Number of tuples with interpretable data,

documentation for key values"

(Batini et al., 2009; Jeusfeld et al., 1998).

Five star Open

Data

This metric does not require any formula;

the value assigned

depends on the level of the scheme in which the dataset is.

[0, 5] fsodn = fsod/5 - Understandability percentage of columns with metadata percentage of columns in comprhensible format

ncm:Number of column with metadata

nc:Number of columns

ncuf:Number of columns in understandable format

nc:Number of columns

Accuracy Percentage of syntactically accurate cells

nce:Number of cells with errors

nci: Number of cells

pac=(1-nce/nci)*100 0%, 100%] pacn=pac/100 Semantic accuracy, in which are considered not only the values not belonging to a certain domain but also all the values that don't represent the real world entity correctly. e.g incoherent values, and typos in names (Batini & Scannapieco, 2006; Heinrich, 2002; Kaiser et al., 2007).The metric "derivation integrity" in the TIMQ framework calculates the same thing but in a broader way, it is defined as "percentage of correct calculations of derived data according to the integrity derivation formula or calculation definition" (Batini et al., 2009; English, 1999). Accuracy in aggregation

e:Errors sum

s:Scale

oav:Own aggregation value

dav: Dataset aggregation value

e=n |davi – oavi|

i=1

ea = 1- (e/s)

[-

, 1]

if (ea<=0)

ean=0

else if (ea<=0.9)

ean=0.25*ea

else if (ea<=0.95)

ean=0.5*ea

else if (ean<=0.999)

ean = 0.75*ea

if (ea>0.999)

ean =ea

Explanation / Answer

2 Base Measures -

da = ed + 1

dp = 1- (dp-da/ed-sd)

pcpr=(1-nir/nr)*100

psc=(ns/nsc)*100

egmsc=s+dc+c+t+0.2(d+id+pb+cv+l)

2 Derived Measures -

tcn = tc/3

tun = tu/2

e=n |davi – oavi|

i=1

ea = 1- (e/s)

2 Indicators -

if (dae<=0)

daen =0

else if (

dae<=1) daen = rs

else if (dae>1)

dae = 1

if (ea<=0)

ean=0

else if (ea<=0.9)

ean=0.25*ea

2 Information Products -

2 Attributes -

2 Variables -

nce:Number of cells with errors

nci: Number of cells

nr: Number of rows

nc: Number of columns

ic: Number of incomplete cells

ncl: Number of cells

ncl = nr*nc

pcc = (1-ic/nc)* 100

Percentage of syntactically accurate cells

nce:Number of cells with errors

nci: Number of cells

pac=(1-nce/nci)*100