Skip to content

Latest commit

 

History

History
 
 

A022_Consultes_PandasScimago

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Fitxer Scimago amb Pandas

Exercicis

Partirem del fitxer descarregat scimago-medicine.csv

# Read scimago ranking
entries: pd.DataFrame = pd.read_csv("scimagomedicine.csv", sep=";")
entries
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index Total Docs. (2020) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country Region Publisher Coverage Categories
0 1 28773 Ca-A Cancer Journal for Clinicians journal 15424863, 00079235 62,937 Q1 168 47 119 3452 15499 80 126,34 73,45 United States Northern America Wiley-Blackwell 1950-2020 Hematology (Q1); Oncology (Q1)
1 2 19434 MMWR Recommendations and Reports journal 10575987, 15458601 40,949 Q1 143 10 9 1292 492 9 50,00 129,20 United States Northern America Centers for Disease Control and Prevention (CDC) 1990-2020 Epidemiology (Q1); Health Information Manageme...
2 3 18991 Nature Reviews Genetics journal 14710056, 14710064 26,214 Q1 365 106 325 7332 6348 149 21,22 69,17 United Kingdom Western Europe Nature Publishing Group 2000-2020 Genetics (Q1); Genetics (clinical) (Q1); Molec...
3 4 21318 Nature Reviews Immunology journal 14741741, 14741733 20,529 Q1 390 230 436 9421 8200 202 17,33 40,96 United Kingdom Western Europe Nature Publishing Group 2001-2020 Immunology (Q1); Immunology and Allergy (Q1); ...
4 5 71056 MMWR. Surveillance summaries : Morbidity and m... journal 15458636, 15460738 19,961 Q1 100 32 48 499 2235 48 57,77 15,59 United States Northern America Centers for Disease Control and Prevention (CDC) 2002-2020 Epidemiology (Q1); Health Information Manageme...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7120 7121 25412 Zhonghua kou qiang yi xue za zhi = Zhonghua ko... journal 10020098 NaN - 14 150 0 0 0 0 0,00 0,00 China Asiatic Region Zhonghua Yixuehui Zazhishe 1987-2016, 2020 Medicine (miscellaneous)
7121 7122 21485 Zhonghua liu xing bing xue za zhi = Zhonghua l... journal 02546450 NaN - 31 292 0 0 0 0 0,00 0,00 China Asiatic Region Zhonghua Yixuehui Zazhishe 1982-2016, 2020 Medicine (miscellaneous)
7122 7123 26726 Zhonghua nei ke za zhi [Chinese journal of int... journal 05781426 NaN - 18 5 0 0 0 0 0,00 0,00 China Asiatic Region Zhonghua Yixuehui Zazhishe 1957-1959, 1979-1997, 1999-2016, 2020 Medicine (miscellaneous)
7123 7124 19324 Zhonghua wai ke za zhi [Chinese journal of sur... journal 05295815 NaN - 16 5 0 0 0 0 0,00 0,00 China Asiatic Region Zhonghua Yixuehui Zazhishe 1957, 1959-1964, 1979-2016, 2020 Medicine (miscellaneous)
7124 7125 20906 Zhurnal Mikrobiologii Epidemiologii i Immunobi... journal 03729311 NaN - 12 53 0 1264 0 0 0,00 23,85 Russian Federation Eastern Europe Izdatel'stvo S-Info 1945-1947, 1954-2016 Immunology; Medicine (miscellaneous); Microbio...

7125 rows × 20 columns

Mostra totes les files, però sol la seva columna Rank

entries.loc[:,"Rank"]

0 1

1 2

2 3

3 4

4 5

      ... 

7120 7121

7121 7122

7122 7123

7123 7124

7124 7125

Name: Rank, Length: 7125, dtype: int64

Mostra el tipus de dades de totes les files, però sol la seva columna Rank

entries.loc[:,"Rank"].dtype

dtype('int64')

entries.dtypes

Rank int64

Sourceid int64

Title object

Type object

Issn object

SJR object

SJR Best Quartile object

H index int64

Total Docs. (2020) int64

Total Docs. (3years) int64

Total Refs. int64

Total Cites (3years) int64

Citable Docs. (3years) int64

Cites / Doc. (2years) object

Ref. / Doc. object

Country object

Region object

Publisher object

Coverage object

Categories object

dtype: object

#Mostra els index de cada fila
entries.index

RangeIndex(start=0, stop=7125, step=1)

Podem fer filtres de files a partir del contingut d'alguna columna. Exemple: Mostrar totes les entries, el qual el seu H index es superior a 450

#seleccionar i mostrar les entries amb H index superior
entries_high = entries.loc[:,"H index"] >= 450
entries_ok = entries.loc[entries_high,:]
entries_ok
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index Total Docs. (2020) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country Region Publisher Coverage Categories
5 6 15847 New England Journal of Medicine journal 00284793, 15334406 19,889 Q1 1030 1671 4312 15715 82469 1842 19,08 9,40 United States Northern America Massachussetts Medical Society 1945-2020 Medicine (miscellaneous) (Q1)
7 8 15819 Nature Medicine journal 1546170X, 10788956 19,536 Q1 547 452 953 10601 22548 664 23,52 23,45 United Kingdom Western Europe Nature Publishing Group 1995-2020 Biochemistry, Genetics and Molecular Biology (...
13 14 16590 Lancet, The journal 01406736, 1474547X 13,103 Q1 762 1488 4593 16580 45581 1227 9,45 11,14 United Kingdom Western Europe Elsevier Ltd. 1823-2020 Medicine (miscellaneous) (Q1)
19 20 29949 Journal of Clinical Oncology journal 15277755, 0732183X 10,482 Q1 548 583 1890 17448 23642 1221 12,29 29,93 United States Northern America American Society of Clinical Oncology 1983-2020 Cancer Research (Q1); Medicine (miscellaneous)...
43 44 22581 Circulation journal 00097322, 15244539 7,795 Q1 607 778 2685 22242 26532 1702 9,48 28,59 United States Northern America Lippincott Williams and Wilkins Ltd. 1950-2020 Cardiology and Cardiovascular Medicine (Q1); P...
69 70 15870 Journal of Clinical Investigation journal 00219738, 15588238 6,278 Q1 488 611 1446 32961 16569 1418 10,27 53,95 United States Northern America The American Society for Clinical Investigation 1945-2020 Medicine (miscellaneous) (Q1)
89 90 25454 Blood journal 15280020, 00064971 5,515 Q1 465 853 2755 26498 22558 2041 7,41 31,06 United States Northern America American Society of Hematology 1946-2020 Biochemistry (Q1); Cell Biology (Q1); Hematolo...
113 114 85291 JAMA - Journal of the American Medical Associa... journal 15383598, 00987484, 00029955 4,688 Q1 680 1793 5000 14369 30016 2627 5,46 8,01 United States Northern America American Medical Association 1883-2020 Medicine (miscellaneous) (Q1)
#ensenyar les 5 primeres
#Ordenació per valors axis=0 columnes 
entries_top = entries_ok.sort_values(by=['H index'], 
                                    axis=0, 
                                    ascending=False)
entries_top.head(5)
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index Total Docs. (2020) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country Region Publisher Coverage Categories
5 6 15847 New England Journal of Medicine journal 00284793, 15334406 19,889 Q1 1030 1671 4312 15715 82469 1842 19,08 9,40 United States Northern America Massachussetts Medical Society 1945-2020 Medicine (miscellaneous) (Q1)
13 14 16590 Lancet, The journal 01406736, 1474547X 13,103 Q1 762 1488 4593 16580 45581 1227 9,45 11,14 United Kingdom Western Europe Elsevier Ltd. 1823-2020 Medicine (miscellaneous) (Q1)
113 114 85291 JAMA - Journal of the American Medical Associa... journal 15383598, 00987484, 00029955 4,688 Q1 680 1793 5000 14369 30016 2627 5,46 8,01 United States Northern America American Medical Association 1883-2020 Medicine (miscellaneous) (Q1)
43 44 22581 Circulation journal 00097322, 15244539 7,795 Q1 607 778 2685 22242 26532 1702 9,48 28,59 United States Northern America Lippincott Williams and Wilkins Ltd. 1950-2020 Cardiology and Cardiovascular Medicine (Q1); P...
19 20 29949 Journal of Clinical Oncology journal 15277755, 0732183X 10,482 Q1 548 583 1890 17448 23642 1221 12,29 29,93 United States Northern America American Society of Clinical Oncology 1983-2020 Cancer Research (Q1); Medicine (miscellaneous)...

Canviar totes les entrades inferiors a 750 a h_index igual a 0.

import copy
#canviar totes les entrades menors de 750 a h_index negatiu
entries2 = copy.deepcopy(entries)


bad_entries_mask = (entries2.loc[:,"H index"] < 750)
entries2.loc[bad_entries_mask,"H index"] = 0;
entries2.sort_values(by=["H index"], 
                          axis=0, 
                          ascending=False).head(5)
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index Total Docs. (2020) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country Region Publisher Coverage Categories
5 6 15847 New England Journal of Medicine journal 00284793, 15334406 19,889 Q1 1030 1671 4312 15715 82469 1842 19,08 9,40 United States Northern America Massachussetts Medical Society 1945-2020 Medicine (miscellaneous) (Q1)
13 14 16590 Lancet, The journal 01406736, 1474547X 13,103 Q1 762 1488 4593 16580 45581 1227 9,45 11,14 United Kingdom Western Europe Elsevier Ltd. 1823-2020 Medicine (miscellaneous) (Q1)
0 1 28773 Ca-A Cancer Journal for Clinicians journal 15424863, 00079235 62,937 Q1 0 47 119 3452 15499 80 126,34 73,45 United States Northern America Wiley-Blackwell 1950-2020 Hematology (Q1); Oncology (Q1)
4747 4748 21100896985 2018 IEEE Biomedical Circuits and Systems Conf... conference and proceedings - 0,266 - 0 0 178 0 216 177 1,21 0,00 United States Northern America NaN 2018 Biomedical Engineering; Electrical and Electro...
4758 4759 21100901159 International Journal of Child Care and Educat... journal 19765681, 22886729 0,265 Q3 0 11 40 419 42 40 0,77 38,09 Singapore Asiatic Region Springer Open 2007-2020 Sociology and Political Science (Q2); Communit...

Ficar el valor de les les entrades a h_index negatiu, si son menors de 750...

#canviar el valor de les entrades amb el **H index** menor a 750,  al seu valor amb negatiu.
entries3 = copy.deepcopy(entries)

bad_entries_mask = (entries3.loc[:,"H index"] < 350)
entries3.loc[bad_entries_mask,"H index"] = entries3.loc[bad_entries_mask,"H index"]*(-1);
entries3.sort_values(by=["H index"], 
                                                   axis=0, 
                                                   ascending=False)
entries3.head(5)
Rank Sourceid Title Type Issn SJR SJR Best Quartile H index Total Docs. (2020) Total Docs. (3years) Total Refs. Total Cites (3years) Citable Docs. (3years) Cites / Doc. (2years) Ref. / Doc. Country Region Publisher Coverage Categories
0 1 28773 Ca-A Cancer Journal for Clinicians journal 15424863, 00079235 62,937 Q1 -168 47 119 3452 15499 80 126,34 73,45 United States Northern America Wiley-Blackwell 1950-2020 Hematology (Q1); Oncology (Q1)
1 2 19434 MMWR Recommendations and Reports journal 10575987, 15458601 40,949 Q1 -143 10 9 1292 492 9 50,00 129,20 United States Northern America Centers for Disease Control and Prevention (CDC) 1990-2020 Epidemiology (Q1); Health Information Manageme...
2 3 18991 Nature Reviews Genetics journal 14710056, 14710064 26,214 Q1 365 106 325 7332 6348 149 21,22 69,17 United Kingdom Western Europe Nature Publishing Group 2000-2020 Genetics (Q1); Genetics (clinical) (Q1); Molec...
3 4 21318 Nature Reviews Immunology journal 14741741, 14741733 20,529 Q1 390 230 436 9421 8200 202 17,33 40,96 United Kingdom Western Europe Nature Publishing Group 2001-2020 Immunology (Q1); Immunology and Allergy (Q1); ...
4 5 71056 MMWR. Surveillance summaries : Morbidity and m... journal 15458636, 15460738 19,961 Q1 -100 32 48 499 2235 48 57,77 15,59 United States Northern America Centers for Disease Control and Prevention (CDC) 2002-2020 Epidemiology (Q1); Health Information Manageme...

Modificar el valor de tots els Publisher, que actualment esta informat a null, passar-los a np.nan.

# Modificar el valor de tots els Publisher, que actualment esta informat a null, passar-los a np.nan.
# Clean NAs

entries4 = copy.deepcopy(entries)

# Pas 1. Cercar valors nuls amb la màscara.
print("Valors Publisher nuls o buits ??")

entries4.loc[:,"Publisher"].isnull().value_counts()
null_publisher_mask = entries4.loc[:,"Publisher"].isnull()

# Pas 2. Comprovem el resultat de la màscara. 
# En general: df.loc(MASK,FIELD)
print(entries4.loc[null_publisher_mask,"Publisher"] )

Valors Publisher nuls o buits ?? 62 NaN 485 NaN 662 NaN 1481 NaN 1545 NaN

...

# Pas 3. Substituïr els nulls per np.nan, aplicant la màscara.
# En general: df.loc(MASK,FIELD) = VALUE.
entries4.loc[null_publisher_mask,"Publisher"] = np.nan

# Pas 4. Mostrem un resultat per a provar.
print(entries4.iloc[62,:])

Rank 645

Sourceid 22549

Title Public Health Reviews

Type journal Issn 21076952, 03010422

  SJR                                                                   1,692

  SJR Best Quartile                                                        Q1

  H index                                                                  34

Total Docs. (2020) 31

  Total Docs. (3years)                                                     68

  Total Refs.                                                            1891

  Total Cites (3years)                                                    376

  Citable Docs. (3years)                                                   65

Cites / Doc. (2years) 5,76

Ref. / Doc. 61,00

Country United Kingdom

Region Western Europe

Publisher NaN

  Coverage                                    1973-1980, 1982-2003, 2010-2020

Categories Community and Home Care (Q1); Public Health, E...

Name: 644, dtype: object

Marcar todos los publisher que se encuentran a Null, pasarlos a Nan

# Clean NAs

entries4 = copy.deepcopy(entries)

entries4.loc[:,"Publisher"].isnull().value_counts()

null_publisher_mask = entries4.loc[:,"Publisher"].isnull()

entries4.loc[null_publisher_mask,"Publisher"] = np.nan
entries4.iloc[644,:]
Rank                                                                    645
Sourceid                                                              22549
Title                                                 Public Health Reviews
Type                                                                journal
Issn                                                     21076952, 03010422
SJR                                                                   1,692
SJR Best Quartile                                                        Q1
H index                                                                  34
Total Docs. (2020)                                                       31
Total Docs. (3years)                                                     68
Total Refs.                                                            1891
Total Cites (3years)                                                    376
Citable Docs. (3years)                                                   65
Cites / Doc. (2years)                                                  5,76
Ref. / Doc.                                                           61,00
Country                                                      United Kingdom
Region                                                       Western Europe
Publisher                                                               NaN
Coverage                                    1973-1980, 1982-2003, 2010-2020
Categories                Community and Home Care (Q1); Public Health, E...
Name: 644, dtype: object

Actualitzar tots els registres que es troben a nulls, o a nan, amb un valor fixe de String="Unkown Publisher"

# Manage NA's

entries5 = copy.deepcopy(entries4)
#2 opcions , aquestes dues linees, fan el mateix que utilitzant el parametre inplace
update_publisher = entries5.loc[:,"Publisher"].fillna(value="Unkown Publisher")
entries5.loc[:,"Publisher"] = update_publisher
#segona opcio amb inplace, ho canvia a la mateixa linea (abans no)
entries5.loc[:,"Publisher"].fillna(value="Unkown Publisher",inplace=True)
entries5.iloc[644,:]

Rank 645

Sourceid 22549

Title Public Health Reviews

Type journal Issn 21076952, 03010422

   SJR                                                                   1,692

   SJR Best Quartile                                                        Q1

   H index                                                                  34

   Total Docs. (2020)                                                       31

   Total Docs. (3years)                                                     68

   Total Refs.                                                            1891

   Total Cites (3years)                                                    376

   Citable Docs. (3years)                                                   65

   Cites / Doc. (2years)                                                  5,76

   Ref. / Doc.                                                           61,00

   Country                                                      United Kingdom

   Region                                                       Western Europe

   Publisher                                                  Unkown Publisher

   Coverage                                    1973-1980, 1982-2003, 2010-2020

   Categories                Community and Home Care (Q1); Public Health, E...

   Name: 644, dtype: object

Canviar valors na a 0, o eliminant registres que tenen el valor na en una columna especial.

ser1: pd.Series = pd.Series([0,1,2,3,np.nan,5.6])
ser1.fillna(value=0,inplace=True)
ser1
0    0.0
1    1.0
2    2.0
3    3.0
4    0.0
5    5.6
dtype: float64
#esborrar registres amb valors na
ser2: pd.Series = pd.Series([0,1,2,3,np.nan,5.6])
ser2=ser2.dropna()
ser2
0    0.0
1    1.0
2    2.0
3    3.0
5    5.6
dtype: float64

MAP,MAPAPPLY, APPLY

Instrucció MAP

Aplicar una transformació(en aquest cas, doblar el valor) a tota la fila

#1 Map
ser3: pd.Series = pd.Series([0,1,2,3])
ser3.map(lambda x:x*2)

0 0

1 2

2 4

3 6

dtype: int64

Optimització

No es recomana usar funcions lambda en series o dataframes molt grans, perquè el temps que es triga creant la funció anònima és alt i això fa que el rendiment sigui més dolent que creant la funció apart.

Per tant, aquest codi tindria més bon rendiment.

#1 Map
def mult5(num: int)-> int:
    return num * 5

ser3: pd.Series = pd.Series([2,4,6,8,10,12])
ser3 = ser3.map(mult5)

print(ser3)
ser4: pd.Series = pd.Series(["John","Lucy","Mary","Peter"])
ser4.map(lambda x: "Hello " + x)
def helloName(name: str)-> str:
    return "Hello " + name

ser4: pd.Series = pd.Series(["John","Lucy","Mary","Peter"])
ser4 = ser4.map(helloName)
print(ser4)

0 Hello John

1 Hello Lucy

2 Hello Mary

3 Hello Peter

dtype: object

applymap
# DataFrame.mapaply(). Works elements wise for rows
data = {"A": [1,2,3,9,6],
       "B": [3,4,8,6,9]}
df3 = pd.DataFrame(data)
df3.applymap(lambda x:x*2)
A B
0 2 6
1 4 8

Funció apply

En aquest cas sumarem valors

#Works column wise
df3.apply(lambda column:column.sum())

A 3

B 7

dtype: int64

Crear una nova columna, dins el teu dataframe.
df4 = copy.deepcopy(df3)
df4.loc[:,"C"] = df3.B.map(lambda x:x*2)
df4
A B C
0 1 3 6
1 2 4 8

Canviar el ordre dins un dataframe
df5 = df4.loc[:,["C","A","B"]]
df5
C A B
0 6 1 3
1 8 2 4