- Read - Llegir fitxer Scimago
- Seleccionar registres amb h index superior a 450
- Canviar valors - Canviar el h_index de tots els registres que el tenen inferior a 750
- Canviar els valors a h_index negatiu sempre i quant el h_index sigui inferior a 750
- Posar tots els Publisher, que actualment es troben a null, ficar-los a np.nan.
- NAN - Tots els registres que tenen publisher a null, pasar-los a Nan
- Series - Canviar el valor als objectes d'una serie que conté nan
- Les instruccions MAP, APPLY, APPLYMAP
- Afegir columnes a un dataframe existent
- Canviar el ordre de les columnes d'un dataframe
Partirem del fitxer descarregat scimago-medicine.csv
# Read scimago ranking
entries: pd.DataFrame = pd.read_csv("scimagomedicine.csv", sep=";")
entries
Rank | Sourceid | Title | Type | Issn | SJR | SJR Best Quartile | H index | Total Docs. (2020) | Total Docs. (3years) | Total Refs. | Total Cites (3years) | Citable Docs. (3years) | Cites / Doc. (2years) | Ref. / Doc. | Country | Region | Publisher | Coverage | Categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 28773 | Ca-A Cancer Journal for Clinicians | journal | 15424863, 00079235 | 62,937 | Q1 | 168 | 47 | 119 | 3452 | 15499 | 80 | 126,34 | 73,45 | United States | Northern America | Wiley-Blackwell | 1950-2020 | Hematology (Q1); Oncology (Q1) |
1 | 2 | 19434 | MMWR Recommendations and Reports | journal | 10575987, 15458601 | 40,949 | Q1 | 143 | 10 | 9 | 1292 | 492 | 9 | 50,00 | 129,20 | United States | Northern America | Centers for Disease Control and Prevention (CDC) | 1990-2020 | Epidemiology (Q1); Health Information Manageme... |
2 | 3 | 18991 | Nature Reviews Genetics | journal | 14710056, 14710064 | 26,214 | Q1 | 365 | 106 | 325 | 7332 | 6348 | 149 | 21,22 | 69,17 | United Kingdom | Western Europe | Nature Publishing Group | 2000-2020 | Genetics (Q1); Genetics (clinical) (Q1); Molec... |
3 | 4 | 21318 | Nature Reviews Immunology | journal | 14741741, 14741733 | 20,529 | Q1 | 390 | 230 | 436 | 9421 | 8200 | 202 | 17,33 | 40,96 | United Kingdom | Western Europe | Nature Publishing Group | 2001-2020 | Immunology (Q1); Immunology and Allergy (Q1); ... |
4 | 5 | 71056 | MMWR. Surveillance summaries : Morbidity and m... | journal | 15458636, 15460738 | 19,961 | Q1 | 100 | 32 | 48 | 499 | 2235 | 48 | 57,77 | 15,59 | United States | Northern America | Centers for Disease Control and Prevention (CDC) | 2002-2020 | Epidemiology (Q1); Health Information Manageme... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7120 | 7121 | 25412 | Zhonghua kou qiang yi xue za zhi = Zhonghua ko... | journal | 10020098 | NaN | - | 14 | 150 | 0 | 0 | 0 | 0 | 0,00 | 0,00 | China | Asiatic Region | Zhonghua Yixuehui Zazhishe | 1987-2016, 2020 | Medicine (miscellaneous) |
7121 | 7122 | 21485 | Zhonghua liu xing bing xue za zhi = Zhonghua l... | journal | 02546450 | NaN | - | 31 | 292 | 0 | 0 | 0 | 0 | 0,00 | 0,00 | China | Asiatic Region | Zhonghua Yixuehui Zazhishe | 1982-2016, 2020 | Medicine (miscellaneous) |
7122 | 7123 | 26726 | Zhonghua nei ke za zhi [Chinese journal of int... | journal | 05781426 | NaN | - | 18 | 5 | 0 | 0 | 0 | 0 | 0,00 | 0,00 | China | Asiatic Region | Zhonghua Yixuehui Zazhishe | 1957-1959, 1979-1997, 1999-2016, 2020 | Medicine (miscellaneous) |
7123 | 7124 | 19324 | Zhonghua wai ke za zhi [Chinese journal of sur... | journal | 05295815 | NaN | - | 16 | 5 | 0 | 0 | 0 | 0 | 0,00 | 0,00 | China | Asiatic Region | Zhonghua Yixuehui Zazhishe | 1957, 1959-1964, 1979-2016, 2020 | Medicine (miscellaneous) |
7124 | 7125 | 20906 | Zhurnal Mikrobiologii Epidemiologii i Immunobi... | journal | 03729311 | NaN | - | 12 | 53 | 0 | 1264 | 0 | 0 | 0,00 | 23,85 | Russian Federation | Eastern Europe | Izdatel'stvo S-Info | 1945-1947, 1954-2016 | Immunology; Medicine (miscellaneous); Microbio... |
7125 rows × 20 columns
Mostra totes les files, però sol la seva columna Rank
entries.loc[:,"Rank"]
0 1
1 2
2 3
3 4
4 5
...
7120 7121
7121 7122
7122 7123
7123 7124
7124 7125
Name: Rank, Length: 7125, dtype: int64
Mostra el tipus de dades de totes les files, però sol la seva columna Rank
entries.loc[:,"Rank"].dtype
dtype('int64')
entries.dtypes
Rank int64
Sourceid int64
Title object
Type object
Issn object
SJR object
SJR Best Quartile object
H index int64
Total Docs. (2020) int64
Total Docs. (3years) int64
Total Refs. int64
Total Cites (3years) int64
Citable Docs. (3years) int64
Cites / Doc. (2years) object
Ref. / Doc. object
Country object
Region object
Publisher object
Coverage object
Categories object
dtype: object
#Mostra els index de cada fila
entries.index
RangeIndex(start=0, stop=7125, step=1)
Podem fer filtres de files a partir del contingut d'alguna columna. Exemple: Mostrar totes les entries, el qual el seu H index es superior a 450
#seleccionar i mostrar les entries amb H index superior
entries_high = entries.loc[:,"H index"] >= 450
entries_ok = entries.loc[entries_high,:]
entries_ok
Rank | Sourceid | Title | Type | Issn | SJR | SJR Best Quartile | H index | Total Docs. (2020) | Total Docs. (3years) | Total Refs. | Total Cites (3years) | Citable Docs. (3years) | Cites / Doc. (2years) | Ref. / Doc. | Country | Region | Publisher | Coverage | Categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 15847 | New England Journal of Medicine | journal | 00284793, 15334406 | 19,889 | Q1 | 1030 | 1671 | 4312 | 15715 | 82469 | 1842 | 19,08 | 9,40 | United States | Northern America | Massachussetts Medical Society | 1945-2020 | Medicine (miscellaneous) (Q1) |
7 | 8 | 15819 | Nature Medicine | journal | 1546170X, 10788956 | 19,536 | Q1 | 547 | 452 | 953 | 10601 | 22548 | 664 | 23,52 | 23,45 | United Kingdom | Western Europe | Nature Publishing Group | 1995-2020 | Biochemistry, Genetics and Molecular Biology (... |
13 | 14 | 16590 | Lancet, The | journal | 01406736, 1474547X | 13,103 | Q1 | 762 | 1488 | 4593 | 16580 | 45581 | 1227 | 9,45 | 11,14 | United Kingdom | Western Europe | Elsevier Ltd. | 1823-2020 | Medicine (miscellaneous) (Q1) |
19 | 20 | 29949 | Journal of Clinical Oncology | journal | 15277755, 0732183X | 10,482 | Q1 | 548 | 583 | 1890 | 17448 | 23642 | 1221 | 12,29 | 29,93 | United States | Northern America | American Society of Clinical Oncology | 1983-2020 | Cancer Research (Q1); Medicine (miscellaneous)... |
43 | 44 | 22581 | Circulation | journal | 00097322, 15244539 | 7,795 | Q1 | 607 | 778 | 2685 | 22242 | 26532 | 1702 | 9,48 | 28,59 | United States | Northern America | Lippincott Williams and Wilkins Ltd. | 1950-2020 | Cardiology and Cardiovascular Medicine (Q1); P... |
69 | 70 | 15870 | Journal of Clinical Investigation | journal | 00219738, 15588238 | 6,278 | Q1 | 488 | 611 | 1446 | 32961 | 16569 | 1418 | 10,27 | 53,95 | United States | Northern America | The American Society for Clinical Investigation | 1945-2020 | Medicine (miscellaneous) (Q1) |
89 | 90 | 25454 | Blood | journal | 15280020, 00064971 | 5,515 | Q1 | 465 | 853 | 2755 | 26498 | 22558 | 2041 | 7,41 | 31,06 | United States | Northern America | American Society of Hematology | 1946-2020 | Biochemistry (Q1); Cell Biology (Q1); Hematolo... |
113 | 114 | 85291 | JAMA - Journal of the American Medical Associa... | journal | 15383598, 00987484, 00029955 | 4,688 | Q1 | 680 | 1793 | 5000 | 14369 | 30016 | 2627 | 5,46 | 8,01 | United States | Northern America | American Medical Association | 1883-2020 | Medicine (miscellaneous) (Q1) |
#ensenyar les 5 primeres
#Ordenació per valors axis=0 columnes
entries_top = entries_ok.sort_values(by=['H index'],
axis=0,
ascending=False)
entries_top.head(5)
Rank | Sourceid | Title | Type | Issn | SJR | SJR Best Quartile | H index | Total Docs. (2020) | Total Docs. (3years) | Total Refs. | Total Cites (3years) | Citable Docs. (3years) | Cites / Doc. (2years) | Ref. / Doc. | Country | Region | Publisher | Coverage | Categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 15847 | New England Journal of Medicine | journal | 00284793, 15334406 | 19,889 | Q1 | 1030 | 1671 | 4312 | 15715 | 82469 | 1842 | 19,08 | 9,40 | United States | Northern America | Massachussetts Medical Society | 1945-2020 | Medicine (miscellaneous) (Q1) |
13 | 14 | 16590 | Lancet, The | journal | 01406736, 1474547X | 13,103 | Q1 | 762 | 1488 | 4593 | 16580 | 45581 | 1227 | 9,45 | 11,14 | United Kingdom | Western Europe | Elsevier Ltd. | 1823-2020 | Medicine (miscellaneous) (Q1) |
113 | 114 | 85291 | JAMA - Journal of the American Medical Associa... | journal | 15383598, 00987484, 00029955 | 4,688 | Q1 | 680 | 1793 | 5000 | 14369 | 30016 | 2627 | 5,46 | 8,01 | United States | Northern America | American Medical Association | 1883-2020 | Medicine (miscellaneous) (Q1) |
43 | 44 | 22581 | Circulation | journal | 00097322, 15244539 | 7,795 | Q1 | 607 | 778 | 2685 | 22242 | 26532 | 1702 | 9,48 | 28,59 | United States | Northern America | Lippincott Williams and Wilkins Ltd. | 1950-2020 | Cardiology and Cardiovascular Medicine (Q1); P... |
19 | 20 | 29949 | Journal of Clinical Oncology | journal | 15277755, 0732183X | 10,482 | Q1 | 548 | 583 | 1890 | 17448 | 23642 | 1221 | 12,29 | 29,93 | United States | Northern America | American Society of Clinical Oncology | 1983-2020 | Cancer Research (Q1); Medicine (miscellaneous)... |
Canviar totes les entrades inferiors a 750 a h_index igual a 0.
import copy
#canviar totes les entrades menors de 750 a h_index negatiu
entries2 = copy.deepcopy(entries)
bad_entries_mask = (entries2.loc[:,"H index"] < 750)
entries2.loc[bad_entries_mask,"H index"] = 0;
entries2.sort_values(by=["H index"],
axis=0,
ascending=False).head(5)
Rank | Sourceid | Title | Type | Issn | SJR | SJR Best Quartile | H index | Total Docs. (2020) | Total Docs. (3years) | Total Refs. | Total Cites (3years) | Citable Docs. (3years) | Cites / Doc. (2years) | Ref. / Doc. | Country | Region | Publisher | Coverage | Categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 15847 | New England Journal of Medicine | journal | 00284793, 15334406 | 19,889 | Q1 | 1030 | 1671 | 4312 | 15715 | 82469 | 1842 | 19,08 | 9,40 | United States | Northern America | Massachussetts Medical Society | 1945-2020 | Medicine (miscellaneous) (Q1) |
13 | 14 | 16590 | Lancet, The | journal | 01406736, 1474547X | 13,103 | Q1 | 762 | 1488 | 4593 | 16580 | 45581 | 1227 | 9,45 | 11,14 | United Kingdom | Western Europe | Elsevier Ltd. | 1823-2020 | Medicine (miscellaneous) (Q1) |
0 | 1 | 28773 | Ca-A Cancer Journal for Clinicians | journal | 15424863, 00079235 | 62,937 | Q1 | 0 | 47 | 119 | 3452 | 15499 | 80 | 126,34 | 73,45 | United States | Northern America | Wiley-Blackwell | 1950-2020 | Hematology (Q1); Oncology (Q1) |
4747 | 4748 | 21100896985 | 2018 IEEE Biomedical Circuits and Systems Conf... | conference and proceedings | - | 0,266 | - | 0 | 0 | 178 | 0 | 216 | 177 | 1,21 | 0,00 | United States | Northern America | NaN | 2018 | Biomedical Engineering; Electrical and Electro... |
4758 | 4759 | 21100901159 | International Journal of Child Care and Educat... | journal | 19765681, 22886729 | 0,265 | Q3 | 0 | 11 | 40 | 419 | 42 | 40 | 0,77 | 38,09 | Singapore | Asiatic Region | Springer Open | 2007-2020 | Sociology and Political Science (Q2); Communit... |
Ficar el valor de les les entrades a h_index negatiu, si son menors de 750...
#canviar el valor de les entrades amb el **H index** menor a 750, al seu valor amb negatiu.
entries3 = copy.deepcopy(entries)
bad_entries_mask = (entries3.loc[:,"H index"] < 350)
entries3.loc[bad_entries_mask,"H index"] = entries3.loc[bad_entries_mask,"H index"]*(-1);
entries3.sort_values(by=["H index"],
axis=0,
ascending=False)
entries3.head(5)
Rank | Sourceid | Title | Type | Issn | SJR | SJR Best Quartile | H index | Total Docs. (2020) | Total Docs. (3years) | Total Refs. | Total Cites (3years) | Citable Docs. (3years) | Cites / Doc. (2years) | Ref. / Doc. | Country | Region | Publisher | Coverage | Categories | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 28773 | Ca-A Cancer Journal for Clinicians | journal | 15424863, 00079235 | 62,937 | Q1 | -168 | 47 | 119 | 3452 | 15499 | 80 | 126,34 | 73,45 | United States | Northern America | Wiley-Blackwell | 1950-2020 | Hematology (Q1); Oncology (Q1) |
1 | 2 | 19434 | MMWR Recommendations and Reports | journal | 10575987, 15458601 | 40,949 | Q1 | -143 | 10 | 9 | 1292 | 492 | 9 | 50,00 | 129,20 | United States | Northern America | Centers for Disease Control and Prevention (CDC) | 1990-2020 | Epidemiology (Q1); Health Information Manageme... |
2 | 3 | 18991 | Nature Reviews Genetics | journal | 14710056, 14710064 | 26,214 | Q1 | 365 | 106 | 325 | 7332 | 6348 | 149 | 21,22 | 69,17 | United Kingdom | Western Europe | Nature Publishing Group | 2000-2020 | Genetics (Q1); Genetics (clinical) (Q1); Molec... |
3 | 4 | 21318 | Nature Reviews Immunology | journal | 14741741, 14741733 | 20,529 | Q1 | 390 | 230 | 436 | 9421 | 8200 | 202 | 17,33 | 40,96 | United Kingdom | Western Europe | Nature Publishing Group | 2001-2020 | Immunology (Q1); Immunology and Allergy (Q1); ... |
4 | 5 | 71056 | MMWR. Surveillance summaries : Morbidity and m... | journal | 15458636, 15460738 | 19,961 | Q1 | -100 | 32 | 48 | 499 | 2235 | 48 | 57,77 | 15,59 | United States | Northern America | Centers for Disease Control and Prevention (CDC) | 2002-2020 | Epidemiology (Q1); Health Information Manageme... |
Modificar el valor de tots els Publisher, que actualment esta informat a null, passar-los a np.nan.
# Modificar el valor de tots els Publisher, que actualment esta informat a null, passar-los a np.nan.
# Clean NAs
entries4 = copy.deepcopy(entries)
# Pas 1. Cercar valors nuls amb la màscara.
print("Valors Publisher nuls o buits ??")
entries4.loc[:,"Publisher"].isnull().value_counts()
null_publisher_mask = entries4.loc[:,"Publisher"].isnull()
# Pas 2. Comprovem el resultat de la màscara.
# En general: df.loc(MASK,FIELD)
print(entries4.loc[null_publisher_mask,"Publisher"] )
Valors Publisher nuls o buits ?? 62 NaN 485 NaN 662 NaN 1481 NaN 1545 NaN
...
# Pas 3. Substituïr els nulls per np.nan, aplicant la màscara.
# En general: df.loc(MASK,FIELD) = VALUE.
entries4.loc[null_publisher_mask,"Publisher"] = np.nan
# Pas 4. Mostrem un resultat per a provar.
print(entries4.iloc[62,:])
Rank 645
Sourceid 22549
Title Public Health Reviews
Type journal Issn 21076952, 03010422
SJR 1,692 SJR Best Quartile Q1 H index 34
Total Docs. (2020) 31
Total Docs. (3years) 68 Total Refs. 1891 Total Cites (3years) 376 Citable Docs. (3years) 65
Cites / Doc. (2years) 5,76
Ref. / Doc. 61,00
Country United Kingdom
Region Western Europe
Publisher NaN
Coverage 1973-1980, 1982-2003, 2010-2020
Categories Community and Home Care (Q1); Public Health, E...
Name: 644, dtype: object
Marcar todos los publisher que se encuentran a Null, pasarlos a Nan
# Clean NAs
entries4 = copy.deepcopy(entries)
entries4.loc[:,"Publisher"].isnull().value_counts()
null_publisher_mask = entries4.loc[:,"Publisher"].isnull()
entries4.loc[null_publisher_mask,"Publisher"] = np.nan
entries4.iloc[644,:]
Rank 645
Sourceid 22549
Title Public Health Reviews
Type journal
Issn 21076952, 03010422
SJR 1,692
SJR Best Quartile Q1
H index 34
Total Docs. (2020) 31
Total Docs. (3years) 68
Total Refs. 1891
Total Cites (3years) 376
Citable Docs. (3years) 65
Cites / Doc. (2years) 5,76
Ref. / Doc. 61,00
Country United Kingdom
Region Western Europe
Publisher NaN
Coverage 1973-1980, 1982-2003, 2010-2020
Categories Community and Home Care (Q1); Public Health, E...
Name: 644, dtype: object
Actualitzar tots els registres que es troben a nulls, o a nan, amb un valor fixe de String="Unkown Publisher"
# Manage NA's
entries5 = copy.deepcopy(entries4)
#2 opcions , aquestes dues linees, fan el mateix que utilitzant el parametre inplace
update_publisher = entries5.loc[:,"Publisher"].fillna(value="Unkown Publisher")
entries5.loc[:,"Publisher"] = update_publisher
#segona opcio amb inplace, ho canvia a la mateixa linea (abans no)
entries5.loc[:,"Publisher"].fillna(value="Unkown Publisher",inplace=True)
entries5.iloc[644,:]
Rank 645
Sourceid 22549
Title Public Health Reviews
Type journal Issn 21076952, 03010422
SJR 1,692 SJR Best Quartile Q1 H index 34 Total Docs. (2020) 31 Total Docs. (3years) 68 Total Refs. 1891 Total Cites (3years) 376 Citable Docs. (3years) 65 Cites / Doc. (2years) 5,76 Ref. / Doc. 61,00 Country United Kingdom Region Western Europe Publisher Unkown Publisher Coverage 1973-1980, 1982-2003, 2010-2020 Categories Community and Home Care (Q1); Public Health, E... Name: 644, dtype: object
Canviar valors na a 0, o eliminant registres que tenen el valor na en una columna especial.
ser1: pd.Series = pd.Series([0,1,2,3,np.nan,5.6])
ser1.fillna(value=0,inplace=True)
ser1
0 0.0
1 1.0
2 2.0
3 3.0
4 0.0
5 5.6
dtype: float64
#esborrar registres amb valors na
ser2: pd.Series = pd.Series([0,1,2,3,np.nan,5.6])
ser2=ser2.dropna()
ser2
0 0.0
1 1.0
2 2.0
3 3.0
5 5.6
dtype: float64
Aplicar una transformació(en aquest cas, doblar el valor) a tota la fila
#1 Map
ser3: pd.Series = pd.Series([0,1,2,3])
ser3.map(lambda x:x*2)
0 0
1 2
2 4
3 6
dtype: int64
Optimització
No es recomana usar funcions lambda en series o dataframes molt grans, perquè el temps que es triga creant la funció anònima és alt i això fa que el rendiment sigui més dolent que creant la funció apart.
Per tant, aquest codi tindria més bon rendiment.
#1 Map
def mult5(num: int)-> int:
return num * 5
ser3: pd.Series = pd.Series([2,4,6,8,10,12])
ser3 = ser3.map(mult5)
print(ser3)
ser4: pd.Series = pd.Series(["John","Lucy","Mary","Peter"])
ser4.map(lambda x: "Hello " + x)
def helloName(name: str)-> str:
return "Hello " + name
ser4: pd.Series = pd.Series(["John","Lucy","Mary","Peter"])
ser4 = ser4.map(helloName)
print(ser4)
0 Hello John
1 Hello Lucy
2 Hello Mary
3 Hello Peter
dtype: object
# DataFrame.mapaply(). Works elements wise for rows
data = {"A": [1,2,3,9,6],
"B": [3,4,8,6,9]}
df3 = pd.DataFrame(data)
df3.applymap(lambda x:x*2)
A | B | |
---|---|---|
0 | 2 | 6 |
1 | 4 | 8 |
En aquest cas sumarem valors
#Works column wise
df3.apply(lambda column:column.sum())
A 3
B 7
dtype: int64
df4 = copy.deepcopy(df3)
df4.loc[:,"C"] = df3.B.map(lambda x:x*2)
df4
A | B | C | |
---|---|---|---|
0 | 1 | 3 | 6 |
1 | 2 | 4 | 8 |
df5 = df4.loc[:,["C","A","B"]]
df5
C | A | B | |
---|---|---|---|
0 | 6 | 1 | 3 |
1 | 8 | 2 | 4 |